COMPANY PROFILE
Join our dynamic team at Lalaith Astor Technical Consulting House, LLC, where we specialize in providing cutting-edge technical solutions to government agencies. As a woman-owned small business (WOSB) and a member of the SBA 8(a) program, we are a small yet fast-growing Federal IT Contractor. We pride ourselves on a culture of innovation, excellence, and a commitment to delivering high-quality services in complex technical, Internet, and cybersecurity domains.
JOB SUMMARY
The Multi-cloud Site Reliability Engineer (SRE) Subject Matter Expert (SME) will support our customer in providing technical leadership, skills, and solutions necessary to support next generation efforts in this enterprise initiative. The SRE SME will assist the team by leveraging their skills and experience to ensure reliability, availability, and performance of the enterprise services for the client in a high availability environment. The SRE SME will work with the development and operations teams to build and maintain a scalable and robust infrastructure that supports the client’s mission and goals.
RESPONSIBILITIES AND DUTIES
Job responsibilities and duties will include, but are not limited to, the following:
- Design and implement highly available and scalable systems, ensuring the reliability and performance of the company's website and applications in multi-cloud environment.
- Collaborate with cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems.
- Participate in system design consulting, platform management, and capacity planning.
- Monitor systems and applications, proactively identifying and resolving any performance bottlenecks or availability issues.
- Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance across systems deployed in AWS, GCP and Azure cloud providers.
- Develop and maintain automation scripts, configuration management tools, and infrastructure as code (IaC) templates to automate deployment, scaling, and monitoring tasks across multiple cloud platforms.
- Develop and implement guidelines for provisioning, configuring, and optimizing cloud resources to meet performance, scalability, and cost requirements.
- Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents.
- Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding.
- Create and maintain documentation for system architecture, configuration, and troubleshooting procedures.
- Perform capacity planning and resource allocation to ensure optimal system performance and scalability.
- Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability and performance standards.
- Stay up to date with industry best practices, new technologies, and emerging trends in site reliability engineering across major cloud service providers.
REQUIRED QUALIFICATIONS AND SKILLS
The selected candidate must have the following qualifications and skills:
- Strong knowledge of Linux/Unix and Windows systems and command line tools.
- Must have proficiency in scripting languages such as Python, Java Script, Shell, or Perl.
- Experience with configuration management tools like Ansible, Puppet, or Chef.
- Familiarity with multiple cloud platforms AWS, Azure, and/or Google Cloud.
- In-depth understanding and expertise with native cloud tools and solutions
- Deep understanding of the cloud infrastructure provided by various providers, such as AWS, Azure, and GCP.
- Understanding of networking principles and protocols (TCP/IP, HTTP, DNS, etc.).
- Knowledge of containerization technologies (Docker, Kubernetes) and orchestration tools.
- Expertise in monitoring and logging tools such as Prometheus, Grafana, ELK stack, or Splunk.
- Strong problem-solving and troubleshooting skills, with the ability to analyze and resolve complex technical issues.
- Excellent communication and collaboration skills to work effectively with cross-functional teams.
- Strong attention to detail and ability to work in a fast-paced, dynamic environment.
DESIRED QUALIFICATIONS AND SKILLS
- Experience in architecting and optimizing highly available and scalable systems specifically tailored for government agencies' needs.
- Proven track record of collaborating with government stakeholders to define and establish service level objectives (SLOs) and service level agreements (SLAs) aligned with agency mission objectives.
- Demonstrated expertise in leveraging native cloud tools and solutions provided by major cloud service providers to enhance the reliability and performance of enterprise services.
- Proficiency in containerization technologies and orchestration tools with a focus on ensuring compliance with government security standards and regulations.
- Strong familiarity with federal government compliance requirements and security protocols, including FedRAMP, FISMA, and NIST guidelines, ensuring seamless integration of security measures into multi-cloud environments.
REQUIRED EXPERIENCE
- Years of Industry Experience: 15+ years
- Proven experience as a Site Reliability Engineer or a similar role.
- Solid understanding of software development methodologies and DevOps principles.
- Experience with agile and iterative development processes.
- Familiarity with continuous integration/continuous deployment (CI/CD) pipelines.
- Experience with source control systems such as Git.
- Knowledge of security best practices and experience implementing security measures in a production environment.
- Ability to work independently and handle multiple projects and priorities simultaneously.
- Strong analytical and problem-solving skills, with a focus on continuous improvement and automation.
Job Type: Full-time
Pay: $175,000.00 - $190,000.00 per year
Benefits:
- 401(k)
- 401(k) matching
- Dental insurance
- Health insurance
- Paid time off
- Parental leave
- Professional development assistance
- Referral program
- Vision insurance
Compensation package:
- Yearly pay
Experience level:
- 11+ years
Schedule:
- 8 hour shift
- Day shift
- Monday to Friday
- On call
Application Question(s):
- Are you willing to obtain and maintain a Public Trust background check?
- Do you currently hold legal authorization to work for any employer in the United States? Please note that at this time, we are not in a position to sponsor or assume sponsorship of employment visas.
Experience:
- a Site Reliability Engineering or Observability Engineering: 8 years (Required)
Work Location: Remote