Job Description
The Site Reliability Engineer (SRE) is a hands-on member of the Network Operations team responsible for collaborating with Engineering, Product Management and Network Operations teams to automate the deployment, testing, provisioning, monitoring, securing and management of Sandata's solutions and infrastructure.
The SRE will be instrumental in helping Sandata migrate from a purely on-prem infrastructure to a highly scalable hybrid cloud-colo configuration.
The ideal candidate will have extensive experience in automating all facets of infrastructure provisioning and management in a continuous integration and deployment environment using a variety of open source and cloud-based tools that helps development and operations with "last mile" delivery.
Duties
· Develop end to end automated software delivery and configuration management mechanisms for our CI/CD pipeline
· Create processes, procedures and tools for reporting and visualizing metrics and system health
· Implement automated testing frameworks into delivery pipeline
· Perform ongoing monitoring and routine application maintenance tasks
· Evaluates new application packages and tools and performs research on best practices
· Participate in 24x7 system reliability support, troubleshooting and incident management activities
· Collaborate with Engineering teams for the development and deployment of microservices
· Help implement best practices to improve quality time to market
· Help develop strategies for reducing our Recovery Time Objective (RTO)
· Continuously inspect and adapt methods for meeting scalability, reliability, security and performance objectives
Skills and Qualifications
· Strong experience with AWS services(VPC, EC2, IAM, RDS, Elasticache, System Manager, DynamoDB, Document DB).
· Good understanding of distributed networks and application high availability and load balancing
· Microservices using containerization tools (Kubernetes/Docker), experience with IaC tools specially with Terraform, Jenkins/Bamboo for CI/CD, configuration management tools(Ansible preferably)
· Hands-on skills in architecting and implementing end to end automation of a CI/CD pipeline especially the "last mile" for full deployment / release automation with logging, monitoring, alerting, and auto-scaling
· Experience in building, and automating the building and provisioning of, AWS infrastructure including security groups, VPCs via Terraform, CloudFormation
· Strong scripting experience with Python and PowerShell.
· Thorough understanding of *nix and how it works (cpu, mem, cron, ssh, ENV, .*rc, IP, DNS, proxy, top, SSL/TLS, HTTPS, SFTP, SCP, VPN
· Excellent scripting skills in shell (BASH) (or other linux shells) plus experience in scripting/coding in one or more languages including Java, Python, Javascript (Node.js, Angular, React)
· Solid experience with Docker and container orchestration with Swarm/Compose and Kubernetes/Helm (preferred) in a virtualized (VMWare) mixed OS environment (i.e. Windows .NET core and Linux (RHEL, CentOS))
· Hands on experience with deploying services such as Tomcat, Jetty, Nginx, Apache, Node.js, Mongo, Cassandra, MySQL, Oracle, MS-SQL, IIS, Redis, RabbitMQ, REST, SOAP, JSON, XML, Prometheus, Consul, Vault
· Ability to train and support development and operations team members in becoming self-sufficient with the CI/CD Pipeline (Jira, Bitbucket/Git, Bamboo, Nexus)
· Expert knowledge of Git and all things "* as Code" (Infra as Code, Config as Code, etc.)
· Security-minded individual with experience embedding security and code quality scanning tools into the development pipeline
· Experience with helping testing teams shift left in quality via automated continuous testing practices
· Strong collaboration, and written / verbal communication skills
· Passion, proactivity and self-motivated to achieving results
· Ability to have fun, laugh and generally be a great person to want to work with
· Bachelor's Degree in Computer Science or equivalent work experience
· 5 years of analysis and programming experience in tooling and service integration
· 3+ years hands on experience as a Cloud DevOps Engineer or Site Reliability Engineer
· Experience working in agile team environment
· Experience working in applications, systems or IT operations
· Experience with automation tools
· Strong troubleshooting and problem solving skills
Physical Requirements/Work Environment:
Daily activities of an administrative nature. Work is primarily sedentary.
Important Notices: This job description is not an exclusive or exhaustive list of all job functions that a team member in this position may be asked to perform. Duties and responsibilities can be changed, expanded, reduced or delegated by management to meet the business needs. Team member is required to sign this document in the space provided below, acknowledging receipt and comprehension of this job description.