Stefanini Group is looking for a Business Analyst for a globally recognized company!
For interested applicants, click the apply button or you may reach out to Alfher Hidalgo at (248) 728-2627/Alfher.Hidalgo@stefanini.com for faster processing. Thank you!
Responsibilities:
- Drive continuous improvement in software quality and infrastructure reliability and resilience.
- Perform analytics on previous incidents to understand root causes and better predict and prevent future issues.
- Create dashboards and reports to communicate key metrics.
- Deploy technology to improve performance, scalability, and stability of systems.
- Track performance against SLOs in partnership with monitoring teams or other stakeholders, and ensure systems continue to meet SLOs over time.
- Remain current with site reliability engineering methods and trends such as observability-driven development and chaos engineering.
- May oversee, design, implement, and manage DevOps capabilities using continuous integration/continuous delivery toolsets and automation.
- Collaborate with development teams to promote the concept of reliability engineering during all phases of the software development lifecycle to detect and correct performance issues and meet availability goals.
- Deliver software to automate manual operational work (i.e., “toil”).
- Work with stakeholders such as product owners to define service level objectives (SLOs) for system operations such as mean time to detect (MTTD), mean time to triage (MTTT), mean time to mobile (MTTM), mean time to acknowledge (MTTA), and mean time to resolve (MTTR).
- Participate in operational support, including major incidents (MI), and on-call rotation shifts for supported systems and products.
- Conduct blameless postmortems to troubleshoot priority incidents.
- Use automation to reduce the probability and/or impact of problem recurrence.
- Identify, evaluate, and recommend monitoring tools and diagnostic techniques to improve system observability.
- Participate in system design consulting, platform management, capacity planning and launch reviews.
- Collaborate and share lessons learned regarding performance and reliability issues with all stakeholders including developers, other SMEs, operations teams, and project management teams.
- Participate in communities of practice to share knowledge and foster continuous improvement.
- Remain current with site reliability engineering methods and trends such as observability-driven development and chaos engineering.
Qualifications
- Bachelor's degree (or equivalent years of experience).
- Minimum 5 years IT experience with 3+ years of relevant work experience. SRE (Site Reliability Engineering) experience preferred.
- Prior experience in a corporate IT environment.
- Experience with incident and response management.
- Strong problem solving and analytical skill and strong interpersonal and written and verbal communication skills.
- Highly adaptable to changing circumstances. Interest in continuously learning new skills and technologies.
- Experience with IT Service Management (ITSM) tools (e.g. ServiceNow, PagerDuty).
- Experience with IT enterprise architecture tools (e.g. LeanIX)
- Experience with working in cloud ecosystems (e.g. AWS, Microsoft Azure).
- Experience with monitoring and observability tools (e.g. Splunk, Nagios, SmartBear).
- Experience with programming and scripting languages (e.g. Java, C#, C++, Python, Bash, PowerShell).
- Experience with Agile and DevOps development methodologies.
- Experience with container technologies and supporting tools.
- Experience with configuration management systems (e.g. Puppet, Ansible, Chef, Salt, Terraform).
- Experience working with continuous integration/continuous deployment tools (e.g. Git, Jenkin,).
Education:Bachelor (BA, BS...)Employment Type: CONTRACTOR