Job description
Incident Management:Lead and coordinate incident response activities, ensuring timely detection, escalation, and resolution of production issues.Collaborate with cross-functional teams to mitigate the impact of incidents and prevent recurrence.Production Monitoring:Design, implement, and maintain robust production monitoring systems to proactively identify potential issues before they impact users.Analyze monitoring data to identify trends, patterns, and areas for improvement in system reliability.Operations Leadership:Provide technical leadership to the SRE team, fostering a culture of continuous improvement and innovation.Collaborate with development teams to integrate reliability best practices into the software development lifecycle.Capacity Planning:Work closely with infrastructure and capacity planning teams to ensure scalability and performance of systems.Proactively identify and address potential capacity issues before they impact system performance.Documentation:Maintain comprehensive documentation of system architecture, configurations, and procedures to facilitate efficient incident response and knowledge sharing.Collaboration:Collaborate with cross-functional teams, including development, QA, and product management, to drive improvements in system reliability and performance.Post-Incident Analysis:Conduct thorough post-incident analyses to identify root causes, contributing factors, and implement preventive measures to avoid recurrence.