Software Engineer (Site Reliability) Operations Lead, Enterprise Systems

Company	Machine Learning And AiSee more
Address	Austin, TX
Category	Information Technology

Job description

Incident Management:Lead and coordinate incident response activities, ensuring timely detection, escalation, and resolution of production issues.Collaborate with cross-functional teams to mitigate the impact of incidents and prevent recurrence.Production Monitoring:Design, implement, and maintain robust production monitoring systems to proactively identify potential issues before they impact users.Analyze monitoring data to identify trends, patterns, and areas for improvement in system reliability.Operations Leadership:Provide technical leadership to the SRE team, fostering a culture of continuous improvement and innovation.Collaborate with development teams to integrate reliability best practices into the software development lifecycle.Capacity Planning:Work closely with infrastructure and capacity planning teams to ensure scalability and performance of systems.Proactively identify and address potential capacity issues before they impact system performance.Documentation:Maintain comprehensive documentation of system architecture, configurations, and procedures to facilitate efficient incident response and knowledge sharing.Collaboration:Collaborate with cross-functional teams, including development, QA, and product management, to drive improvements in system reliability and performance.Post-Incident Analysis:Conduct thorough post-incident analyses to identify root causes, contributing factors, and implement preventive measures to avoid recurrence.

Request

Proven experience as a Site Reliability Engineer or similar role, with a focus on operations management.
Demonstrated experience managing large-scale production outages and leading incident response.
Deep understanding of production monitoring systems, log analysis, and performance metrics.
Proficient in scripting languages (e.g., Python, Bash) and automation tools.
Strong leadership and communication skills with the ability to effectively collaborate with cross-functional teams.
Experience mentoring and coaching team members to enhance overall performance.
Strong analytical and problem-solving skills with a proactive approach to identifying and addressing potential issues.
Ability to thrive in a fast-paced, dynamic environment and adapt to evolving technologies and business needs.

Refer code: 8279950. Machine Learning And Ai - The previous day - 2024-02-21 12:12

Software Engineer (Site Reliability) Operations Lead, Enterprise Systems

Machine Learning And AiSee more

Job description

Request

Cardiothoracic Vascular Surgeon

WFD Programs Specialist - Employment

Otolaryngologist - ENT

Outpatient Family Medicine - Minnesota

Oncology / Hematology Physician

Orthopedic Surgeon (Joint)

PART-TIME PLASTIC SURGEON OPPORTUNITY IN DETROIT, MI!

Non-Invasive Cardiologist

Warehouse Fulfillment Specialist Part Time

Technical Support Specialist - Remote - Latin America

Related jobs

Software Engineer (Site Reliability) Operations Lead, Enterprise Systems

Senior Software Engineer

Software Engineer (Data Apps & Frontend), IS&T Ai & Data Platforms

iOS Software Engineer

Mid-Level Software Engineer-Guidewire Contact Manager

Software Engineer (BI Tools Platform), Ai & Data Platforms

Senior Software Engineer - Mobile (Android)

Software Engineer-Senior iOS

Software Development Engineer

Postgres Database Engineer, IS Data Services

Senior Software Engineer