Primary Roles & Responsibilities:
In this Site Reliability Engineer role, you will work closely with several Data Centers, the entire Cloud organization and IBM vendors to support, maintain and operationally improve the IBM cloud infrastructure. You will focus on the following key responsibilities:
- Monitor the health of production and test systems
- Ability to respond promptly to production issues and alerts
- Execute changes in the production environment through automation and AI
- Partner with other SRE teams and program managers to deliver mission-critical services to the market
- Support development of new and existing capabilities for our compute, storage, and network infrastructure services
- Implement and automate infrastructure solutions that support IBM Cloud products and infrastructure
- Support the compliance and security integrity of the environment
- Automate health monitoring of the production and test systems
- Automate return to service procedures for Cloud Service delivery
- Support the compliance and security integrity of the environment through your work
- Partner with other teams, functional managers, and program managers to deliver mission-critical services to the market
- Creating power BI dashboards on historic and prediction data for client use case -should be involved in designing the process and implementation of key entities extraction from millions of unstructured files using python NLP techniques and Apache spark.
- Expertise in Data Interpretation and Visualization skills
- Define problems and opportunities in a complex business area
- Develop advanced analytics products
- Create and develop end-to-end data driven solutions to support and monitor the health of production and test systems
- Extract data from multiple varied sources and integrate it for analytics and application development
- Partner with other SRE teams and program managers to deliver mission-critical services to the market
- Experience with machine learning engineering to develop self-running AI software to automate predictive models
- Experience with designing machine learning systems and algorithms to generate accurate predictions.
- Working knowledge with ServiceNow, JIRA, Confluence, and GitHub
- Working knowledge with Container technologies: Kubernetes (preferred), Docker, etc.
- Hands on knowledge of log aggregate software such as Splunk or Elk
- Must have the ability to perform debugging and problem analysis by examining logs and running Unix commands
- Provide initial assessment and possible workaround of production issue
- Troubleshoot and resolve production issues
- Identify and resolve issues
- Discuss and plan integration tasks
- Provide technical escalation support for other Infrastructure Operations teams