Job Description
- Implement and enhance monitoring of the hardware & software across our ecosystem
- Developing and improving instrumentations/integrations
- Providing guidance on monitoring best practices
- Providing guidance on monitoring specific hardware & software items (key points to monitor)
- Implement and enhance observability of hardware & software across our ecosystem
- Developing and improving instrumentation
- Providing guidance on key areas to observe
- Educating teams on how observability tools work
- Being responsible for ensuring we provide our internal customers with the best monitoring & observability possible to aid them in raising the quality, reliability & availability of IT corporate infrastructure
- Scripting / Infrastructure as Code for monitoring & observability implementations & enhancements
- Engineering degree or equivalent experience and familiarity with engineering best practices
- Working knowledge of how hardware & software interact in a corporate retail environment
- Familiar with Agile Scrum process
- Ability to interact with a variety of personalities and technical skill levels across multiple product & platform teams
- Proficient in developing and maintaining technical documentation
- Application Servers (IIS, Tomcat, WebSphere, jBoss, etc)
- Containerization (Kubernetes, VMWare, etc)
- Database (SqlServer, Postgres, DB2, Oracle, etc)
- Message Bus (IBM MQ, Kafka, Active MQ, Rabbit MQ)
- Networking (Cisco ACI, F5 Load Balancers, Firewalls, etc)
- Operating Systems (RedHat, Windows, etc)
- Programming (java, .net, pyton, etc)
- Storage Devices
- Web Servers (apache, nginx, etc)
- Nagios
- Datadog
- ServiceNow Event Management / Service Operations Workspace
- Confluence
- BigFix
- Knowledge on the Google Site Reliability Engineering model
- Terraform
- Ansible
- Jenkins
- Skills in troubleshooting production environments (this is not a day to day responsibility of this role but this experience will prove valuable as we build the tools those teams utilize)
- Strong ownership attitude / track record of taking responsibility