ABOUT THIS ROLE
Peloton is seeking an outstanding Platform Reliability Engineer with a K8s (Kubernetes) focus to join our Platform team. Our team builds and maintains a multi-cluster, multi-region, reliable, and highly scalable Kubernetes platform. In this role, you will have a rare and great opportunity to work with groundbreaking technologies that encourage innovation and ensure the reliability of running workloads in a flexible, scalable, and secure way.
YOUR DAILY IMPACT AT PELOTON
- You will be a technical leader within your team, influencing and driving technical investments across partner teams with a "Platform Thinking" attitude. You will help others in design, execution, and problem-solving
- Architect, develop, test, release, and support CI/CD systems such as Jenkins, GitHub Actions, Gradle, and Artifactory
- Adhere to best practices in architectural design, testing (unit, integration, visual, and regression), and scrum methodology
- Assist in planning, execution, and updating of technical roadmaps
- Host a critical infrastructure that ensures that our developers have the best experience possible on multiple Kubernetes pods across multiple clusters
- Automatic, fast auto-scaling for Connected Fitness devices and eCommerce platform
- Develop and manage our Container Orchestration Platform, overseeing a diverse ecosystem of over 2,000 applications. This includes Multi-Cluster/Multi-tenant Kubernetes with 15+ clusters per environment, Istio Multi-cluster Mesh, and an AWS multi-account structure
- Design, improve, and implement additional services for our centralized Observability Platforms, ensuring efficient log management based on Splunk, and effective monitoring and alerting powered by DataDog and PagerDuty.
- Provide a platform for machine learning (and other exciting workloads) Allow developers to move quickly and experiment, without getting in the way
- Promote standard methodologies for building and operating highly reliable systems
- Consult in code and design reviews, planning, and technical discussions to ensure all are high quality, efficient, and well documented and meet reliability and capacity requirements
- Automate everything, from infrastructure down to day-to-day tasks
- Follow standard incident management process and demonstrate ability to conduct timely post-mortems of infrastructure incidents and high judgment in knowing when to triage and when to dive down into a root-cause analysis
- Assist with all aspects of operational security and compliance, seek out potential threats to security and reliability, and advocate solutions
- Participate in a rotating on-call duty schedule, providing support and assistance for the services within the Platform team's responsibility
YOU BRING TO PELOTON
- A degree in Computer Science, Engineering, or a similar field of study or equivalent work experience
- 3+ years of experience in software engineering, with a solid understanding of Kubernetes and Infrastructure as Code
- 1+ years of systems configuration and automation experience (e.g. Ansible, Chef, Puppet, Terraform)
- Extensive knowledge and hands-on experience in AWS Cloud infrastructure and Services, including CI/CD and IaC provisioning tools (Jenkins, ArgoCD, Scalr, Terraform, and Github Actions)
- Experience in a cloud environment like AWS or GCP, and familiarity with running containerized services
- Experience with a programming language like Python, Golang or Java.
- Knowledge of standard practices in observability and monitoring for Kubernetes clusters at scale with experience in cost optimization tools like Kubecost, Goldilocks, etc.
- Knowledge of standard processes in regards to securing a Kubernetes cluster and its deployments at scale
BONUS
- Passion for helping development teams make the transition to a container-native world
- Passion for reliable, scalable, observable software with a sense of ownership
- Design and operate large, reliable, and scalable distributed systems
- Knowledge of network infrastructure basics, including DNS, DHCP, firewalling, and load balancing, to facilitate multi-functional collaboration.
#LI-Hybrid
#LI-SW2