Platform Reliability Engineer

Company	PelotonSee more
Address	New York, NY
Form of work	Full-Time
Category	Information Technology

Job description

ABOUT THIS ROLE

Peloton is seeking an outstanding Platform Reliability Engineer with a K8s (Kubernetes) focus to join our Platform team. Our team builds and maintains a multi-cluster, multi-region, reliable, and highly scalable Kubernetes platform. In this role, you will have a rare and great opportunity to work with groundbreaking technologies that encourage innovation and ensure the reliability of running workloads in a flexible, scalable, and secure way.

YOUR DAILY IMPACT AT PELOTON

You will be a technical leader within your team, influencing and driving technical investments across partner teams with a "Platform Thinking" attitude. You will help others in design, execution, and problem-solving
Architect, develop, test, release, and support CI/CD systems such as Jenkins, GitHub Actions, Gradle, and Artifactory
Adhere to best practices in architectural design, testing (unit, integration, visual, and regression), and scrum methodology
Assist in planning, execution, and updating of technical roadmaps
Host a critical infrastructure that ensures that our developers have the best experience possible on multiple Kubernetes pods across multiple clusters
Automatic, fast auto-scaling for Connected Fitness devices and eCommerce platform
Develop and manage our Container Orchestration Platform, overseeing a diverse ecosystem of over 2,000 applications. This includes Multi-Cluster/Multi-tenant Kubernetes with 15+ clusters per environment, Istio Multi-cluster Mesh, and an AWS multi-account structure
Design, improve, and implement additional services for our centralized Observability Platforms, ensuring efficient log management based on Splunk, and effective monitoring and alerting powered by DataDog and PagerDuty.
Provide a platform for machine learning (and other exciting workloads) Allow developers to move quickly and experiment, without getting in the way
Promote standard methodologies for building and operating highly reliable systems
Consult in code and design reviews, planning, and technical discussions to ensure all are high quality, efficient, and well documented and meet reliability and capacity requirements
Automate everything, from infrastructure down to day-to-day tasks
Follow standard incident management process and demonstrate ability to conduct timely post-mortems of infrastructure incidents and high judgment in knowing when to triage and when to dive down into a root-cause analysis
Assist with all aspects of operational security and compliance, seek out potential threats to security and reliability, and advocate solutions
Participate in a rotating on-call duty schedule, providing support and assistance for the services within the Platform team's responsibility

YOU BRING TO PELOTON

A degree in Computer Science, Engineering, or a similar field of study or equivalent work experience
3+ years of experience in software engineering, with a solid understanding of Kubernetes and Infrastructure as Code
1+ years of systems configuration and automation experience (e.g. Ansible, Chef, Puppet, Terraform)
Extensive knowledge and hands-on experience in AWS Cloud infrastructure and Services, including CI/CD and IaC provisioning tools (Jenkins, ArgoCD, Scalr, Terraform, and Github Actions)
Experience in a cloud environment like AWS or GCP, and familiarity with running containerized services
Experience with a programming language like Python, Golang or Java.
Knowledge of standard practices in observability and monitoring for Kubernetes clusters at scale with experience in cost optimization tools like Kubecost, Goldilocks, etc.
Knowledge of standard processes in regards to securing a Kubernetes cluster and its deployments at scale

BONUS

Passion for helping development teams make the transition to a container-native world
Passion for reliable, scalable, observable software with a sense of ownership
Design and operate large, reliable, and scalable distributed systems
Knowledge of network infrastructure basics, including DNS, DHCP, firewalling, and load balancing, to facilitate multi-functional collaboration.

#LI-Hybrid

#LI-SW2

Refer code: 8164396. Peloton - The previous day - 2024-02-08 14:01

Platform Reliability Engineer

PelotonSee more

Job description

Junior Accountant

Accounting and Finance intern

Pulmonary Critical Care - Central Illinois - 500k Potential with 40k Sign-on

Accounts Payable Analyst - New York

Hospitalists: Join West Virginia Academic Affiliated Program

Payroll Analyst

Ophthalmologist - Portland, OR

Psychiatrist

Physician Assistant (PA) - Neurosurgery - (PA) Opportunity in El Paso, TX

Internal Medicine - FQHC - Sign-On Bonus - 30 miles North of Boston

Related jobs

Platform Reliability Engineer

System Performance & Reliability Engineer

Site Reliability Engineer - Security Infrastructure

Lead Software Engineer-Site Reliability Engineer

Senior Software Engineer (Database Reliability Engineer)

Site Reliability Engineer AI

Product Reliability Engineer

Senior Site Reliability Engineer (Remote, AMER)

Software Engineer II, Platform & Site Reliability Engineering

Sr. Software Engineer II, Platform & Site Reliability Engineering