At Apple, we work every day to create products that enrich people's lives. Our Advertising Platforms group makes it possible for people around the world to easily access informative and imaginative content on their devices while helping publishers and developers promote and monetize their work.
Our technology and services power advertising in Apple News and Ads in the App Store. Our systems are highly-performant, deployed to handle high-volume events at scale, and set new standards for enabling effective advertising while protecting user privacy.
The Ad Platforms team is seeking a Senior Site Reliability Engineer for a great opportunity. Our mission is to enable Ad Platforms to deliver advertisements in a reliable and scalable way that results in awesome user experiences.
As a Site Reliability Engineer, you will be responsible for providing the platform for mission-critical ad-tech systems to maintain constant uptime, scale seamlessly, and allow for new applications and services to flourish.
The successful candidate will be highly self-motivated and passionate about excellence, quality, and detail. The SRE will not only support operations but also work closely with the developers and architects within the team to aid in the design and assist with the implementation to improve stability, security, and scalability.
Key Qualifications
5+ years managing clustered services, distributed systems, and production data stores
Expert understanding in Linux based systems and deep expertise in Hadoop/YARN/Spark based technologies
Hands-on experience with AWS/EMR, S3, Glue, Athena, and Kubernetes Infrastructure
Expertise in designing, implementing, and administering large Hadoop clusters and related Infrastructure such as Hive, Spark, HDFS, HBase, Oozie, Presto, Flume, Airflow and Zookeeper
Experience in managing the life cycle of data services from inception and design to deployment, operation, migration, administration, and sunsets
Experience in running Machine Learning pipelines (Training models, experimentation) and Jupyterhub/GPU compute/PyTorch Infrastructure
Cloudera CDH5/CDH6/CDP cluster management and prior capacity planning experience for large-scale multi-tenant clusters
Ability to code well in at least one language (Shell, Ruby, Python, Java, Perl)
Experience in setup/management of security infrastructure such as Kerberos
Good work attitude and tenacious troubleshooting/analytical skills
Multi-datacenter deployment / Disaster Recovery experience is a plus
Prior Advertising and related data pipeline (click stream, etc.) experience is a plus
Description
Design and implement scalable data platforms for our customer-facing services.
Monitor production, staging, test, and development environments for multiple teams in an agile/dynamic, fast-paced engineering organization.
Deploy and scale Hadoop infrastructure to support data pipeline and related services.
Build infrastructure capabilities to improve the resiliency and efficiency of the systems and services at scale.
Drive data infrastructure/pipeline, services, and upgrade/migration projects from start to finish.
Support in Hadoop / HDFS infrastructure day-today operations, administration, and maintenance.
Data cluster monitoring and troubleshooting.
Capacity planning, management, and troubleshooting for HDFS, YARN/MapReduce, and Spark workloads.
Participate in rotational on-call schedule.
Partner with program management, network engineering, and other cross-functional teams on larger initiatives.
Work simultaneously on multiple projects competing for your time and understand how to prioritize them accordingly.
Education & Experience
BS/MS in computer science or equivalent field.