Tesla's Supercomputing/AI infrastructure team works directly with the high-performance computing and machine learning infrastructure on which our ML algorithms run; this includes virtual simulations, Autopilot hardware, silicon design, and Dojo. With the rapidly-growing need for more data and optimized compute resources, cluster builds are getting larger and increasingly complex. Continued development/automation of deployment, monitoring, self-healing and alerting processes is imperative to the success of our engineering groups. As the scope and impact of our Full-Self-Driving (FSD) & Robotaxi efforts continue to scale, so does the value of this team and its work.
As a Site Reliability Engineer, you will be responsible for maintaining and improving our infrastructure to ensure engineering teams across Autopilot/AI & Dojo have the necessary tools and resources to be productive. This includes managing/operating our HPC clusters, monitoring compute/GPU/network metrics, Linux troubleshooting & performance tuning, and collaborating with our Data Center team to coordinate the smooth operation of hundreds of servers & bring up new GPU capacity. Your work will directly facilitate neural network training at scale, streamline FSD development, and enable Dojo to become the most powerful supercomputer to date.