Company

TeslaSee more

addressAddressPalo Alto, CA
type Form of workFull-time
CategoryEngineering/Architecture/scientific

Job description

Tesla's Supercomputing/AI infrastructure team works directly with the high-performance computing and machine learning infrastructure on which our ML algorithms run; this includes virtual simulations, Autopilot hardware, silicon design, and Dojo. With the rapidly-growing need for more data and optimized compute resources, cluster builds are getting larger and increasingly complex. Continued development/automation of deployment, monitoring, self-healing and alerting processes is imperative to the success of our engineering groups. As the scope and impact of our Full-Self-Driving (FSD) & Robotaxi efforts continue to scale, so does the value of this team and its work.


As a Site Reliability Engineer, you will be responsible for maintaining and improving our infrastructure to ensure engineering teams across Autopilot/AI & Dojo have the necessary tools and resources to be productive. This includes managing/operating our HPC clusters, monitoring compute/GPU/network metrics, Linux troubleshooting & performance tuning, and collaborating with our Data Center team to coordinate the smooth operation of hundreds of servers & bring up new GPU capacity. Your work will directly facilitate neural network training at scale, streamline FSD development, and enable Dojo to become the most powerful supercomputer to date.

Request

  • Proficiency in Python, Golang and/or Bash
  • Proficiency with Linux fundamentals and performance optimizations (Ubuntu/RHEL OS)
  • Demonstrable knowledge of TCP/IP, IPoIB, Linux operating system internals, filesystems, disk/storage technologies and storage protocols
  • Experience collaborating with network and data center teams for large scale cluster builds
  • Experience with configuration management software (Ansible, etc.), systems monitoring & alerting (Prometheus, Grafana, Telegraf, Splunk, etc.), and/or administering HPC workload managers
  • Experience with high-throughput low-latency networks, GPU-based computing systems, and/or high performance storage systems
  • Experience with Slurm, LSF and storage management of distributed parallel file systems a plus
  • Bachelor's Degree in Computer Science, Computer Engineering, Electrical Engineering, Physics or proof of exceptional skills in related field
  • 3+ years of additional equivalent experience or evidence of exceptional ability related to the position
Refer code: 9424375. Tesla - The previous day - 2024-06-29 23:25

Tesla

Palo Alto, CA

Share jobs with friends

Related jobs

Site Reliability Engineer, Ai & Hpc Infrastructure

Principal Site Reliability Engineer (DevSecOps)

Oracle

Pleasanton, CA

2 days ago - seen

Site Reliability Engineer, ASE Block Storage

Software And Services

Cupertino, CA

6 days ago - seen

Site Reliability Engineer

Atlassian

San Francisco, CA

a week ago - seen

Staff Site Reliability Engineer - Remote, US

Earnest Current Job Openings

San Francisco, CA

a week ago - seen

Site Reliability Engineer

Adobe

San Jose, CA

2 weeks ago - seen

Site Reliability Engineer, Data Analytics

Software And Services

San Diego, CA

3 weeks ago - seen

Senior Staff Site Reliability Engineer

Nvidia

$164,000 - $310,500 a year

Santa Clara, CA

a month ago - seen

Senior Software Engineer, Site Reliability Engineering

Forward

$100,000 - $220,000 a year

San Francisco, CA

a month ago - seen

Cloud DevOps / Site Reliability Engineer, Applied Machine Learning

Software And Services

Sunnyvale, CA

a month ago - seen

Site Reliability Engineer - Redis

Software And Services

Cupertino, CA

2 months ago - seen

Site Reliability Engineer - Solr

Software And Services

Cupertino, CA

2 months ago - seen

Site Reliability Engineer (remote - CA locals only)

Culturetech Solutions

Sacramento, CA

2 months ago - seen

Senior Site Reliability Engineer (SRE) - ASE / iCloud

Software And Services

Cupertino, CA

2 months ago - seen

Lead Site Reliability Engineer

Job Board

San Francisco, CA

2 months ago - seen

DevOps & Site Reliability Engineer (SRE)

Hardware

Cupertino, CA

2 months ago - seen

Sr Site Reliability Engineer - Cross Functional

Software And Services

Cupertino, CA

3 months ago - seen

Site Reliability Engineer (SRE) - ASE / iCloud

Software And Services

Cupertino, CA

3 months ago - seen