The AI Systems & Accelerated Platforms (ASAP) engineering team designs, builds, brings-up, tests and lands the hardware systems that power Meta’s products, deployed in our data-centers worldwide. Designs are published for industry adoption through the Open Compute Project Foundation.We are looking for highly skilled engineers with experience in AI Systems Engineering for our ASAP team. The ideal candidate will work as part of a team and operate in a highly multi-tasked, fast-paced and highly cross-functional engineering environment. They will have hands-on experience with the development and adoption of AI systems and general purpose hardware system design, and with hardware, firmware and software integration for large scale deployments. They will have deep knowledge and experience in the design of scalable and fault-tolerant risk mitigation frameworks for critical hardware infrastructure systems, with domain knowledge spanning server, storage and network technology. They will be data-driven and focus on the highest impact they can create as part of a world-class engineering team. This is an opportunity to join our team and help us build some of the world’s most open, efficient and advanced AI platforms.
- Collaborate with Hardware Engineering, Firmware and Software Engineering and company wide AI Infrastructure teams to develop the solution roadmaps and specifications for our AI hardware systems.
- Work hands-on with cross-functional partners to architect and integrate AI system solutions. This includes integrating systems, achieving stability, performance and power requirements, and driving defects to resolution with our external supply chain and manufacturing partners and internal teams.
- Influence the direction of the landscape of AI systems through development and collaboration with internal and open source firmware and hardware communities.
- Perform in-depth analysis and modeling of Meta hardware infrastructure, determine deficiencies in current solutions for current and future Meta system designs and develop technical strategies for meeting current and future AI solutions requirements.
- Partner with external vendors on AI solutions and drive off-the-shelf system and component roadmaps.
- Coordinate Incident Response activities for critical hardware infrastructure systems.
- Partner with Data Center Site Operations teams and NPI tooling teams to understand installation, operation and maintenance considerations within Meta data centers and incorporate feedback into future hardware designs.
- Collaborate closely with software and hardware sub-system subject matter experts to bring disparate technologies together to produce highly efficient, reliable and secure solutions.
- Work as part of the ASAP team to design, develop, test and deploy AI solutions for Meta AI infrastructure spanning ASIC, accelerated and general purpose compute, storage and networking.
- BS or MS in Computer Science, Computer Engineering, Electrical Engineering or a related technical discipline or equivalent experience.
- 9+ years of industry experience in Hardware Systems Engineering.
- Expert level knowledge in AI Systems technology, including demonstrable depth in at least two areas out of the following technology domains: Compute Systems, Storage Systems, Networking Systems.
- Knowledge of hardware solutions that crosses multiple technology domains and multiple subsystems and experienced in complex, multi-subsystem system-level troubleshooting, system performance analysis, and optimization practices and the experience to dive into software, firmware and hardware problems (e.g. debug wherever the problem leads and have the confidence to engage cross functional partners to support issue resolution.).
- Experience to quickly learn new hardware technology, protocols, frameworks and understand firmware and software concerns/requirements.
- Experience with architecture of disaggregated systems at scale.
- Proven detail-oriented with careful and balanced rapid execution in a fast-paced environment.
- Experience with CPLD, FPGA and/or ASIC development, specification of silicon level security feature definition and integration of complex logic with firmware.
- Experience in the specification, development and productionization of GPU and/or domain specific accelerator based ML systems.
- Knowledge of Compute systems and memory buses (DDR4, DDR5, HBM, UPI/QPI, etc.).
- Understanding of storage service types and presentation layers (BLOB, Block, File, etc).
- Knowledge of typical system IO and management buses (PCIe, CXL, I2C/SMBus, LPC, NVMe, NVMeoF, SATA, SAS, etc.).
- Experience with typical high performance computing networking/connectivity technologies: RDMA, Infiniband, NVLink, CXL, Ethernet, IPv4/v6, etc.
- Experience in firmware development and debugging: along with knowledge of system firmware development and system firmware configuration (BIOS, EFI Drivers, coreboot etc.).
- Domain expertise in Platform Security technologies such as UEFI Secure Boot, Measured Boot, Intel TXT and SGX-TEM, ARM TrustZone etc. with In-depth understanding of Platform Security standards and specifications such as TCG, OPAL, NIST 800-193, PKCS and X.509.
- Familiarity with Linux operating system internals (e.g. kernel dev, tracing, profiling, scheduling, IO subsystems), x86-based server hardware, storage, networking and IO stacks, and large-scale Infrastructure automation.