Job description
As a member of this team, the successful candidate will:- Build features for our on-device inference stack to support the most relevant accuracy preserving, general purpose techniques that empower model developers to compress and accelerate SoTA models (e.g., LLMs) in apps- Convert models from a high-level ML framework to a target device (CPU, GPU, Neural Engine) for optimal functional accuracy and performance- Write unit and system integration tests to ensure functional correctness and avoid performance regressions- Diagnose performance bottlenecks and work with HW Arch teams to co-design solutions that further improve latency, power, and memory footprint of neural network workloads- Analyze impact of model optimization (compression/quantization etc) on model quality by partnering with modeling and adaptation teams across diverse product use cases.