Machine Learning Infrastructure Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) · 6 months ago

Machine Learning Infrastructure Engineer

The Institute of Foundation Models is a dedicated research lab focused on foundation models and AI development. They are seeking a Senior Machine Learning Infrastructure Engineer to extend and scale training systems, working closely with researchers and engineers to develop innovative AI solutions.

Artificial Intelligence (AI)Higher EducationUniversities

Responsibilities

Extend distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)
Implement distributed optimizers from mathematical specs
Build robust config + launch systems across multi-node, multi-GPU clusters
Own experiment tracking, metrics logging, and job monitoring for external visibility
Improve training system reliability, maintainability, and performance
Extend or modify training frameworks (e.g., DeepSpeed, FSDP) to support new use cases and architectures
Translate mathematical optimizer specs into distributed implementations
Create and debug multi-node launch scripts with flexible batch sizes, parallelism strategies, and hardware targets
Build systems for experiment tracking, job monitoring, and logging usable by collaborators and researchers
Write production-quality code and tests for ML infra in PyTorch or JAX; ensure reliability and maintainability at scale

Qualification

Distributed ML frameworksMulti-node experienceSoftware engineering fundamentalsImplementing algorithmsLarge-scale ML workloadsMixed-precision trainingPerformance profilingOpen-source contributionsCUDA experienceCustom training pipelinesTraining infrastructure tuning

Required

5+ years of experience in ML systems, infra, or distributed training
Experience modifying distributed ML frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)
Strong software engineering fundamentals (Python, systems design, testing)
Proven multi-node experience (e.g., Slurm, Kubernetes, Ray) and debugging skills (e.g., NCCL/GLOO)
Ability to implement algorithms across GPUs/nodes based on mathematical specs
Experience working on an ML platform/ infrastructure, and/or distributed inference optimization team
Experience with large-scale machine learning workloads (strong ML fundamentals)

Preferred

Exposure to mixed-precision training (e.g., bf16, fp8) with accuracy validation
Familiarity with performance profiling, kernel fusion, or memory optimization
Open-source contributions or published research (MLSys, ICML, NeurIPS)
CUDA or Triton kernel experience
Experience with large-scale pre-training
Experience building custom training pipelines at scale and modifying them for custom needs
Deep familiarity with training infrastructure and performance tuning

Benefits

Comprehensive medical, dental, and vision
401(k) program
Generous PTO, sick leave, and holidays
Paid parental leave and family-friendly benefits
On-site amenities and perks: Complimentary lunch, gym access, and a short walk to the Sunnyvale Caltrain station

Company

MBZUAI (Mohamed bin Zayed University of Artificial Intelligence)

twittertwittertwitter
company-logo
Official account of Mohamed bin Zayed University of Artificial Intelligence. Dedicated to research, innovation, and empowering brilliant minds in AI.

Funding

Current Stage
Growth Stage
Total Funding
$0.04M
Key Investors
Llama
2024-09-24Grant· $0.04M

Leadership Team

leader-logo
Eric Xing
President
linkedin
Company data provided by crunchbase