Apply on Employer Site

MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) · 6 months ago

Machine Learning Infrastructure Engineer

Sunnyvale, CA

Full-time

Onsite

Senior Level

$300K/yr - $600K/yr

5+ years exp

The Institute of Foundation Models is a dedicated research lab focused on foundation models and AI development. They are seeking a Senior Machine Learning Infrastructure Engineer to extend and scale training systems, working closely with researchers and engineers to develop innovative AI solutions.

Artificial Intelligence (AI)Higher EducationUniversities

Responsibilities

Extend distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)

Implement distributed optimizers from mathematical specs

Build robust config + launch systems across multi-node, multi-GPU clusters

Own experiment tracking, metrics logging, and job monitoring for external visibility

Improve training system reliability, maintainability, and performance

Extend or modify training frameworks (e.g., DeepSpeed, FSDP) to support new use cases and architectures

Translate mathematical optimizer specs into distributed implementations

Create and debug multi-node launch scripts with flexible batch sizes, parallelism strategies, and hardware targets

Build systems for experiment tracking, job monitoring, and logging usable by collaborators and researchers

Write production-quality code and tests for ML infra in PyTorch or JAX; ensure reliability and maintainability at scale

Qualification

Distributed ML frameworksMulti-node experienceSoftware engineering fundamentalsImplementing algorithmsLarge-scale ML workloadsMixed-precision trainingPerformance profilingOpen-source contributionsCUDA experienceCustom training pipelinesTraining infrastructure tuning