Foundation Model DevOps Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) · 17 hours ago

Foundation Model DevOps Engineer

The Institute of Foundation Models is a dedicated research lab focused on building and understanding foundation models. The Foundation Model DevOps Engineer will ensure operational stability and build the tooling and infrastructure necessary for AI research, facilitating a friction-free environment for model development.

Artificial Intelligence (AI)Higher EducationUniversities

Responsibilities

You own the standard of our public presence. You ensure that every release (weights, code, training logs, data) is reproducible, meticulously documented, and packaged with the polish of a top-tier open-source product
Design and implement pipelines that automate the testing and packaging of complex model releases, moving us away from manual handovers to automated verification
Administer the organization’s GitHub Enterprise account, ensuring branch protection and clean versioning practices are enforced across the lab
Manage the efficiency of our large-scale GPU resources. You track utilization to identify idle nodes, 'zombie jobs,' or inefficient scheduling, ensuring we extract maximum value from our compute clusters
Manage the lifecycle of petabyte-scale datasets and checkpoint storage. You implement intelligent aging policies to solve the 'disk full' bottleneck without risking critical data loss
Proactively manage storage and compute quotas across research teams to prevent resource contention before it blocks a training run
Build and maintain the internal CLI tools and dashboards that allow researchers to launch, track, and organize jobs across thousands of GPUs
Set up real-time monitoring for interconnect throughput, GPU memory, and file system latency to catch performance degradation instantly
Work closely with infrastructure teams to optimize how we run synthetic data pipelines and large-scale evaluations, ensuring our tooling scales with our compute
Build the scripts and tooling that instantly provision compute environments, permissions, and storage namespaces for researchers (automating away the manual work)
Streamline SSH and node access protocols to ensure friction-free entry to our massive-scale compute clusters while maintaining security boundaries

Qualification

DevOpsRelease EngineeringFoundation Model FluencyLinux/Unix FluencyVersion Control AdminScripting & AutomationHPC SchedulersCloud StorageProblem-solving skillsTeam collaboration

Required

A bachelor's degree in Computer Science, Information Technology, or a related field, or equivalent practical experience
3+ years of experience in DevOps, Release Engineering, or MLE, specifically within AI/ML or HPC environments
You understand the lifecycle of training large models (LLMs or Diffusion). You know what a checkpoint is, you understand the difference between pre-training and inference, and you are familiar with the artifacts required for a model release
You live in the command line. You have deep expertise in bash scripting, file system permissions, and SSH configuration
Expert-level administration of GitHub Enterprise (managing teams, API limits, and repository security)
Proficiency in Python or Bash to automate repetitive administrative tasks

Preferred

Experience contributing to or managing high-profile open-source releases (Hugging Face libraries, model families, datasets)
Deep understanding of Slurm job scheduling and troubleshooting
Familiarity with cloud storage buckets (S3/GCP) and efficient data transfer tools

Benefits

Comprehensive medical, dental, and vision benefits
Bonus
401K Plan
Generous paid time off, sick leave and holidays
Paid Parental Leave
Employee Assistance Program
Life insurance and disability

Company

MBZUAI (Mohamed bin Zayed University of Artificial Intelligence)

twittertwittertwitter
company-logo
Official account of Mohamed bin Zayed University of Artificial Intelligence. Dedicated to research, innovation, and empowering brilliant minds in AI.

Funding

Current Stage
Growth Stage
Total Funding
$0.04M
Key Investors
Llama
2024-09-24Grant· $0.04M

Leadership Team

leader-logo
Eric Xing
President
linkedin
Company data provided by crunchbase