This job has closed.

NVIDIA · 3 days ago

Senior ML Platform Engineer - Lepton

Durham, NC

Full-time

Hybrid

Senior Level, Lead/Staff

$184K/yr - $288K/yr

8+ years exp

NVIDIA is at the forefront of innovations in Artificial Intelligence, High-Performance Computing, and Visualization. They are seeking a Senior ML Platform Engineer to architect, build, and scale high-performance ML infrastructure using modern Infrastructure-as-Code practices, enabling scientists and engineers to train and deploy advanced ML models.

AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality

Growth Opportunities

H1B Sponsor Likely

Responsibilities

Design, build, and maintain our core ML platform infrastructure as code, primarily using Ansible and Terraform, ensuring reproducibility and scalability across large-scale, distributed GPU clusters

Apply SRE principles to diagnose, troubleshoot, and resolve complex system issues across the entire stack, ensuring high availability and performance for critical AI workloads

Develop robust internal automation and tooling for ML workflow orchestration, resource scheduling, and platform operations, with a strong focus on software engineering best practices

Collaborate with ML researchers and applied scientists to understand infrastructure needs and build solutions that streamline their end-to-end experimentation

Evolve and operate our multi-cloud and hybrid (on-prem + cloud) environments, implementing monitoring, alerting, and incident response protocols

Participate in on-call rotation to provide support for platform services and infrastructure running critical ML jobs, driving root cause analysis and implementing preventative measures

Write high-quality, maintainable code (Python, Go) to contribute to the core orchestration platform and automate manual processes

Drive the adoption of modern GPU technologies and ensure smooth integration of next-generation hardware into ML pipelines (e.g., GB200, NVLink, etc.)

Qualification

Infrastructure-as-CodeML infrastructureSRE principlesPythonAnsibleTerraformKubernetesDockerLinux systemsGoCI/CD methodologiesGitOps practicesDistributed trainingML workflows

Required

BS/MS in Computer Science, Engineering, or equivalent experience

8+ years in software/platform engineering or SRE roles, including 3+ years focused on ML infrastructure or distributed compute systems

Strong proficiency in Infrastructure-as-Code (IaC) tools, specifically Ansible and Terraform, with a proven track record of building and managing production infrastructure

SRE-oriented mindset with extensive experience in diagnosing system-level issues, performance tuning, and ensuring platform reliability

Solid understanding of ML workflows and lifecycle—from data preprocessing to deployment

Proficiency in operating containerized workloads with Kubernetes and Docker

Strong software engineering skills in languages such as Python or Go, with a focus on automation, tooling, and writing production-grade code

Experience with Linux systems internals, networking, and performance tuning at scale

Preferred

Experience building or operating ML platforms supporting frameworks like PyTorch or TensorFlow at scale

Deep understanding of distributed training techniques (e.g., data/model parallelism, Horovod, NCCL)

Expertise with modern CI/CD methodologies and GitOps practices

Passion for building developer-centric platforms with great UX and strong operational reliability

Proven ability to contribute code to complex orchestration or automation platforms

Benefits

Equity

Benefits

Company

NVIDIA

Glassdoor4.6

NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.

Founded in 1993

Santa Clara, California, USA

10001+ employees

https://www.nvidia.com

H1B Sponsorship

NVIDIA has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (1877)

2024 (1355)

2023 (976)

2022 (835)

2021 (601)

2020 (529)

Funding

Current Stage

Public Company

Total Funding

$4.09B

Key Investors

ARPA-EARK Investment ManagementSoftBank Vision Fund

2023-05-09Grant· $5M

2022-08-09Post Ipo Equity· $65M

2021-02-18Post Ipo Equity

Leadership Team

Jensen Huang

Founder and CEO

Michael Kagan

Chief Technology Officer

Recent News

SiliconANGLE

Red Hat pledges day-zero support for Nvidia’s newest GPUs

2026-01-11

PitchBook

Map: The VCs benefiting from Nvidia’s M&A

2026-01-11

Digital News Asia

Republic Polytechnic accelerates AI transformation to develop future-ready learners and an AI-adept workforce

2026-01-11

Company data provided by crunchbase