Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps) jobs in United States
cer-icon
Apply on Employer Site
company-logo

TrueFoundry · 7 months ago

Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)

TrueFoundry is an enterprise platform that helps teams build, deploy, and manage large language model applications at scale. The Staff ML Platform Engineer will be responsible for building and optimizing infrastructure for training and deploying large-scale ML models, ensuring reliability and performance in production environments.

Artificial Intelligence (AI)DevOpsMachine Learning
check
H1B Sponsor Likelynote

Responsibilities

Write clean, modular, and scalable Python code, with a strong emphasis on reliability and performance
Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools
Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models
Build platform for developing, deploying and evaluating agentic applications for our end customers
Help shape internal standards and best practices across the engineering team for high-scale ML workloads

Qualification

ML Systems EngineeringDeep LearningPyTorchKubernetesMulti-GPU TrainingMulti-Node TrainingKubeflowInference EnginesVLLMTensorRTOpen-source LLM TrainingPragmatic Mindset

Required

5+ years of hands-on experience building and deploying ML systems at scale
5+ years of writing production quality high performance code
Deep experience with multi-GPU/multi-node training, ideally with PyTorch as your primary framework
Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT)
A pragmatic mindset—you know when to optimize and when to ship

Preferred

Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus
Bonus: Familiarity with open-source LLM training/fine-tuning

Benefits

Flexible hours
Learning credits

Company

TrueFoundry

twittertwittertwitter
company-logo
TrueFoundry is a unified platform with an enterprise-grade AI Gateway - combining LLM, MCP, and Agent Gateway.

H1B Sponsorship

TrueFoundry has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (3)
2024 (1)
2021 (1)

Funding

Current Stage
Early Stage
Total Funding
$21.3M
Key Investors
Intel Capital
2025-02-06Series A· $19M
2022-09-19Seed· $2.3M

Leadership Team

leader-logo
Nikunj Bajaj
Co-Founder & CEO
linkedin
leader-logo
Abhishek Choudhary
Co-Founder
linkedin
Company data provided by crunchbase