TrueFoundry · 7 months ago
Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)
TrueFoundry is redefining how ML teams train, deploy, and scale their models through their LLMOps and MLOps platform. They are seeking a Staff ML Platform Engineer to write scalable Python code, build platforms for training large-scale ML models, and optimize high-throughput inference pipelines.
Artificial Intelligence (AI)DevOpsMachine Learning
Responsibilities
Write clean, modular, and scalable Python code, with a strong emphasis on reliability and performance
Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools
Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models
Build platform for developing, deploying and evaluating agentic applications for our end customers
Help shape internal standards and best practices across the engineering team for high-scale ML workloads
Qualification
Required
5+ years of hands-on experience building and deploying ML systems at scale
5+ years of writing production quality high performance code
Deep experience with multi-GPU/multi-node training, ideally with PyTorch as your primary framework
Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT)
Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus
A pragmatic mindset—you know when to optimize and when to ship
Preferred
Familiarity with open-source LLM training/fine-tuning
Benefits
Flexible hours
Learning credits
Company
TrueFoundry
TrueFoundry is a unified platform with an enterprise-grade AI Gateway - combining LLM, MCP, and Agent Gateway.
H1B Sponsorship
TrueFoundry has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (3)
2024 (1)
2021 (1)
Funding
Current Stage
Early StageTotal Funding
$21.3MKey Investors
Intel Capital
2025-02-06Series A· $19M
2022-09-19Seed· $2.3M
Recent News
2025-12-10
Metrovacesa lanza el agente virtual conversacional MiA basado en IA generativa | CIO
2025-09-25
Company data provided by crunchbase