TrueFoundry · 7 months ago
Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)
TrueFoundry is an enterprise platform that helps teams build, deploy, and manage large language model applications at scale. The Staff ML Platform Engineer will be responsible for building and optimizing infrastructure for training and deploying large-scale ML models, ensuring reliability and performance in production environments.
Artificial Intelligence (AI)DevOpsMachine Learning
Responsibilities
Write clean, modular, and scalable Python code, with a strong emphasis on reliability and performance
Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools
Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models
Build platform for developing, deploying and evaluating agentic applications for our end customers
Help shape internal standards and best practices across the engineering team for high-scale ML workloads
Qualification
Required
5+ years of hands-on experience building and deploying ML systems at scale
5+ years of writing production quality high performance code
Deep experience with multi-GPU/multi-node training, ideally with PyTorch as your primary framework
Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT)
A pragmatic mindset—you know when to optimize and when to ship
Preferred
Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus
Bonus: Familiarity with open-source LLM training/fine-tuning
Benefits
Flexible hours
Learning credits
Company
TrueFoundry
TrueFoundry is a unified platform with an enterprise-grade AI Gateway - combining LLM, MCP, and Agent Gateway.
H1B Sponsorship
TrueFoundry has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (3)
2024 (1)
2021 (1)
Funding
Current Stage
Early StageTotal Funding
$21.3MKey Investors
Intel Capital
2025-02-06Series A· $19M
2022-09-19Seed· $2.3M
Recent News
2025-12-10
Metrovacesa lanza el agente virtual conversacional MiA basado en IA generativa | CIO
2025-09-25
Company data provided by crunchbase