Staff / Principal ML Ops Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Pragmatike · 1 week ago

Staff / Principal ML Ops Engineer

Pragmatike is a fast-growing AI startup recognized as a Top 10 GenAI company by GTM Capital, founded by MIT CSAIL researchers. They are seeking a Staff / Principal ML Ops Engineer to lead the design, implementation, and scaling of the company's ML infrastructure and production AI systems, collaborating closely with various teams to ensure robustness and efficiency.

Information TechnologyRecruitingSoftware

Responsibilities

Architect, build, and scale the end-to-end ML Ops pipeline, including training, fine-tuning, evaluation, rollout, and monitoring
Design reliable infrastructure for model deployment, versioning, reproducibility, and orchestration across cloud and on-prem GPU clusters
Optimize compute usage across distributed systems (Kubernetes, autoscaling, caching, GPU allocation, checkpointing workflows)
Lead the implementation of observability for ML systems (monitor drift, performance, throughput, reliability, cost)
Build automated workflows for dataset curation, labeling, feature pipelines, evaluation, and CI/CD for ML models
Collaborate with researchers to productionize models and accelerate training/inference pipelines
Establish ML Ops best practices, internal standards, and cross-team tooling
Mentor engineers and influence architectural direction across the entire AI platform

Qualification

ML OpsDistributed systemsCloud infrastructurePythonContainerizationML frameworksTechnical strategyCollaborationMentoringFast-paced environments

Required

Deep hands-on experience designing and operating production ML systems at scale (Staff/Principal-level expected)
Strong background in ML Ops, distributed systems, and cloud infrastructure (AWS, GCP, or Azure)
Proficiency with Python and familiarity with TypeScript or Go for platform integration
Expertise in ML frameworks: PyTorch, Transformers, vLLM, Llama-factory, Megatron-LM, CUDA / GPU acceleration (practical understanding)
Strong experience with containerization and orchestration (Docker, Kubernetes, Helm, autoscaling)
Deep understanding of ML lifecycle workflows: training, fine-tuning, evaluation, inference, model registries
Ability to lead technical strategy, collaborate cross-functionally, and operate in fast-paced environments

Preferred

Experience deploying and operating LLMs and generative models in production at enterprise scale
Familiarity with DevOps, CI/CD, automated deployment pipelines, and infrastructure-as-code
Experience optimizing GPU clusters, scheduling, and distributed training frameworks
Prior startup experience or comfort operating with ambiguity and high ownership
Experience working with data engineering, feature pipelines, or real-time ML systems

Benefits

Competitive salary & equity options
Sign-on bonus
Health, Dental, and Vision
401k

Company

Pragmatike

twittertwittertwitter
company-logo
Pragmatike is a remote tech job platform that helps companies hire tech talent.

Funding

Current Stage
Early Stage

Leadership Team

leader-logo
Grégoire Clément
Co-Founder & CEO
linkedin
Company data provided by crunchbase