Pragmatike · 1 month ago
Staff / Principal ML Ops Engineer
Pragmatike is a fast-growing AI startup recognized as a Top 10 GenAI company by GTM Capital, founded by MIT CSAIL researchers. They are seeking a Staff / Principal ML Ops Engineer to lead the design, implementation, and scaling of the company's ML infrastructure and production AI systems, collaborating with various teams to ensure robust and efficient AI systems.
Information TechnologyRecruitingSoftware
Responsibilities
Architect, build, and scale the end-to-end ML Ops pipeline, including training, fine-tuning, evaluation, rollout, and monitoring
Design reliable infrastructure for model deployment, versioning, reproducibility, and orchestration across cloud and on-prem GPU clusters
Optimize compute usage across distributed systems (Kubernetes, autoscaling, caching, GPU allocation, checkpointing workflows)
Lead the implementation of observability for ML systems (monitor drift, performance, throughput, reliability, cost)
Build automated workflows for dataset curation, labeling, feature pipelines, evaluation, and CI/CD for ML models
Collaborate with researchers to productionize models and accelerate training/inference pipelines
Establish ML Ops best practices, internal standards, and cross-team tooling
Mentor engineers and influence architectural direction across the entire AI platform
Qualification
Required
Deep hands-on experience designing and operating production ML systems at scale (Staff/Principal-level expected)
Strong background in ML Ops, distributed systems, and cloud infrastructure (AWS, GCP, or Azure)
Proficiency with Python and familiarity with TypeScript or Go for platform integration
Expertise in ML frameworks: PyTorch, Transformers, vLLM, Llama-factory, Megatron-LM, CUDA / GPU acceleration (practical understanding)
Strong experience with containerization and orchestration (Docker, Kubernetes, Helm, autoscaling)
Deep understanding of ML lifecycle workflows: training, fine-tuning, evaluation, inference, model registries
Ability to lead technical strategy, collaborate cross-functionally, and operate in fast-paced environments
Preferred
Experience deploying and operating LLMs and generative models in production at enterprise scale
Familiarity with DevOps, CI/CD, automated deployment pipelines, and infrastructure-as-code
Experience optimizing GPU clusters, scheduling, and distributed training frameworks
Prior startup experience or comfort operating with ambiguity and high ownership
Experience working with data engineering, feature pipelines, or real-time ML systems
Benefits
Competitive salary & equity options
Sign-on bonus
Health, Dental, and Vision
401k