Apply on Employer Site

Pragmatike · 1 month ago

Staff / Principal ML Ops Engineer

Cambridge, Massachusetts, United States

Full-time

Onsite

Lead/Staff

Pragmatike is a fast-growing AI startup recognized as a Top 10 GenAI company by GTM Capital, founded by MIT CSAIL researchers. They are seeking a Staff / Principal ML Ops Engineer to lead the design, implementation, and scaling of the company's ML infrastructure and production AI systems, collaborating with various teams to ensure robust and efficient AI systems.

Information TechnologyRecruitingSoftware

Responsibilities

Architect, build, and scale the end-to-end ML Ops pipeline, including training, fine-tuning, evaluation, rollout, and monitoring

Design reliable infrastructure for model deployment, versioning, reproducibility, and orchestration across cloud and on-prem GPU clusters

Optimize compute usage across distributed systems (Kubernetes, autoscaling, caching, GPU allocation, checkpointing workflows)

Lead the implementation of observability for ML systems (monitor drift, performance, throughput, reliability, cost)

Build automated workflows for dataset curation, labeling, feature pipelines, evaluation, and CI/CD for ML models

Collaborate with researchers to productionize models and accelerate training/inference pipelines

Establish ML Ops best practices, internal standards, and cross-team tooling

Mentor engineers and influence architectural direction across the entire AI platform

Qualification

ML OpsDistributed systemsCloud infrastructurePythonContainerizationML frameworksTechnical strategyTypeScriptGoDevOpsCI/CDReal-time ML systems

Required

Deep hands-on experience designing and operating production ML systems at scale (Staff/Principal-level expected)

Strong background in ML Ops, distributed systems, and cloud infrastructure (AWS, GCP, or Azure)

Proficiency with Python and familiarity with TypeScript or Go for platform integration

Expertise in ML frameworks: PyTorch, Transformers, vLLM, Llama-factory, Megatron-LM, CUDA / GPU acceleration (practical understanding)

Strong experience with containerization and orchestration (Docker, Kubernetes, Helm, autoscaling)

Deep understanding of ML lifecycle workflows: training, fine-tuning, evaluation, inference, model registries

Ability to lead technical strategy, collaborate cross-functionally, and operate in fast-paced environments

Preferred

Experience deploying and operating LLMs and generative models in production at enterprise scale

Familiarity with DevOps, CI/CD, automated deployment pipelines, and infrastructure-as-code

Experience optimizing GPU clusters, scheduling, and distributed training frameworks

Prior startup experience or comfort operating with ambiguity and high ownership

Experience working with data engineering, feature pipelines, or real-time ML systems

Benefits

Competitive salary & equity options

Sign-on bonus

Health, Dental, and Vision

401k

Company

Pragmatike

Pragmatike is a remote tech job platform that helps companies hire tech talent.

Founded in 2022

La Corderie, Bretagne, FRA

11-50 employees

https://www.pragmatike.com

Funding

Current Stage

Early Stage

Leadership Team

Grégoire Clément

Co-Founder & CEO

Company data provided by crunchbase