Senior Machine Learning Engineer - I (MLOps/LLMOps) jobs in United States
cer-icon
Apply on Employer Site
company-logo

Sumo Logic · 3 hours ago

Senior Machine Learning Engineer - I (MLOps/LLMOps)

Sumo Logic, Inc. helps make the digital world secure, fast, and reliable by unifying critical security and operational data through its Intelligent Operations Platform. As a Senior Machine Learning Engineer - MLOps/LLMOps, you will design and build scalable infrastructure for ML and LLM systems, collaborating with teams to operationalize AI/ML solutions from prototype to production.

AnalyticsBig DataCloud Data ServicesEnterprise SoftwareSaaS
badNo H1Bnote

Responsibilities

Design and implement scalable MLOps/LLMOps platforms supporting the full ML lifecycle: data versioning, model training, evaluation, deployment, and monitoring
Build and maintain CI/CD pipelines for ML models and LLM applications with automated testing, validation, and rollback capabilities
Develop infrastructure-as-code (IaC) for reproducible, version-controlled ML environments
Architect model serving infrastructure with auto-scaling, A/B testing, and canary deployment capabilities
Build platforms for LLM fine-tuning, prompt management, and experimentation at scale
Implement evaluation frameworks for LLM performance, quality, safety, and cost optimization
Design and deploy enterprise-grade AI agents and copilots with robust monitoring and guardrails
Establish LLM observability: token usage tracking, latency monitoring, prompt/response logging, and cost attribution
Own uptime, reliability, and performance of ML/LLM services (SLIs/SLOs)
Implement comprehensive monitoring, alerting, and incident response for ML systems
Participate in on-call rotations and drive post-incident reviews to improve system resilience
Build automation and tooling to reduce toil and accelerate ML development velocity
Partner with ML Engineers and Data Scientists to translate research into production-ready systems
Collaborate with platform and infrastructure teams on cloud architecture and resource optimization
Mentor team members on MLOps best practices, production ML patterns, and operational excellence
Drive technical decisions on tooling, frameworks, and architectural patterns

Qualification

MLOps expertiseLLMOps expertiseCloud experienceSoftware engineeringContainerizationCI/CD practicesMonitoring toolsPythonIncident managementMentoring

Required

Education: B.S./M.S./Ph.D. in Computer Science, Engineering, or related technical field
Experience: 4+ years of software engineering experience with 2+ years focused on MLOps/LLMOps
Production experience with ML model serving frameworks (e.g., TensorFlow Serving, TorchServe, Triton)
Hands-on with ML experiment tracking and model registry tools (MLflow, Weights & Biases, Kubeflow)
Proficiency in workflow orchestration (Airflow, Prefect, Kubeflow Pipelines, Metaflow)
Experience with LLM deployment, fine-tuning, and evaluation frameworks (e.g., vLLM, LangChain, LlamaIndex)
Knowledge of prompt engineering, RAG architectures, and LLM application patterns
Familiarity with LLM observability tools (e.g., LangSmith, Arize, WhyLabs)
Strong experience with major cloud providers (AWS, GCP, or Azure) and ML-specific services (SageMaker, Vertex AI, Azure ML, Bedrock)
Proficiency in containerization (Docker, Kubernetes) and infrastructure-as-code (Terraform, CloudFormation, Pulumi)
Experience with microservices architecture and API development (REST, gRPC)
Strong programming skills in Python, terraform and Helm; familiarity with Go, Java, or Rust is a plus
Deep understanding of CI/CD practices and tools (GitHub Actions, GitLab CI, Jenkins, ArgoCD)
Experience with monitoring and observability stacks (Prometheus, Grafana, DataDog, ELK)
Track record of managing production systems with defined SLIs/SLOs
Experience with on-call rotations, incident management, and reliability engineering practices

Preferred

Experience building internal ML platforms or developer tooling used by multiple teams
Hands-on with distributed training frameworks (Ray, Horovod, DeepSpeed)
Knowledge of model optimization techniques (quantization, distillation, pruning)
Familiarity with feature stores (Feast, Tecton) and data versioning tools (DVC, LakeFS)
Understanding of ML security best practices, model governance, and compliance requirements
Experience with cost optimization and resource management for large-scale ML workloads
Contributions to open-source MLOps/LLMOps projects
Background in applied ML or data science with practical model development experience

Benefits

Certain roles are eligible to participate in our bonus or commission plans
Benefits offerings
Equity awards

Company

Sumo Logic

company-logo
Sumo Logic is a provider of cloud-based machine data analytics that enables reliable and secure cloud-native applications.

Funding

Current Stage
Public Company
Total Funding
$340M
Key Investors
Battery VenturesSapphire VenturesDFJ Growth
2023-02-09Acquired
2020-09-16IPO
2019-05-08Series G· $110M

Leadership Team

leader-logo
Stewart Grierson
Chief Financial Officer
linkedin
leader-logo
Aaron Feigin
Chief Communications & Brand Officer
linkedin
Company data provided by crunchbase