Sr ML Ops Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Lucasfilm · 4 months ago

Sr ML Ops Engineer

Lucasfilm is seeking a highly skilled Sr ML Ops Engineer to build and maintain the infrastructure powering their machine learning and AI frameworks. This role will support the development of transformative audio solutions for speech processing and other media production workflows, ensuring reliable operation of AI solutions at scale.

FilmSoftwareTV ProductionVideo

Responsibilities

Develop, deploy, and maintain scalable infrastructure for machine learning model training, retraining, and inference
Design and optimize CI/CD pipelines specifically tailored for machine learning workflows, ensuring efficient delivery from research to production
Implement robust monitoring and logging systems to track model performance and identify potential issues in production environments
Collaborate with AI researchers and data scientists to ensure infrastructure aligns with project requirements and supports iterative experimentation
Manage compute resources (cloud and on-premises) to enable large-scale distributed training and inference tasks
Containerize machine learning models and applications using Docker and deploy them via Kubernetes or equivalent orchestration systems
Automate deployment workflows for serving ML models using frameworks such as TorchServe, TensorFlow Serving and FastAPI
Implement model versioning, rollback strategies, and governance for maintaining production stability
Optimize cost efficiency and performance of machine learning workflows in cloud environments such as AWS, GCP, or Azure
Stay updated with emerging ML Ops tools and practices, integrating them into existing workflows to improve performance and reliability

Qualification

ML OpsCI/CD pipelinesContainerization (Docker)Cloud infrastructure AWSCloud infrastructure GCPCloud infrastructure AzureKubernetesModel deployment frameworksDistributed training workflowsMonitoring toolsScripting PythonScripting BashScripting GoSecurity best practicesData orchestration toolsHyperparameter tuningOpen-source contributions

Required

Bachelor's in Computer Science, Engineering, or a related field
5+ years of experience in DevOps, Site Reliability Engineering, or a related role, with at least 2+ years focusing on ML Ops
Expertise in building and maintaining CI/CD pipelines for machine learning applications
Strong proficiency with containerization (Docker) and orchestration tools (Kubernetes)
Proficiency in deploying machine learning models using frameworks such as TensorFlow Serving, TorchServe, or custom APIs
Deep understanding of cloud infrastructure and services (AWS, GCP, or Azure) for ML workloads, including GPUs and TPU utilization
Experience managing large-scale distributed training workflows and optimizing resource allocation
Familiarity with tools like MLflow, DVC, Weight+Biases, or similar for data and model tracking and versioning
Solid understanding of security best practices for machine learning systems and sensitive data handling
Strong scripting and programming skills in Python, Bash, or Go

Preferred

Experience with data orchestration tools like DataChain, Weights and Biases, etc, for managing ML workflows
Hands-on experience with automated hyperparameter tuning and optimization frameworks
Familiarity with model monitoring tools like Prometheus, Grafana, or custom solutions for model drift and data quality checks
Experience integrating pre-trained foundational models and managing their deployment at scale
Contributions to open-source ML Ops projects or relevant research publications

Benefits

A bonus and/or long-term incentive units may be provided as part of the compensation package, in addition to the full range of medical, financial, and/or other benefits, dependent on the level and position offered.

Company

Lucasfilm

company-logo
Lucasfilm produces original content, postproduction effects, and audio for external clients, licensed products, and the gaming industry. It is a sub-organization of The Walt Disney Company.

Funding

Current Stage
Late Stage
Total Funding
unknown
2012-10-30Acquired

Leadership Team

leader-logo
Rob Bredow
SVP, Creative Innovation
linkedin
Company data provided by crunchbase