Apply on Employer Site

Cubiq Recruitment · 4 weeks ago

Staff ML Infrastructure Engineer

San Francisco Bay Area

Full-time

Onsite

Lead/Staff

Cubiq Recruitment is building one of the world’s leading generative video and multimodal AI platforms, and they are seeking a Staff ML Infrastructure Engineer. This role involves designing and evolving infrastructure for large-scale generative video and multimodal model training, as well as ensuring production reliability and developing end-to-end CI/CD pipelines for machine learning.

Artificial Intelligence (AI)AutomotiveAutonomous VehiclesEmerging MarketsEnergyHealth CareMachine LearningManufacturingRenewable EnergySoftware

Hiring Manager

Jack Cartlidge

Responsibilities

Core ML Platform Architecture: Design and evolve the infrastructure that supports large-scale generative video and multimodal model training, evaluation, and deployment

High-Throughput Compute Systems: Build and optimize GPU/TPU clusters, distributed training systems, and orchestration layers tailored for video-heavy pipelines

Production Reliability for Generative Models: Create the tooling and services needed to safely push frequent model updates while handling massive compute loads and long-running jobs

End-to-End CI/CD for ML: Lead the development of automated pipelines for model training, validation, artifact management, and production rollout

Multimodal Data Infrastructure: Build systems to ingest, version, transform, and serve large-scale video, audio, and text datasets with high reliability

Internal Developer Experience: Partner with research, product, and applied ML teams to build intuitive internal tooling for experiment tracking, model lineage, and resource scheduling

Technical Leadership: Mentor engineers, set platform standards, and influence long-term architectural direction

Qualification

Cloud-scale systemsHigh-performance compute platformsCI/CD pipelinesDistributed computePythonKubernetesAWS/GCP/AzureTechnical leadershipMentoringCollaboration

Required

Experience architecting and operating large-scale infrastructure at a cloud provider, hyperscaler, or leading AI company

Built or owned mission-critical CI/CD systems, high-capacity compute platforms, or data infrastructure supporting ML teams

Deep experience with distributed compute across GPUs/accelerators, Kubernetes, and cloud infrastructure (AWS/GCP/Azure)

Strong engineering fundamentals in Python, Go, or equivalent languages

Previous exposure to ML training pipelines—especially systems that handle heavy video, multimodal, or high-dimensional data

Demonstrated ability to lead complex cross-org initiatives and drive technical strategy

Preferred

Experience with video processing systems, large-scale media pipelines, or streaming architectures

Familiarity with modern multimodal or video-generation frameworks (PyTorch, JAX, diffusers, custom accelerators)

Experience with Ray, Triton, CUDA optimization, or specialized scheduling for ML workloads

Background working in high-growth AI startups or research-focused environments

Security and compliance considerations for models that generate or process user content

Benefits

Over market average

Equity

Highly competitive compensation

Company

Cubiq Recruitment

Cubiq are a talent partner who specialise in niche areas of Technology & Engineering.

Founded in 2010

Manchester, Manchester, GBR

11-50 employees

https://www.cubiqrecruitment.com/

Funding

Current Stage

Early Stage

Recent News

The Business Desk

Tower 12 fully let after five-year deal with recruitment agency ...

2023-12-25

Company data provided by crunchbase