Director Engineering, AI/ML jobs in United States
cer-icon
Apply on Employer Site
company-logo

CoreWeave · 2 months ago

Director Engineering, AI/ML

CoreWeave is the AI Hyperscaler™, delivering a cloud platform of cutting edge services for AI. The Director of Engineering will lead the development of a next-generation Large Scale Training Platform, overseeing a world-class engineering team to enhance GPU training services and ensure operational excellence.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingCloud InfrastructureInformation TechnologyMachine Learning
badNo H1BnoteU.S. Citizen Onlynote

Responsibilities

Define and continuously refine the end-to-end Training Platform roadmap, prioritizing scalability, throughput, and cost optimization for training the largest AI models
Set technical standards for distributed training frameworks, model/data parallelism strategies, mixed-precision techniques (FP8, BF16), and advanced checkpointing/restart mechanisms
Design and implement a Kubernetes-native training control plane capable of managing multi-thousand GPU jobs with high reliability and efficiency
Build solutions for elastic distributed training, including job-aware autoscaling, dynamic GPU allocation, and multi-node communication optimizations using NCCL, SHARP, and RDMA
Integrate data pipeline optimizations, such as caching layers, streaming datasets, and sharded data loading to eliminate I/O bottlenecks
Implement state-of-the-art distributed training optimizations—including tensor/sequence parallelism, pipeline parallelism, optimizer sharding, activation checkpointing, and gradient compression—to push the limits of performance and scale
Establish SLOs/SLA dashboards, real-time observability, and self-healing mechanisms for thousands of concurrent training jobs across multiple regions
Develop cost-performance trade-off tooling that enables customers to seamlessly select hardware configurations that minimize time-to-train while optimizing for cost
Build robust fault tolerance and automatic recovery workflows to handle large-scale preemption, checkpoint failures, or data pipeline interruptions
Hire, mentor, and grow a diverse team of engineers and managers passionate about building the world’s leading AI training platform
Foster a customer-obsessed, metrics-driven engineering culture with crisp design reviews, deep technical rigor, and blameless post-mortems
Partner closely with Product, Orchestration, Networking, and Storage teams to deliver a unified CoreWeave experience
Work directly with flagship customers training frontier models to gather feedback, optimize workflows, and shape the platform roadmap

Qualification

Large Scale Training PlatformDistributed Training FrameworksKubernetesGPU Resource AllocationCI/CD PipelinesData Pipeline OptimizationFault ToleranceLeadershipCommunicationTeam Collaboration

Required

10+ years building large-scale distributed systems or HPC/cloud services, with 5+ years leading engineering teams
Proven success delivering mission-critical distributed training platforms or large-scale ML pipelines
Deep understanding of GPU/CPU resource allocation, NUMA-aware scheduling, interconnect topologies (NVLink, InfiniBand), and large-scale data handling
Experience with data, model, tensor, and pipeline parallelism, and advanced optimizer techniques for training massive models
Expertise in Kubernetes, service meshes, and CI/CD pipelines for ML workloads; familiarity with Slurm, Ray, or similar orchestration systems is a plus
Hands-on experience with PyTorch, DeepSpeed, Megatron-LM, or other large-scale training frameworks
Background in scaling pretraining of LLMs or multimodal models to thousands of GPUs
Excellent communicator capable of translating complex engineering trade-offs into clear business outcomes
Bachelor's or Master's degree in CS, EE, or related field (or equivalent practical experience)

Preferred

Experience operating multi-region training clusters for hyperscalers or large AI labs
Familiarity with open-source ML training frameworks (e.g., DeepSpeed, FSDP, Alpa, MosaicML)
Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry) and training-specific telemetry

Benefits

Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption

Company

CoreWeave

twittertwittertwitter
company-logo
CoreWeave is a cloud-based AI infrastructure company offering GPU cloud services to simplify AI and machine learning workloads.

Funding

Current Stage
Public Company
Total Funding
$23.37B
Key Investors
Jane Street CapitalStack CapitalCoatue
2025-12-08Post Ipo Debt· $2.54B
2025-11-12Post Ipo Debt· $1B
2025-08-20Post Ipo Secondary

Leadership Team

leader-logo
Michael Intrator
Chief Executive Officer
linkedin
leader-logo
Nitin Agrawal
Chief Financial Officer
linkedin
Company data provided by crunchbase