CoreWeave · 2 months ago
Director Engineering, AI/ML
CoreWeave is the AI Hyperscaler™, delivering a cloud platform of cutting edge services for AI. The Director of Engineering will lead the development of a next-generation Large Scale Training Platform, overseeing a world-class engineering team to enhance GPU training services and ensure operational excellence.
AI InfrastructureArtificial Intelligence (AI)Cloud ComputingCloud InfrastructureInformation TechnologyMachine Learning
Responsibilities
Define and continuously refine the end-to-end Training Platform roadmap, prioritizing scalability, throughput, and cost optimization for training the largest AI models
Set technical standards for distributed training frameworks, model/data parallelism strategies, mixed-precision techniques (FP8, BF16), and advanced checkpointing/restart mechanisms
Design and implement a Kubernetes-native training control plane capable of managing multi-thousand GPU jobs with high reliability and efficiency
Build solutions for elastic distributed training, including job-aware autoscaling, dynamic GPU allocation, and multi-node communication optimizations using NCCL, SHARP, and RDMA
Integrate data pipeline optimizations, such as caching layers, streaming datasets, and sharded data loading to eliminate I/O bottlenecks
Implement state-of-the-art distributed training optimizations—including tensor/sequence parallelism, pipeline parallelism, optimizer sharding, activation checkpointing, and gradient compression—to push the limits of performance and scale
Establish SLOs/SLA dashboards, real-time observability, and self-healing mechanisms for thousands of concurrent training jobs across multiple regions
Develop cost-performance trade-off tooling that enables customers to seamlessly select hardware configurations that minimize time-to-train while optimizing for cost
Build robust fault tolerance and automatic recovery workflows to handle large-scale preemption, checkpoint failures, or data pipeline interruptions
Hire, mentor, and grow a diverse team of engineers and managers passionate about building the world’s leading AI training platform
Foster a customer-obsessed, metrics-driven engineering culture with crisp design reviews, deep technical rigor, and blameless post-mortems
Partner closely with Product, Orchestration, Networking, and Storage teams to deliver a unified CoreWeave experience
Work directly with flagship customers training frontier models to gather feedback, optimize workflows, and shape the platform roadmap
Qualification
Required
10+ years building large-scale distributed systems or HPC/cloud services, with 5+ years leading engineering teams
Proven success delivering mission-critical distributed training platforms or large-scale ML pipelines
Deep understanding of GPU/CPU resource allocation, NUMA-aware scheduling, interconnect topologies (NVLink, InfiniBand), and large-scale data handling
Experience with data, model, tensor, and pipeline parallelism, and advanced optimizer techniques for training massive models
Expertise in Kubernetes, service meshes, and CI/CD pipelines for ML workloads; familiarity with Slurm, Ray, or similar orchestration systems is a plus
Hands-on experience with PyTorch, DeepSpeed, Megatron-LM, or other large-scale training frameworks
Background in scaling pretraining of LLMs or multimodal models to thousands of GPUs
Excellent communicator capable of translating complex engineering trade-offs into clear business outcomes
Bachelor's or Master's degree in CS, EE, or related field (or equivalent practical experience)
Preferred
Experience operating multi-region training clusters for hyperscalers or large AI labs
Familiarity with open-source ML training frameworks (e.g., DeepSpeed, FSDP, Alpa, MosaicML)
Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry) and training-specific telemetry
Benefits
Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption
Company
CoreWeave
CoreWeave is a cloud-based AI infrastructure company offering GPU cloud services to simplify AI and machine learning workloads.
Funding
Current Stage
Public CompanyTotal Funding
$23.37BKey Investors
Jane Street CapitalStack CapitalCoatue
2025-12-08Post Ipo Debt· $2.54B
2025-11-12Post Ipo Debt· $1B
2025-08-20Post Ipo Secondary
Recent News
2026-01-13
The Motley Fool
2026-01-13
2026-01-13
Company data provided by crunchbase