Apply on Employer Site

CoreWeave · 2 months ago

Director Engineering, AI/ML

Sunnyvale, CA

Full-time

Hybrid

Director/Executive

$206K/yr - $303K/yr

10+ years exp

CoreWeave is the AI Hyperscaler™, delivering a cloud platform of cutting edge services for AI. The Director of Engineering will lead the development of a next-generation Large Scale Training Platform, overseeing a world-class engineering team to enhance GPU training services and ensure operational excellence.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingCloud InfrastructureInformation TechnologyMachine Learning

No H1B

U.S. Citizen Only

Responsibilities

Define and continuously refine the end-to-end Training Platform roadmap, prioritizing scalability, throughput, and cost optimization for training the largest AI models

Set technical standards for distributed training frameworks, model/data parallelism strategies, mixed-precision techniques (FP8, BF16), and advanced checkpointing/restart mechanisms

Design and implement a Kubernetes-native training control plane capable of managing multi-thousand GPU jobs with high reliability and efficiency

Build solutions for elastic distributed training, including job-aware autoscaling, dynamic GPU allocation, and multi-node communication optimizations using NCCL, SHARP, and RDMA

Integrate data pipeline optimizations, such as caching layers, streaming datasets, and sharded data loading to eliminate I/O bottlenecks

Implement state-of-the-art distributed training optimizations—including tensor/sequence parallelism, pipeline parallelism, optimizer sharding, activation checkpointing, and gradient compression—to push the limits of performance and scale

Establish SLOs/SLA dashboards, real-time observability, and self-healing mechanisms for thousands of concurrent training jobs across multiple regions

Develop cost-performance trade-off tooling that enables customers to seamlessly select hardware configurations that minimize time-to-train while optimizing for cost

Build robust fault tolerance and automatic recovery workflows to handle large-scale preemption, checkpoint failures, or data pipeline interruptions

Hire, mentor, and grow a diverse team of engineers and managers passionate about building the world’s leading AI training platform

Foster a customer-obsessed, metrics-driven engineering culture with crisp design reviews, deep technical rigor, and blameless post-mortems

Partner closely with Product, Orchestration, Networking, and Storage teams to deliver a unified CoreWeave experience

Work directly with flagship customers training frontier models to gather feedback, optimize workflows, and shape the platform roadmap

Qualification

Large Scale Training PlatformDistributed Training FrameworksKubernetesGPU Resource AllocationCI/CD PipelinesData Pipeline OptimizationFault ToleranceLeadershipCommunicationTeam Collaboration

Required

10+ years building large-scale distributed systems or HPC/cloud services, with 5+ years leading engineering teams

Proven success delivering mission-critical distributed training platforms or large-scale ML pipelines

Deep understanding of GPU/CPU resource allocation, NUMA-aware scheduling, interconnect topologies (NVLink, InfiniBand), and large-scale data handling

Experience with data, model, tensor, and pipeline parallelism, and advanced optimizer techniques for training massive models

Expertise in Kubernetes, service meshes, and CI/CD pipelines for ML workloads; familiarity with Slurm, Ray, or similar orchestration systems is a plus

Hands-on experience with PyTorch, DeepSpeed, Megatron-LM, or other large-scale training frameworks

Background in scaling pretraining of LLMs or multimodal models to thousands of GPUs

Excellent communicator capable of translating complex engineering trade-offs into clear business outcomes

Bachelor's or Master's degree in CS, EE, or related field (or equivalent practical experience)