CoreWeave · 1 day ago
Principal Engineer, Cluster Orchestration
CoreWeave is The Essential Cloud for AI™, providing a platform that enables innovators to build and scale AI with confidence. As a Principal Engineer in AI Infrastructure, you will lead the design and evolution of cluster orchestration systems, influencing how efficiently GPUs are used and how reliably large models run.
AI InfrastructureArtificial Intelligence (AI)Cloud ComputingCloud InfrastructureInformation TechnologyMachine Learning
Responsibilities
Define the long-term architecture for CoreWeave’s orchestration platforms across Kubernetes, Slurm, SUNK, Kueue, and related systems
Act as a technical authority on scheduling, quota enforcement, fairness, pre-emption, and multi-tenant GPU isolation
Make design decisions that balance performance, reliability, cost, and operational complexity
Lead the evolution of Kubernetes-native control planes, including SUNK and custom operators
Design systems that support workload admission, validation, and rollout, including model onboarding flows
Identify and remove scaling limits across schedulers, control planes, registries, networking, and storage
Set standards for reliability, observability, and operational readiness across orchestration services
Define SLOs, alerting, and incident response practices for platform-critical systems
Ensure systems behave predictably during failures, peak load, and rapid growth
Write and review production code for Kubernetes controllers, schedulers, admission logic, and internal tooling
Measure and improve scheduling latency, container startup time, image distribution, and cold-start performance
Lead architecture and design reviews across infrastructure teams
Mentor senior and staff engineers and help grow technical leaders
Influence platform, infrastructure, security, and product teams through clear technical judgment
Engage with customers and open-source communities on deep technical topics when needed
Qualification
Required
15+ years of experience building and operating large-scale distributed systems
Deep, practical knowledge of Kubernetes and Slurm internals
Experience running GPU-heavy platforms for AI training, inference, or HPC workloads
Strong background in Go and cloud-native systems development
Proven ability to set technical direction across teams without direct authority
Comfortable making high-impact technical decisions in complex systems
Bachelor's or Master's degree in a relevant field, or equivalent experience
Preferred
Experience with systems such as Kueue, Kubeflow, Argo Workflows, Ray, Istio, or Knative
Background in ML platform engineering, model onboarding, or lifecycle management
Strong understanding of scheduling strategies, pre-emption, quota enforcement, and elastic scaling
Track record of operating highly reliable systems with clear SLOs and incident processes
Contributions to Kubernetes, ML infrastructure, or related open-source projects
Experience mentoring senior engineers and raising engineering standards
Benefits
Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption
Company
CoreWeave
CoreWeave is a cloud-based AI infrastructure company offering GPU cloud services to simplify AI and machine learning workloads.
Funding
Current Stage
Public CompanyTotal Funding
$26.87BKey Investors
NVIDIAGoldman Sachs,JP Morgan Chase,Morgan Stanley,MUFG Union BankJane Street Capital
2026-01-26Post Ipo Equity· $2B
2025-12-08Post Ipo Debt· $2.54B
2025-11-12Post Ipo Debt· $2.5B
Recent News
2026-02-07
Mobile World Live
2026-02-07
Company data provided by crunchbase