Principal Engineer, Inference jobs in United States
cer-icon
Apply on Employer Site
company-logo

CoreWeave · 5 months ago

Principal Engineer, Inference

CoreWeave is the AI Hyperscaler™, delivering cutting-edge cloud services for AI. They are seeking a Principal Engineer to lead the development of their next-generation Inference Platform, focusing on architecting and building GPU inference services while collaborating with various teams to optimize AI applications.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingCloud InfrastructureInformation TechnologyMachine Learning
badNo H1BnoteU.S. Citizen Onlynote

Responsibilities

Define the technical roadmap for ultra-low-latency, high-throughput inference
Evaluate and influence adoption of runtimes and frameworks (Triton, vLLM, TensorRT-LLM, Ray Serve, TorchServe) and guide build-vs-buy decisions
Design Kubernetes-native control-plane components that deploy, autoscale, and monitor fleets of model-server pods spanning thousands of GPUs
Implement advanced optimizations: micro-batching, speculative decoding, KV-cache reuse, early-exit heuristics, tensor/stream parallel inference, to squeeze every microsecond out of large-model serving
Build intelligent request routing and adaptive scheduling to maximize GPU utilization while guaranteeing strict P99 latency SLAs
Create real-time observability, live debugging hooks, and automated rollback/traffic-shift for model versioning
Develop cost-per-token and cost-per-request analytics so customers can instantly select the ideal hardware tier
Write production code, reference implementations, and performance benchmarks across gRPC/HTTP, CUDA Graphs, and NCCL/SHARP fast-paths
Lead deep-dive investigations into network, PCIe, NVLink, and memory-bandwidth bottlenecks
Coach engineers on large-scale inference best practices and performance profiling
Partner with lighthouse customers to launch and optimize mission-critical, real-time AI applications

Qualification

GPU inference servicesKubernetes-native architectureReal-time ML inferencePyTorchTensorFlowMicro-batch schedulingCost-per-token analyticsCI/CD pipelinesCommunicationMentorshipCollaboration

Required

10+ years building distributed systems or HPC/cloud services, with 4+ years focused on real-time ML inference or other latency-critical data planes
Demonstrated expertise in micro-batch schedulers, GPU resource isolation, KV caching, speculative decoding, and mixed precision (BF16/FP8) inference
Deep knowledge of PyTorch or TensorFlow serving internals, CUDA kernels, NCCL/SHARP, RDMA, NUMA, and GPU interconnect topologies
Proven track record of driving sub-50 ms global P99 latencies and optimizing cost-per-token / cost-per-request on multi-node GPU clusters
Fluency with Kubernetes (or Slurm/Ray) at production scale plus CI/CD, service meshes, and observability stacks (Prometheus, Grafana, OpenTelemetry)
Excellent communicator who influences architecture across teams and presents complex trade-offs to executives and customers
Bachelor's or Master's in CS, EE, or related field (or equivalent practical experience)

Preferred

Code contributions to open-source inference frameworks (vLLM, Triton, Ray Serve, TensorRT-LLM, TorchServe)
Experience operating multi-region inference fleets or streaming-token services at a hyperscaler or AI research lab
Publications/talks on latency optimization, token streaming, or advanced model-server architectures

Benefits

Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption

Company

CoreWeave

twittertwittertwitter
company-logo
CoreWeave is a cloud-based AI infrastructure company offering GPU cloud services to simplify AI and machine learning workloads.

Funding

Current Stage
Public Company
Total Funding
$23.37B
Key Investors
Jane Street CapitalStack CapitalCoatue
2025-12-08Post Ipo Debt· $2.54B
2025-11-12Post Ipo Debt· $1B
2025-08-20Post Ipo Secondary

Leadership Team

leader-logo
Michael Intrator
Chief Executive Officer
linkedin
leader-logo
Nitin Agrawal
Chief Financial Officer
linkedin
Company data provided by crunchbase