Senior Customer Reliability Engineer - Weights & Biases jobs in United States
cer-icon
Apply on Employer Site
company-logo

Weights & Biases · 1 day ago

Senior Customer Reliability Engineer - Weights & Biases

Weights & Biases, a part of CoreWeave, is focused on delivering a powerful AI development platform. The Senior Customer Reliability Engineer will ensure that top-tier customers achieve exceptional stability and performance by acting as a technical owner and trusted partner, optimizing their use of the platform, and resolving complex issues.

AI InfrastructureArtificial Intelligence (AI)Data VisualizationDeveloper ToolsGenerative AIMachine Learning
check
Comp. & Benefits
badNo H1BnoteU.S. Citizen Onlynote

Responsibilities

Serve as the primary technical partner for a portfolio of top-tier enterprise customers, owning reliability, performance, and operational success for their critical workloads
Develop a deep understanding of each customer’s architecture and ML workloads to provide proactive guidance, best practices, and optimization strategies
Troubleshoot and resolve complex problems across Weights & Biases platform, APIs, integrations, and customer environments — driving both immediate fixes and long-term solutions
Reproduce, isolate, and document product issues, collaborating closely with engineering to ensure prioritization and sustainable resolution
Build tools, scripts, and automation to diagnose issues quickly and enhance internal and customer-facing troubleshooting workflows
Provide architectural and operational recommendations to help customers scale experimentation, training pipelines, and generative AI workloads efficiently
Mentor support engineers, ensuring consistent technical depth and high-quality guidance across customer interactions
Identify patterns across top-tier accounts and advocate for systemic improvements that enhance platform reliability and customer experience
Participate in incident response, postmortems, and internal documentation to continually elevate reliability standards
Participate in a 24/7 on-call rotation focused on supporting mission-critical customer workloads

Qualification

PythonKubernetes/GKEAI/ML frameworksDistributed systemsMonitoring toolsCloud platformsCustomer-obsessedMentoring support engineersExceptional communication

Required

5+ years of experience in technical support, customer engineering, production engineering, reliability engineering, or a similar role supporting enterprise or strategic accounts
Expert in Python, with strong debugging, profiling, and production-grade development skills
Strong background in computer science or software engineering (B.S. in CS or equivalent experience)
You have strong experience running or supporting large-scale, high-availability systems (Kubernetes/GKE, cloud services, distributed systems, or similar)
Deep familiarity with the AI/ML ecosystem: training frameworks (PyTorch, TensorFlow), generative AI stack (Hugging Face, LangChain, vector databases), and modern experimentation workflows
Skilled at diagnosing distributed systems, APIs, containerized environments, and multi-tenant cloud architectures
Exceptional communication skills, with the ability to interface effectively with customer engineering teams, executives, and internal stakeholders
Demonstrated success partnering with product and engineering teams to drive reliability improvements and influence roadmap priorities
Self-driven, customer-obsessed, and passionate about building reliable, scalable systems and great customer experiences
Proficient with monitoring and observability tools (Datadog, Prometheus/Grafana, OpenTelemetry, etc.) for debugging production environments

Preferred

Experience with Docker, Kubernetes, and cloud platforms (AWS, GCP, Azure)
Familiarity with GPU compute environments and distributed model training pipelines
Previous experience in SRE, incident management, or cloud platform reliability roles
Experience owning reliability for a specific major customer or 'tenant' (e.g., dedicated instances, VPC deployments, or on-prem/isolated environments)
Experience participating in or running on-call rotations and incident management processes

Benefits

Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption

Company

Weights & Biases

company-logo
Weights & Biases is a developer-first MLOps platform that builds machine learning performance visualization tools.

Funding

Current Stage
Growth Stage
Total Funding
$250M
Key Investors
NVIDIAInsight PartnersCoatue
2025-03-04Acquired
2023-09-01Secondary Market
2023-08-09Series Unknown· $50M

Leadership Team

leader-logo
Chris Van Pelt
Co-Founder & CISO
linkedin
leader-logo
Shawn Lewis
Founder/CTO
linkedin
Company data provided by crunchbase