Weights & Biases · 1 day ago
Senior Customer Reliability Engineer - Weights & Biases
Weights & Biases, a part of CoreWeave, is focused on delivering a powerful AI development platform. The Senior Customer Reliability Engineer will ensure that top-tier customers achieve exceptional stability and performance by acting as a technical owner and trusted partner, optimizing their use of the platform, and resolving complex issues.
AI InfrastructureArtificial Intelligence (AI)Data VisualizationDeveloper ToolsGenerative AIMachine Learning
Responsibilities
Serve as the primary technical partner for a portfolio of top-tier enterprise customers, owning reliability, performance, and operational success for their critical workloads
Develop a deep understanding of each customer’s architecture and ML workloads to provide proactive guidance, best practices, and optimization strategies
Troubleshoot and resolve complex problems across Weights & Biases platform, APIs, integrations, and customer environments — driving both immediate fixes and long-term solutions
Reproduce, isolate, and document product issues, collaborating closely with engineering to ensure prioritization and sustainable resolution
Build tools, scripts, and automation to diagnose issues quickly and enhance internal and customer-facing troubleshooting workflows
Provide architectural and operational recommendations to help customers scale experimentation, training pipelines, and generative AI workloads efficiently
Mentor support engineers, ensuring consistent technical depth and high-quality guidance across customer interactions
Identify patterns across top-tier accounts and advocate for systemic improvements that enhance platform reliability and customer experience
Participate in incident response, postmortems, and internal documentation to continually elevate reliability standards
Participate in a 24/7 on-call rotation focused on supporting mission-critical customer workloads
Qualification
Required
5+ years of experience in technical support, customer engineering, production engineering, reliability engineering, or a similar role supporting enterprise or strategic accounts
Expert in Python, with strong debugging, profiling, and production-grade development skills
Strong background in computer science or software engineering (B.S. in CS or equivalent experience)
You have strong experience running or supporting large-scale, high-availability systems (Kubernetes/GKE, cloud services, distributed systems, or similar)
Deep familiarity with the AI/ML ecosystem: training frameworks (PyTorch, TensorFlow), generative AI stack (Hugging Face, LangChain, vector databases), and modern experimentation workflows
Skilled at diagnosing distributed systems, APIs, containerized environments, and multi-tenant cloud architectures
Exceptional communication skills, with the ability to interface effectively with customer engineering teams, executives, and internal stakeholders
Demonstrated success partnering with product and engineering teams to drive reliability improvements and influence roadmap priorities
Self-driven, customer-obsessed, and passionate about building reliable, scalable systems and great customer experiences
Proficient with monitoring and observability tools (Datadog, Prometheus/Grafana, OpenTelemetry, etc.) for debugging production environments
Preferred
Experience with Docker, Kubernetes, and cloud platforms (AWS, GCP, Azure)
Familiarity with GPU compute environments and distributed model training pipelines
Previous experience in SRE, incident management, or cloud platform reliability roles
Experience owning reliability for a specific major customer or 'tenant' (e.g., dedicated instances, VPC deployments, or on-prem/isolated environments)
Experience participating in or running on-call rotations and incident management processes
Benefits
Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption
Company
Weights & Biases
Weights & Biases is a developer-first MLOps platform that builds machine learning performance visualization tools.
Funding
Current Stage
Growth StageTotal Funding
$250MKey Investors
NVIDIAInsight PartnersCoatue
2025-03-04Acquired
2023-09-01Secondary Market
2023-08-09Series Unknown· $50M
Recent News
Qualcomm Ventures
2026-01-20
Dynamic Business
2026-01-20
Company data provided by crunchbase