CoreWeave · 7 hours ago
Software Engineer, Observability
CoreWeave is The Essential Cloud for AI™, delivering a platform that enables innovators to build and scale AI with confidence. The Software Engineer in Observability will be responsible for building, maintaining, and optimizing systems that support GPU-dense clusters and telemetry, enhancing the observability stack for AI workloads.
AI InfrastructureArtificial Intelligence (AI)Cloud ComputingCloud InfrastructureInformation TechnologyMachine Learning
Responsibilities
Design, build, and maintain logging, tracing, and/or metrics platforms by writing production-quality code in languages like Go and Python, with guidance from senior engineers, contributing to the reliability and performance of our observability stack
Develop and refine monitoring and alerting to enhance system reliability, reduce meantime-to-detect, and improve incident response
Assist engineers across CoreWeave in developing effective usage patterns for observability systems, helping teams instrument services, tune dashboards, and set actionable alerts
Manage production and pre-production clusters, including deployments and configuration, and build tools that enable development teams to follow best practices
Participate in the team’s on-call rotation to support critical production systems, learning from incidents and contributing to long-term reliability improvements
Qualification
Required
2+ years of experience in Software Engineering, Site Reliability Engineering, DevOps, or a related field
Proficiency in at least one programming or scripting language (e.g., Python, Go)
Experience working with Kubernetes, containerization, and microservices architectures
Experience participating in on-call rotations, including triaging and appropriately escalating production issues
Experience using observability systems at scale (e.g., metrics, logging, tracing) to understand and debug complex distributed systems
Strong problem-solving, analytical, and communication skills, with the ability to work effectively with other engineering teams
Preferred
Experience running a production observability database or tool (e.g., ClickHouse, Elastic, Loki, VictoriaMetrics, Prometheus, Thanos, OpenTelemetry, Grafana)
Familiarity with infrastructure-as-code tools like Terraform
Exposure to modern testing frameworks and progressive deployment strategies (e.g., canary, blue–green)
Hands-on experience using data-streaming systems (e.g., Kafka, Kafka Connect) for observability pipelines
Experience with modern AI platforms and workloads (e.g., large-scale training and inference, GPU-based infrastructure, MLOps tooling) is a plus
Benefits
Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption
Company
CoreWeave
CoreWeave is a cloud-based AI infrastructure company offering GPU cloud services to simplify AI and machine learning workloads.
Funding
Current Stage
Public CompanyTotal Funding
$26.87BKey Investors
NVIDIAGoldman Sachs,JP Morgan Chase,Morgan Stanley,MUFG Union BankJane Street Capital
2026-01-26Post Ipo Equity· $2B
2025-12-08Post Ipo Debt· $2.54B
2025-11-12Post Ipo Debt· $2.5B
Recent News
2026-02-07
Mobile World Live
2026-02-07
Company data provided by crunchbase