CoreWeave · 2 weeks ago
Senior Software Engineer, Observability
CoreWeave is The Essential Cloud for AI™, providing a platform that enables innovators to build and scale AI with confidence. The role involves designing and building observability infrastructure, managing telemetry data, and improving the performance and reliability of observability services.
AI InfrastructureArtificial Intelligence (AI)Cloud ComputingCloud InfrastructureInformation TechnologyMachine Learning
Responsibilities
Design, build, and own core observability infrastructure, including highly scalable and reliable logging, metrics, and tracing platforms
Develop and implement scalable, high-throughput telemetry pipelines that ingest, transform, and expose observability data, ensuring high reliability, security, and transparent data migrations for platform consumers
Establish and build governance mechanisms and best practices to empower CoreWeave engineers to effectively manage the telemetry their services produce, fostering effective usage patterns and a self-service model
Continuously improve the performance, security, reliability, and scalability of observability services through software enhancements and new feature development
Participate in the team's on-call rotation to support critical production systems, focusing on root cause analysis and building durable solutions to prevent future incidents
Collaborate closely with internal engineering teams, applying a platform-as-a-product mindset to understand their needs and embed observability best practices and custom tooling into their systems
Contribute to the overall observability strategy, influencing the direction of our platform
Qualification
Required
Six or more years of experience in software or infrastructure engineering, with a proven track record of designing, building, and operating large-scale distributed systems in production
Proficiency in Go (our primary language) or Python, with a strong ability to write clean, resilient, and testable code for production-grade software
Non-negotiable hands-on production Kubernetes experience, including familiarity with containerization and microservices architectures, and understanding its observability challenges
Proven track record of designing, building, and delivering robust and scalable production systems. A commitment to operational excellence, writing high-quality code and implementing best practices for system reliability, including effective testing and progressive release strategies
Ability to analyze and decompose complex problems in elastic architectures into manageable tasks
Comfortable with helm and YAML configuration for deploying and managing services, including templating, automation, and infrastructure-as-code practices
A customer-obsessed mindset, eager to provide infrastructure as a service and apply a product lens when evaluating platform scale problems
Experience participating in an on-call rotation for critical production systems
Preferred
Direct, hands-on experience designing, operating, or scaling logging, tracing, and/or metrics platforms (e.g., Loki, ClickHouse, Elasticsearch, Prometheus, VictoriaMetrics, Grafana, Thanos)
Familiarity with data streaming systems (e.g., Kafka, Kafka Connect,) for observability pipelines
Experience automating and provisioning infrastructure as part of the software development lifecycle, using tools like Terraform
Knowledge of Linux systems, shell scripting, and the Linux storage and networking stacks
Experience with OpenTelemetry for unified telemetry collection
Interest in contributing to open source projects
Benefits
Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption
Company
CoreWeave
CoreWeave is a cloud-based AI infrastructure company offering GPU cloud services to simplify AI and machine learning workloads.
Funding
Current Stage
Public CompanyTotal Funding
$23.37BKey Investors
Jane Street CapitalStack CapitalCoatue
2025-12-08Post Ipo Debt· $2.54B
2025-11-12Post Ipo Debt· $1B
2025-08-20Post Ipo Secondary
Recent News
The Motley Fool
2026-01-09
The Motley Fool
2026-01-09
Company data provided by crunchbase