SIGN IN
Senior Software Engineer, Observability jobs in United States
cer-icon
Apply on Employer Site
company-logo

CoreWeave · 8 hours ago

Senior Software Engineer, Observability

CoreWeave is The Essential Cloud for AI™, providing a platform that enables innovators to build and scale AI. The Senior Software Engineer, Observability will design and build core observability infrastructure to help understand, troubleshoot, and optimize complex systems.
AI InfrastructureArtificial Intelligence (AI)Cloud ComputingCloud InfrastructureInformation TechnologyMachine Learning
badNo H1BnoteU.S. Citizen Onlynote

Responsibilities

Design, build, and own core observability infrastructure, including highly scalable and reliable logging, metrics, and tracing platforms
Develop and operate high-throughput telemetry pipelines that ingest, transform, and expose observability data, ensuring reliability, security, and transparent data migrations for platform consumers
Tackle observability challenges at extreme scale, supporting clusters of thousands of GPUs, petabyte-scale telemetry, and high-cardinality workloads
Continuously improve performance, security, reliability, and scalability of observability services through software enhancements, automation, and new feature development
Participate in the team’s on-call rotation to support critical production systems, focusing on root cause analysis and building durable solutions to prevent recurrence
Collaborate closely with internal engineering teams, applying a platform-as-a-product mindset to understand their needs and embed observability best practices and custom tooling into their systems
Contribute to the overall observability strategy, influencing the direction of our platform and the experience we provide to customers

Qualification

GoKubernetesObservability platformsPythonTerraformOpenTelemetrySoft skills

Required

5+ years of experience in software or infrastructure engineering, with a proven track record of designing, building, and operating large-scale distributed systems in production
Proficient in Go (our primary language) or Python, with the ability to write clean, resilient, and testable production code
Hands-on Kubernetes experience in production, including containerization and microservices architectures, and familiarity with their observability challenges
Demonstrated experience designing, building, and delivering robust and scalable systems, with a commitment to operational excellence, high-quality code, effective testing, and progressive release strategies
Ability to analyze and decompose complex problems in elastic, distributed architectures into manageable, well-scoped work
Comfortable working with Helm and YAML-based configuration for deploying and managing services, including templating, automation, and infrastructure-as-code practices
A customer-obsessed, platform-minded engineer, eager to provide infrastructure as a service and apply a product lens when evaluating platform scale problems
Experience participating in an on-call rotation for critical production systems

Preferred

Direct, hands-on experience designing, operating, or scaling logging, tracing, and/or metrics platforms (e.g., Loki, ClickHouse, Elasticsearch, Prometheus, VictoriaMetrics, Grafana, Thanos)
Familiarity with data streaming systems (e.g., Kafka, Kafka Connect) for observability pipelines
Experience automating and provisioning infrastructure as part of the software development lifecycle, using tools like Terraform
Experience with OpenTelemetry for unified telemetry collection and instrumentation
Experience with modern AI platforms and workloads (e.g., large-scale training and inference, GPU-based infrastructure, MLOps tooling) is a plus

Benefits

Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption

Company

CoreWeave

twittertwittertwitter
company-logo
CoreWeave is a cloud-based AI infrastructure company offering GPU cloud services to simplify AI and machine learning workloads.

Funding

Current Stage
Public Company
Total Funding
$26.87B
Key Investors
NVIDIAGoldman Sachs,JP Morgan Chase,Morgan Stanley,MUFG Union BankJane Street Capital
2026-01-26Post Ipo Equity· $2B
2025-12-08Post Ipo Debt· $2.54B
2025-11-12Post Ipo Debt· $2.5B

Leadership Team

leader-logo
Michael Intrator
Chief Executive Officer
linkedin
leader-logo
Brannin McBee
Founder & CDO
linkedin
Company data provided by crunchbase