CoreWeave · 3 hours ago
Senior Software Engineer, Observability Insights
CoreWeave is The Essential Cloud for AI™, providing a platform that enables innovators to build and scale AI with confidence. The role involves leading the Observability Insights effort, building product experiences and interfaces on top of a telemetry layer to help understand and optimize complex AI systems.
AI InfrastructureArtificial Intelligence (AI)Cloud ComputingCloud InfrastructureInformation TechnologyMachine Learning
Responsibilities
Design and execute the development of highly available, multi-tenant APIs that expose telemetry and derived insights in an developer obsessed way
Modernize how users interact with data by building agentic experiences, including MCP servers, agentic tools and API gateways that safely expose foundational telemetry
Build agentic observability capabilities that will enable agentic workflows for guided debugging, workload optimization, and incident summarization to empower CoreWeavers and customers alike
Develop and enforce best practices regarding the health of telemetry data pipelines, specifically focused on correlation primitives and aggregation services for RCA and performance detection
Improve the performance, security, reliability, and scalability of insights services including SLO ownership and latency optimization while participating in the team’s on-call rotation
Collaborate closely with internal engineering teams, applying a platform-as-a-product mindset to understand their needs and embed observability best practices and custom tooling into their systems
Contribute to the overall observability strategy, influencing the direction of our platform
Qualification
Required
Six or more years of experience in software or infrastructure engineering, with a focus on building production-grade backend systems and distributed APIs
You are customer obsessed, ecstatic to provide infrastructure as a service, and default to adopting a product lens when building developer-facing surfaces like SDKs and CLIs
Versed in reliability engineering concepts, including evaluation datasets for LLMs, error budgets for platform services, and fault-tolerant design for multi-tenant systems
Familiar with various observability systems like ClickHouse, Loki, Victoria Metrics, Prometheus, and Grafana
Experienced in building agentic applications or LLM features, with a pragmatic approach to grounding, tool calling, and operational safety
Comfortable with the idea of using Go as your primary programming language, but capable of collaborating with Python components when required for agentic layers
Work with a passionate team of engineers in an iterative, high-trust agile environment to ensure the collection-to-insights pipeline works end-to-end
Preferred
Operated Kubernetes clusters at scale with experience of debugging real-world AI workloads
Experience with logging, tracing, and metrics platforms in production and at scale, with a deep understanding of cardinality, indexing, and query performance
Experienced running distributed systems/APIs services at cloud-scale, including event streaming or data pipeline management
Experience in building services/products with LLMs, MCP and Agentic frameworks like Langchain, AgentCore
Benefits
Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption
Company
CoreWeave
CoreWeave is a cloud-based AI infrastructure company offering GPU cloud services to simplify AI and machine learning workloads.
Funding
Current Stage
Public CompanyTotal Funding
$24.87BKey Investors
Jane Street CapitalStack CapitalCoatue
2025-12-08Post Ipo Debt· $2.54B
2025-11-12Post Ipo Debt· $2.5B
2025-08-20Post Ipo Secondary
Recent News
2026-01-24
The Motley Fool
2026-01-24
Company data provided by crunchbase