Senior Engineer, Network Observability jobs in United States
cer-icon
Apply on Employer Site
company-logo

CoreWeave · 2 days ago

Senior Engineer, Network Observability

CoreWeave is The Essential Cloud for AI™, delivering a platform of technology and tools for innovators to build and scale AI. The Senior Engineer for Network Observability will design, develop, and maintain monitoring and observability systems for CoreWeave’s GPU cloud network, ensuring reliable operation and proactive issue resolution.

Artificial Intelligence (AI)Cloud ComputingCloud InfrastructureInformation TechnologyMachine Learning
badNo H1BnoteU.S. Citizen Onlynote

Responsibilities

Develop, optimize, and maintain network observability platforms. Use your skills in Python and Golang to create and automate collectors, exporters, and dashboards that provide deep visibility into network health and performance
Collaborate with Network Engineering and Platform teams to ingest and unify logs, metrics, and events from a variety of platforms (Arista EOS, NVIDIA Cumulus Linux, Nokia SR OS, SR Linux, etc.) into a single observability pipeline
Design and implement scalable telemetry solutions using protocols like gNMI, SNMP, and streaming analytics. Ensure advanced alerting and anomaly detection with tools such as Prometheus, Grafana, and Alertmanager
Work closely with network developers, site reliability engineers, and security teams to integrate observability solutions across the broader infrastructure. Participate in design discussions, RFCs, and architectural decisions
Join a rotating on-call schedule to troubleshoot and resolve observability-related issues. Provide timely support to operations teams, quickly isolating and fixing problems when they arise
Guide junior team members, share best practices, and foster a culture of continuous learning and improvement within the observability domain

Qualification

PythonGolangPrometheusGrafanaSNMPGNMIKubernetesLinux systemsNetwork EngineeringAnsibleBashMachine LearningNetwork CertificationsDistributed Tracing

Required

Deep familiarity with Prometheus, Grafana, Alertmanager, gNMI, and SNMP. Experience writing or extending custom metric collectors/exporters is a plus
Experience as a Network Engineer, SRE, Software Developer, or Systems Administrator in large-scale environments. A track record of building and operating robust telemetry and monitoring solutions is a plus
Passion for automating tasks and processes. You find satisfaction in creating workflows that handle repetitive tasks and reduce human error to near zero
Comfortable containerizing solutions in Kubernetes, designing, building, and deploying container-based workloads efficiently
Proficient with Python, Go, and Bash, plus familiarity with configuration management and templating tools (e.g., Ansible, Jinja2)
Strong knowledge of Linux systems and IP networking concepts, with hands-on experience in routing, switching, and network troubleshooting
Practical knowledge with a variety of platforms, including Arista EOS, NVIDIA Cumulus Linux, Nokia SR OS, and SR Linux
Collaborative, humble, and always ready to help others while staying open to learning from more senior colleagues

Preferred

College Education: Bachelor's degree in Computer Science or a related field
Machine Learning for Anomaly Detection: Hands-on experience applying ML techniques or tools (e.g., TensorFlow, scikit-learn) to proactively detect performance or security anomalies in network traffic
Network Certifications: Certifications like CCNA, CCNP, or similar
Advanced Metrics & Analytics: Hands-on experience with data pipelines, event correlation, or anomaly detection in large-scale environments
Distributed Tracing: Familiarity with OpenTelemetry, Jaeger, or Zipkin for end-to-end tracing across microservices and network components

Benefits

Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption

Company

CoreWeave

twittertwittertwitter
company-logo
CoreWeave is a cloud-based AI infrastructure company offering GPU cloud services to simplify AI and machine learning workloads.

Funding

Current Stage
Public Company
Total Funding
$23.37B
Key Investors
Jane Street CapitalStack CapitalCoatue
2025-12-08Post Ipo Debt· $2.54B
2025-11-12Post Ipo Debt· $1B
2025-08-20Post Ipo Secondary

Leadership Team

leader-logo
Michael Intrator
Chief Executive Officer
linkedin
leader-logo
Nitin Agrawal
Chief Financial Officer
linkedin
Company data provided by crunchbase