CoreWeave · 1 day ago
Production Engineer
CoreWeave is The Essential Cloud for AI™, providing a platform of technology and tools for innovators. The Production Engineer will be responsible for maintaining the reliability of CoreWeave’s cloud infrastructure, supporting incident response, and contributing to operational improvements.
Artificial Intelligence (AI)Cloud ComputingCloud InfrastructureInformation TechnologyMachine Learning
Responsibilities
Assist in incident response efforts by helping identify and resolve service disruptions quickly, working under the guidance of more senior engineers
Help document incidents, assist with root cause analysis (RCA), and support post-incident reviews (PIRs) to identify lessons learned
Contribute to the development and maintenance of incident response playbooks to ensure preparedness for various failure scenarios
Participate in communication efforts during incidents, updating stakeholders and keeping clear records of incident activities
Monitor system performance and health using tools like Prometheus and Grafana, identifying any performance issues or potential incidents
Help implement automation and process improvements to enhance efficiency and reduce manual intervention in incident detection and recovery
Support the development of KPIs and SLAs for incident management and ensure alignment with team goals
Collaborate with engineers across teams to improve platform reliability, resilience improvements, and disaster recovery
Work closely with other engineers to troubleshoot system issues, refine workflows, and support ongoing operational needs
Participate in knowledge-sharing activities, helping improve team processes and learning from senior team members
Take part in training and mentorship opportunities to build technical skills and grow into more advanced responsibilities within the team
Qualification
Required
4 years of experience in cloud operations, site reliability engineering (SRE), or related technical roles
Understanding of cloud platforms (e.g., Kubernetes, AWS, GCP) and basic knowledge of cloud infrastructure
Familiarity with incident management practices and frameworks (e.g., ITIL, SRE best practices)
Experience with monitoring and alerting tools (e.g., Prometheus, Grafana) or willingness to learn
Basic experience with scripting or automation tools (e.g., Python, Bash, Terraform, Ansible)
Strong communication skills, with the ability to explain technical concepts clearly and concisely to both technical and non-technical team members
Ability to work in a fast-paced, high-pressure environment while learning and adapting quickly
Preferred
Exposure to Kubernetes, containerization, and distributed systems
Familiarity with change management processes and post-incident analysis
Experience with automated systems or self-healing infrastructure is a plus
A desire to learn and grow in the areas of cloud operations, reliability engineering, and incident management
Benefits
Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption
Company
CoreWeave
CoreWeave is a cloud-based AI infrastructure company offering GPU cloud services to simplify AI and machine learning workloads.
Funding
Current Stage
Public CompanyTotal Funding
$23.37BKey Investors
Jane Street CapitalStack CapitalCoatue
2025-12-08Post Ipo Debt· $2.54B
2025-11-12Post Ipo Debt· $1B
2025-08-20Post Ipo Secondary
Recent News
2026-01-08
2026-01-08
Company data provided by crunchbase