Infrastructure Reliability Engineer, Bare Metal jobs in United States
cer-icon
Apply on Employer Site
company-logo

CoreWeave · 23 hours ago

Infrastructure Reliability Engineer, Bare Metal

CoreWeave is The Essential Cloud for AI™, dedicated to providing a reliable and high-performance experience for clients running AI workloads at scale. The Infrastructure Reliability Engineer, Bare Metal will ensure the stability and performance of the bare metal infrastructure, collaborating with engineering teams to drive automation and improve operational strategies.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingCloud InfrastructureInformation TechnologyMachine Learning
badNo H1BnoteU.S. Citizen Onlynote

Responsibilities

Provide expert-level technical support and in-depth troubleshooting for a wide spectrum of hardware and associated software issues, encompassing server malfunctions, network outages, and performance degradations
Manage the lifecycle of our bare metal infrastructure, including overseeing deployment methodologies, executing maintenance procedures, coordinating upgrades, and managing hardware retirement processes
Architect and implement automation solutions through scripting and tooling to streamline repetitive operational tasks, enhance overall efficiency, and minimize manual intervention across the infrastructure
Lead the development and refinement of critical operational processes, comprehensive technical documentation (SOPs, TSGs, runbooks), and the establishment of engineering best practices to bolster team effectiveness and infrastructure resilience
Engage in close collaboration with Software, Network, and Data Center Operations Engineering teams to facilitate effective issue resolution, contribute to strategic project planning, and ensure the cohesive operation of the entire infrastructure ecosystem
Serve as a key technical point of contact for hardware and software vendors, managing technical support engagements, overseeing the RMA process, and driving the resolution of complex hardware-centric challenges
Design, deploy, and maintain sophisticated monitoring and alerting frameworks to proactively identify and mitigate potential infrastructure anomalies and performance deviations
Participate actively in incident response protocols, conduct thorough root cause analysis (RCAs) for infrastructure events, and contribute to problem management strategies aimed at preventing future occurrences
Contribute technical expertise to and potentially lead infrastructure-focused projects, including new hardware deployments, critical system upgrades, and the integration of new operational tooling
Mentor and guide junior engineering team members, fostering technical growth and contributing to the development of internal knowledge resources and training programs
Maintain the integrity of hardware asset tracking and related data within our infrastructure inventory systems (e.g., Snipe-IT)
Adhere to and promote stringent security protocols and best practices related to infrastructure access and maintenance activities

Qualification

Bare metal infrastructureLinux system administrationAutomation solutionsInfrastructure monitoring toolsScripting language (Python)Technical documentationProblem-solving skillsAnalytical skillsCommunication skillsCollaboration abilities

Required

Bachelor's degree in Computer Science, Electrical Engineering, or related technical discipline
5+ years of experience in hands-on management and support of complex bare metal infrastructure environments and data center operations
Comprehensive understanding of modern server hardware architectures, including specialized compute accelerators (GPUs) and high-speed interconnect technologies from leading high-performance computing vendors such as NVIDIA, Dell, or HPE
Demonstrated expertise in Linux system administration, encompassing deep familiarity with command-line operations and system configuration
Proficiency in at least one high-level scripting language (e.g., Python) and practical experience with infrastructure and/or network automation tools, methodologies, and frameworks (e.g., Ansible)
Extensive experience with modern infrastructure monitoring and logging tools such as Prometheus, Grafana, and the ELK stack (Elasticsearch, Logstash, Kibana)
Working knowledge of enterprise ticketing systems (e.g., Jira) and an understanding of IT Service Management (ITSM) frameworks and best practices
Strong analytical and problem-solving skills, with the ability to systematically diagnose and resolve complex technical issues
Excellent communication and collaboration abilities, with experience working effectively across multidisciplinary technical teams
Self-motivated and proactive, with a demonstrated sense of ownership and a commitment to ensuring infrastructure reliability and performance
Proven ability to manage multiple tasks and priorities effectively in a fast-paced and dynamic environment

Preferred

You're curious about Kubernetes, Docker, and containerized infrastructure
You have strong problem-solving skills with a proactive and analytical mindset
You have excellent communication skills and a demonstrated ability to work collaboratively in a fast-paced environment

Benefits

Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption

Company

CoreWeave

twittertwittertwitter
company-logo
CoreWeave is a cloud-based AI infrastructure company offering GPU cloud services to simplify AI and machine learning workloads.

Funding

Current Stage
Public Company
Total Funding
$23.37B
Key Investors
Jane Street CapitalStack CapitalCoatue
2025-12-08Post Ipo Debt· $2.54B
2025-11-12Post Ipo Debt· $1B
2025-08-20Post Ipo Secondary

Leadership Team

leader-logo
Michael Intrator
Chief Executive Officer
linkedin
leader-logo
Nitin Agrawal
Chief Financial Officer
linkedin
Company data provided by crunchbase