Site Reliability Engineer - Hardware Specialist jobs in United States
cer-icon
Apply on Employer Site
company-logo

xAI · 2 months ago

Site Reliability Engineer - Hardware Specialist

xAI is focused on creating AI systems that understand the universe and aid humanity. The Site Reliability Engineer - Hardware Specialist will be responsible for ensuring hardware reliability, managing vendor relations, and resolving hardware issues to support datacenter operations.

Artificial Intelligence (AI)Foundational AIGenerative AIInformation TechnologyMachine Learning
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Analyze firmware packages and hardware specifications for upcoming releases to ensure compatibility, performance, and reliability in xAI's datacenter environment
Investigate and diagnose hardware failures, including "grey failures" (ambiguous or intermittent issues), proving them as true hardware defects through rigorous testing and data analysis
Manage vendor relationships, including initiating RMA (Return Merchandise Authorization) claims, negotiating beyond standard processes when necessary, and holding vendors accountable for resolutions
Collaborate with Datacenter Operations Technicians to troubleshoot, repair, and optimize hardware systems in real-time
Research and evaluate next-generation hardware technologies that are not yet released, providing insights and recommendations to inform xAI's infrastructure roadmap
Develop and implement monitoring tools, scripts, and processes to detect hardware anomalies early and minimize downtime
Document failure modes, RMA outcomes, and hardware evaluations to build a knowledge base for the team
Participate in on-call rotations and incident response for hardware-related issues in the Memphis datacenter

Qualification

Hardware reliability engineeringFirmware analysisVendor negotiationsDiagnostic softwareScripting languagesDatacenter hardware componentsEmerging technologiesCertifications in hardware engineeringProblem-solving skillsCollaboration skills

Required

Bachelor's degree in Systems Engineering, Electrical Engineering, Computer Science, or a related field (or equivalent experience)
5+ years of experience in hardware reliability engineering, preferably in high-performance computing or datacenter environments
Proven expertise in firmware analysis, hardware specifications review, and release validation
Strong experience with RMA processes, including filing claims, vendor negotiations, and pushing for resolutions outside standard protocols
Demonstrated ability to diagnose and prove complex hardware failures, including grey or intermittent issues, using tools, logic analyzers, or diagnostic software
Familiarity with datacenter hardware components (e.g., servers, GPUs, networking equipment) and emerging technologies
Proficiency in scripting languages (e.g., Python, Bash) for automation and analysis
Excellent problem-solving skills with a data-driven approach to reliability engineering
Ability to work collaboratively with cross-functional teams, including operations technicians

Preferred

Experience in AI/ML infrastructure or supercomputing environments
Knowledge of vendor ecosystems (e.g., NVIDIA, Dell, HP, Supermicro) and supply chain management
Certifications in hardware engineering or reliability (e.g., CRE, CompTIA Server+)
Prior work in a fast-paced startup or tech company like xAI

Company

xAI

twittertwittertwitter
company-logo
XAI is an artificial intelligence startup that develops AI solutions and tools to enhance reasoning and search capabilities.

H1B Sponsorship

xAI has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)

Funding

Current Stage
Late Stage
Total Funding
$42.73B
Key Investors
Neptune Digital AssetsSpaceXMorgan Stanley
2026-01-06Series E· $20B
2025-12-11Secondary Market· $0.3M
2025-07-13Corporate Round· $5.32B

Leadership Team

leader-logo
Toby Pohlen
Founding Member
linkedin
Company data provided by crunchbase