xAI · 2 months ago
Site Reliability Engineer - Hardware Specialist
xAI is focused on creating AI systems that understand the universe and aid humanity. The Site Reliability Engineer - Hardware Specialist will be responsible for ensuring hardware reliability, managing vendor relations, and resolving hardware issues to support datacenter operations.
Artificial Intelligence (AI)Foundational AIGenerative AIInformation TechnologyMachine Learning
Responsibilities
Analyze firmware packages and hardware specifications for upcoming releases to ensure compatibility, performance, and reliability in xAI's datacenter environment
Investigate and diagnose hardware failures, including "grey failures" (ambiguous or intermittent issues), proving them as true hardware defects through rigorous testing and data analysis
Manage vendor relationships, including initiating RMA (Return Merchandise Authorization) claims, negotiating beyond standard processes when necessary, and holding vendors accountable for resolutions
Collaborate with Datacenter Operations Technicians to troubleshoot, repair, and optimize hardware systems in real-time
Research and evaluate next-generation hardware technologies that are not yet released, providing insights and recommendations to inform xAI's infrastructure roadmap
Develop and implement monitoring tools, scripts, and processes to detect hardware anomalies early and minimize downtime
Document failure modes, RMA outcomes, and hardware evaluations to build a knowledge base for the team
Participate in on-call rotations and incident response for hardware-related issues in the Memphis datacenter
Qualification
Required
Bachelor's degree in Systems Engineering, Electrical Engineering, Computer Science, or a related field (or equivalent experience)
5+ years of experience in hardware reliability engineering, preferably in high-performance computing or datacenter environments
Proven expertise in firmware analysis, hardware specifications review, and release validation
Strong experience with RMA processes, including filing claims, vendor negotiations, and pushing for resolutions outside standard protocols
Demonstrated ability to diagnose and prove complex hardware failures, including grey or intermittent issues, using tools, logic analyzers, or diagnostic software
Familiarity with datacenter hardware components (e.g., servers, GPUs, networking equipment) and emerging technologies
Proficiency in scripting languages (e.g., Python, Bash) for automation and analysis
Excellent problem-solving skills with a data-driven approach to reliability engineering
Ability to work collaboratively with cross-functional teams, including operations technicians
Preferred
Experience in AI/ML infrastructure or supercomputing environments
Knowledge of vendor ecosystems (e.g., NVIDIA, Dell, HP, Supermicro) and supply chain management
Certifications in hardware engineering or reliability (e.g., CRE, CompTIA Server+)
Prior work in a fast-paced startup or tech company like xAI
Company
xAI
XAI is an artificial intelligence startup that develops AI solutions and tools to enhance reasoning and search capabilities.
H1B Sponsorship
xAI has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)
Funding
Current Stage
Late StageTotal Funding
$42.73BKey Investors
Neptune Digital AssetsSpaceXMorgan Stanley
2026-01-06Series E· $20B
2025-12-11Secondary Market· $0.3M
2025-07-13Corporate Round· $5.32B
Recent News
2026-01-09
2026-01-09
2026-01-09
Company data provided by crunchbase