xAI · 2 days ago
Site Reliability Engineer - Storage
xAI is a company dedicated to creating AI systems that enhance human understanding of the universe. As a Site Reliability Engineer - Storage, you will ensure the reliability and performance of large-scale storage infrastructure while collaborating with various engineering teams to optimize storage solutions for AI workloads.
Artificial Intelligence (AI)Foundational AIGenerative AIInformation TechnologyMachine Learning
Responsibilities
Deploy, maintain, and scale exabyte-scale storage clusters with a focus on observability, zero-downtime upgrades, and integration with high-density GPU environments
Troubleshoot production storage issues across hardware-software stacks: NVMe/PCIe/RDMA paths, firmware bugs, BMC logs, disk failures—performing root cause analysis and automating preventions
Collaborate with storage teams to validate server specs, debug field problems and influence custom designs with vendors for cutting-edge AI storage
Evaluate and onboard new storage vendors and technologies; benchmark for cost, density and GPU-direct performance against AI training I/O patterns
Support storage SDEs by translating engineering requirements into reliable, observable systems; develop scripting and playbooks to reduce toil and enable self-service
Lead hardware refreshes for legacy X storage fleets, including migration, decommissioning, and designing repeatable processes for customized solutions
Participate in on-call rotations (follow-the-sun, generous stipend) for storage domains; respond to incidents, drive post-mortems, and forecast capacity for EiB+ growth
Create and maintain documentation, standard operating procedures, and monitoring for storage health in massive-scale AI pipelines
Qualification
Required
Bachelor's degree in Computer Science, Engineering, or a related technical field (or equivalent experience)
3+ years in site reliability engineering, systems engineering, or storage operations at multi-PB+ scale
Hands-on experience with storage systems from various vendors like VAST, DDN, Dell, and parallel filesystems (such as Lustre, GPFS, Weka) and Linux storage stacks (kernel tuning, eBPF, blktrace, NVMe/RDMA/RoCE)
Proficiency in scripting for automation (Python/Bash); light programming experience (Go nice-to-have) but emphasis on operational clarity over heavy coding
Strong troubleshooting skills across storage hardware (e.g., harddrives, SSDs, NVME drives, drive enclosures, and software + firmware) and vendor qualification/refresh cycles
Experience with incident response, including on-call rotations, rapid resolution, root cause analysis, and implementation of preventative measures
Basic hardware knowledge for storage bring-up and debugging in data center environments
Excellent communication and documentation skills, with the ability to share knowledge concisely and accurately
Preferred
light programming experience (Go nice-to-have)
Benefits
Participate in on-call rotations (follow-the-sun, generous stipend)
Company
xAI
XAI is an artificial intelligence startup that develops AI solutions and tools to enhance reasoning and search capabilities.
H1B Sponsorship
xAI has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)
Funding
Current Stage
Late StageTotal Funding
$42.73BKey Investors
Neptune Digital AssetsSpaceXMorgan Stanley
2026-01-06Series E· $20B
2025-12-11Secondary Market· $0.3M
2025-07-13Corporate Round· $5.32B
Recent News
2026-01-12
Company data provided by crunchbase