x.ai · 2 months ago
Site Reliability Engineer - Storage
xAI is focused on creating AI systems that enhance humanity's understanding of the universe. The Site Reliability Engineer - Storage will ensure the reliability and performance of large-scale storage infrastructure, collaborating with various teams to optimize storage for AI workloads.
Artificial Intelligence (AI)InternetSchedulingSoftwareVirtual Assistant
Responsibilities
Deploy, maintain, and scale exabyte-scale storage clusters with a focus on observability, zero-downtime upgrades, and integration with high-density GPU environments
Troubleshoot production storage issues across hardware-software stacks: NVMe/PCIe/RDMA paths, firmware bugs, BMC logs, disk failures—performing root cause analysis and automating preventions
Collaborate with storage teams to validate server specs, debug field problems and influence custom designs with vendors for cutting-edge AI storage
Evaluate and onboard new storage vendors and technologies; benchmark for cost, density and GPU-direct performance against AI training I/O patterns
Support storage SDEs by translating engineering requirements into reliable, observable systems; develop scripting and playbooks to reduce toil and enable self-service
Lead hardware refreshes for legacy X storage fleets, including migration, decommissioning, and designing repeatable processes for customized solutions
Participate in on-call rotations (follow-the-sun, generous stipend) for storage domains; respond to incidents, drive post-mortems, and forecast capacity for EiB+ growth
Create and maintain documentation, standard operating procedures, and monitoring for storage health in massive-scale AI pipelines
Qualification
Required
Bachelor's degree in Computer Science, Engineering, or a related technical field (or equivalent experience)
3+ years in site reliability engineering, systems engineering, or storage operations at multi-PB+ scale
Hands-on experience with storage systems from various vendors like VAST, DDN, Dell, and parallel filesystems (such as Lustre, GPFS, Weka) and Linux storage stacks (kernel tuning, eBPF, blktrace, NVMe/RDMA/RoCE)
Proficiency in scripting for automation (Python/Bash); light programming experience (Go nice-to-have) but emphasis on operational clarity over heavy coding
Strong troubleshooting skills across storage hardware (e.g., harddrives, SSDs, NVME drives, drive enclosures, and software + firmware) and vendor qualification/refresh cycles
Experience with incident response, including on-call rotations, rapid resolution, root cause analysis, and implementation of preventative measures
Basic hardware knowledge for storage bring-up and debugging in data center environments
Excellent communication and documentation skills, with the ability to share knowledge concisely and accurately
Company
x.ai
x.ai is a tool that helps you and your team share ideal availability and schedule meetings.
H1B Sponsorship
x.ai has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (35)
2024 (9)
2023 (2)
Funding
Current Stage
Growth StageTotal Funding
$44.29MKey Investors
Pegasus Tech VenturesTwo Sigma VenturesFirstMark
2021-06-03Acquired
2017-08-14Series B· $10M
2016-04-07Series B· $23M
Recent News
bloomberglaw.com
2025-09-25
2025-08-13
Company data provided by crunchbase