Site Reliability Engineer - Automation Specialist jobs in United States
cer-icon
Apply on Employer Site
company-logo

xAI · 2 months ago

Site Reliability Engineer - Automation Specialist

xAI is dedicated to creating AI systems that enhance human understanding. The Site Reliability Engineer - Automation Specialist will automate firmware upgrades and scripting solutions while enhancing datacenter efficiency and supporting scalable AI infrastructure.

Artificial Intelligence (AI)Foundational AIGenerative AIInformation TechnologyMachine Learning
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Develop and maintain scripts in Python and Bash for handling firmware packages, performing upgrades, and automating the entire process across Linux and Kubernetes environments
Work with hardware from vendors such as NVIDIA, Dell, Supermicro, and HP to ensure seamless firmware integration, testing, and deployment in the datacenter
Identify operational problems in real-time, design automated fixes or workflows to resolve them, and implement scalable solutions to prevent recurrence
Collaborate with Datacenter Operations Technicians to deploy automation tools, troubleshoot firmware-related issues, and optimize processes for high-availability systems
Integrate automation scripts into CI/CD pipelines or orchestration tools like Kubernetes for efficient scaling and management
Monitor and refine automated processes, ensuring they align with datacenter reliability goals and minimize downtime
Document automation scripts, firmware upgrade procedures, and problem-solving approaches to build a reusable knowledge base for the team
Participate in on-call rotations and incident response, applying automation to accelerate resolutions in the Memphis datacenter

Qualification

PythonBashLinuxKubernetesFirmware managementProblem-solvingCollaborationHigh-performance computingAI infrastructureVendor hardware integrationAutomation tools

Required

Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience)
5+ years of experience in site reliability engineering or automation roles, preferably in datacenter or cloud environments
Proficiency in Python, Bash, Linux, and Kubernetes for scripting, automation, and orchestration
Hands-on experience with firmware packages, including writing scripts for upgrades and automating deployment processes
Familiarity with hardware from vendors like NVIDIA, Dell, Supermicro, and HP, including integration and troubleshooting in production settings
Strong problem-solving skills with a proven ability to identify issues and automate fixes to improve system efficiency
Experience in high-performance computing or AI infrastructure environments
Excellent collaboration skills for working with cross-functional teams in fast-paced settings

Preferred

Experience automating firmware management in large-scale datacenters or supercomputing clusters
Knowledge of additional tools like Ansible, Terraform, ArgoCD or additional containerization tools for enhanced automation
Prior work in a startup or tech company like xAI, with contributions to scalable automation systems

Company

xAI

twittertwittertwitter
company-logo
XAI is an artificial intelligence startup that develops AI solutions and tools to enhance reasoning and search capabilities.

H1B Sponsorship

xAI has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)

Funding

Current Stage
Late Stage
Total Funding
$42.73B
Key Investors
Neptune Digital AssetsSpaceXMorgan Stanley
2026-01-06Series E· $20B
2025-12-11Secondary Market· $0.3M
2025-07-13Corporate Round· $5.32B

Leadership Team

leader-logo
Toby Pohlen
Founding Member
linkedin
Company data provided by crunchbase