Site Reliability Engineer - Automation jobs in United States
cer-icon
Apply on Employer Site
company-logo

x.ai · 2 months ago

Site Reliability Engineer - Automation

xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. As a Site Reliability Engineer in Automation, you will focus on automating firmware upgrades and enhancing datacenter efficiency while supporting scalable AI infrastructure. Your role involves scripting solutions, identifying operational issues, and collaborating with technicians to optimize processes.

Artificial Intelligence (AI)InternetSchedulingSoftwareVirtual Assistant
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Develop and maintain scripts in Python and Bash for handling firmware packages, performing upgrades, and automating the entire process across Linux and Kubernetes environments
Work with hardware from vendors such as NVIDIA, Dell, Supermicro, and HP to ensure seamless firmware integration, testing, and deployment in the datacenter
Identify operational problems in real-time, design automated fixes or workflows to resolve them, and implement scalable solutions to prevent recurrence
Collaborate with Datacenter Operations Technicians to deploy automation tools, troubleshoot firmware-related issues, and optimize processes for high-availability systems
Integrate automation scripts into CI/CD pipelines or orchestration tools like Kubernetes for efficient scaling and management
Monitor and refine automated processes, ensuring they align with datacenter reliability goals and minimize downtime
Document automation scripts, firmware upgrade procedures, and problem-solving approaches to build a reusable knowledge base for the team
Participate in on-call rotations and incident response, applying automation to accelerate resolutions in the Memphis datacenter

Qualification

PythonBashLinuxKubernetesFirmware managementProblem-solvingCollaborationHigh-performance computingAI infrastructureAnsibleTerraformArgoCD

Required

Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience)
5+ years of experience in site reliability engineering or automation roles, preferably in datacenter or cloud environments
Proficiency in Python, Bash, Linux, and Kubernetes for scripting, automation, and orchestration
Hands-on experience with firmware packages, including writing scripts for upgrades and automating deployment processes
Familiarity with hardware from vendors like NVIDIA, Dell, Supermicro, and HP, including integration and troubleshooting in production settings
Strong problem-solving skills with a proven ability to identify issues and automate fixes to improve system efficiency
Experience in high-performance computing or AI infrastructure environments
Excellent collaboration skills for working with cross-functional teams in fast-paced settings

Preferred

Experience automating firmware management in large-scale datacenters or supercomputing clusters
Knowledge of additional tools like Ansible, Terraform, ArgoCD or additional containerization tools for enhanced automation
Prior work in a startup or tech company like xAI, with contributions to scalable automation systems

Company

x.ai

twittertwittertwitter
company-logo
x.ai is a tool that helps you and your team share ideal availability and schedule meetings.

H1B Sponsorship

x.ai has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (35)
2024 (9)
2023 (2)

Funding

Current Stage
Growth Stage
Total Funding
$44.29M
Key Investors
Pegasus Tech VenturesTwo Sigma VenturesFirstMark
2021-06-03Acquired
2017-08-14Series B· $10M
2016-04-07Series B· $23M
Company data provided by crunchbase