Software Engineer - Reliability jobs in United States
cer-icon
Apply on Employer Site
company-logo

x.ai · 5 months ago

Software Engineer - Reliability

xAI is dedicated to creating AI systems that enhance human understanding of the universe. The Software Engineer - Reliability will focus on ensuring the reliability, scalability, and performance of the high-performance computing infrastructure that supports AI research, collaborating with cross-functional teams to optimize system performance and develop automation tools.

Artificial Intelligence (AI)InternetSchedulingSoftwareVirtual Assistant
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Design, implement, and maintain robust, scalable infrastructure for supercomputing environments
Monitor and optimize system performance, ensuring high availability and minimal downtime
Develop automation tools and scripts to streamline operations and improve system reliability
Troubleshoot complex issues across distributed systems, networks, and storage solutions
Collaborate with AI researchers and engineers to support compute-intensive workloads
Implement security best practices to protect sensitive data and infrastructure
Contribute to capacity planning and disaster recovery strategies
Participate in an on-call rotation to ensure 24/7 system reliability

Qualification

Site Reliability EngineeringLinux AdministrationContainerizationCloud PlatformsNetworkingDistributed SystemsStorage TechnologiesHPC EnvironmentsInfrastructure as CodeMonitoring ToolsScriptingProgramming LanguagesProblem-Solving SkillsCommunication SkillsCollaboration

Required

Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience)
3+ years of experience in site reliability engineering, DevOps, or systems engineering
Proficiency in Linux system administration and scripting (e.g., Python, Bash)
Experience with containerization (e.g., Docker, Kubernetes) and cloud platforms (e.g., AWS, GCP, Azure)
Strong understanding of networking, distributed systems, and storage technologies
Familiarity with HPC environments, GPU clusters, or large-scale data processing
Excellent problem-solving skills and ability to work in a fast-paced, dynamic environment
Strong communication skills and a collaborative mindset

Preferred

Experience with Infrastructure as Code (e.g., Terraform, Ansible) or monitoring tools (e.g., Prometheus, Grafana)

Benefits

Equity
Comprehensive medical, vision, and dental coverage
Access to a 401(k) retirement plan
Short & long-term disability insurance
Life insurance
Various other discounts and perks

Company

x.ai

twittertwittertwitter
company-logo
x.ai is a tool that helps you and your team share ideal availability and schedule meetings.

H1B Sponsorship

x.ai has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (35)
2024 (9)
2023 (2)

Funding

Current Stage
Growth Stage
Total Funding
$44.29M
Key Investors
Pegasus Tech VenturesTwo Sigma VenturesFirstMark
2021-06-03Acquired
2017-08-14Series B· $10M
2016-04-07Series B· $23M
Company data provided by crunchbase