Site Reliability Engineer - xAI Technical Operations jobs in United States
cer-icon
Apply on Employer Site
company-logo

xAI · 2 weeks ago

Site Reliability Engineer - xAI Technical Operations

xAI is dedicated to creating AI systems that aid humanity in understanding the universe. They are seeking Site Reliability Engineers to ensure the availability and reliability of their infrastructure and core services, focusing on incident management and performance metrics in distributed environments.

Artificial Intelligence (AI)Foundational AIGenerative AIInformation TechnologyMachine Learning
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Setting technical strategy and roadmap for infrastructure availability
Automating monitoring, alerting, and troubleshooting for high-availability services, while working with legacy systems to scale, improve, or deprecate
Owning incident response, problem management, and conducting thorough RCAs to prevent recurrence and drive continuous improvement
Analyzing performance metrics and service health to identify, resolve, and mitigate bottlenecks or failures in distributed environments
Ensuring security, scalability, and resilience of production infrastructure supporting AI workloads

Qualification

Site Reliability EngineeringDistributed SystemsCloud InfrastructureIncident ManagementPythonContainerizationNetworking KnowledgeProblem SolvingCommunication SkillsTeam Collaboration

Required

A minimum of 5 years of software, systems or reliability engineering experience
Experience managing services in distributed, internet-scale ix environments, including on-prem and cloud (e.g., AWS, GCP)
Development experience in Python, Scala, Java, C, or C++
Demonstrable knowledge of TCP/IP, HTTP, Networking and systems programming (e.g., bash and shell tools)
Familiarity with containerization and orchestration tools (e.g., Kubernetes, Docker, Mesos) and systems management (e.g., Puppet, Chef, Ansible)
Bachelor's degree in Computer Science, Electrical Engineering, or a related field (or equivalent experience)

Preferred

Experience in on-call rotations and incident response in high-stakes environments
Experience with AI/ML infrastructure, large-scale GPU clusters
Strong problem-solving skills and ability to thrive in a fast-paced, ambiguous setting
Comfortable with deployment, support, monitoring, administration, and troubleshooting across on-prem, cloud and hybrid infrastructures
Proven understanding of systems and application design, including operational trade-offs

Benefits

Equity
Comprehensive medical, vision, and dental coverage
Access to a 401(k) retirement plan
Short & long-term disability insurance
Life insurance
Various other discounts and perks

Company

xAI

twittertwittertwitter
company-logo
XAI is an artificial intelligence startup that develops AI solutions and tools to enhance reasoning and search capabilities.

H1B Sponsorship

xAI has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)

Funding

Current Stage
Late Stage
Total Funding
$42.73B
Key Investors
Neptune Digital AssetsSpaceXMorgan Stanley
2026-01-06Series E· $20B
2025-12-11Secondary Market· $0.3M
2025-07-13Corporate Round· $5.32B

Leadership Team

leader-logo
Toby Pohlen
Founding Member
linkedin
Company data provided by crunchbase