Site Reliability Engineer - Kubernetes Platform jobs in United States
cer-icon
Apply on Employer Site
company-logo

xAI · 2 months ago

Site Reliability Engineer - Kubernetes Platform

xAI is a mission-driven company focused on creating AI systems that aid humanity in its pursuit of knowledge. They are seeking a highly skilled Senior Site Reliability Storage Engineer to design, build, and optimize Kubernetes clusters, enhancing the reliability and performance of their infrastructure for large-scale AI workloads.

Artificial Intelligence (AI)Foundational AIGenerative AIInformation TechnologyMachine Learning
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Develop and optimize software to provision and manage Kubernetes clusters on-premises, enabling xAI to scale efficiently
Enhance the reliability, performance, and cost-effectiveness of Kubernetes infrastructure to support large-scale AI and application workloads
Collaborate with xAI engineers to understand workload requirements and design tailored Kubernetes solutions to meet their needs
Implement robust observability, monitoring, and security practices to ensure the integrity, availability, and confidentiality of critical systems
Manage storage infrastructure using Infrastructure-as-Code (IaC) tools such as Pulumi, Terraform, or Ansible
Drive system reliability through incident management, postmortems, and the definition of clear SLAs and SLOs
Contribute to the Kubernetes stack, including expertise in CNI, CRI, CSI, and related components

Qualification

Kubernetes managementInfrastructure-as-Code (IaC)Incident managementKubernetes stack expertiseAutomation tools proficiencyCuriosityCommunication skillsProblem-solvingOwnership

Required

5+ years of experience as a Site Reliability Engineer or similar role, with a focus on building and maintaining reliable, scalable systems
Proven expertise in managing Kubernetes infrastructure using tools like Cluster API (CAPI) and kubeadm
Proficiency in managing storage infrastructure with IaC tools such as Pulumi, Terraform, or Ansible
Deep understanding of the Kubernetes stack, including CNI, CRI, CSI, and related components
Demonstrated ability to improve system reliability through incident management, postmortems, and defining SLAs/SLOs

Preferred

Experience with high-traffic web or mobile application workloads, including optimizing Kubernetes for large-scale deployments
Familiarity with chaos engineering, capacity planning, or similar practices for ensuring system resilience
Proficiency with tools such as Kyverno, ArgoCD, or Go programming for infrastructure automation
Strong sense of ownership, curiosity, and enthusiasm for tackling complex technical challenges
Passion for problem-solving and a proactive drive to deliver impactful results
A sense of adventure and humor to navigate challenges with a positive mindset

Benefits

Equity
Comprehensive medical, vision, and dental coverage
Access to a 401(k) retirement plan
Short & long-term disability insurance
Life insurance
Various other discounts and perks

Company

xAI

twittertwittertwitter
company-logo
XAI is an artificial intelligence startup that develops AI solutions and tools to enhance reasoning and search capabilities.

H1B Sponsorship

xAI has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)

Funding

Current Stage
Late Stage
Total Funding
$42.73B
Key Investors
Neptune Digital AssetsSpaceXMorgan Stanley
2026-01-06Series E· $20B
2025-12-11Secondary Market· $0.3M
2025-07-13Corporate Round· $5.32B

Leadership Team

leader-logo
Toby Pohlen
Founding Member
linkedin
Company data provided by crunchbase