x.ai · 4 months ago
Site Reliability Engineer - Kubernetes Platform
xAI is focused on creating AI systems that enhance human understanding of the universe. They are seeking a highly skilled Site Reliability Engineer to design, build, and optimize Kubernetes clusters, ensuring the reliability and performance of their infrastructure for large-scale AI workloads.
Artificial Intelligence (AI)InternetSchedulingSoftwareVirtual Assistant
Responsibilities
Develop and optimize software to provision and manage Kubernetes clusters on-premises, enabling xAI to scale efficiently
Enhance the reliability, performance, and cost-effectiveness of Kubernetes infrastructure to support large-scale AI and application workloads
Collaborate with xAI engineers to understand workload requirements and design tailored Kubernetes solutions to meet their needs
Implement robust observability, monitoring, and security practices to ensure the integrity, availability, and confidentiality of critical systems
Manage storage infrastructure using Infrastructure-as-Code (IaC) tools such as Pulumi, Terraform, or Ansible
Drive system reliability through incident management, postmortems, and the definition of clear SLAs and SLOs
Contribute to the Kubernetes stack, including expertise in CNI, CRI, CSI, and related components
This is an in-person role based in Palo Alto, CA, with up to 25% travel required
Qualification
Required
5+ years of experience as a Site Reliability Engineer or similar role, with a focus on building and maintaining reliable, scalable systems
Proven expertise in managing Kubernetes infrastructure using tools like Cluster API (CAPI) and kubeadm
Proficiency in managing storage infrastructure with IaC tools such as Pulumi, Terraform, or Ansible
Deep understanding of the Kubernetes stack, including CNI, CRI, CSI, and related components
Demonstrated ability to improve system reliability through incident management, postmortems, and defining SLAs/SLOs
Preferred
Experience with high-traffic web or mobile application workloads, including optimizing Kubernetes for large-scale deployments
Familiarity with chaos engineering, capacity planning, or similar practices for ensuring system resilience
Proficiency with tools such as Kyverno, ArgoCD, or Go programming for infrastructure automation
Strong sense of ownership, curiosity, and enthusiasm for tackling complex technical challenges
Passion for problem-solving and a proactive drive to deliver impactful results
A sense of adventure and humor to navigate challenges with a positive mindset
Benefits
Equity
Comprehensive medical, vision, and dental coverage
Access to a 401(k) retirement plan
Short & long-term disability insurance
Life insurance
Various other discounts and perks
Company
x.ai
x.ai is a tool that helps you and your team share ideal availability and schedule meetings.
H1B Sponsorship
x.ai has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (35)
2024 (9)
2023 (2)
Funding
Current Stage
Growth StageTotal Funding
$44.29MKey Investors
Pegasus Tech VenturesTwo Sigma VenturesFirstMark
2021-06-03Acquired
2017-08-14Series B· $10M
2016-04-07Series B· $23M
Recent News
bloomberglaw.com
2025-09-25
2025-08-13
Company data provided by crunchbase