OpenAI · 2 months ago
Site Reliability Engineer, Frontier Systems Infrastructure
OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. The Site Reliability Engineer will operate and scale large Kubernetes clusters, automate infrastructure processes, and ensure the reliability of supercomputers used for model training.
Agentic AIArtificial Intelligence (AI)Foundational AIGenerative AIMachine LearningNatural Language ProcessingSaaS
Responsibilities
Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and cluster lifecycle management
Build software abstractions that unify multiple clusters and present a seamless interface to training workloads
Own node bring-up from bare metal through firmware upgrades, ensuring fast, repeatable deployment at massive scale
Improve operational metrics such as reducing cluster restart times (e.g., from hours to minutes) and accelerating firmware or OS upgrade cycles
Integrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructure
Develop monitoring and observability systems to detect issues early and keep clusters stable under extreme load
Be expected to execute at the same level as a software engineer
Qualification
Required
Experience as an infrastructure, systems, or distributed systems engineer in large-scale or high-availability environments
Strong knowledge of Kubernetes internals, cluster scaling patterns, and containerized workloads
Proficiency in cloud infrastructure concepts (compute, networking, storage, security) and in automating cluster or data center operations
deep experience operating or scaling Kubernetes clusters or similar container orchestration systems in high-growth or hyperscale environments
strong programming or scripting skills (Python, Go, or similar)
familiarity with Infrastructure-as-Code tools such as Terraform or CloudFormation
comfortable with bare-metal Linux environments, GPU hardware, and large-scale networking
enjoy solving fast-moving, high-impact operational problems and building automation to eliminate manual work
can balance careful engineering with the urgency of keeping mission-critical systems running
Preferred
background with GPU workloads
firmware management
high-performance computing
Company
OpenAI
OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT. It is a sub-organization of OpenAI Foundation.
H1B Sponsorship
OpenAI has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)
2024 (1)
2023 (1)
2022 (18)
2021 (10)
2020 (6)
Funding
Current Stage
Growth StageTotal Funding
$79BKey Investors
The Walt Disney CompanySoftBankThrive Capital
2025-12-11Corporate Round· $1B
2025-10-02Secondary Market· $6.6B
2025-03-31Series Unknown· $40B
Recent News
2026-01-09
The Motley Fool
2026-01-09
Company data provided by crunchbase