Site Reliability Engineer, Frontier Systems Infrastructure jobs in United States
cer-icon
Apply on Employer Site
company-logo

OpenAI · 2 months ago

Site Reliability Engineer, Frontier Systems Infrastructure

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. The Site Reliability Engineer will operate and scale large Kubernetes clusters, automate infrastructure processes, and ensure the reliability of supercomputers used for model training.

Agentic AIArtificial Intelligence (AI)Foundational AIGenerative AIMachine LearningNatural Language ProcessingSaaS
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and cluster lifecycle management
Build software abstractions that unify multiple clusters and present a seamless interface to training workloads
Own node bring-up from bare metal through firmware upgrades, ensuring fast, repeatable deployment at massive scale
Improve operational metrics such as reducing cluster restart times (e.g., from hours to minutes) and accelerating firmware or OS upgrade cycles
Integrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructure
Develop monitoring and observability systems to detect issues early and keep clusters stable under extreme load
Be expected to execute at the same level as a software engineer

Qualification

KubernetesDistributed systemsPythonInfrastructure-as-CodeLinuxCloud infrastructureGPU workloadsFirmware managementHigh-performance computingAutomation

Required

Experience as an infrastructure, systems, or distributed systems engineer in large-scale or high-availability environments
Strong knowledge of Kubernetes internals, cluster scaling patterns, and containerized workloads
Proficiency in cloud infrastructure concepts (compute, networking, storage, security) and in automating cluster or data center operations
deep experience operating or scaling Kubernetes clusters or similar container orchestration systems in high-growth or hyperscale environments
strong programming or scripting skills (Python, Go, or similar)
familiarity with Infrastructure-as-Code tools such as Terraform or CloudFormation
comfortable with bare-metal Linux environments, GPU hardware, and large-scale networking
enjoy solving fast-moving, high-impact operational problems and building automation to eliminate manual work
can balance careful engineering with the urgency of keeping mission-critical systems running

Preferred

background with GPU workloads
firmware management
high-performance computing

Company

OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT. It is a sub-organization of OpenAI Foundation.

H1B Sponsorship

OpenAI has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)
2024 (1)
2023 (1)
2022 (18)
2021 (10)
2020 (6)

Funding

Current Stage
Growth Stage
Total Funding
$79B
Key Investors
The Walt Disney CompanySoftBankThrive Capital
2025-12-11Corporate Round· $1B
2025-10-02Secondary Market· $6.6B
2025-03-31Series Unknown· $40B

Leadership Team

leader-logo
Sam Altman
CEO & Co-Founder
leader-logo
Greg Brockman
President, Chairman, & Co-Founder
linkedin
Company data provided by crunchbase