Apply on Employer Site

OpenAI · 2 months ago

Site Reliability Engineer, Frontier Systems Infrastructure

San Francisco

Full-time

Onsite

Mid, Senior Level

$255K/yr - $490K/yr

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. The Site Reliability Engineer will operate and scale large Kubernetes clusters, automate infrastructure processes, and ensure the reliability of supercomputers used for model training.

Agentic AIArtificial Intelligence (AI)Foundational AIGenerative AIMachine LearningNatural Language ProcessingSaaS

Growth Opportunities

H1B Sponsor Likely

Responsibilities

Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and cluster lifecycle management

Build software abstractions that unify multiple clusters and present a seamless interface to training workloads

Own node bring-up from bare metal through firmware upgrades, ensuring fast, repeatable deployment at massive scale

Improve operational metrics such as reducing cluster restart times (e.g., from hours to minutes) and accelerating firmware or OS upgrade cycles

Integrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructure

Develop monitoring and observability systems to detect issues early and keep clusters stable under extreme load

Be expected to execute at the same level as a software engineer

Qualification

KubernetesDistributed systemsPythonInfrastructure-as-CodeLinuxCloud infrastructureGPU workloadsFirmware managementHigh-performance computingAutomation

Required

Experience as an infrastructure, systems, or distributed systems engineer in large-scale or high-availability environments

Strong knowledge of Kubernetes internals, cluster scaling patterns, and containerized workloads

Proficiency in cloud infrastructure concepts (compute, networking, storage, security) and in automating cluster or data center operations

deep experience operating or scaling Kubernetes clusters or similar container orchestration systems in high-growth or hyperscale environments

strong programming or scripting skills (Python, Go, or similar)

familiarity with Infrastructure-as-Code tools such as Terraform or CloudFormation

comfortable with bare-metal Linux environments, GPU hardware, and large-scale networking

enjoy solving fast-moving, high-impact operational problems and building automation to eliminate manual work

can balance careful engineering with the urgency of keeping mission-critical systems running

Preferred

background with GPU workloads

firmware management

high-performance computing

Company

OpenAI

Glassdoor4.2

OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT. It is a sub-organization of OpenAI Foundation.

Founded in 2015

San Francisco, California, USA

201-500 employees

https://www.openai.com

H1B Sponsorship

OpenAI has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (1)

2024 (1)

2023 (1)

2022 (18)

2021 (10)

2020 (6)

Funding

Current Stage

Growth Stage

Total Funding

$79B

Key Investors

The Walt Disney CompanySoftBankThrive Capital

2025-12-11Corporate Round· $1B

2025-10-02Secondary Market· $6.6B

2025-03-31Series Unknown· $40B

Leadership Team

Sam Altman

CEO & Co-Founder

Greg Brockman

President, Chairman, & Co-Founder

Recent News

Business Insider

This is the key breakthrough AI still requires to reach superintelligence, according to those building it

2026-01-09

Indian Express

AI showdown: ChatGPT web traffic slips as Gemini’s share rises 3.3%

2026-01-09

The Motley Fool

Why UiPath Stock Rocketed 29% Higher in 2025

2026-01-09

Company data provided by crunchbase