Apply on Employer Site

Lambda · 1 month ago

Senior Site Reliability Engineer - Managed Kubernetes

San Francisco, CA

Full-time

Onsite

Senior Level

$240K/yr - $401K/yr

6+ years exp

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving a diverse range of customers. The Senior Site Reliability Engineer will be responsible for operating and maintaining Kubernetes clusters, handling incidents, and developing automation for cluster lifecycle management.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingData CenterGPUMachine Learning

Comp. & Benefits

H1B Sponsor Likely

Responsibilities

Operate and maintain bare-metal Kubernetes clusters, scaling up to thousands of nodes

Handle cluster degradation, recovery, resizing, and incident response using fleet management tools

Participate in a well-managed on-call rotation for critical incidents

Assist customers with Kubernetes questions, workload integration, storage, and authentication

Work closely with our HPC Ops and Datacenter Ops teams for low-level or cross-functional issues

Use Python and Golang to create tooling and automate the validation of platform quality

Design, build, and maintain scalable control plane services, operators, and custom controllers for Kubernetes

Develop automation for cluster lifecycle management: provisioning, upgrades, patching, and deletion

Define and implement SLOs and SLIs for Kubernetes services, workloads, and platform reliability

Qualification

KubernetesPythonGolangLinux systemsGitOpsObservability toolsCluster lifecycle managementCustomer supportTeam collaboration

Required

6+ years of experience in a SRE, operations engineer, or similar role, with a deep knowledge of running Linux clusters and systems

Strong programming skills in Go and Python; experience with GitOps (e.g., ArgoCD), Helm, and Kubernetes operators

Proven experience operating Kubernetes clusters in production environments (on-prem, EKS, GKE, or similar)

Can work either independently with limited direction or as part of a team

Can work with customers during incidents either via tickets, live messaging, or as part of a larger call

Familiarity with observability tools like Prometheus, Grafana, FluentBit, and CI/CD pipelines

Proven experience provisioning Kubernetes using tools such as kubeadm, Cluster API, or similar

Preferred

Deep Kubernetes expertise: CRDs, CSI, CNI, Kubernetes Operator Coding experience

Exposure to HPC clusters, AI/ML workloads, or large-scale GPU clusters

Hybrid or multi-cloud Kubernetes environment experience

Contributions to CNCF projects or Kubernetes SIGs

Benefits

Health, dental, and vision coverage for you and your dependents

Wellness and commuter stipends for select roles

401k Plan with 2% company match (USA employees)

Flexible paid time off plan that we all actually use

Company

Lambda

Lambda is a cloud-based platform that provides high-performance GPU hardware and cloud infrastructure for AI model training and inference.

Founded in 2012

San Jose, California, USA

501-1000 employees

https://lambda.ai

H1B Sponsorship

Lambda has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (16)

2024 (1)

2023 (3)

2022 (2)

2021 (2)

2020 (3)

Funding

Current Stage

Late Stage

Total Funding

$3.19B

Key Investors

TWG GlobalJP MorganMacquarie Group

2025-11-18Series E· $1.5B

2025-08-19Debt Financing· $275M

2025-02-19Series D· $480M

Leadership Team

Stephen Balaban

Co-founder, CEO

Michael Balaban

Co-Founder / CTO

Recent News

SiliconANGLE

AI cloud provider Lambda reportedly raising $350M round

2026-01-11

Business Wire

Lambda Appoints Leonard Speiser as Chief Operating Officer

2026-01-09

Techmeme

Source: Lambda, which rents access to AI chips and is backed by Nvidia, is in talks to raise $350M+ led by Mubadala Capital, ahead of an IPO planned for H2 2026 (The Information)

2026-01-09

Company data provided by crunchbase