Senior Site Reliability Engineer - Managed Kubernetes jobs in United States
cer-icon
Apply on Employer Site
company-logo

Lambda · 1 month ago

Senior Site Reliability Engineer - Managed Kubernetes

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving a diverse range of customers. The Senior Site Reliability Engineer will be responsible for operating and maintaining Kubernetes clusters, handling incidents, and developing automation for cluster lifecycle management.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingData CenterGPUMachine Learning
check
Comp. & Benefits
check
H1B Sponsor Likelynote

Responsibilities

Operate and maintain bare-metal Kubernetes clusters, scaling up to thousands of nodes
Handle cluster degradation, recovery, resizing, and incident response using fleet management tools
Participate in a well-managed on-call rotation for critical incidents
Assist customers with Kubernetes questions, workload integration, storage, and authentication
Work closely with our HPC Ops and Datacenter Ops teams for low-level or cross-functional issues
Use Python and Golang to create tooling and automate the validation of platform quality
Design, build, and maintain scalable control plane services, operators, and custom controllers for Kubernetes
Develop automation for cluster lifecycle management: provisioning, upgrades, patching, and deletion
Define and implement SLOs and SLIs for Kubernetes services, workloads, and platform reliability

Qualification

KubernetesPythonGolangLinux systemsGitOpsObservability toolsCluster lifecycle managementCustomer supportTeam collaboration

Required

6+ years of experience in a SRE, operations engineer, or similar role, with a deep knowledge of running Linux clusters and systems
Strong programming skills in Go and Python; experience with GitOps (e.g., ArgoCD), Helm, and Kubernetes operators
Proven experience operating Kubernetes clusters in production environments (on-prem, EKS, GKE, or similar)
Can work either independently with limited direction or as part of a team
Can work with customers during incidents either via tickets, live messaging, or as part of a larger call
Familiarity with observability tools like Prometheus, Grafana, FluentBit, and CI/CD pipelines
Proven experience provisioning Kubernetes using tools such as kubeadm, Cluster API, or similar

Preferred

Deep Kubernetes expertise: CRDs, CSI, CNI, Kubernetes Operator Coding experience
Exposure to HPC clusters, AI/ML workloads, or large-scale GPU clusters
Hybrid or multi-cloud Kubernetes environment experience
Contributions to CNCF projects or Kubernetes SIGs

Benefits

Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan that we all actually use

Company

Lambda

twittertwittertwitter
company-logo
Lambda is a cloud-based platform that provides high-performance GPU hardware and cloud infrastructure for AI model training and inference.

H1B Sponsorship

Lambda has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (16)
2024 (1)
2023 (3)
2022 (2)
2021 (2)
2020 (3)

Funding

Current Stage
Late Stage
Total Funding
$3.19B
Key Investors
TWG GlobalJP MorganMacquarie Group
2025-11-18Series E· $1.5B
2025-08-19Debt Financing· $275M
2025-02-19Series D· $480M

Leadership Team

leader-logo
Stephen Balaban
Co-founder, CEO
linkedin
leader-logo
Michael Balaban
Co-Founder / CTO
linkedin
Company data provided by crunchbase