Lambda · 1 month ago
Senior Site Reliability Engineer - Managed Kubernetes
Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving a diverse range of customers. The Senior Site Reliability Engineer will be responsible for operating and maintaining Kubernetes clusters, handling incidents, and developing automation for cluster lifecycle management.
AI InfrastructureArtificial Intelligence (AI)Cloud ComputingData CenterGPUMachine Learning
Responsibilities
Operate and maintain bare-metal Kubernetes clusters, scaling up to thousands of nodes
Handle cluster degradation, recovery, resizing, and incident response using fleet management tools
Participate in a well-managed on-call rotation for critical incidents
Assist customers with Kubernetes questions, workload integration, storage, and authentication
Work closely with our HPC Ops and Datacenter Ops teams for low-level or cross-functional issues
Use Python and Golang to create tooling and automate the validation of platform quality
Design, build, and maintain scalable control plane services, operators, and custom controllers for Kubernetes
Develop automation for cluster lifecycle management: provisioning, upgrades, patching, and deletion
Define and implement SLOs and SLIs for Kubernetes services, workloads, and platform reliability
Qualification
Required
6+ years of experience in a SRE, operations engineer, or similar role, with a deep knowledge of running Linux clusters and systems
Strong programming skills in Go and Python; experience with GitOps (e.g., ArgoCD), Helm, and Kubernetes operators
Proven experience operating Kubernetes clusters in production environments (on-prem, EKS, GKE, or similar)
Can work either independently with limited direction or as part of a team
Can work with customers during incidents either via tickets, live messaging, or as part of a larger call
Familiarity with observability tools like Prometheus, Grafana, FluentBit, and CI/CD pipelines
Proven experience provisioning Kubernetes using tools such as kubeadm, Cluster API, or similar
Preferred
Deep Kubernetes expertise: CRDs, CSI, CNI, Kubernetes Operator Coding experience
Exposure to HPC clusters, AI/ML workloads, or large-scale GPU clusters
Hybrid or multi-cloud Kubernetes environment experience
Contributions to CNCF projects or Kubernetes SIGs
Benefits
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan that we all actually use
Company
Lambda
Lambda is a cloud-based platform that provides high-performance GPU hardware and cloud infrastructure for AI model training and inference.
H1B Sponsorship
Lambda has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (16)
2024 (1)
2023 (3)
2022 (2)
2021 (2)
2020 (3)
Funding
Current Stage
Late StageTotal Funding
$3.19BKey Investors
TWG GlobalJP MorganMacquarie Group
2025-11-18Series E· $1.5B
2025-08-19Debt Financing· $275M
2025-02-19Series D· $480M
Recent News
2026-01-11
2026-01-09
Company data provided by crunchbase