Apply on Employer Site

Lambda · 1 month ago

Senior Site Reliability Engineer - Fleet Reliability

San Francisco, CA

Full-time

Onsite

Senior Level

$230K/yr - $345K/yr

7+ years exp

Lambda is a leader in AI cloud infrastructure serving a broad range of customers from AI researchers to enterprises. The Senior Site Reliability Engineer will be responsible for defining metrics for system availability, collaborating on monitoring systems, creating automated remediations, and participating in incident response.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingData CenterGPUMachine Learning

Comp. & Benefits

H1B Sponsor Likely

Responsibilities

Define Fleet Health metrics and indicators to objectively measure and improve system availability

Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies

Create runbooks and automated remediations for common failure scenarios

Build in automation and auditing to ensure compliance and improve efficiency and productivity

Participate in on-call rotations and provide support for incident response and resolution

Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc

Qualification

Site Reliability EngineeringAI infrastructureLinux-based systemsPythonGoMonitoring toolsAutomation toolsCloud platformsContinuous improvementMachine learning experienceContainerization technologiesHPC resourcesChaos engineeringCompliance frameworksProblem-solving skillsCommunication skillsCollaboration skills

Required

7+ years of experience in Site Reliability Engineering, DevOps, or a similar role

Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization

Strong understanding of Linux-based systems in a distributed environment

Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling

Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)

Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)

Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)

Excellent problem-solving and troubleshooting skills

Strong communication and collaboration skills

Passion for continuous improvement and innovation

Preferred

Experience in the machine learning or computer hardware industry

Knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes)

Experience building and/or operating HPC resources

Background in chaos engineering or similar reliability testing methodologies

Understanding of compliance frameworks (SOC 2, ISO 27001, etc.)

Benefits

Health, dental, and vision coverage for you and your dependents

Wellness and commuter stipends for select roles

401k Plan with 2% company match (USA employees)

Flexible paid time off plan that we all actually use

Company

Lambda

Lambda is a cloud-based platform that provides high-performance GPU hardware and cloud infrastructure for AI model training and inference.

Founded in 2012

San Jose, California, USA

501-1000 employees

https://lambda.ai

H1B Sponsorship

Lambda has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (16)

2024 (1)

2023 (3)

2022 (2)

2021 (2)

2020 (3)

Funding

Current Stage

Late Stage

Total Funding

$3.19B

Key Investors

TWG GlobalJP MorganMacquarie Group

2025-11-18Series E· $1.5B

2025-08-19Debt Financing· $275M

2025-02-19Series D· $480M

Leadership Team

Stephen Balaban

Co-founder, CEO

Michael Balaban

Co-Founder / CTO

Recent News

SiliconANGLE

AI cloud provider Lambda reportedly raising $350M round

2026-01-11

Business Wire

Lambda Appoints Leonard Speiser as Chief Operating Officer

2026-01-09

Techmeme

Source: Lambda, which rents access to AI chips and is backed by Nvidia, is in talks to raise $350M+ led by Mubadala Capital, ahead of an IPO planned for H2 2026 (The Information)

2026-01-09

Company data provided by crunchbase