Senior Site Reliability Engineer - Fleet Reliability jobs in United States
cer-icon
Apply on Employer Site
company-logo

Lambda · 1 month ago

Senior Site Reliability Engineer - Fleet Reliability

Lambda is a leader in AI cloud infrastructure serving a broad range of customers from AI researchers to enterprises. The Senior Site Reliability Engineer will be responsible for defining metrics for system availability, collaborating on monitoring systems, creating automated remediations, and participating in incident response.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingData CenterGPUMachine Learning
check
Comp. & Benefits
check
H1B Sponsor Likelynote

Responsibilities

Define Fleet Health metrics and indicators to objectively measure and improve system availability
Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
Create runbooks and automated remediations for common failure scenarios
Build in automation and auditing to ensure compliance and improve efficiency and productivity
Participate in on-call rotations and provide support for incident response and resolution
Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc

Qualification

Site Reliability EngineeringAI infrastructureLinux-based systemsPythonGoMonitoring toolsAutomation toolsCloud platformsContinuous improvementMachine learning experienceContainerization technologiesHPC resourcesChaos engineeringCompliance frameworksProblem-solving skillsCommunication skillsCollaboration skills

Required

7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
Strong understanding of Linux-based systems in a distributed environment
Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
Excellent problem-solving and troubleshooting skills
Strong communication and collaboration skills
Passion for continuous improvement and innovation

Preferred

Experience in the machine learning or computer hardware industry
Knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes)
Experience building and/or operating HPC resources
Background in chaos engineering or similar reliability testing methodologies
Understanding of compliance frameworks (SOC 2, ISO 27001, etc.)

Benefits

Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan that we all actually use

Company

Lambda

twittertwittertwitter
company-logo
Lambda is a cloud-based platform that provides high-performance GPU hardware and cloud infrastructure for AI model training and inference.

H1B Sponsorship

Lambda has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (16)
2024 (1)
2023 (3)
2022 (2)
2021 (2)
2020 (3)

Funding

Current Stage
Late Stage
Total Funding
$3.19B
Key Investors
TWG GlobalJP MorganMacquarie Group
2025-11-18Series E· $1.5B
2025-08-19Debt Financing· $275M
2025-02-19Series D· $480M

Leadership Team

leader-logo
Stephen Balaban
Co-founder, CEO
linkedin
leader-logo
Michael Balaban
Co-Founder / CTO
linkedin
Company data provided by crunchbase