Lambda · 1 month ago
Senior Site Reliability Engineer - Fleet Reliability
Lambda is a leader in AI cloud infrastructure serving a broad range of customers from AI researchers to enterprises. The Senior Site Reliability Engineer will be responsible for defining metrics for system availability, collaborating on monitoring systems, creating automated remediations, and participating in incident response.
AI InfrastructureArtificial Intelligence (AI)Cloud ComputingData CenterGPUMachine Learning
Responsibilities
Define Fleet Health metrics and indicators to objectively measure and improve system availability
Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
Create runbooks and automated remediations for common failure scenarios
Build in automation and auditing to ensure compliance and improve efficiency and productivity
Participate in on-call rotations and provide support for incident response and resolution
Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc
Qualification
Required
7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
Strong understanding of Linux-based systems in a distributed environment
Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
Excellent problem-solving and troubleshooting skills
Strong communication and collaboration skills
Passion for continuous improvement and innovation
Preferred
Experience in the machine learning or computer hardware industry
Knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes)
Experience building and/or operating HPC resources
Background in chaos engineering or similar reliability testing methodologies
Understanding of compliance frameworks (SOC 2, ISO 27001, etc.)
Benefits
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan that we all actually use
Company
Lambda
Lambda is a cloud-based platform that provides high-performance GPU hardware and cloud infrastructure for AI model training and inference.
H1B Sponsorship
Lambda has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (16)
2024 (1)
2023 (3)
2022 (2)
2021 (2)
2020 (3)
Funding
Current Stage
Late StageTotal Funding
$3.19BKey Investors
TWG GlobalJP MorganMacquarie Group
2025-11-18Series E· $1.5B
2025-08-19Debt Financing· $275M
2025-02-19Series D· $480M
Recent News
2026-01-11
2026-01-09
Company data provided by crunchbase