Senior Site Reliability Engineer - Observability jobs in United States
cer-icon
Apply on Employer Site
company-logo

Lambda · 1 month ago

Senior Site Reliability Engineer - Observability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. The role involves deploying and operating observability platforms and collaborating with engineering teams to enhance system reliability and monitoring capabilities.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingData CenterGPUMachine Learning
check
Comp. & Benefits
check
H1B Sponsor Likelynote

Responsibilities

Deploy and operate observability platforms for logging, metrics, and distributed tracing
Automate the deployment and operation of these observability systems
Set up monitoring for modern AI/HPC clusters
Develop platform software to make observability adoptable and improve system reliability across Lambda engineering
Lead members of other engineering teams to design and develop solutions for their monitoring challenges

Qualification

Site Reliability EngineeringGoObservability toolsKubernetesCI/CD pipelinesLinux fundamentalsQuality expectationCollaborationProblem-solvingCommunication

Required

Have 8+ years of experience in software engineering, with 3+ years in Go
Have 5+ years of experience in Site Reliability Engineering practices
Possess proven understanding of Observability tools and practices
Have experience with application deployment and monitoring using Kubernetes
Have experience building CI/CD pipelines
Expect quality and reliability from the solutions you build
Enjoy collaborating across team boundaries to help our engineering teams meet their observability needs

Preferred

Experience monitoring AI systems or HPC clusters
Experience with Prometheus and writing queries in PromQL
Experience with messaging systems like NATS
Understanding of the OpenTelemetry ecosystem and experience with both OTel instrumentation and the OTel collector
Experience with network monitoring, Ethernet and Infiniband
Understanding of dashboard design principles
Strong understanding of Linux fundamentals and system administration
Experience with infrastructure automation tooling such as Ansible and Terraform

Benefits

Generous cash & equity compensation
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan that we all actually use

Company

Lambda

twittertwittertwitter
company-logo
Lambda is a cloud-based platform that provides high-performance GPU hardware and cloud infrastructure for AI model training and inference.

H1B Sponsorship

Lambda has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (16)
2024 (1)
2023 (3)
2022 (2)
2021 (2)
2020 (3)

Funding

Current Stage
Late Stage
Total Funding
$3.19B
Key Investors
TWG GlobalJP MorganMacquarie Group
2025-11-18Series E· $1.5B
2025-08-19Debt Financing· $275M
2025-02-19Series D· $480M

Leadership Team

leader-logo
Stephen Balaban
Co-founder, CEO
linkedin
leader-logo
Michael Balaban
Co-Founder / CTO
linkedin
Company data provided by crunchbase