Lambda · 1 month ago
Senior Site Reliability Engineer - Observability
Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. The role involves deploying and operating observability platforms and collaborating with engineering teams to enhance system reliability and monitoring capabilities.
AI InfrastructureArtificial Intelligence (AI)Cloud ComputingData CenterGPUMachine Learning
Responsibilities
Deploy and operate observability platforms for logging, metrics, and distributed tracing
Automate the deployment and operation of these observability systems
Set up monitoring for modern AI/HPC clusters
Develop platform software to make observability adoptable and improve system reliability across Lambda engineering
Lead members of other engineering teams to design and develop solutions for their monitoring challenges
Qualification
Required
Have 8+ years of experience in software engineering, with 3+ years in Go
Have 5+ years of experience in Site Reliability Engineering practices
Possess proven understanding of Observability tools and practices
Have experience with application deployment and monitoring using Kubernetes
Have experience building CI/CD pipelines
Expect quality and reliability from the solutions you build
Enjoy collaborating across team boundaries to help our engineering teams meet their observability needs
Preferred
Experience monitoring AI systems or HPC clusters
Experience with Prometheus and writing queries in PromQL
Experience with messaging systems like NATS
Understanding of the OpenTelemetry ecosystem and experience with both OTel instrumentation and the OTel collector
Experience with network monitoring, Ethernet and Infiniband
Understanding of dashboard design principles
Strong understanding of Linux fundamentals and system administration
Experience with infrastructure automation tooling such as Ansible and Terraform
Benefits
Generous cash & equity compensation
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan that we all actually use
Company
Lambda
Lambda is a cloud-based platform that provides high-performance GPU hardware and cloud infrastructure for AI model training and inference.
H1B Sponsorship
Lambda has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (16)
2024 (1)
2023 (3)
2022 (2)
2021 (2)
2020 (3)
Funding
Current Stage
Late StageTotal Funding
$3.19BKey Investors
TWG GlobalJP MorganMacquarie Group
2025-11-18Series E· $1.5B
2025-08-19Debt Financing· $275M
2025-02-19Series D· $480M
Recent News
2026-01-11
2026-01-09
Company data provided by crunchbase