Lambda · 5 months ago
Senior Site Reliability Engineer - Observability
Lambda is a company focused on helping the smartest minds build Superintelligence through their advanced AI infrastructure. They are seeking a Senior Site Reliability Engineer with a strong background in observability to deploy and operate platforms for logging and metrics, automate deployment processes, and enhance system reliability across engineering teams.
AI InfrastructureArtificial Intelligence (AI)Cloud ComputingData CenterGPUMachine Learning
Responsibilities
Deploy and operate observability platforms for logging, metrics, and distributed tracing
Automate the deployment and operation of these observability systems
Set up monitoring for modern AI/HPC clusters
Develop platform software to make observability adoptable and improve system reliability across Lambda engineering
Lead members of other engineering teams to design and develop solutions for their monitoring challenges
Qualification
Required
Have 8+ years of experience in software engineering, with 3+ years in Go
Have 5+ years of experience in Site Reliability Engineering practices
Possess proven understanding of Observability tools and practices
Have experience with application deployment and monitoring using Kubernetes
Have experience building CI/CD pipelines
Expect quality and reliability from the solutions you build
Enjoy collaborating across team boundaries to help our engineering teams meet their observability needs
Preferred
Experience monitoring AI systems or HPC clusters
Experience with Prometheus and writing queries in PromQL
Experience with messaging systems like NATS
Understanding of the OpenTelemetry ecosystem and experience with both OTel instrumentation and the OTel collector
Experience with network monitoring, Ethernet and Infiniband
Understanding of dashboard design principles
Strong understanding of Linux fundamentals and system administration
Experience with infrastructure automation tooling such as Ansible and Terraform
Benefits
Health, dental, and vision coverage for you and your dependents
Wellness and Commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible Paid Time Off Plan that we all actually use
Company
Lambda
Lambda is a cloud-based platform that provides high-performance GPU hardware and cloud infrastructure for AI model training and inference.
H1B Sponsorship
Lambda has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (16)
2024 (1)
2023 (3)
2022 (2)
2021 (2)
2020 (3)
Funding
Current Stage
Late StageTotal Funding
$3.19BKey Investors
TWG GlobalJP MorganMacquarie Group
2025-11-18Series E· $1.5B
2025-08-19Debt Financing· $275M
2025-02-19Series D· $480M
Recent News
2026-01-11
2026-01-09
Company data provided by crunchbase