Apply on Employer Site

Lambda · 2 months ago

Senior HPC Operations Engineer

San Francisco Office

Full-time

Onsite

Senior Level, Lead/Staff

$207K/yr - $401K/yr

10+ years exp

Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. They are seeking a Senior HPC Operations Engineer to remotely deploy and configure large-scale HPC clusters for AI workloads and troubleshoot any issues that arise.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingData CenterGPUMachine Learning

Comp. & Benefits

H1B Sponsor Likely

Responsibilities

Remotely deploy and configure large-scale HPC clusters for AI workloads (up to many thousands of nodes)

Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools

Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site

Provide clear and detailed requirements back to other engineering teams on gaps and improvement areas, specifically in the areas of simplification, stability, and operational efficiency

Contribute to the creation of and maintenance of Standard Operating Procedures

Provide regular and well-communicated updates to project leads throughout each deployment

Mentor and assist less experienced team members

Stay up-to-date on the latest HPC/AI technologies and best practices

Qualification

HPC cluster deploymentHPC/AI architectureLinux systemsBright Cluster ManagerSLURMKubernetesProblem solvingMentoringTeam collaborationAttention to detailFlexibility

Required

Deeply experienced HPC engineer comfortable with logical provisioning of a cluster

Strong understanding of HPC/AI architecture, operating systems, firmware, software, and networking

10+ years of experience in deploying and configuring HPC clusters for AI workloads

Innate attention to detail

Experience with Bright Cluster Manager or similar cluster management tools

Expert in configuring and troubleshooting SFP+ fiber, Infiniband (IB), and 100 GbE network fabrics

Expert in configuring and troubleshooting Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, Horovod environments

Expert in configuring and troubleshooting Linux based compute nodes, firmware updates, driver installation

Expert in SLURM, Kubernetes, or other job scheduling systems

Work well under deadlines and structured project plans also knowing when and how to ask for changes to project timelines

Excellent problem solving and troubleshooting skills

Flexibility to travel to North American data centers as on-site needs arise or as part of training exercises

Able to work independently and as part of a team

Comfortable mentoring and supporting junior HPC engineers on cluster deployments

Preferred

Experience with machine learning and deep learning frameworks (PyTorch, Tensorflow) and benchmarking tools (DeepSpeed, MLPerf)

Experience with containerization technologies (Docker, Kubernetes)

Experience working with the technologies that underpin our cloud business (GPU acceleration, virtualization, and cloud computing)

Keen situational awareness in customer situations, employing diplomacy and tact

Bachelors degree in EE, CS, Physics, Mathematics, or equivalent work experience

Benefits

Health, dental, and vision coverage for you and your dependents

Wellness and Commuter stipends for select roles

401k Plan with 2% company match (USA employees)

Flexible Paid Time Off Plan that we all actually use

Company

Lambda

Lambda is a cloud-based platform that provides high-performance GPU hardware and cloud infrastructure for AI model training and inference.

Founded in 2012

San Jose, California, USA

501-1000 employees

https://lambda.ai

H1B Sponsorship

Lambda has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (16)

2024 (1)

2023 (3)

2022 (2)

2021 (2)

2020 (3)

Funding

Current Stage

Late Stage

Total Funding

$3.19B

Key Investors

TWG GlobalJP MorganMacquarie Group

2025-11-18Series E· $1.5B

2025-08-19Debt Financing· $275M

2025-02-19Series D· $480M

Leadership Team

Stephen Balaban

Co-founder, CEO

Michael Balaban

Co-Founder / CTO

Recent News

SiliconANGLE

AI cloud provider Lambda reportedly raising $350M round

2026-01-11

Business Wire

Lambda Appoints Leonard Speiser as Chief Operating Officer

2026-01-09

Techmeme

Source: Lambda, which rents access to AI chips and is backed by Nvidia, is in talks to raise $350M+ led by Mubadala Capital, ahead of an IPO planned for H2 2026 (The Information)

2026-01-09

Company data provided by crunchbase