SRE, Compute - Decentralized High-Performance Computing Leader jobs in United States
cer-icon
Apply on Employer Site
company-logo

Andiamo · 5 days ago

SRE, Compute - Decentralized High-Performance Computing Leader

Andiamo is a globally recognized staffing and consulting firm specializing in placing top technology professionals. They are seeking a Senior or Staff Site Reliability Engineer to enhance the reliability and performance of a large-scale compute platform for AI and high-performance computing workloads.

ConsultingHuman ResourcesInformation TechnologyStaffing Agency
check
Comp. & Benefits
check
H1B Sponsor Likelynote

Responsibilities

Push the limits of virtualization: Engineer hypervisors (KVM/QEMU) and fine-tune kernel subsystems, CPU topology, and NUMA configurations to drive down tail latencies for demanding AI and HPC workloads
Deploy and optimize at scale: Roll out new compute clusters with thousands of CPU and GPU nodes, validate offload capabilities on SmartNICs and DPUs, and fortify isolation across diverse workloads
Automate everything: Build intelligent telemetry systems and observability pipelines that surface kernel-to-orchestrator insights. Create automated incident-response tooling and rich performance dashboards to keep operations transparent and resilient
Diagnose the toughest issues: Lead deep-dive investigations into kernel crashes, kexec/kdump analyses, and performance regressions — distilling findings into actionable fixes, configuration improvements, or upstream contributions
Collaborate on the future of compute: Partner with hardware and kernel engineering teams to debug complex drivers, accelerate I/O pathways, and integrate emerging compute technologies such as TPUs and DPUs
Drive continuous improvement: Design chaos experiments, lead operational game days, and translate postmortems into meaningful SLOs that measure what truly impacts end users

Qualification

Linux internalsVirtualization technologiesC programmingInfrastructure-as-CodeCI/CD systemsSmartNICsDPUsContinuous improvementCollaborationProblem-solving

Required

5+ years of experience in site reliability, kernel, or virtualization engineering within large-scale or compute-intensive environments
Expert understanding of Linux internals — from schedulers and memory management to device drivers and kernel debugging
Hands-on experience with virtualization technologies such as KVM, QEMU, Xen, or VMware in production settings
Strong programming skills in C, Go, or Rust, along with practical knowledge of Infrastructure-as-Code and CI/CD systems
Familiar with SmartNICs, DPUs, or kernel-bypass networking technologies that enhance data throughput and reduce system overhead
Proven success scaling high-performance or HPC-grade infrastructure with measurable gains in reliability and efficiency

Company

The Talent Partners for the AI Revolution.

H1B Sponsorship

Andiamo has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2022 (2)
2021 (1)

Funding

Current Stage
Growth Stage

Leadership Team

leader-logo
Patrick McAdams
CEO & Co-Founder
linkedin
leader-logo
Steven Kottler
CFO
linkedin
Company data provided by crunchbase