Andiamo · 5 days ago
SRE, Compute - Decentralized High-Performance Computing Leader
Andiamo is a globally recognized staffing and consulting firm specializing in placing top technology professionals. They are seeking a Senior or Staff Site Reliability Engineer to enhance the reliability and performance of a large-scale compute platform for AI and high-performance computing workloads.
ConsultingHuman ResourcesInformation TechnologyStaffing Agency
Responsibilities
Push the limits of virtualization: Engineer hypervisors (KVM/QEMU) and fine-tune kernel subsystems, CPU topology, and NUMA configurations to drive down tail latencies for demanding AI and HPC workloads
Deploy and optimize at scale: Roll out new compute clusters with thousands of CPU and GPU nodes, validate offload capabilities on SmartNICs and DPUs, and fortify isolation across diverse workloads
Automate everything: Build intelligent telemetry systems and observability pipelines that surface kernel-to-orchestrator insights. Create automated incident-response tooling and rich performance dashboards to keep operations transparent and resilient
Diagnose the toughest issues: Lead deep-dive investigations into kernel crashes, kexec/kdump analyses, and performance regressions — distilling findings into actionable fixes, configuration improvements, or upstream contributions
Collaborate on the future of compute: Partner with hardware and kernel engineering teams to debug complex drivers, accelerate I/O pathways, and integrate emerging compute technologies such as TPUs and DPUs
Drive continuous improvement: Design chaos experiments, lead operational game days, and translate postmortems into meaningful SLOs that measure what truly impacts end users
Qualification
Required
5+ years of experience in site reliability, kernel, or virtualization engineering within large-scale or compute-intensive environments
Expert understanding of Linux internals — from schedulers and memory management to device drivers and kernel debugging
Hands-on experience with virtualization technologies such as KVM, QEMU, Xen, or VMware in production settings
Strong programming skills in C, Go, or Rust, along with practical knowledge of Infrastructure-as-Code and CI/CD systems
Familiar with SmartNICs, DPUs, or kernel-bypass networking technologies that enhance data throughput and reduce system overhead
Proven success scaling high-performance or HPC-grade infrastructure with measurable gains in reliability and efficiency
Company
Andiamo
The Talent Partners for the AI Revolution.
H1B Sponsorship
Andiamo has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2022 (2)
2021 (1)
Funding
Current Stage
Growth StageCompany data provided by crunchbase