Apply on Employer Site

Andiamo · 5 days ago

SRE, Compute - Decentralized High-Performance Computing Leader

New York, NY

Contract

Onsite

Senior Level, Lead/Staff

5+ years exp

Andiamo is a globally recognized staffing and consulting firm specializing in placing top technology professionals. They are seeking a Senior or Staff Site Reliability Engineer to enhance the reliability and performance of a large-scale compute platform for AI and high-performance computing workloads.

ConsultingHuman ResourcesInformation TechnologyStaffing Agency

Comp. & Benefits

H1B Sponsor Likely

Responsibilities

Push the limits of virtualization: Engineer hypervisors (KVM/QEMU) and fine-tune kernel subsystems, CPU topology, and NUMA configurations to drive down tail latencies for demanding AI and HPC workloads

Deploy and optimize at scale: Roll out new compute clusters with thousands of CPU and GPU nodes, validate offload capabilities on SmartNICs and DPUs, and fortify isolation across diverse workloads

Automate everything: Build intelligent telemetry systems and observability pipelines that surface kernel-to-orchestrator insights. Create automated incident-response tooling and rich performance dashboards to keep operations transparent and resilient

Diagnose the toughest issues: Lead deep-dive investigations into kernel crashes, kexec/kdump analyses, and performance regressions — distilling findings into actionable fixes, configuration improvements, or upstream contributions

Collaborate on the future of compute: Partner with hardware and kernel engineering teams to debug complex drivers, accelerate I/O pathways, and integrate emerging compute technologies such as TPUs and DPUs

Drive continuous improvement: Design chaos experiments, lead operational game days, and translate postmortems into meaningful SLOs that measure what truly impacts end users

Qualification

Linux internalsVirtualization technologiesC programmingInfrastructure-as-CodeCI/CD systemsSmartNICsDPUsContinuous improvementCollaborationProblem-solving

Required

5+ years of experience in site reliability, kernel, or virtualization engineering within large-scale or compute-intensive environments

Expert understanding of Linux internals — from schedulers and memory management to device drivers and kernel debugging

Hands-on experience with virtualization technologies such as KVM, QEMU, Xen, or VMware in production settings

Strong programming skills in C, Go, or Rust, along with practical knowledge of Infrastructure-as-Code and CI/CD systems

Familiar with SmartNICs, DPUs, or kernel-bypass networking technologies that enhance data throughput and reduce system overhead

Proven success scaling high-performance or HPC-grade infrastructure with measurable gains in reliability and efficiency

Company

Andiamo

Glassdoor4.0

The Talent Partners for the AI Revolution.

Founded in 2003

New York, New York, USA

201-500 employees

http://andiamogo.com

H1B Sponsorship

Andiamo has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2022 (2)

2021 (1)

Funding

Current Stage

Growth Stage

Leadership Team

Patrick McAdams

CEO & Co-Founder

Steven Kottler

CFO

Company data provided by crunchbase