Apply on Employer Site

Andiamo · 1 day ago

SRE, Observability - Decentralized High-Performance Computing Leader

New York, NY

Contract

Onsite

Senior Level, Lead/Staff

7+ years exp

Andiamo is a globally recognized staffing and consulting firm specializing in placing top technology professionals. They are seeking a Senior / Staff Site Reliability Engineer to design, build, and maintain telemetry infrastructure for a global AI cloud platform, ensuring performance and reliability of advanced machine learning workloads.

ConsultingHuman ResourcesInformation TechnologyStaffing Agency

Comp. & Benefits

H1B Sponsor Likely

Responsibilities

Architect large-scale observability systems: Design and operate telemetry pipelines for metrics, logs, and traces using modern observability stacks (Prometheus, Mimir, Loki, Tempo, Grafana) at petabyte scale

Ensure reliability and efficiency: Tune distributed telemetry systems for performance, cardinality control, and cost optimization while maintaining high availability across global deployments

Empower debugging and insight: Build tools and frameworks that give developers deep visibility into distributed ML training, inference pipelines, and infrastructure performance

Collaborate cross-functionally: Partner with platform, SRE, and infrastructure teams to extend observability coverage for Kubernetes clusters, SLURM schedulers, and GPU-based compute environments

Operational excellence: Establish SLOs, alerting policies, and observability standards that reduce noise, streamline incident response, and strengthen reliability culture across teams

Automate at scale: Develop clean, maintainable code in Go, Python, or Bash to extend observability tooling and automate operational workflows

Qualification

Large-scale observability systemsGrafana ecosystemKubernetesProgramming in GoTerraformLinux internalsSLOsSLIsCommunicativeCalm under pressureCollaborative

Required

7+ years of total engineering experience, including at least 3 years building or operating large-scale observability or telemetry infrastructure (100M+ metric series, 10TB+/day logs)

Proven expertise with the Grafana ecosystem — Prometheus, Mimir, Loki, Tempo, Grafana, and Alertmanager — in production environments

Hands-on proficiency with Kubernetes, including Helm, Kustomize, custom CRDs, and multi-cluster federation

Experienced with Terraform (or Pulumi) and Infrastructure-as-Code best practices for hybrid or bare-metal provisioning

Strong programming ability in Go (preferred), with additional experience in Python or Bash for automation, data collection, and controller development

Deep knowledge of Linux internals — cgroups, namespaces, networking, and filesystem performance — plus foundational TCP/IP and TLS expertise

Experienced in defining and enforcing SLOs, SLIs, and alerting mechanisms that align engineering focus with real user impact

Calm and methodical under pressure — you've led incident response efforts, authored postmortems, and driven systemic improvements afterward

Communicative and collaborative — able to explain complex systems clearly and influence peers in dynamic, cross-functional environments

Preferred

Instrumentation of GPU-heavy or HPC clusters (NVIDIA A-/H-series, NVSwitch, DGX, RoCE, RDMA)

Observability for distributed ML workloads managed by Slurm, Ray, or Kubernetes-native batch schedulers

Hands-on with eBPF, Cilium, or Hubble for high-fidelity, low-overhead network visibility

Experience deploying and migrating OpenTelemetry across metrics, logs, and traces

Operating service meshes like Istio or Linkerd and managing telemetry pipelines built on Envoy

Managing observability across distributed or multi-region environments (US/EU/APAC), optimizing for latency and cost

Implementing cost and resource monitoring using tools like Kubecost or Cloudability

Security observability overlap — integrating Falco, GuardDuty, or auditd into telemetry pipelines

Contributions to open-source observability projects or thought leadership through blogs, talks, or community participation

Knowledge of high-performance storage systems (Ceph, Lustre, NVMe-oF) and telemetry integrations for throughput and latency analysis

Experience building custom backends with Kafka, ClickHouse, or VictoriaMetrics for large-scale data ingestion

Company

Andiamo

Glassdoor4.0

The Talent Partners for the AI Revolution.

Founded in 2003

New York, New York, USA

201-500 employees

http://andiamogo.com

H1B Sponsorship

Andiamo has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2022 (2)

2021 (1)

Funding

Current Stage

Growth Stage

Leadership Team

Patrick McAdams

CEO & Co-Founder

Steven Kottler

CFO

Company data provided by crunchbase