Andiamo · 1 day ago
SRE, Observability - Decentralized High-Performance Computing Leader
Andiamo is a globally recognized staffing and consulting firm specializing in placing top technology professionals. They are seeking a Senior / Staff Site Reliability Engineer to design, build, and maintain telemetry infrastructure for a global AI cloud platform, ensuring performance and reliability of advanced machine learning workloads.
ConsultingHuman ResourcesInformation TechnologyStaffing Agency
Responsibilities
Architect large-scale observability systems: Design and operate telemetry pipelines for metrics, logs, and traces using modern observability stacks (Prometheus, Mimir, Loki, Tempo, Grafana) at petabyte scale
Ensure reliability and efficiency: Tune distributed telemetry systems for performance, cardinality control, and cost optimization while maintaining high availability across global deployments
Empower debugging and insight: Build tools and frameworks that give developers deep visibility into distributed ML training, inference pipelines, and infrastructure performance
Collaborate cross-functionally: Partner with platform, SRE, and infrastructure teams to extend observability coverage for Kubernetes clusters, SLURM schedulers, and GPU-based compute environments
Operational excellence: Establish SLOs, alerting policies, and observability standards that reduce noise, streamline incident response, and strengthen reliability culture across teams
Automate at scale: Develop clean, maintainable code in Go, Python, or Bash to extend observability tooling and automate operational workflows
Qualification
Required
7+ years of total engineering experience, including at least 3 years building or operating large-scale observability or telemetry infrastructure (100M+ metric series, 10TB+/day logs)
Proven expertise with the Grafana ecosystem — Prometheus, Mimir, Loki, Tempo, Grafana, and Alertmanager — in production environments
Hands-on proficiency with Kubernetes, including Helm, Kustomize, custom CRDs, and multi-cluster federation
Experienced with Terraform (or Pulumi) and Infrastructure-as-Code best practices for hybrid or bare-metal provisioning
Strong programming ability in Go (preferred), with additional experience in Python or Bash for automation, data collection, and controller development
Deep knowledge of Linux internals — cgroups, namespaces, networking, and filesystem performance — plus foundational TCP/IP and TLS expertise
Experienced in defining and enforcing SLOs, SLIs, and alerting mechanisms that align engineering focus with real user impact
Calm and methodical under pressure — you've led incident response efforts, authored postmortems, and driven systemic improvements afterward
Communicative and collaborative — able to explain complex systems clearly and influence peers in dynamic, cross-functional environments
Preferred
Instrumentation of GPU-heavy or HPC clusters (NVIDIA A-/H-series, NVSwitch, DGX, RoCE, RDMA)
Observability for distributed ML workloads managed by Slurm, Ray, or Kubernetes-native batch schedulers
Hands-on with eBPF, Cilium, or Hubble for high-fidelity, low-overhead network visibility
Experience deploying and migrating OpenTelemetry across metrics, logs, and traces
Operating service meshes like Istio or Linkerd and managing telemetry pipelines built on Envoy
Managing observability across distributed or multi-region environments (US/EU/APAC), optimizing for latency and cost
Implementing cost and resource monitoring using tools like Kubecost or Cloudability
Security observability overlap — integrating Falco, GuardDuty, or auditd into telemetry pipelines
Contributions to open-source observability projects or thought leadership through blogs, talks, or community participation
Knowledge of high-performance storage systems (Ceph, Lustre, NVMe-oF) and telemetry integrations for throughput and latency analysis
Experience building custom backends with Kafka, ClickHouse, or VictoriaMetrics for large-scale data ingestion
Company
Andiamo
The Talent Partners for the AI Revolution.
H1B Sponsorship
Andiamo has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2022 (2)
2021 (1)
Funding
Current Stage
Growth StageCompany data provided by crunchbase