Apply on Employer Site

Hydra Host · 2 months ago

Site Reliability Engineer at Hydra Host

Miami, United States

Full-time

Onsite

Senior Level

$140K/yr - $200K/yr

5+ years exp

Hydra Host is a fast-growing baremetal HPC infrastructure company providing reliable cloud systems for mission-critical training and inference. The Site Reliability Engineer will own QA systems, monitoring, and backend service delivery to ensure systems meet internal and customer SLAs, collaborating with various teams to maintain operational excellence.

Artificial Intelligence (AI)Cloud InfrastructureDeveloper APIsWeb Hosting

Responsibilities

Design, deploy, and maintain QA systems used by our development teams to test integration and live system responses across full-stack deployments in local, live, and ephemeral environments

Evaluate and integrate monitoring and QA tools to find the right tools for the job

Create a unified monitoring platform and processes that datacenter and device teams will integrate to monitor their components (live servers, lifecycle, networks, power, etc.)

Maintain monitoring processes and dashboards to provide complete visibility into the health, performance, and reliability of our CI systems, software deployments, and testing platforms

Create and maintain a systems test suite, in collaboration with our product managers, to validate marketplace changes against all business functions in live and ephemeral QA environments

Integrate all fore-mentioned systems to create holistic platform health statistics reporting

Design disaster-recovery processes in collaboration with devops

Ensure we are meeting uptime SLAs across all platform deployments

Work with datacenter and device teams to define service-level indicators (SLIs), service-level objectives (SLOs), and SLAs

Establish observability standards across the stack: logs, metrics, traces, and alerts, and actionable on-call playbooks

Automate everything from monitoring setups to incident responses to eliminate manual toil and increase reliability

Drive incident response, root cause analysis, and post‑mortems. Guide incident turn-around into tooling and process improvements

Establish the monitoring infrastructure and dashboards that enable everyone — from engineers to execs — to know what’s going on

Act as the reliability partner to engineering teams: review systems for reliability concerns, help design QA requirements and testing, and help teams meet reliability targets

Qualification

Reliability EngineeringMonitoring toolsService orchestrationInfrastructure as codeScripting languagesDistributed tracingLog aggregationPost-mortem analysisIncident responseCommunication

Required

5–8+ years of experience in Reliability Engineering, DevOps, or infrastructure roles focused on large-scale, high-uptime production environments

Deep familiarity with monitoring and observability tooling: you've implemented and managed systems, esp. Prometheus, Grafana, and Zabbix

Strong experience with service orchestration in multi-region environment (Nomad, Kubernetes, cloud VMs, distributed databases)

Track record of managing production system uptime and SLAs and building tools to support it

Experience writing and reviewing post-mortems and using those findings to drive improvements in tools and process

Proficient with scripting and programming languages (Python, Go, BASH, etc.) for automating operational tasks

Strong proficiency with infrastructure as code and devops workflows

Experience with distributed tracing, log aggregation, and alert tuning

Passion for building systems that fail gracefully, alert correctly, and empower others to operate confidently

Excellent communication skills: you can write clear documentation, drive incident reviews, and communicate reliability risks to technical and non-technical stakeholders