Doghouse Recruitment ยท 6 hours ago
Site Reliability Engineer
Doghouse Recruitment is focused on building a cloud platform for high-throughput, compute-heavy workloads. They are seeking a Senior Site Reliability Engineer to own production reliability, define SLIs/SLOs, and improve latency while working in a bare-metal environment.
Responsibilities
Define SLIs/SLOs
Run error budget conversations
Ship changes that reduce incidents and improve latency (p95/p99)
Build automation to kill toil
Raise deployment safety (canary/rollback)
Turn observability into signal instead of noise
Qualification
Required
Senior-level experience in Site Reliability Engineering / Production Engineering running bare metal / on-prem / data center infrastructure (not public cloud only)
Deep hands-on expertise in Linux systems debugging and performance (CPU, memory, IO, kernel-level behaviors)
Strong understanding of networking (DNS/TCP/TLS, latency, packet loss, congestion, troubleshooting under load)
Strong Kubernetes experience beyond manifests: scheduler behavior, autoscaling edge cases, kubelet pressure/evictions, etcd/control plane
Experience with Terraform, Docker, Helm, and modern CI/CD practices
Some coding skills in Go and/or Python and/or C++
Benefits
Additional bonus and stock
Company
Doghouse Recruitment
Recruitment for your technology teams. You don't need another agency flooding your inbox with mismatched candidates.
Funding
Current Stage
Early StageCompany data provided by crunchbase