CyRAD Solutions · 2 months ago
Site Reliability Engineer - SRE
CyRAD Solutions is a high-growth startup focused on building the next generation of software-defined networks for satellite megaconstellations and aerospace fleets. They seek a Strategic Site Reliability Engineer to design and manage the core reliability platform and ensure mission readiness through robust engineering practices.
Staffing & Recruiting
Responsibilities
Design the core reliability platform for the final frontier of space Mesh networking
Architect mission-critical systems and drive platform maturity
Qualification
Required
Deep, hands-on expertise in the architecture, scaling, and management of production observability stacks: Prometheus, OpenTelemetry, Grafana, Loki, and distributed tracing systems
Expert-level production experience with Kubernetes and GCP
Proven ability to define, implement, and manage robust SLOs, SLIs, and Error Budgets for high-availability distributed systems, crucial for mission readiness
Mastery of Infrastructure as Code (Terraform) and GitOps (ArgoCD) for automated deployment and scaling across complex cloud environments
Strong command of systems programming; fluency in Go and/or Python is required for developing and optimizing platform tooling
US Citizenship is required
Preferred
Expertise in multi-cloud (AWS) environments is highly preferred
Experience with Service Mesh (Istio/Linkerd)
Instrumenting applications in Golang/C++
Working with HPC environments (CPU/GPU workloads)
An active Secret security clearance or higher is strongly preferred