Rivian · 2 months ago
Sr. Staff Site Reliability Engineer, Factory Infrastructure & Systems
Rivian is on a mission to keep the world adventurous forever, and they are seeking a Sr. Staff Site Reliability Engineer to own reliability outcomes for factory digital systems. The role involves responsibilities across Platform Engineering, Observability, and incident response, ensuring the reliability and performance of critical factory systems.
AutomotiveElectric VehicleManufacturingTransportation
Responsibilities
Design and evolve reliable, scalable, and secure platform foundations across hybrid/on‑prem factory environments (e.g., Kubernetes/EKS, vSphere/ESXi, Linux/Windows server, industrial PCs), with clear reliability and cost guardrails
Codify production‑readiness standards and guardrails for factory systems (health checks, runbooks, SLOs/SLIs, deployment safety, failover patterns) aligned to Platform’s production readiness checklist
Advance Infrastructure‑as‑Code and configuration automation (e.g., Terraform/Terragrunt, Ansible) for factory workloads, including provisioning, secrets, policies, and change safety
Partner with Manufacturing Engineering, Factory IT, Security, and Networking to land pragmatic, operable designs; contribute to reference architectures and reusable patterns
Lead or contribute to reliability initiatives (e.g., self‑healing automation, safe rollouts/canaries, rollback strategies) appropriate to level
Raise the bar on end‑to‑end telemetry for factory systems: high‑signal metrics, logs, traces, and SLO‑driven alerts (e.g., Prometheus/Grafana, Loki/Tempo, Datadog, Splunk)
Establish consistent dashboards and service health views for shop/line‑level systems, including exporters for hypervisor/VM health and plant endpoints where feasible (e.g., vSphere exporters)
Improve alert quality and ownership: reduce noise, align escalation policies, and ensure actionable runbooks and health checks for critical services
Build internal tooling (CLI/SDKs, operators/controllers, remediation bots) that turns telemetry into prevention and rapid response
Act as technical incident responder for factory‑impacting events; lead fast triage, stabilize services
Drive post‑incident reviews that eliminate repeat failure modes; improve MTTR and availability through durable engineering fixes and process improvements
Drill on‑call readiness, escalation policies, and schedules using established incident tooling and practices (e.g., Rootly/alternatives), tuned for 24x7 manufacturing operations
Mentor peers through reliability deep dives, failover exercises, and simulation runbooks (breadth of mentorship scales with level)
Qualification
Required
Production experience in SRE/Platform/DevOps or Operations, owning availability, performance, and cost for critical services
Strength in several of: Kubernetes/EKS and container networking; AWS primitives for resilient platforms; vSphere/ESXi and virtualization; Linux (and working Windows Server) administration; service discovery, load balancing, and DNS
Observability across metrics/logs/traces, SLO/error‑budget practice, and alert hygiene with tools like Prometheus/Grafana, Loki/Tempo, Datadog, Splunk
Production change safety: GitOps, progressive delivery, guardrails in CI/CD (GitLab preferred), automated rollbacks, and policy‑as‑code
Infrastructure automation: Terraform/Terragrunt, Ansible, scripting (Python/Bash), secrets management, and least‑privilege patterns
Incident leadership/participation in 24x7 environments; clear comms under pressure and a habit of converting learnings into durable fixes
Ability to partner across Factory IT, Manufacturing Engineering, Security, Networking, and application teams; communicate tradeoffs simply and drive decisions
Preferred
Industrial/OT‑adjacent experience (lineside HMIs, MES/SCADA integrations, PLC interfaces, ruggedized compute) and shop‑floor networking constraints
Experience building or integrating exporters (e.g., vSphere) or consolidating factory telemetry into plant‑wide health views
DR playbooks, capacity modeling, and cost/performance optimization for hybrid environments
Company
Rivian
Rivian is an automotive technology company that develops products and services to advance the shift to sustainable mobility.
H1B Sponsorship
Rivian has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (38)
2024 (70)
2023 (54)
2022 (79)
2021 (21)
Funding
Current Stage
Public CompanyTotal Funding
$21.93BKey Investors
Volkswagen GroupUS Department of EnergyIllinois Department of Commerce & Economic Opportunity
2025-06-30Post Ipo Equity· $1B
2024-11-25Post Ipo Debt· $6.6B
2024-05-02Grant· $827M
Recent News
Business Insider
2026-01-08
2026-01-07
Business Insider
2026-01-07
Company data provided by crunchbase