Sr. Staff Site Reliability Engineer, Factory Infrastructure & Systems jobs in United States
info-icon
This job has closed.
company-logo

Rivian · 2 weeks ago

Sr. Staff Site Reliability Engineer, Factory Infrastructure & Systems

Rivian is on a mission to keep the world adventurous forever, focusing on emissions-free Electric Adventure Vehicles. The Site Reliability Engineer (SRE) role is responsible for ensuring reliability outcomes for factory digital systems, managing platform engineering, observability, and incident response.

AutomotiveElectric VehicleManufacturingTransportation
check
H1B Sponsor Likelynote

Responsibilities

Design and evolve reliable, scalable, and secure platform foundations across hybrid/on‑prem factory environments (e.g., Kubernetes/EKS, vSphere/ESXi, Linux/Windows server, industrial PCs), with clear reliability and cost guardrails
Codify production‑readiness standards and guardrails for factory systems (health checks, runbooks, SLOs/SLIs, deployment safety, failover patterns) aligned to Platform’s production readiness checklist
Advance Infrastructure‑as‑Code and configuration automation (e.g., Terraform/Terragrunt, Ansible) for factory workloads, including provisioning, secrets, policies, and change safety
Partner with Manufacturing Engineering, Factory IT, Security, and Networking to land pragmatic, operable designs; contribute to reference architectures and reusable patterns
Lead or contribute to reliability initiatives (e.g., self‑healing automation, safe rollouts/canaries, rollback strategies) appropriate to level
Raise the bar on end‑to‑end telemetry for factory systems: high‑signal metrics, logs, traces, and SLO‑driven alerts (e.g., Prometheus/Grafana, Loki/Tempo, Datadog, Splunk)
Establish consistent dashboards and service health views for shop/line‑level systems, including exporters for hypervisor/VM health and plant endpoints where feasible (e.g., vSphere exporters)
Improve alert quality and ownership: reduce noise, align escalation policies, and ensure actionable runbooks and health checks for critical services
Build internal tooling (CLI/SDKs, operators/controllers, remediation bots) that turns telemetry into prevention and rapid response
Act as technical incident responder for factory‑impacting events; lead fast triage, stabilize services
Drive post‑incident reviews that eliminate repeat failure modes; improve MTTR and availability through durable engineering fixes and process improvements
Drill on‑call readiness, escalation policies, and schedules using established incident tooling and practices (e.g., Rootly/alternatives), tuned for 24x7 manufacturing operations
Mentor peers through reliability deep dives, failover exercises, and simulation runbooks (breadth of mentorship scales with level)

Qualification

SRE/Platform/DevOps experienceKubernetes/EKSAWS primitivesObservability toolsInfrastructure automationProduction change safetyIncident leadershipCommunication skillsCollaboration skills

Required

Production experience in SRE/Platform/DevOps or Operations, owning availability, performance, and cost for critical services
Strength in several of: Kubernetes/EKS and container networking; AWS primitives for resilient platforms; vSphere/ESXi and virtualization; Linux (and working Windows Server) administration; service discovery, load balancing, and DNS
Observability across metrics/logs/traces, SLO/error‑budget practice, and alert hygiene with tools like Prometheus/Grafana, Loki/Tempo, Datadog, Splunk
Production change safety: GitOps, progressive delivery, guardrails in CI/CD (GitLab preferred), automated rollbacks, and policy‑as‑code
Infrastructure automation: Terraform/Terragrunt, Ansible, scripting (Python/Bash), secrets management, and least‑privilege patterns
Incident leadership/participation in 24x7 environments; clear comms under pressure and a habit of converting learnings into durable fixes
Ability to partner across Factory IT, Manufacturing Engineering, Security, Networking, and application teams; communicate tradeoffs simply and drive decisions

Preferred

Industrial/OT‑adjacent experience (lineside HMIs, MES/SCADA integrations, PLC interfaces, ruggedized compute) and shop‑floor networking constraints
Experience building or integrating exporters (e.g., vSphere) or consolidating factory telemetry into plant‑wide health views
DR playbooks, capacity modeling, and cost/performance optimization for hybrid environments

Benefits

Rivian provides robust medical/Rx, dental and vision insurance packages for full-time employees, their spouse or domestic partner, and children up to age 26. Coverage is effective on the first day of employment, and Rivian covers most of the premiums.

Company

Rivian is an automotive technology company that develops products and services to advance the shift to sustainable mobility.

H1B Sponsorship

Rivian has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (38)
2024 (70)
2023 (54)
2022 (79)
2021 (21)

Funding

Current Stage
Public Company
Total Funding
$21.93B
Key Investors
Volkswagen GroupUS Department of EnergyIllinois Department of Commerce & Economic Opportunity
2025-06-30Post Ipo Equity· $1B
2024-11-25Post Ipo Debt· $6.6B
2024-05-02Grant· $827M

Leadership Team

leader-logo
Robert Scaringe
Chief Executive Officer
linkedin
leader-logo
Claire McDonough
Chief Financial Officer
linkedin
Company data provided by crunchbase