This job has closed.

Rivian · 2 weeks ago

Sr. Staff Site Reliability Engineer, Factory Infrastructure & Systems

Atlanta, GA

Full-time

Onsite

Senior Level, Lead/Staff

$199K/yr - $249K/yr

Rivian is on a mission to keep the world adventurous forever, focusing on emissions-free Electric Adventure Vehicles. The Site Reliability Engineer (SRE) role is responsible for ensuring reliability outcomes for factory digital systems, managing platform engineering, observability, and incident response.

AutomotiveElectric VehicleManufacturingTransportation

H1B Sponsor Likely

Responsibilities

Design and evolve reliable, scalable, and secure platform foundations across hybrid/on‑prem factory environments (e.g., Kubernetes/EKS, vSphere/ESXi, Linux/Windows server, industrial PCs), with clear reliability and cost guardrails

Codify production‑readiness standards and guardrails for factory systems (health checks, runbooks, SLOs/SLIs, deployment safety, failover patterns) aligned to Platform’s production readiness checklist

Advance Infrastructure‑as‑Code and configuration automation (e.g., Terraform/Terragrunt, Ansible) for factory workloads, including provisioning, secrets, policies, and change safety

Partner with Manufacturing Engineering, Factory IT, Security, and Networking to land pragmatic, operable designs; contribute to reference architectures and reusable patterns

Lead or contribute to reliability initiatives (e.g., self‑healing automation, safe rollouts/canaries, rollback strategies) appropriate to level

Raise the bar on end‑to‑end telemetry for factory systems: high‑signal metrics, logs, traces, and SLO‑driven alerts (e.g., Prometheus/Grafana, Loki/Tempo, Datadog, Splunk)

Establish consistent dashboards and service health views for shop/line‑level systems, including exporters for hypervisor/VM health and plant endpoints where feasible (e.g., vSphere exporters)

Improve alert quality and ownership: reduce noise, align escalation policies, and ensure actionable runbooks and health checks for critical services

Build internal tooling (CLI/SDKs, operators/controllers, remediation bots) that turns telemetry into prevention and rapid response

Act as technical incident responder for factory‑impacting events; lead fast triage, stabilize services

Drive post‑incident reviews that eliminate repeat failure modes; improve MTTR and availability through durable engineering fixes and process improvements

Drill on‑call readiness, escalation policies, and schedules using established incident tooling and practices (e.g., Rootly/alternatives), tuned for 24x7 manufacturing operations

Mentor peers through reliability deep dives, failover exercises, and simulation runbooks (breadth of mentorship scales with level)

Qualification

SRE/Platform/DevOps experienceKubernetes/EKSAWS primitivesObservability toolsInfrastructure automationProduction change safetyIncident leadershipCommunication skillsCollaboration skills

Required

Production experience in SRE/Platform/DevOps or Operations, owning availability, performance, and cost for critical services

Strength in several of: Kubernetes/EKS and container networking; AWS primitives for resilient platforms; vSphere/ESXi and virtualization; Linux (and working Windows Server) administration; service discovery, load balancing, and DNS

Observability across metrics/logs/traces, SLO/error‑budget practice, and alert hygiene with tools like Prometheus/Grafana, Loki/Tempo, Datadog, Splunk

Production change safety: GitOps, progressive delivery, guardrails in CI/CD (GitLab preferred), automated rollbacks, and policy‑as‑code

Infrastructure automation: Terraform/Terragrunt, Ansible, scripting (Python/Bash), secrets management, and least‑privilege patterns

Incident leadership/participation in 24x7 environments; clear comms under pressure and a habit of converting learnings into durable fixes

Ability to partner across Factory IT, Manufacturing Engineering, Security, Networking, and application teams; communicate tradeoffs simply and drive decisions

Preferred

Industrial/OT‑adjacent experience (lineside HMIs, MES/SCADA integrations, PLC interfaces, ruggedized compute) and shop‑floor networking constraints

Experience building or integrating exporters (e.g., vSphere) or consolidating factory telemetry into plant‑wide health views

DR playbooks, capacity modeling, and cost/performance optimization for hybrid environments

Benefits

Rivian provides robust medical/Rx, dental and vision insurance packages for full-time employees, their spouse or domestic partner, and children up to age 26. Coverage is effective on the first day of employment, and Rivian covers most of the premiums.

Company

Rivian

Glassdoor3.4

Rivian is an automotive technology company that develops products and services to advance the shift to sustainable mobility.

Founded in 2009

Plymouth, Michigan, USA

10001+ employees

http://www.rivian.com

H1B Sponsorship

Rivian has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (38)

2024 (70)

2023 (54)

2022 (79)

2021 (21)

Funding

Current Stage

Public Company

Total Funding

$21.93B

Key Investors

Volkswagen GroupUS Department of EnergyIllinois Department of Commerce & Economic Opportunity

2025-06-30Post Ipo Equity· $1B

2024-11-25Post Ipo Debt· $6.6B

2024-05-02Grant· $827M

Leadership Team

Robert Scaringe

Chief Executive Officer

Claire McDonough

Chief Financial Officer

Recent News

Business Insider

Ford is throwing its hat into the ring alongside Rivian and making an AI companion in-house

2026-01-08

EIN Presswire

Why Mobile Charging Stations Are Crucial for Electric Car Mobility

2026-01-07

Business Insider

Inside Rivian's 'profound' push toward AI-defined vehicles

2026-01-07

Company data provided by crunchbase