RunPod · 4 hours ago

Site Reliability Engineer

San Francisco, CA

Full-time

Remote

Mid Level

$152K/yr - $175K/yr

Maximize your interview chances

Artificial Intelligence (AI)Cloud Infrastructure

U.S. Citizen Only

Insider Connection @RunPod

Discover valuable connections within the company who might provide insights and potential referrals.
Get 3x more responses when you reach out via email instead of LinkedIn.

Responsibilities

Design, implement, and maintain robust, scalable, and highly available systems

Troubleshoot and resolve complex issues in distributed environments

Develop and implement SLIs and SLOs to ensure system reliability and performance

Manage and optimize large-scale bare-metal fleets across multiple data centers

Implement and maintain secure practices for foundational systems

Collaborate with cross-functional teams to improve system design and operation

Automate processes to increase efficiency and reduce human error

Participate in on-call rotations to provide 24/7 support for critical systems

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

Linux kernel internalsContainerization (Docker)Virtualization (Kata/QEMU)Distributed system troubleshootingSLIsSLOs managementLarge-scale bare-metal managementPythonGolangConfiguration management ChefConfiguration management PuppetSecure best practicesAWSGCPAzureMonitoring tools StatsdMonitoring tools GrafanaMonitoring tools DatadogMonitoring tools OpenTelemetryMonitoring tools VictoriaMetricsAWS IAM permissionsKey distribution systemsOSI model LayersGPU compute resource management

Required

Deep knowledge of Linux kernel internals, containerization (Docker), virtualization (Kata/QEMU), and networking components

Extensive experience with distributed system troubleshooting and design

Proficiency in at least one programming language, preferably Python or Golang

Proven experience implementing and managing SLIs and SLOs

Experience with pull-based configuration management tools such as Chef or Puppet

Demonstrated ability to manage large-scale bare-metal fleets (5,000+ machines) across multiple data centers

Strong background in implementing secure best practices for foundational systems, including secret management, AWS IAM permissions, and key distribution systems

Comprehensive understanding of OSI model Layers 3, 4, and 7

Successful completion of a background check