Site Reliability Engineer @ RunPod | Jobright.ai
JOBSarrow
RecommendedLiked
0
Applied
0
External
0
Site Reliability Engineer jobs in San Francisco, CA
Be an early applicantLess than 25 applicants
expire-info-iconThis job has closed.
company-logo

RunPod · 4 hours ago

Site Reliability Engineer

ftfMaximize your interview chances
Artificial Intelligence (AI)Cloud Infrastructure
U.S. Citizen Onlynote

Insider Connection @RunPod

Discover valuable connections within the company who might provide insights and potential referrals.
Get 3x more responses when you reach out via email instead of LinkedIn.

Responsibilities

Design, implement, and maintain robust, scalable, and highly available systems
Troubleshoot and resolve complex issues in distributed environments
Develop and implement SLIs and SLOs to ensure system reliability and performance
Manage and optimize large-scale bare-metal fleets across multiple data centers
Implement and maintain secure practices for foundational systems
Collaborate with cross-functional teams to improve system design and operation
Automate processes to increase efficiency and reduce human error
Participate in on-call rotations to provide 24/7 support for critical systems

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

Linux kernel internalsContainerization (Docker)Virtualization (Kata/QEMU)Distributed system troubleshootingSLIsSLOs managementLarge-scale bare-metal managementPythonGolangConfiguration management ChefConfiguration management PuppetSecure best practicesAWSGCPAzureMonitoring tools StatsdMonitoring tools GrafanaMonitoring tools DatadogMonitoring tools OpenTelemetryMonitoring tools VictoriaMetricsAWS IAM permissionsKey distribution systemsOSI model LayersGPU compute resource management

Required

Deep knowledge of Linux kernel internals, containerization (Docker), virtualization (Kata/QEMU), and networking components
Extensive experience with distributed system troubleshooting and design
Proficiency in at least one programming language, preferably Python or Golang
Proven experience implementing and managing SLIs and SLOs
Experience with pull-based configuration management tools such as Chef or Puppet
Demonstrated ability to manage large-scale bare-metal fleets (5,000+ machines) across multiple data centers
Strong background in implementing secure best practices for foundational systems, including secret management, AWS IAM permissions, and key distribution systems
Comprehensive understanding of OSI model Layers 3, 4, and 7
Successful completion of a background check

Preferred

Bachelor's degree in Computer Science, Engineering, or a related field
Relevant industry certifications (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator)
Experience with cloud platforms (AWS, GCP, Azure)
Familiarity with monitoring and observability tools (e.g., Statsd, Grafana, Datadog, OpenTelemetry, VictoriaMetrics)
Experience with managing fleets of GPU compute resources at scale
Strong communication skills and ability to work effectively in a team environment

Benefits

Stock options
The flexibility of remote work with an inclusive, collaborative team.
An opportunity to grow with a company that values innovation and user-centric design.
Generous vacation policy to ensure work-life harmony and well-being.

Company

RunPod

twittertwittertwitter
company-logo
RunPod is a cloud platform designed for GPUs, enabling developers to deploy customized full-stack AI applications.

Funding

Current Stage
Early Stage
Total Funding
$22M
2024-05-08Seed· $20M
2023-03-30Pre Seed· $2M

Leadership Team

leader-logo
Zhen Lu
Co-Founder and CEO
linkedin
P
Pardeep Singh
CTO and Co-Founder
linkedin
Company data provided by crunchbase
logo

Orion

Your AI Copilot