RunPod · 4 hours ago
Site Reliability Engineer
Maximize your interview chances
Artificial Intelligence (AI)Cloud Infrastructure
U.S. Citizen Only
Insider Connection @RunPod
Get 3x more responses when you reach out via email instead of LinkedIn.
Responsibilities
Design, implement, and maintain robust, scalable, and highly available systems
Troubleshoot and resolve complex issues in distributed environments
Develop and implement SLIs and SLOs to ensure system reliability and performance
Manage and optimize large-scale bare-metal fleets across multiple data centers
Implement and maintain secure practices for foundational systems
Collaborate with cross-functional teams to improve system design and operation
Automate processes to increase efficiency and reduce human error
Participate in on-call rotations to provide 24/7 support for critical systems
Qualification
Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.
Required
Deep knowledge of Linux kernel internals, containerization (Docker), virtualization (Kata/QEMU), and networking components
Extensive experience with distributed system troubleshooting and design
Proficiency in at least one programming language, preferably Python or Golang
Proven experience implementing and managing SLIs and SLOs
Experience with pull-based configuration management tools such as Chef or Puppet
Demonstrated ability to manage large-scale bare-metal fleets (5,000+ machines) across multiple data centers
Strong background in implementing secure best practices for foundational systems, including secret management, AWS IAM permissions, and key distribution systems
Comprehensive understanding of OSI model Layers 3, 4, and 7
Successful completion of a background check
Preferred
Bachelor's degree in Computer Science, Engineering, or a related field
Relevant industry certifications (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator)
Experience with cloud platforms (AWS, GCP, Azure)
Familiarity with monitoring and observability tools (e.g., Statsd, Grafana, Datadog, OpenTelemetry, VictoriaMetrics)
Experience with managing fleets of GPU compute resources at scale
Strong communication skills and ability to work effectively in a team environment
Benefits
Stock options
The flexibility of remote work with an inclusive, collaborative team.
An opportunity to grow with a company that values innovation and user-centric design.
Generous vacation policy to ensure work-life harmony and well-being.
Company
RunPod
RunPod is a cloud platform designed for GPUs, enabling developers to deploy customized full-stack AI applications.
Funding
Current Stage
Early StageTotal Funding
$22M2024-05-08Seed· $20M
2023-03-30Pre Seed· $2M
Recent News
2024-11-24
EIN Presswire
2024-10-31
Company data provided by crunchbase