Replit · 2 months ago
Site Reliability Engineer
Replit is a software creation platform that enables users to build applications using natural language. The Site Reliability Engineer will ensure the reliability, scalability, and performance of Replit's infrastructure, implementing automation and best practices while designing robust monitoring solutions and leading incident response efforts.
Artificial Intelligence (AI)Cloud ComputingDeveloper ToolsInformation TechnologySoftware
Responsibilities
Design and Implement Observability Solutions: Develop comprehensive monitoring and alerting systems using modern observability tools. Create dashboards and metrics that provide real-time visibility into system health and performance. Implement logging strategies that enable quick problem identification and resolution
Drive Automation and Infrastructure as Code: Architect and implement infrastructure automation solutions using tools like Terraform, Ansible, or Pulumi. Design and maintain CI/CD pipelines that enable reliable and consistent deployments. Create self-healing systems that can automatically respond to common failure scenarios
Establish SLOs and SLIs: Work with product and engineering teams to define and implement Service Level Objectives (SLOs) and Service Level Indicators (SLIs). Build systems to track and report on these metrics, ensuring we maintain high reliability standards while balancing innovation speed
Incident Management and Response: Lead incident response efforts, conducting thorough post-mortems, and implementing improvements to prevent future occurrences. Develop and maintain runbooks for critical services. Build tools and processes that reduce Mean Time To Recovery (MTTR)
Performance Optimization: Identify and resolve performance bottlenecks across our infrastructure. Implement capacity planning strategies and optimize resource utilization. Work on reducing latency and improving system efficiency across global regions
Qualification
Required
4-8 years of experience in Site Reliability Engineering or similar roles (DevOps, Systems Engineering, Infrastructure Engineering)
Strong programming skills in languages commonly used for automation (Python, Go, or similar)
Deep understanding of distributed systems
Experience with container orchestration platforms (Kubernetes) and cloud-native technologies
Proven track record of implementing and maintaining monitoring/observability solutions
Strong incident management skills with experience leading incident response
Experience with infrastructure as code and configuration management tools
Preferred
Experience with Google Cloud Platform (GCP) services and tools
Knowledge of modern observability platforms (Prometheus, Grafana, Datadog, etc.)
Benefits
401(k) Program
Health, Dental, Vision and Life Insurance
Short Term and Long Term Disability
Paid Parental, Medical, Caregiver Leave
Commuter Benefits
Monthly Wellness Stipend
Autonoumous Work Environement
In Office Set-Up Reimbursement
Flexible Time Off (FTO) + Holidays
Quarterly Team Gatherings
In Office Amenities
Company
Replit
Replit is the most secure agentic platform for production-ready apps.
H1B Sponsorship
Replit has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (8)
2024 (5)
2023 (2)
2022 (2)
Funding
Current Stage
Growth StageTotal Funding
$472.02MKey Investors
Prysm CapitalCraft VenturesAndreessen Horowitz
2025-07-30Series C· $250M
2023-11-06Series B· $20M
2023-04-25Series B· $97.4M
Recent News
2026-01-19
Company data provided by crunchbase