Topstep · 1 day ago
Staff Site Reliability Engineer
Topstep is a company that emphasizes building resilient infrastructure and operational excellence. As a Staff Site Reliability Engineer, you'll be responsible for shaping the SRE practice, defining incident response culture, and optimizing AWS infrastructure for performance and cost.
ConsultingEducationFinancial ServicesStock ExchangesTrading Platform
Responsibilities
Set technical direction for reliability and observability across the entire engineering organization, influencing architectural decisions
Build and mature our SRE practice defining SLOs, incident response protocols, and on-call standards
Own the observability stack using DataDog (primary platform for metrics, APM, logging) and CloudWatch (AWS-native monitoring), instrumenting distributed tracing and closing gaps that currently prevent diagnosis of production issues
Partner with engineering teams to embed reliability principles early in the design process and improve system resilience
Lead incident response and blameless post-mortems, turning outages into opportunities for systematic improvement
Mentor engineers across the organization on reliability practices, operational thinking, and production ownership
Champion a culture of transparency, continuous improvement, and shared ownership of production systems
Qualification
Required
7+ years of professional experience in SRE, infrastructure, or platform engineering, with demonstrated impact building practices that scaled across multiple teams
Proven track record either starting an SRE function from scratch or scaling an existing practice with measurable improvements to MTTR, MTTD, change failure rate, or availability
Strong proficiency with DataDog for end-to-end observability (metrics, APM, logs, distributed tracing) and building alerting that catches real issues without causing fatigue
Deep expertise with AWS infrastructure (EKS, ECS, EC2, and RDS) running production services at scale, and hands-on experience optimizing for both reliability and cost
Solid foundation in distributed systems, networking, database performance, and debugging complex system failures across service boundaries
Comfortable reading code, writing automation scripts, and contributing to infrastructure tooling when needed
Proficiency with infrastructure as code (Terraform) and GitOps practices
Track record of influencing engineering culture through documentation, tooling, mentorship, and technical leadership
Excellent communication skills with the ability to explain complex system behavior and trade-offs to varied audiences
Comfortable making pragmatic trade-offs between long-term platform vision and immediate business needs
Benefits
10 Company paid Holidays and generous Family Leave.
Paid time off is accrued monthly.
Competitive 401(k) matching, health, dental, and vision insurance is offered for full time employees.
Vacations are encouraged with a bonus for taking 5 consecutive days.
Employee referrals are bonused.
Topstep offers a food and groceries budget and contributes towards health and wellness.
Company
Topstep
Topstep is a financial organization that enables resource traders to develop and learn in the trading industry.
Funding
Current Stage
Growth StageRecent News
2025-10-23
2025-10-22
Company data provided by crunchbase