Staff Software Engineer, Reliability jobs in United States
cer-icon
Apply on Employer Site
company-logo

Metropolis Technologies · 2 hours ago

Staff Software Engineer, Reliability

Metropolis is an artificial intelligence company that uses computer vision technology to enable frictionless, checkout-free experiences in the real world. They are seeking a Staff Software Engineer focused on Reliability to own reliability across the entire Metropolis platform, ensuring system availability, resilience, and observability for mission-critical mobility infrastructure.

Artificial Intelligence (AI)Computer VisionMachine LearningParkingSoftware
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Own the overall reliability posture for the Metropolis platform, establishing practices, metrics, and systems that ensure 99.9%+ uptime across all services
Design and implement automatic failover mechanisms for critical external dependencies (Twilio for SMS/voice, Stripe for payments) with circuit breakers, retry policies, and degraded mode operations
Architect and build active-passive or active-active regional deployment strategies with database replication, automated failover, and DNS-based traffic routing including disaster recovery planning and testing
Establish comprehensive monitoring using Datadog for APM, logs, and metrics correlation; implement synthetic monitoring, SLO-based alerting, on-call rotation, and escalation policies; build service health dashboards that show customer impact
Own the incident management process including workflows, tooling, post-mortem culture, runbook automation, and MTTR reduction initiatives – driving down mean time to recovery from detection to resolution
Drive adoption of resilience patterns across all services including health checks, graceful degradation, feature flags, rate limiting, backpressure mechanisms, and chaos engineering practices
Build and maintain local mirrors for critical dependencies (Maven/NPM/Docker registries) with artifact caching, dependency pinning, and vulnerability scanning to prevent build failures from upstream outages

Qualification

Reliability engineeringDistributed systemsJava proficiencyMicroservices architectureCloud platforms (AWS)Database knowledgeContainer orchestrationObservability expertiseIncident response leadershipChaos engineeringPerformance optimizationOpen source contributionsTechnical communication

Required

8+ years of backend software engineering experience with deep focus on distributed systems and platform infrastructure
Expert-level Java proficiency with deep understanding of JVM performance, concurrency, and ecosystem tooling. Scala experience is a big plus
Production experience with microservices architecture, container orchestration (Kubernetes), and cloud platforms (AWS)
Strong systems thinking with proven ability to design and implement large-scale, high-availability distributed systems that handle significant load
Observability expertise including hands-on production experience with metrics, logging, tracing, and alerting systems in high-load environments
Database and data systems knowledge including relational databases, event streaming (Kafka, SQS), caching strategies, and data consistency patterns
Experience with AI-powered development tools such as Claude Code, GitHub Copilot, or similar agentic coding tools for enhanced productivity – context engineering in particular
Excellent technical communication with ability to design and document complex systems, lead technical discussions, and collaborate across multiple teams local to New York City, Seattle, or Los Angeles area

Preferred

SRE or Reliability Engineering experience at companies known for operational excellence or high-growth startups where you built reliability practices from the ground up
Incident response leadership including experience building incident management processes, conducting blameless post-mortems, and driving MTTR reduction initiatives in production environments
Chaos engineering experience with tools like Chaos Monkey, Gremlin, or similar, including designing and executing game days and failure injection testing
Performance optimization experience with profiling, benchmarking, capacity planning, and system tuning at hyperscale including experience optimizing for high-throughput, low-latency systems
Open source contributions or technical blog writing that demonstrates depth of expertise in reliability engineering, distributed systems, or production operations

Benefits

Healthcare benefits
401(k) plan
Short-term and long-term disability coverage
Basic life insurance
A lucrative stock option plan
Bonus plans

Company

Metropolis Technologies

twittertwittertwitter
company-logo
Metropolis is building AI for the real world.

H1B Sponsorship

Metropolis Technologies has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2023 (4)
2021 (3)
2020 (1)

Funding

Current Stage
Late Stage
Total Funding
$3.53B
Key Investors
LionTreeJP Morgan Chase3L Capital
2025-11-06Series D· $500M
2025-11-06Debt Financing· $1.1B
2023-10-05Series C· $1.05B

Leadership Team

leader-logo
Alexander Israel
Co-Founder & CEO
linkedin
leader-logo
Travis Kell
Co-Founder & Chief Strategy Officer
linkedin
Company data provided by crunchbase