Apply on Employer Site

Metropolis Technologies · 2 hours ago

Staff Software Engineer, Reliability

Seattle, Washington, United States

Full-time

Onsite

Senior Level, Lead/Staff

$220K/yr - $250K/yr

8+ years exp

Metropolis is an artificial intelligence company that uses computer vision technology to enable frictionless, checkout-free experiences in the real world. They are seeking a Staff Software Engineer focused on Reliability to own reliability across the entire Metropolis platform, ensuring system availability, resilience, and observability for mission-critical mobility infrastructure.

Artificial Intelligence (AI)Computer VisionMachine LearningParkingSoftware

Growth Opportunities

H1B Sponsor Likely

Responsibilities

Own the overall reliability posture for the Metropolis platform, establishing practices, metrics, and systems that ensure 99.9%+ uptime across all services

Design and implement automatic failover mechanisms for critical external dependencies (Twilio for SMS/voice, Stripe for payments) with circuit breakers, retry policies, and degraded mode operations

Architect and build active-passive or active-active regional deployment strategies with database replication, automated failover, and DNS-based traffic routing including disaster recovery planning and testing

Establish comprehensive monitoring using Datadog for APM, logs, and metrics correlation; implement synthetic monitoring, SLO-based alerting, on-call rotation, and escalation policies; build service health dashboards that show customer impact

Own the incident management process including workflows, tooling, post-mortem culture, runbook automation, and MTTR reduction initiatives – driving down mean time to recovery from detection to resolution

Drive adoption of resilience patterns across all services including health checks, graceful degradation, feature flags, rate limiting, backpressure mechanisms, and chaos engineering practices

Build and maintain local mirrors for critical dependencies (Maven/NPM/Docker registries) with artifact caching, dependency pinning, and vulnerability scanning to prevent build failures from upstream outages

Qualification

Reliability engineeringDistributed systemsJava proficiencyMicroservices architectureCloud platforms (AWS)Database knowledgeContainer orchestrationObservability expertiseIncident response leadershipChaos engineeringPerformance optimizationOpen source contributionsTechnical communication

Required

8+ years of backend software engineering experience with deep focus on distributed systems and platform infrastructure

Expert-level Java proficiency with deep understanding of JVM performance, concurrency, and ecosystem tooling. Scala experience is a big plus

Production experience with microservices architecture, container orchestration (Kubernetes), and cloud platforms (AWS)

Strong systems thinking with proven ability to design and implement large-scale, high-availability distributed systems that handle significant load

Observability expertise including hands-on production experience with metrics, logging, tracing, and alerting systems in high-load environments

Database and data systems knowledge including relational databases, event streaming (Kafka, SQS), caching strategies, and data consistency patterns

Experience with AI-powered development tools such as Claude Code, GitHub Copilot, or similar agentic coding tools for enhanced productivity – context engineering in particular

Excellent technical communication with ability to design and document complex systems, lead technical discussions, and collaborate across multiple teams local to New York City, Seattle, or Los Angeles area

Preferred

SRE or Reliability Engineering experience at companies known for operational excellence or high-growth startups where you built reliability practices from the ground up

Incident response leadership including experience building incident management processes, conducting blameless post-mortems, and driving MTTR reduction initiatives in production environments

Chaos engineering experience with tools like Chaos Monkey, Gremlin, or similar, including designing and executing game days and failure injection testing

Performance optimization experience with profiling, benchmarking, capacity planning, and system tuning at hyperscale including experience optimizing for high-throughput, low-latency systems

Open source contributions or technical blog writing that demonstrates depth of expertise in reliability engineering, distributed systems, or production operations

Benefits

Healthcare benefits

401(k) plan

Short-term and long-term disability coverage

Basic life insurance

A lucrative stock option plan

Bonus plans

Company

Metropolis Technologies

Metropolis is building AI for the real world.

Founded in 2017

New York, New York, USA

10001+ employees

http://www.metropolis.io

H1B Sponsorship

Metropolis Technologies has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2023 (4)

2021 (3)

2020 (1)

Funding

Current Stage

Late Stage

Total Funding

$3.53B

Key Investors

LionTreeJP Morgan Chase3L Capital

2025-11-06Series D· $500M

2025-11-06Debt Financing· $1.1B

2023-10-05Series C· $1.05B

Leadership Team

Alexander Israel

Co-Founder & CEO

Travis Kell

Co-Founder & Chief Strategy Officer

Recent News

BiometricUpdate.com

AI, fraud and market timing drive biometrics consolidation in 2025 … and maybe 2026

2025-12-27

globalventuring.com

AI supercharged 2025’s biggest deals – both directly and indirectly

2025-12-27

Lane Report | Kentucky Business & Economic News

Louisville airport launches new parking reservation with AeroParker

2025-12-18

Company data provided by crunchbase