Berkley Hunt · 11 hours ago
Site Reliability Engineer
Berkley Hunt has partnered with a high-growth fintech company to hire a Site Reliability Engineer to help build, operate, and scale a globally distributed, highly available cloud platform. This role focuses on reliability, automation, and operational excellence, working closely with engineering teams to ensure systems are resilient, scalable, and production-ready from day one.
Responsibilities
Architect and evolve cloud infrastructure to support a secure, highly available, and globally distributed fintech platform
Embed reliability best practices into the development lifecycle, influencing design decisions before code reaches production
Drive improvements in deployment workflows through GitOps and Infrastructure-as-Code methodologies
Enhance system visibility by building robust monitoring, logging, and alerting frameworks
Lead incident response efforts, conduct post-incident reviews, and implement preventative measures to strengthen platform resilience
Continuously refine Kubernetes environments to improve performance, scalability, and operational efficiency
Partner cross-functionally with engineering and product teams to balance speed of delivery with operational stability
Reduce operational toil by identifying automation opportunities and improving internal tooling
Qualification
Required
You think in systems, not silos, you naturally connect infrastructure decisions to customer experience and business impact
You have strong experience running production environments at scale and understand what 'good' looks like in terms of uptime, latency, and reliability
You're confident operating Kubernetes in real-world production settings, not just deploying to it
You have a solid background in cloud architecture across AWS and GCP, and understand the trade-offs of distributed systems
You are proactive about identifying risk and eliminating single points of failure before they become incidents
You are comfortable working in fast-paced environments where priorities evolve and ownership is shared
You believe infrastructure should be repeatable, observable, and continuously improving