Apply on Employer Site

BNY · 3 months ago

Vice President, Site Reliability Engineer

240 Greenwich Street, New York, NY, 10286, US

Full-time

Onsite

Mid, Senior Level

$83K/yr - $150K/yr

5+ years exp

BNY is a leading global financial services company at the heart of the global financial system. The role of Vice President, Site Reliability Engineer involves driving reliability and performance, automating infrastructure and operations, and collaborating with cross-functional teams to build resilient services.

Financial Services

Responsibilities

Drive reliability and performance by defining SLOs/SLIs, improving observability, and proactively identifying and addressing system bottlenecks across cloud environments

Automate infrastructure and operations using Terraform, Kubernetes, and CI/CD tools to eliminate toil and enable scalable, fault-tolerant deployments

Collaborate cross-functionally with product, infrastructure, and DevOps teams to reduce incidents, build resilient services, and ensure architectural clarity

Lead incident management by participating in on-call rotations, conducting postmortems, and implementing automated recovery to minimize downtime

Build and maintain monitoring systems with tools like Prometheus, Grafana, AppDynamics, and Splunk to support real-time alerting and root cause analysis

Develop platform tooling and pipelines for container orchestration, third-party integrations, and cloud-native operations to improve system efficiency and reliability

Maintain and improve live services by measuring and monitoring latency and overall system health, working closely with tech support and operations teams

Leverage and define KPIs to understand service performance and identify corrective actions

Create, manage, and use dashboards for continuous monitoring and health checks of applications and underlying infrastructure

Design and implement solutions to customer friction points and improve the entire lifecycle of services from inception through sustainment

Assist in creating and maintaining automation to improve reliability and velocity in addressing issues during regular maintenance tasks

Mentor engineers and champion SRE best practices, embedding a reliability-first culture and ensuring technical excellence across engineering teams

Qualification

Cloud infrastructureContainerizationInfrastructure as CodeObservability toolsProgramming skillsSRE principlesCI/CD methodsAgile experienceCollaboration skillsCommunication skills

Required

Bachelor's degree in computer science or a related discipline, or equivalent work experience required

5-8 years of related experience; experience in the securities or financial services industry is a plus

Strong expertise in cloud infrastructure (Azure, AWS, or GCP), containerization (Docker, Kubernetes), and Infrastructure as Code (Terraform, Helm)

Proficiency in observability and monitoring tools such as Prometheus, Grafana, AppDynamics, Datadog, Splunk, and experience with incident response and on-call support

Solid programming and scripting skills in languages like Python, Go, or Java, with a focus on automation, tooling, and system integration

Deep understanding of SRE principles, including SLAs, SLOs, error budgets, postmortems, and reliability-focused system design

Familiarity with automated testing, DevSecOps practices, CI/CD methods, performance engineering, and security controls

Strong collaboration and communication skills, with experience working in Agile environments and partnering with cross-functional engineering, product, and operations teams

Previous success in technical engineering and coding experience beyond simple scripts