We Insure · 6 hours ago
Staff Platform Resilience Event Manager
Apex Fintech Solutions powers innovation and the future of digital wealth management by processing millions of transactions daily. The Staff Platform Resilience Event Manager is responsible for strategic planning and execution of platform resilience events to ensure the organization's resilience posture is continuously validated and improved.
Insurance
Responsibilities
Develop and maintain the annual Platform Resilience Event Calendar spanning all disaster recovery tests, business continuity exercises, game days, and vendor coordination events
Align event schedule with regulatory examination cycles, customer audit requests, and internal risk assessment priorities
Define success criteria and maturity progression for resilience events (e.g., tabletop → walkthrough → full failover → automated chaos)
Maintain our risk register with updates based on resilience event findings
Design and facilitate 'game day' exercises that inject controlled failures into production or staging environments to validate system resilience
Partner with Engineering, SRE, Product and Ops teams to develop realistic failure scenarios (database outages, network partitions, dependency failures, traffic spikes)
Build game day playbooks, observer guides, and scoring rubrics to measure system and team response effectiveness
Evolve game day maturity from scheduled events to surprise/unannounced exercises (with appropriate stakeholder buy-in)
Align and coordinate the Platform participation and preparation along with the Enterprise Risk team for DR/BC events
In partnership with the Enterprise Risk Team, organize and coordinate vendor-led DR tests, ensuring Apex participation and validation of vendor recovery capabilities
Ensure vendor DR documentation (runbooks, RTO/RPO commitments, contact lists) is current and accessible during incidents
Ensure inventory of critical third-party vendors is maintained with contractual DR/BC obligations (cloud providers, tech vendors, service providers, SaaS/IaaS/PaaS services)
Serve as primary liaison between Platform and: Compliance, Legal, Enterprise Risk Management, Internal Audit
Integrate Security incident response scenarios into resilience events (e.g., ransomware recovery, insider threat)
Translate technical resilience outcomes into compliance artifacts, audit evidence, and regulatory examination responses
Coordinate with Legal on customer contractual obligations for DR demonstrations and availability SLAs
Maintain compliance with FINRA Rule 4370 (Business Continuity Plans), SEC regulations, and state-level financial services resilience requirements
Produce request-ready documentation: GameDay Results and findings, Resilience metrics, improvement tracking
Support regulatory examinations by providing examiner-requested evidence of resilience testing and improvement trends
Define and track key resilience metrics: RTO/RPO actuals, DR test success rates, mean time to failover, game day findings, vendor DR SLA compliance
Produce quarterly executive dashboards on resilience posture, event outcomes, and improvement initiatives
Maintain centralized repository of runbooks, event after-action reports, and lessons learned
Drive continuous improvement by converting event findings into actionable engineering backlogs and process improvements
Benchmark Apex resilience maturity against industry standards (e.g., Gartner, NIST, financial services peers)
Ensure resilience events validate and improve actual incident response capabilities (not just technical recovery)
Integrate platform events with ITSM Incident Management training to build muscle memory for real outages
Validate incident communication plans during events (customer notifications, executive escalations, status pages)
Use real incidents as inputs for future game day scenarios ('let's replay last quarter's outage in a controlled environment')
Qualification
Required
Bachelor's degree in a technical field (or equivalent work experience) required
10+ years in technology operations, site reliability engineering (SRE), DevOps, or infrastructure roles
3+ years in financial services technology (preferably broker-dealer, clearing, custody, or payments)
Hands-on experience with disaster recovery planning and execution in complex, distributed systems environments
Experience supporting regulatory examinations and producing compliance documentation
Working knowledge of FINRA, SEC, and financial services regulatory requirements for business continuity and disaster recovery
Understanding of third-party risk management in regulated environments
Understanding of cloud infrastructure (AWS/Azure/GCP), database failover, load balancing, and multi-region architectures
Familiarity with incident command systems, runbook automation, and monitoring/observability platforms
Proven ability to manage complex, cross-functional programs with multiple stakeholders and competing priorities
Experience leading high-stakes, time-sensitive events requiring real-time coordination and decision-making
Strong project management skills: planning, scheduling, resource coordination, status reporting
Comfort with ambiguity and ability to build new programs from the ground up
Executive presence: able to brief C-suite and board on resilience posture and event outcomes
Exceptional written communication: producing regulatory reports, audit evidence, executive summaries
Ability to influence without authority across technical and non-technical teams
Preferred
Incident command or crisis management experience preferred
Certifications: CBCP (Certified Business Continuity Professional), CISSP, GCP/AWS/Azure certifications, ITIL
Chaos engineering experience (Chaos Monkey, Gremlin, etc.)
Background in internal audit, GRC, or compliance roles
Experience with tabletop exercises and red team/blue team scenarios
Benefits
Healthcare benefits (medical, dental and vision, EAP)
Competitive PTO
401k match
Parental leave
HSA contribution match
Paid subscription to the Calm app
Generous external learning and tuition reimbursement benefits
Hybrid work schedule