Staff Software Engineer - Resiliency and Platform Engineering jobs in United States
cer-icon
Apply on Employer Site
company-logo

Choice Hotels International · 14 hours ago

Staff Software Engineer - Resiliency and Platform Engineering

Choice Hotels International is seeking a Staff Software Engineer for their SkyTouch Technology division, which provides a cloud-based hotel property management system. The role focuses on enhancing the resiliency and operability of a large-scale SaaS platform by improving foundational capabilities and developer experience.

HospitalityHotelTravel
badNo H1Bnote

Responsibilities

Design and implement platform-level capabilities including shared libraries, frameworks, tooling, automation, and guardrails that improve application resiliency, runtime safety, and developer experience across the ecosystem, favoring leverage and durability over short-term delivery
Strengthen foundational platform and runtime behavior by identifying and eliminating systemic failure modes such as JVM memory leaks, unsafe defaults, brittle error handling, poor failure propagation, and resource exhaustion
Improve how software is built and operated at scale by defining and rolling out developer-facing standards and paved roads for resiliency, observability, error handling, and operational readiness
Define, standardize, and evolve logging, monitoring, alerting, and observability practices that improve signal quality, reduce noise, and enable faster diagnosis and recovery
Partner closely with Principal Software Engineers, Solution Architects, and Engineering Managers to identify systemic risks and translate them into well-scoped platform and resiliency initiatives and technical work
Operate across software engineering resiliency, data engineering resiliency, and platform engineering teams to identify cross-cutting risks, design shared solutions, and raise the technical bar, rather than owning individual team backlogs
Engage directly in application codebases, particularly during ramp-up, to understand real-world system behavior, identify failure patterns, and validate resiliency improvements. Exit application-level work once learning is complete and systemic improvements are identified
Participate in incident postmortems and operational reviews to identify recurring patterns and ensure lessons learned are translated into durable platform or resiliency improvements, not one-off fixes
Evaluate, prototype, and introduce tools and technologies that measurably improve developer productivity, platform safety, and application resiliency, prioritizing adoption, simplicity, and long-term impact
Apply AI-assisted development, diagnostics, and operational tools where they demonstrably improve engineering productivity, root cause analysis, signal quality, or resiliency outcomes
Influence engineering practices and technical direction through design reviews, reference implementations, mentorship, and technical leadership rather than formal authority or delivery ownership

Qualification

Java-based servicesCloud-native workloadsAWS public cloudApplication monitoringAI-assisted toolsSite Reliability EngineeringSoft skills

Required

Bachelor's degree in computer science, or a related technical field, or equivalent practical experience building and operating production systems
Typically, 8–10+ years of hands-on experience designing, building, and supporting large-scale software systems in production environments
Hands-on experience designing, building, and operating Java-based services, including Spring Boot applications running in virtualized and containerized environments
Experience developing and supporting cloud-native and serverless workloads, including Python-based services and event-driven architectures
Strong practical experience working in AWS public cloud environments, with an understanding of how cloud-managed services influence reliability, scalability, and operational behavior
Working knowledge of relational and non-relational data stores, including how data persistence, availability, and failure characteristics impact system design and resiliency
Experience using application monitoring and observability platforms to understand system behavior in production, such as application performance monitoring, centralized logging, and cloud-native telemetry tools (for example, AppDynamics, OpenSearch, Amazon CloudWatch, or similar)
Comfortable diagnosing complex production issues by interpreting metrics, logs, traces, and runtime signals rather than relying solely on reactive incident handling
Solid understanding of Site Reliability Engineering (SRE) principles, with the judgment to apply them selectively to guide platform and resiliency improvements rather than adopting SRE practices as a one-size-fits-all operating model
Demonstrated ability to choose between software design changes, platform capabilities, or developer enablement as the most effective way to improve reliability and operability
Hands-on experience designing and delivering one or more platform-level capabilities such as shared libraries, frameworks, internal tooling, or enablement platforms used by multiple application teams
Experience creating and rolling out paved roads, guardrails, or standardized patterns that balance safety, usability, and developer autonomy
Experience using AI-assisted tools (such as code assistants, log/trace analysis, or incident analysis tools) to improve engineering effectiveness or system reliability
Proven ability to influence technical direction and engineering practices across teams without direct ownership of delivery backlogs
Successful candidates for this role consistently demonstrate strength in the following Korn Ferry competencies: Manages Complexity – Navigates complex technical environments, synthesizes information across systems, and identifies systemic root causes. Decision Quality – Makes sound technical decisions under constraints, balancing immediate needs with long-term platform health. Drives Results – Delivers durable improvements in platform resiliency, stability, and developer effectiveness

Preferred

Cloud or technology certifications (such as AWS certifications or equivalent) are a plus and demonstrate commitment to building and operating reliable systems at scale

Benefits

Competitive compensation and benefits, including medical, dental, and vision coverage
Leave and paid time-off for holidays, vacation, personal, family, volunteer, sick, jury duty, bereavement, military, and religious observance
Financial benefits for retirement and health savings
Employee recognition programs
Discounts at Choice hotels worldwide

Company

Choice Hotels International

company-logo
Choice Hotels International is a hospitality franchisor that provides businesses and travelers with a range of lodging options.

Funding

Current Stage
Public Company
Total Funding
$600M
2024-06-25Post Ipo Debt· $600M
1996-10-16IPO

Leadership Team

leader-logo
Judd Wadholm
Senior Vice President and General Manager, Core Brands
linkedin
leader-logo
Noha Abdalla
Chief Marketing Officer
linkedin
Company data provided by crunchbase