Chamberlain Advisors · 23 hours ago
Associate Principal, Site Reliability Engineer
Chamberlain Advisors is partnering with a leading equity derivatives clearing organization to hire a highly skilled Senior Site Reliability Engineer (SRE) to support the reliability, availability, and performance of their next-generation cloud platforms. This role is critical to ensuring systems operate at scale with high resiliency while enabling development teams to deliver features quickly and safely.
Staffing & Recruiting
Responsibilities
Ensure the availability, performance, scalability, and reliability of production systems supporting Chamberlain’s cloud-based platforms
Partner with software development, operations, and infrastructure teams to design and operate production-ready services
Design and implement automation to improve incident response, reduce manual effort, and prevent recurring issues
Develop, maintain, and continuously improve runbooks and operational documentation for service outages and degradations
Assess production readiness of services by evaluating reliability, observability, scalability, and operational risk
Define, implement, and monitor key operational metrics related to system health, performance, and capacity
Architect, develop, and maintain shared reliability services and tooling used across the organization
Participate in incident management, root cause analysis, and post-incident reviews with a focus on long-term remediation
Contribute to continuous improvement through retrospectives, technical research, code reviews, and design discussions
Influence delivery timelines and technical expectations by identifying reliability risks and improvement opportunities
Mentor junior engineers and share knowledge through documentation and collaborative team engagement
Support Agile/Scrum delivery by contributing to sprint planning, backlog refinement, and story development
Qualification
Required
Bachelor's degree in Management Information Systems, Computer Science, or a related field
Minimum of 4+ years of experience in Site Reliability Engineering, DevOps, or a related engineering discipline
Proven experience supporting large-scale, distributed, production systems
Experience working in Agile/Scrum environments
Cloud Platforms: Public cloud experience with AWS (preferred), Azure, or GCP
Observability & AIOps: Monitoring, logging, alerting, and predictive analytics using tools such as Splunk, Datadog, AppDynamics, Prometheus, Grafana, Sysdig, or StackDriver
Programming & Automation: Proficiency in Python, Java, Go, or Bash for automation and tooling
Containers & Orchestration: Experience with Kubernetes and container platforms such as Docker, Rancher, or Mesos
Distributed Systems: Messaging and event-driven platforms including Kafka, RabbitMQ, or ActiveMQ
CI/CD & DevOps: Pipeline and deployment tools such as Jenkins, Harness, Travis CI, AWS CodeBuild/CodePipeline, or Appveyor
AI Enablement: Familiarity using Large Language Models (LLMs) to automate SRE workflows (e.g., scripting, incident analysis, reporting)
Resilience Engineering: Foundational exposure to Chaos Engineering and fault-injection tools (e.g., Gremlin, Chaos Monkey, AWS FIS)
Benefits
Comprehensive medical, dental, vision, PTO, paid holidays, 401(k) with match, professional development, collaborative culture, work-life balance