Staff Software Engineer - Site Reliability jobs in United States
cer-icon
Apply on Employer Site
company-logo

Celonis · 1 day ago

Staff Software Engineer - Site Reliability

Celonis is the global leader in Process Intelligence technology and one of the fastest-growing SaaS firms. As a member of the Reliability Engineering team, you will ensure the health, performance, and resilience of the platform by applying advanced software engineering and SRE principles to drive system reliability and scalability.

AnalyticsArtificial Intelligence (AI)Big DataBusiness IntelligenceBusiness Process Automation (BPA)SaaS
badNo H1Bnote

Responsibilities

Lead reliability efforts for a fleet of 80+ FedRAMP-compliant microservices running on Kubernetes, applying SRE principles to drive observability, automation, and incident prevention
Own high-priority application incident escalations, performing deep technical analysis and restoration within defined SLOs, while continuously improving detection and response mechanisms
Engineer solutions to enhance the availability, latency, and performance of production services—automating manual processes to eliminate toil and scale operational efficiency
Collaborate closely with platform and application engineering teams to conduct post-incident reviews, extract insights, and implement systemic changes that improve overall reliability
Document operational knowledge and runbooks, embedding SRE best practices into onboarding, incident response, and platform architecture standards

Qualification

Site Reliability EngineeringKubernetesCloud platformsJavaPythonSpring frameworkSRE principlesProblem-solvingTroubleshootingCommunication skills

Required

Bachelor's or Master's degree in Computer Science, Software Engineering, or a related technical field (or equivalent hands-on experience)
Minimum of 5 years of experience building and maintaining cloud-based software applications with at least one public cloud platform (AWS, Azure, or GCP)
Proficiency in Java, the Spring framework, and Python (or a similar scripting language) in a Linux environment
Prior experience contributing to Site Reliability Engineering initiatives or similar operational roles
Knowledge of SRE principles, including SLI/SLO design, error budgets, and toil reduction strategies
Proven expertise in developing and operating production-grade, scalable services using Kubernetes and elastic cloud architectures
Strong problem-solving and troubleshooting abilities in complex, distributed systems
Excellent written and verbal communication skills in English

Preferred

Familiarity with observability and monitoring tools (e.g., Datadog, etc.)
Experience with CI/CD pipelines and tools such as ArgoCD, GitHub Actions, or similar
Experience with Infrastructure as Code (IaC) tools such as Terraform and Kustomize
Exposure to incident management practices, on-call rotations, and postmortem culture

Benefits

Generous PTO
Hybrid working options
Company equity (RSUs)
Comprehensive benefits
Extensive parental leave
Dedicated volunteer days
Gym subsidies
Counseling
Well-being programs

Company

Celonis provides an execution management system that helps companies in running their business processes.

Funding

Current Stage
Late Stage
Total Funding
$2.37B
Key Investors
Qatar Investment AuthorityKeyBanc Capital MarketsArena Holdings
2023-07-15Secondary Market
2022-08-23Series D· $400M
2022-08-23Debt Financing· $600M

Leadership Team

leader-logo
Alexander Rinke
Co-CEO
linkedin
leader-logo
Bastian Nominacher
Co-CEO / Co-Founder
linkedin
Company data provided by crunchbase