Site Reliability Engineer (SRE) - Azure | DevSecOps | IaC | Governance | Observability jobs in United States
cer-icon
Apply on Employer Site
company-logo

Avaya · 2 days ago

Site Reliability Engineer (SRE) - Azure | DevSecOps | IaC | Governance | Observability

Avaya is an enterprise software leader that helps organizations forge unbreakable connections. They are seeking a Site Reliability Engineer (SRE) to drive stability, reliability, and performance across their Azure and GCP-based platforms, blending operational excellence with proactive incident management and collaboration with DevOps and Security teams.

Cloud ComputingElectronicsInformation ServicesInformation TechnologySmall and Medium BusinessesSoftwareTelecommunicationsWireless
badNo H1BnoteU.S. Citizen Onlynote

Responsibilities

Serve as a key member of the 24×7 on-call rotation, responding to and managing incidents across production and pre-production environments
Lead incident bridges, coordinate root cause analysis (RCA), and ensure post-incident reviews drive systemic improvements
Maintain clear communication with cross-functional teams and leadership during major incidents
Build, tune, and maintain observability dashboards (Azure Monitor, GCP Operations Suite, Prometheus, Grafana, Datadog, Log Analytics)
Perform deep-dive troubleshooting of application and service-level issues using distributed tracing and log analysis (Grafana, Datadog) to pinpoint root causes beyond infrastructure
Define SLOs, SLIs, and error budgets to proactively identify and mitigate reliability risks before customer impact
Integrate AI-Ops tools for anomaly detection, predictive alerting, and automated incident correlation
Continuously enhance alert quality, reduce false positives, and automate runbooks for faster recovery
Analyze trends to prevent recurring issues and support teams in resilience engineering

Qualification

AzureGCPIaCCI/CDObservabilityTerraformAnsibleJenkinsGitHub ActionsGrafanaDatadogAnalytical skillsTroubleshootingContinuous ImprovementCommunicationCollaboration

Required

5+ years in Site Reliability, DevOps, Cloud Operations, or Customer support roles
Demonstrated experience in application-level troubleshooting by analyzing logs and traces to identify bugs, performance bottlenecks, and error conditions
Expertise in Azure and GCP cloud operations and distributed system reliability
Understanding of Terraform, Ansible, and CI/CD pipelines (Jenkins, GitHub Actions)
Experience with observability and AI-Ops tools (Azure Monitor, GCP Operations Suite, Grafana, Prometheus, Datadog, etc.)
Solid grasp of incident management frameworks (P1–P3 handling, RCA, PIRs, on-call rotations)
Excellent analytical, troubleshooting, and communication skills

Benefits

Performance-related bonus
Benefits

Company

Avaya provides business communications and collaboration systems, applications, and services.

Funding

Current Stage
Public Company
Total Funding
$700M
2022-06-27Post Ipo Debt· $600M
2018-01-17IPO
2002-12-24Series Unknown· $100M

Leadership Team

leader-logo
Eric Rossman
VP Partners and Alliances
linkedin
leader-logo
John Graybill
Director of Product Management
linkedin
Company data provided by crunchbase