Apply on Employer Site

Avaya · 2 days ago

Site Reliability Engineer (SRE) - Azure | DevSecOps | IaC | Governance | Observability

United States

Full-time

Remote

Mid, Senior Level

$122K/yr - $160K/yr

5+ years exp

Avaya is an enterprise software leader that helps organizations forge unbreakable connections. They are seeking a Site Reliability Engineer (SRE) to drive stability, reliability, and performance across their Azure and GCP-based platforms, blending operational excellence with proactive incident management and collaboration with DevOps and Security teams.

Cloud ComputingElectronicsInformation ServicesInformation TechnologySmall and Medium BusinessesSoftwareTelecommunicationsWireless

No H1B

U.S. Citizen Only

Responsibilities

Serve as a key member of the 24×7 on-call rotation, responding to and managing incidents across production and pre-production environments

Lead incident bridges, coordinate root cause analysis (RCA), and ensure post-incident reviews drive systemic improvements

Maintain clear communication with cross-functional teams and leadership during major incidents

Build, tune, and maintain observability dashboards (Azure Monitor, GCP Operations Suite, Prometheus, Grafana, Datadog, Log Analytics)

Perform deep-dive troubleshooting of application and service-level issues using distributed tracing and log analysis (Grafana, Datadog) to pinpoint root causes beyond infrastructure

Define SLOs, SLIs, and error budgets to proactively identify and mitigate reliability risks before customer impact

Integrate AI-Ops tools for anomaly detection, predictive alerting, and automated incident correlation

Continuously enhance alert quality, reduce false positives, and automate runbooks for faster recovery

Analyze trends to prevent recurring issues and support teams in resilience engineering

Qualification

AzureGCPIaCCI/CDObservabilityTerraformAnsibleJenkinsGitHub ActionsGrafanaDatadogAnalytical skillsTroubleshootingContinuous ImprovementCommunicationCollaboration

Required

5+ years in Site Reliability, DevOps, Cloud Operations, or Customer support roles

Demonstrated experience in application-level troubleshooting by analyzing logs and traces to identify bugs, performance bottlenecks, and error conditions

Expertise in Azure and GCP cloud operations and distributed system reliability

Understanding of Terraform, Ansible, and CI/CD pipelines (Jenkins, GitHub Actions)

Experience with observability and AI-Ops tools (Azure Monitor, GCP Operations Suite, Grafana, Prometheus, Datadog, etc.)

Solid grasp of incident management frameworks (P1–P3 handling, RCA, PIRs, on-call rotations)

Excellent analytical, troubleshooting, and communication skills