PDS · 7 hours ago
Site Reliability Engineer
PDS is seeking an experienced Site Reliability Engineer with strong observability expertise to enhance transaction traceability, performance, and resiliency across a complex enterprise environment. The role focuses on building visibility into critical transaction flows and collaborating with cross-functional teams to implement observability frameworks and optimize system performance.
ComputerInformation TechnologySoftwareStaffing Agency
Responsibilities
Design and implement observability frameworks for full transaction traceability across microservices, APIs, databases, and third-party integrations
Utilize tools such as Dynatrace, OpenTelemetry, ELK, and Grafana to visualize dependencies and build actionable dashboards, alerts, and real‑time performance insights
Monitor latency, throughput, and failures to identify bottlenecks
Use telemetry and distributed tracing to troubleshoot and optimize transaction performance
Partner with application and database teams to improve system efficiency
Work with architects, engineering teams, and stakeholders to define observability standards and resiliency requirements
Establish monitoring best practices and provide training across teams
Identify and prioritize business‑critical transaction paths
Implement redundancy, failover strategies, and fault‑tolerant architectures
Support chaos engineering initiatives and resiliency testing
Define and measure SLOs and SLIs for critical transaction paths
Maintain detailed documentation of transaction flows and monitoring configurations
Produce regular reporting on system performance, resiliency, and improvement initiatives
Create incident playbooks and reusable observability frameworks
Achieve a 30% reduction in MTTD and MTTR within the first year
Identify the offending service/root cause for at least 70% of incidents within one hour
Detect 90% of issues through automated monitoring
Contribute to a culture of continuous improvement and knowledge sharing
Qualification
Required
5+ years in SRE, Observability, or related engineering roles
Hands-on experience with Dynatrace, ELK, Datadog, Splunk, OpenTelemetry, Jaeger, or similar tools
Strong background with AWS, Azure, or GCP
Solid understanding of microservices, APIs, and distributed systems
Proficiency with scripting or programming languages (Python, Go, Java)
Preferred
Dynatrace Associate or Professional Certification
Experience with OpenTelemetry and observability standards
Familiarity with chaos engineering practices
Experience with AIOps and automation-driven monitoring