Largeton Group · 5 hours ago
Site Reliability Engineer (SRE) – Dynatrace Automation Specialist
Largeton Group is partnering with a premier client in Washington, D.C. to hire a Site Reliability Engineer specialized in Dynatrace Automation. The role focuses on automating the rollout of Dynatrace across large-scale environments and integrating observability into the CI/CD lifecycle to ensure system stability and performance.
EducationConsultingInformation TechnologyTraining
Responsibilities
Dynatrace Automation: Lead the automated deployment and scaling of Dynatrace across hybrid environments; standardize installations using Infrastructure-as-Code (Terraform/Ansible) rather than manual configuration
CI/CD Pipeline Integration: Build and optimize "Observability-as-Code" by integrating Dynatrace with GitHub Actions, Jenkins, or AWS CodePipeline to enable automated quality gates and performance tracking
Advanced Observability: Implement environment-aware configurations and distributed tracing with context propagation; enforce strict metadata and tagging standards through automated scripts
Resiliency & DR Testing: Design and execute automated Resiliency Test plans and Disaster Recovery simulations; note: these high-impact tests are often scheduled during maintenance windows outside of normal business hours
SRE Operations: Apply SRE principles to manage SLIs, SLOs, and error budgets; develop automated anomaly detectors and self-healing scripts to reduce manual "toil."
Incident Response: Serve as a technical lead for production incidents, utilizing Dynatrace insights for rapid Root Cause Analysis (RCA) and documenting findings in ServiceNow
Qualification
Required
2 to 4 years in an SRE or DevOps role with a heavy emphasis on Dynatrace automation and CI/CD integration
Proven mastery of Terraform, CloudFormation, or AWS CDK for provisioning infrastructure and monitoring agents
Mid-level proficiency in Python or similar languages to develop self-service tools and automation hooks
Hands-on experience managing workloads in AWS and Azure using Docker, Kubernetes, or ECS
Solid understanding of Linux networking and configuration management via Ansible
Demonstrated ability and willingness to work outside standard business hours for overnight resiliency testing and on-call rotations
Strong ability to translate complex observability data into actionable technical documentation and knowledge articles