Datum Technologies Group · 2 days ago
Site Reliability Engineer (Only USC and GC-100% REMOTE)
Datum Technologies Group is seeking a Site Reliability Engineer to design, implement, and manage cloud infrastructure on Azure. The role involves maintaining Kubernetes clusters, building CI/CD pipelines, and enhancing system reliability through monitoring and automation.
Responsibilities
Design, implement, and manage cloud infrastructure on Azure using Terraform and Terragrunt
Maintain and optimize Kubernetes clusters on Azure Kubernetes Service (AKS)
Build and manage CI/CD pipelines using GitHub Actions/Workflows and ArgoCD for GitOps deployments
Enhance system reliability by implementing monitoring, alerting, and observability solutions using Grafana
Automate operational tasks to reduce toil and improve team efficiency
Participate in on-call rotations, incident response, and post-mortem analysis
Collaborate with development teams to improve application performance, scalability, and resilience
Implement and advocate for SRE best practices, including SLIs, SLOs, and error budgets
Continuously improve system performance, cost efficiency, and security
Qualification
Required
3+ years of experience in an SRE, DevOps, or cloud infrastructure role
Strong experience with Azure cloud services and infrastructure
Hands-on experience with Java, Terraform, and Terragrunt for infrastructure-as-code
Proficiency with Kubernetes (preferably AKS) and Databricks
Experience with CI/CD tools, especially GitHub Workflows/Actions and ArgoCD
Solid understanding of observability tools like Grafana (experience with Prometheus, Loki, Tempo is a plus)
Bachelor's degree required
Preferred
Master's preferred
Company
Datum Technologies Group
Datum Technologies Group provides technology solutions, managed services, government contracting, and IT staffing services.