Galent ยท 12 hours ago
Site Reliability Engineer
Galent is seeking a Site Reliability Engineer (SRE) to ensure the reliability, scalability, and operational excellence of their container orchestration platforms. The role involves architecting and managing production-grade Kubernetes clusters, implementing SRE practices, and automating various operational tasks.
Responsibilities
Architect, operate, and scale production-grade Kubernetes clusters (EKS / AKS / self-managed)
Own cluster reliability, availability, performance, and security across multiple environments
Design and implement Kubernetes networking, ingress, and traffic management (CNI, Ingress Controllers, Load Balancers)
Build and enforce Kubernetes SRE practices including SLIs, SLOs, error budgets, and reliability KPIs
Automate cluster provisioning, upgrades, backups, and disaster recovery using Terraform and scripting
Lead incident management and deep root cause analysis for Kubernetes and container-related failures
Implement and optimize observability for Kubernetes workloads using Prometheus, Grafana, and Splunk
Tune resource management, autoscaling, and capacity planning (HPA, VPA, Cluster Autoscaler)
Partner with application and platform teams to define deployment, scaling, and resiliency patterns
Drive adoption of GitOps and Kubernetes best practices across teams
Qualification
Required
6+ years in SRE, DevOps, or Platform Engineering roles with strong production ownership
Expert-level Kubernetes experience, including: Control plane components, scheduling, etcd, and API server behavior
Pod lifecycle, deployments, stateful workloads
Networking (CNI), DNS, ingress, and service meshes
RBAC, secrets management, and security contexts
Hands-on experience managing Kubernetes on AWS (EKS) or Azure (AKS)
Strong expertise in Terraform for Kubernetes and cloud infrastructure provisioning
Deep experience with Kubernetes observability: Prometheus (metrics, alerts), Grafana (dashboards), Splunk (centralized logging)
Strong Linux fundamentals and container runtime knowledge (containerd, Docker)
Proficiency in scripting/programming (Go, Python, Bash)
Experience with CI/CD and GitOps tools (ArgoCD, Flux, Helm)
Preferred
Kubernetes certifications: CKA, CKAD, CKS
Company
Galent
Galent is an AI-native digital engineering firm at the forefront of the AI revolution, dedicated to delivering unified, enterprise-ready AI solutions that transform businesses and industries.
Funding
Current Stage
Late StageCompany data provided by crunchbase