Site Reliability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Galent ยท 12 hours ago

Site Reliability Engineer

Galent is seeking a Site Reliability Engineer (SRE) to ensure the reliability, scalability, and operational excellence of their container orchestration platforms. The role involves architecting and managing production-grade Kubernetes clusters, implementing SRE practices, and automating various operational tasks.

Computer Software
Hiring Manager
Abhinav Ravikumar
linkedin

Responsibilities

Architect, operate, and scale production-grade Kubernetes clusters (EKS / AKS / self-managed)
Own cluster reliability, availability, performance, and security across multiple environments
Design and implement Kubernetes networking, ingress, and traffic management (CNI, Ingress Controllers, Load Balancers)
Build and enforce Kubernetes SRE practices including SLIs, SLOs, error budgets, and reliability KPIs
Automate cluster provisioning, upgrades, backups, and disaster recovery using Terraform and scripting
Lead incident management and deep root cause analysis for Kubernetes and container-related failures
Implement and optimize observability for Kubernetes workloads using Prometheus, Grafana, and Splunk
Tune resource management, autoscaling, and capacity planning (HPA, VPA, Cluster Autoscaler)
Partner with application and platform teams to define deployment, scaling, and resiliency patterns
Drive adoption of GitOps and Kubernetes best practices across teams

Qualification

KubernetesAWSTerraformKubernetes observabilityLinux fundamentalsScriptingCI/CD toolsKubernetes certifications

Required

6+ years in SRE, DevOps, or Platform Engineering roles with strong production ownership
Expert-level Kubernetes experience, including: Control plane components, scheduling, etcd, and API server behavior
Pod lifecycle, deployments, stateful workloads
Networking (CNI), DNS, ingress, and service meshes
RBAC, secrets management, and security contexts
Hands-on experience managing Kubernetes on AWS (EKS) or Azure (AKS)
Strong expertise in Terraform for Kubernetes and cloud infrastructure provisioning
Deep experience with Kubernetes observability: Prometheus (metrics, alerts), Grafana (dashboards), Splunk (centralized logging)
Strong Linux fundamentals and container runtime knowledge (containerd, Docker)
Proficiency in scripting/programming (Go, Python, Bash)
Experience with CI/CD and GitOps tools (ArgoCD, Flux, Helm)

Preferred

Kubernetes certifications: CKA, CKAD, CKS

Company

Galent

twitter
company-logo
Galent is an AI-native digital engineering firm at the forefront of the AI revolution, dedicated to delivering unified, enterprise-ready AI solutions that transform businesses and industries.

Funding

Current Stage
Late Stage
Company data provided by crunchbase