Site Reliability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Programming.com · 5 hours ago

Site Reliability Engineer

Programming.com is seeking a Senior Site Reliability Engineer (SRE) with expertise in AWS and Kubernetes. The role involves designing and operating fault-tolerant systems, leading SRE practices, and providing incident response for production systems.

ConsultingInformation ServicesInformation TechnologySoftware
Hiring Manager
Vishal Puri
linkedin

Responsibilities

Design, build, and operate highly available, fault-tolerant systems supporting core banking, payments, and trading platforms
Lead SRE practices including SLIs, SLOs, error budgets, and reliability-driven engineering decisions
Provide L3/L4 incident response, root cause analysis (RCA), and post-incident remediation for production systems
Support and optimize Java-based microservices running on Kubernetes (EKS)
Implement and manage AWS-native services including EC2, EKS, RDS, DynamoDB, S3, IAM, and CloudWatch
Develop automation using Terraform for infrastructure provisioning and policy enforcement
Manage Kubernetes networking, storage, and service mesh integrations including Istio and Anthos Service Mesh
Implement advanced Kubernetes storage solutions using Portworx
Architect and maintain enterprise-grade CI/CD pipelines using GitLab CI/CD, Jenkins, and cloud-native tooling
Automate manual operational tasks using Python, Go, Bash, and infrastructure-as-code patterns
Implement monitoring, logging, and alerting using Prometheus, Datadog, Splunk, Kiali, and custom dashboards
Utilize eBPF for deep kernel-level observability and performance tuning
Support real-time data platforms using Kafka, KSQLDB, Kafka Streams, and Spark Streaming
Manage multi-cluster Kubernetes environments, including cluster federation
Optimize system performance, scalability, and latency under high transaction volumes
Enforce banking-grade security controls, IAM policies, secrets management, and least-privilege access
Support environments aligned with SOC2, PCI-DSS, SOX, and internal banking security standards
Provide 24×7 operational support including rotational shifts, weekends, and on-call coverage

Qualification

AWS CloudKubernetesJavaCI/CD toolsDockerTerraformKafkaMonitoring toolsLinux/UnixPythonNetworking toolsService MeshVirtualizationRegulatory complianceDisaster recoverySoft skills

Required

Design, build, and operate highly available, fault-tolerant systems supporting core banking, payments, and trading platforms
Lead SRE practices including SLIs, SLOs, error budgets, and reliability-driven engineering decisions
Provide L3/L4 incident response, root cause analysis (RCA), and post-incident remediation for production systems
Support and optimize Java-based microservices running on Kubernetes (EKS)
Implement and manage AWS-native services including EC2, EKS, RDS, DynamoDB, S3, IAM, and CloudWatch
Develop automation using Terraform for infrastructure provisioning and policy enforcement
Manage Kubernetes networking, storage, and service mesh integrations including Istio and Anthos Service Mesh
Implement advanced Kubernetes storage solutions using Portworx
Architect and maintain enterprise-grade CI/CD pipelines using GitLab CI/CD, Jenkins, and cloud-native tooling
Automate manual operational tasks using Python, Go, Bash, and infrastructure-as-code patterns
Implement monitoring, logging, and alerting using Prometheus, Datadog, Splunk, Kiali, and custom dashboards
Utilize eBPF for deep kernel-level observability and performance tuning
Support real-time data platforms using Kafka, KSQLDB, Kafka Streams, and Spark Streaming
Manage multi-cluster Kubernetes environments, including cluster federation
Optimize system performance, scalability, and latency under high transaction volumes
Enforce banking-grade security controls, IAM policies, secrets management, and least-privilege access
Support environments aligned with SOC2, PCI-DSS, SOX, and internal banking security standards
Provide 24×7 operational support including rotational shifts, weekends, and on-call coverage
Java (JVM internals, GC tuning, microservices)
AWS Cloud: EKS, EC2, IAM, VPC, RDS, CloudWatch
Kubernetes with CKA/CKS-level depth
Docker and Terraform
CI/CD tools: GitLab CI/CD, Jenkins
Streaming platforms: Kafka, KSQLDB, Spark Streaming
Service Mesh: Istio, Anthos Service Mesh
Monitoring and Observability: Prometheus, Datadog, Splunk, Kiali
Linux/Unix systems and Bash scripting
Programming experience in Python or Go
Virtualization using VMware
Networking and performance tools including Nginx Controller, Seesaw, and eBPF
Experience supporting core banking systems, payment gateways, or trading platforms
Exposure to high-frequency transaction systems
Knowledge of regulatory audits and compliance controls
Experience with zero-downtime deployments and disaster recovery strategies
AWS Certified Solutions Architect – Professional or AWS DevOps Engineer – Professional
Certified Kubernetes Administrator (CKA) or Certified Kubernetes Security Specialist (CKS)

Company

Programming.com

twittertwittertwitter
company-logo
Programming.com is a leading software development company, providing expertise in strategy, consulting, technology and IT operations.

Funding

Current Stage
Late Stage

Leadership Team

leader-logo
Shashank Munim
Managing Partner
linkedin
Company data provided by crunchbase