Apply on Employer Site

DYNE · 1 day ago

Senior JAVA SRE

McLean, VA

Contract

Hybrid

Senior Level, Lead/Staff

15+ years exp

DYNE is seeking a Senior Java Site Reliability Engineer (SRE) to design, build, and operate highly resilient, low-latency, enterprise-scale systems supporting core banking, payments, and trading platforms. This role requires deep expertise in Java microservices, Kubernetes, and AWS cloud infrastructure, with a focus on ensuring reliability, scalability, and production excellence.

Information Technology & Services

Responsibilities

Design, implement, and operate highly available, fault-tolerant, and scalable systems for mission-critical financial platforms

Lead SRE practices including SLIs, SLOs, error budgets, and reliability-driven engineering decisions

Provide L3/L4 production support, including incident management, root cause analysis (RCA), and post-incident remediation

Drive continuous improvement through blameless postmortems and operational excellence initiatives

Support and optimize Java-based microservices, including JVM internals, GC tuning, and performance optimization

Operate and scale workloads on Kubernetes (EKS) across multi-cluster environments

Implement and manage AWS services including EC2, EKS, IAM, VPC, RDS, DynamoDB, S3, and CloudWatch

Design and maintain zero-downtime deployment strategies and robust disaster recovery (DR) architectures

Build and manage infrastructure using Terraform and infrastructure-as-code best practices

Automate operational workflows using Python, Go, Bash, and cloud-native tooling

Architect and maintain enterprise-grade CI/CD pipelines using GitLab CI/CD, Jenkins, and Kubernetes-native integrations

Manage Kubernetes networking, storage, and ingress using Nginx Controller, Seesaw, and advanced networking patterns

Implement and operate service mesh solutions including Istio and Anthos Service Mesh

Design and manage Kubernetes storage solutions using Portworx

Support multi-cluster Kubernetes environments, including federation and cross-cluster communication

Implement monitoring, logging, and alerting using Prometheus, Datadog, Splunk, Kiali, and custom dashboards

Utilize eBPF for deep kernel-level observability, performance analysis, and system tuning

Optimize latency, throughput, and scalability under high-frequency transaction loads

Support real-time data platforms using Kafka, Kafka Streams, KSQLDB, and Spark Streaming

Ensure reliability and performance of streaming pipelines in high-volume, low-latency environments

Enforce banking-grade security controls, IAM policies, secrets management, and least-privilege access

Support platforms aligned with SOC 2, PCI-DSS, SOX, and internal banking security standards

Participate in regulatory audits, risk assessments, and compliance reviews

Participate in 24×7 on-call rotations, including nights and weekends, supporting U.S. time zones

Act as a senior escalation point during major incidents and platform outages

Qualification

JavaAWSKubernetesTerraformKafkaCI/CDService MeshLinux/UnixPythonObservabilityGoBashVMwareNginx ControllerEBPFRegulatory ComplianceDisaster RecoveryHigh Availability

Required

15+ Years of experience

Deep expertise across Java microservices, Kubernetes, AWS cloud infrastructure, and SRE best practices

Hands-on responsibility for reliability, scalability, and production excellence in high-transaction environments

Operate at L3/L4 production support level

Lead reliability engineering initiatives

Work closely with platform, application, and security teams to ensure zero-downtime, compliance-aligned operations