Apply on Employer Site

VySystems · 19 hours ago

Site Reliability Engineer

New Jersey, United States

Contract

Onsite

Mid, Senior Level

VySystems is a company focused on Site Reliability Engineering. They are seeking a Site Reliability Engineer with deep expertise in distributed systems and production operations to enhance their reliability and automation processes.

AppsConsultingDigital MarketingInformation TechnologyInfrastructureIT InfrastructureIT ManagementWeb Development

Responsibilities

Site Reliability Engineering, Production Engineering, or equivalent roles

Deep expertise in distributed systems, resilience engineering, and large‑scale production operations

Strong proficiency with observability stacks: Metrics, logs, traces, Splunk, ELK, New Relic, synthetic monitoring, APM

Advanced experience with service‑level objectives (SLOs), SLIs, error budgets, and reliability governance

Expertise in Kubernetes, container orchestration, and workload reliability patterns

Strong skills in incident management, on‑call response, war‑room leadership, and RCA methodologies

Proven ability to engineer automation/self‑healing systems (auto‑remediation, failure‑mode detection)

Strong scripting/automation skills in Python, Bash, or similar languages

Solid understanding of traffic distribution, load balancing, session handling, and failure isolation

Expert debugging and performance troubleshooting across the full stack (network, compute, services)

Experience with AWS (EKS/ECS, SQS/SNS, S3, CloudFront, etc.)

Experience implementing AIOps, alert correlation, noise reduction, or automated RCA frameworks

Background in building paved paths, golden templates, or policy‑as‑code reliability gates

Experience with reverse proxy troubleshooting, including rate limits, affinity, and routing logic

Prior experience in high‑throughput government or regulated environments

Performance/load testing experience (designing tests, analyzing throughput, identifying bottlenecks)

Strong understanding of release reliability, risk recording, and continuous deployment safeguards

Familiarity with monitoring‑as‑code or dashboards‑as‑code practices

Hands‑on experience with infrastructure‑as‑code (Terraform preferred)

Qualification

Site Reliability EngineeringDistributed systemsKubernetesIncident managementObservability stacksAutomation/self-healing systemsAWSPerformance troubleshootingScripting/automation skillsTraffic distributionMonitoring-as-codeInfrastructure-as-code

Required