Apply on Employer Site

Bayside Solutions · 1 day ago

SRE/DevOps

United States

Contract

Remote

Mid, Senior Level

$114K/yr - $135K/yr

5+ years exp

Bayside Solutions is preparing for a global-scale service rollout and is seeking a highly technical Site Reliability Engineer (SRE) / DevOps Engineer. The role involves owning infrastructure, tooling, and incident response for a mission-critical launch across multiple regions.

Information TechnologyStaffing AgencyTelecommunicationsVirtual Reality

Growth Opportunities

No H1B

Responsibilities

Global Rollout Execution

Lead deployment and operations in US East / West; extend to 5+ regions globally

Maintain consistent environment setup and infrastructure standards across regions

Core Systems Ownership

Operate and scale Amazon RDS, EKS clusters, and Ray distributed compute environments

Optimize cluster performance, scheduling, and workload orchestration

Operational Reliability

Serve as primary SRE for launch; manage incident triage, escalation, and postmortems

Implement and enforce vac-pairing schedules for global on-call coverage

Utilize AWS Systems Manager Incident Manager for automation and resolution

Tooling & Observability

Build and manage monitoring dashboards with Grafana; integrate logs and traces in Splunk

Develop automation and alerting pipelines for proactive incident detection

Ensure instrumentation is consistent across multi-region services

High-Severity Incident Management

Respond to Sev-0/Sev-1 incidents with immediate root-cause analysis

Conduct resilience testing, chaos drills, and failover validation

Protect our brand by ensuring zero-downtime objectives during launch phases

Qualification

AWS ExpertiseKubernetes / EKSIncident Response & OpsObservability & MonitoringInfrastructure as CodeProgramming & AutomationMachine Learning InfrastructureCaching & Distributed StorageData Lake & GovernanceChaos EngineeringSecurity & ComplianceOperational Rigor

Required

AWS Expertise - Deep hands-on experience with Amazon RDS, EKS, IAM, VPC; proven track record in multi-region deployments, HA/DR, and failover strategies at enterprise scale

Kubernetes / EKS - 5+ years operating and scaling multi-cluster environments; advanced debugging and tuning at the cluster, pod, and network layers; Helm proficiency

Incident Response & Ops - Expert in Sev-0/1 incident triage and recovery, PagerDuty/OpsGenie/AWS Incident Manager (any one of these tools), and large-scale runbook automation

Observability & Monitoring - Strong in Grafana, Splunk, Prometheus, and tracing systems; ability to design end-to-end observability pipelines across global workloads

Infrastructure as Code (IaC) - Production-grade Terraform, Crossplane, AWS CDK, or Ansible; ability to enforce parity across multiple AWS regions

Programming & Automation - Intermediate level Experience in Python, Go, or Bash for scripting, automation, and tooling development

Operational Rigor - Demonstrated ability to thrive in high-pressure, high-visibility environments; experience supporting global-scale product launches with strict zero-downtime objectives

Preferred

Machine Learning Infrastructure - Familiarity with Amazon SageMaker (deployment, monitoring), feature stores, and ML pipeline operations

Caching & Distributed Storage - Experience with Redis/ElastiCache and caching strategies for large-scale, high-throughput systems

Data Lake & Governance - Hands-on with AWS Lake Formation, Glue, or similar tools for secure, governed multi-region data access

Distributed Systems (Ray or equivalent) - Workload profiling, scheduling, and distributed compute optimization

Chaos Engineering - Background in resilience testing, chaos drills, and automated failover validation

Security & Compliance - Knowledge of multi-region security, compliance, and data protection frameworks for enterprise cloud workloads

AI/ML Ops - Experience operationalizing ML in production: monitoring drift, scaling inference endpoints, and integrating ML workloads into SRE practices