Bayside Solutions ยท 1 day ago
SRE/DevOps
Bayside Solutions is preparing for a global-scale service rollout and is seeking a highly technical Site Reliability Engineer (SRE) / DevOps Engineer. The role involves owning infrastructure, tooling, and incident response for a mission-critical launch across multiple regions.
Information TechnologyStaffing AgencyTelecommunicationsVirtual Reality
Responsibilities
Global Rollout Execution
Lead deployment and operations in US East / West; extend to 5+ regions globally
Maintain consistent environment setup and infrastructure standards across regions
Core Systems Ownership
Operate and scale Amazon RDS, EKS clusters, and Ray distributed compute environments
Optimize cluster performance, scheduling, and workload orchestration
Operational Reliability
Serve as primary SRE for launch; manage incident triage, escalation, and postmortems
Implement and enforce vac-pairing schedules for global on-call coverage
Utilize AWS Systems Manager Incident Manager for automation and resolution
Tooling & Observability
Build and manage monitoring dashboards with Grafana; integrate logs and traces in Splunk
Develop automation and alerting pipelines for proactive incident detection
Ensure instrumentation is consistent across multi-region services
High-Severity Incident Management
Respond to Sev-0/Sev-1 incidents with immediate root-cause analysis
Conduct resilience testing, chaos drills, and failover validation
Protect our brand by ensuring zero-downtime objectives during launch phases
Qualification
Required
AWS Expertise - Deep hands-on experience with Amazon RDS, EKS, IAM, VPC; proven track record in multi-region deployments, HA/DR, and failover strategies at enterprise scale
Kubernetes / EKS - 5+ years operating and scaling multi-cluster environments; advanced debugging and tuning at the cluster, pod, and network layers; Helm proficiency
Incident Response & Ops - Expert in Sev-0/1 incident triage and recovery, PagerDuty/OpsGenie/AWS Incident Manager (any one of these tools), and large-scale runbook automation
Observability & Monitoring - Strong in Grafana, Splunk, Prometheus, and tracing systems; ability to design end-to-end observability pipelines across global workloads
Infrastructure as Code (IaC) - Production-grade Terraform, Crossplane, AWS CDK, or Ansible; ability to enforce parity across multiple AWS regions
Programming & Automation - Intermediate level Experience in Python, Go, or Bash for scripting, automation, and tooling development
Operational Rigor - Demonstrated ability to thrive in high-pressure, high-visibility environments; experience supporting global-scale product launches with strict zero-downtime objectives
Preferred
Machine Learning Infrastructure - Familiarity with Amazon SageMaker (deployment, monitoring), feature stores, and ML pipeline operations
Caching & Distributed Storage - Experience with Redis/ElastiCache and caching strategies for large-scale, high-throughput systems
Data Lake & Governance - Hands-on with AWS Lake Formation, Glue, or similar tools for secure, governed multi-region data access
Distributed Systems (Ray or equivalent) - Workload profiling, scheduling, and distributed compute optimization
Chaos Engineering - Background in resilience testing, chaos drills, and automated failover validation
Security & Compliance - Knowledge of multi-region security, compliance, and data protection frameworks for enterprise cloud workloads
AI/ML Ops - Experience operationalizing ML in production: monitoring drift, scaling inference endpoints, and integrating ML workloads into SRE practices