Senior Site Reliability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Stratitech Services LLC · 11 hours ago

Senior Site Reliability Engineer

StratITech is hiring a Staff Site Reliability Engineer to help our client in San Francisco scale and harden a data-intensive platform powering machine learning, neural network workloads, and real-time analytics. The role involves building and maintaining scalable Linux-based infrastructure, improving system reliability, and ensuring ML workloads are production-ready.

Information Technology & Services
badNo H1Bnote
Hiring Manager
Rebecca McCartney
linkedin

Responsibilities

Build and maintain scalable Linux-based infrastructure supporting real-time analytics and ML workloads
Improve system reliability and performance through automation, observability, and proactive capacity planning
Own CI/CD pipelines, deployment automation, rollback strategies, and configuration management for production systems
Implement and operate monitoring, alerting, SLOs, runbooks, and incident response processes for critical services
Partner with engineering and data science teams to ensure ML workloads are production-ready and reliable by design
Ensure security, compliance, and operational readiness across infrastructure and deployment workflows
Lead post-incident reviews and drive measurable, long-term reliability improvements

Qualification

Linux infrastructureMachine learning systemsDockerKubernetesInfrastructure-as-CodeObservability toolsCI/CD pipeline ownershipScripting skillsClear communicator

Required

Deep experience operating Linux infrastructure, systems, and networking in production
Proven impact as an SRE or DevOps Engineer supporting complex, distributed systems
Practical understanding of machine learning systems and neural network workloads in production
Hands-on experience with Docker and Kubernetes
Strong Infrastructure-as-Code experience
Strong scripting skills (Bash and/or Python)
Experience with observability tools (Prometheus, Grafana, Datadog, ELK, OpenTelemetry)
CI/CD pipeline ownership experience (GitHub Actions, ArgoCD, or similar)
Ability to debug systemic failures across infrastructure, deployments, and workloads
Clear communicator who works effectively across engineering and data teams

Preferred

Experience supporting ML platforms at scale (training and inference)
AWS or cloud-managed services experience
Familiarity with data platforms such as Spark, Airflow, or Kafka
Experience operating in SOC 2 or regulated environments

Company

Stratitech Services LLC

twitter
company-logo
About StratITech At StratITech, we help businesses scale, innovate, and transform through technology.

Funding

Current Stage
Early Stage
Company data provided by crunchbase