Apply on Employer Site

Stratitech Services LLC · 11 hours ago

Senior Site Reliability Engineer

San Francisco Bay Area

Contract

Onsite

Senior Level, Lead/Staff

$200K/yr - $240K/yr

StratITech is hiring a Staff Site Reliability Engineer to help our client in San Francisco scale and harden a data-intensive platform powering machine learning, neural network workloads, and real-time analytics. The role involves building and maintaining scalable Linux-based infrastructure, improving system reliability, and ensuring ML workloads are production-ready.

Information Technology & Services

No H1B

Hiring Manager

Rebecca McCartney

Responsibilities

Build and maintain scalable Linux-based infrastructure supporting real-time analytics and ML workloads

Improve system reliability and performance through automation, observability, and proactive capacity planning

Own CI/CD pipelines, deployment automation, rollback strategies, and configuration management for production systems

Implement and operate monitoring, alerting, SLOs, runbooks, and incident response processes for critical services

Partner with engineering and data science teams to ensure ML workloads are production-ready and reliable by design

Ensure security, compliance, and operational readiness across infrastructure and deployment workflows

Lead post-incident reviews and drive measurable, long-term reliability improvements

Qualification

Linux infrastructureMachine learning systemsDockerKubernetesInfrastructure-as-CodeObservability toolsCI/CD pipeline ownershipScripting skillsClear communicator

Required

Deep experience operating Linux infrastructure, systems, and networking in production

Proven impact as an SRE or DevOps Engineer supporting complex, distributed systems

Practical understanding of machine learning systems and neural network workloads in production

Hands-on experience with Docker and Kubernetes

Strong Infrastructure-as-Code experience

Strong scripting skills (Bash and/or Python)

Experience with observability tools (Prometheus, Grafana, Datadog, ELK, OpenTelemetry)

CI/CD pipeline ownership experience (GitHub Actions, ArgoCD, or similar)

Ability to debug systemic failures across infrastructure, deployments, and workloads

Clear communicator who works effectively across engineering and data teams

Preferred

Experience supporting ML platforms at scale (training and inference)

AWS or cloud-managed services experience

Familiarity with data platforms such as Spark, Airflow, or Kafka

Experience operating in SOC 2 or regulated environments

Company

Stratitech Services LLC

About StratITech At StratITech, we help businesses scale, innovate, and transform through technology.

Founded in 2021

Greater Silicon Valley, CA, US

2-10 employees

https://www.stratitechservices.com

Funding

Current Stage

Early Stage

Company data provided by crunchbase