Stratitech Services LLC · 8 hours ago
Senior Site Reliability Engineer
StratITech is hiring a Staff Site Reliability Engineer to help our client in San Francisco scale and harden a data-intensive platform powering machine learning, neural network workloads, and real-time analytics. The role involves building and maintaining scalable Linux-based infrastructure, improving system reliability, and ensuring ML workloads are production-ready.
Responsibilities
Build and maintain scalable Linux-based infrastructure supporting real-time analytics and ML workloads
Improve system reliability and performance through automation, observability, and proactive capacity planning
Own CI/CD pipelines, deployment automation, rollback strategies, and configuration management for production systems
Implement and operate monitoring, alerting, SLOs, runbooks, and incident response processes for critical services
Partner with engineering and data science teams to ensure ML workloads are production-ready and reliable by design
Ensure security, compliance, and operational readiness across infrastructure and deployment workflows
Lead post-incident reviews and drive measurable, long-term reliability improvements
Qualification
Required
Deep experience operating Linux infrastructure, systems, and networking in production
Proven impact as an SRE or DevOps Engineer supporting complex, distributed systems
Practical understanding of machine learning systems and neural network workloads in production
Hands-on experience with Docker and Kubernetes
Strong Infrastructure-as-Code experience
Strong scripting skills (Bash and/or Python)
Experience with observability tools (Prometheus, Grafana, Datadog, ELK, OpenTelemetry)
CI/CD pipeline ownership experience (GitHub Actions, ArgoCD, or similar)
Ability to debug systemic failures across infrastructure, deployments, and workloads
Clear communicator who works effectively across engineering and data teams
Preferred
Experience supporting ML platforms at scale (training and inference)
AWS or cloud-managed services experience
Familiarity with data platforms such as Spark, Airflow, or Kafka
Experience operating in SOC 2 or regulated environments
Company
Stratitech Services LLC
About StratITech At StratITech, we help businesses scale, innovate, and transform through technology.
Funding
Current Stage
Early StageCompany data provided by crunchbase