Milestone Systems · 20 hours ago
Lead Site Reliability Engineer - Infrastructure
Milestone Systems is seeking a Lead Site Reliability Engineer (Infrastructure) to join their fast-moving VSaaS engineering organization. This role involves technical leadership and operational execution of the Infrastructure SRE team, ensuring the reliability, scalability, and operability of the platform and production systems while mentoring senior and staff engineers.
EventsInformation ServicesSoftwareTrainingVideo
Responsibilities
Operate and evolve large-scale distributed systems, anticipating failure modes and proactively mitigating risks across production environments, while owning day-to-day production operations, including monitoring, alert triage, incident response, post-incident analysis, and critical incident coordination and documentation
Lead the design, build, and implementation of automation, orchestration, and operational tooling to improve efficiency, reliability, signal-to-noise ratio, and reduce recurring issues, minimizing service-impacting events
Set technical direction and influence platform strategy by defining platform architecture, system design, and documentation to guide development, testing, deployment, and long-term maintenance of complex distributed systems
Establish and enforce standards, operational rigor, and best practices for deploying, monitoring, managing, and operating cloud-native and distributed infrastructure environments
Lead the adoption and execution of modern CI/CD, GitOps, and cloud-native infrastructure practices, ensuring reliable, scalable, and traceable software and infrastructure releases
Mentor and develop senior and staff engineers, reinforcing SRE principles, DevOps practices, accountability, and operational excellence across the Infrastructure SRE team
Collaborate closely with product and engineering stakeholders, advocating for an SRE mindset and system-level thinking to maximize reliability, performance, availability, security, and scalability across shared platforms and services
Qualification
Required
10+ years of experience in site reliability engineering, infrastructure, or systems engineering, with deep ownership of large-scale production systems and demonstrated leadership of SRE or infrastructure teams, including setting technical direction and mentoring senior engineers
Strong hands-on experience designing and building automation and operational tooling using Golang and/or Python, with expert-level proficiency in Linux/Unix systems, shell scripting, and production troubleshooting
Advanced expertise in cloud-native and IaaS architectures, distributed systems, and container orchestration in production environments, including compliance, security, and network considerations
Expertise in architecting modular Terraform frameworks and Infrastructure-as-code (IaC) design patterns
Deep understanding of SRE and DevOps principles, including incident management, SLA/SLO ownership, automation, reliability engineering practices and leading incident response with post-incident analysis and preventive improvements
Strong experience with CI/CD pipelines, GitOps workflows, release tooling, and modern cloud-native infrastructure practices, ensuring reliable and traceable software and infrastructure changes
Hands-on experience operating Docker and Kubernetes environments, observability platforms (logging, monitoring, alerting), and SQL/NoSQL databases (e.g., Postgres, MongoDB, Graph DB), including performance tuning and operational troubleshooting
Preferred
Subject matter expertise in Google Cloud preferred; experience with other public cloud providers is also valuable
Demonstrated expertise in microservices lifecycle management, including integration, testing, deployment, and operational best practices, supported by advanced knowledge of software release tooling and CI/CD platforms such as GitLab, Jenkins, Cloud Build, ArgoCD, and Spinnaker
Deep understanding of the Docker and Kubernetes ecosystem, including orchestration, cluster management, and image lifecycle optimization
Strong experience with observability, logging, and monitoring tools such as ELK Stack, Prometheus, Stackdriver, Datadog, New Relic, or Dynatrace
Hands-on experience with algorithms, data structures, complexity analysis, and software/system design for large-scale distributed environments
Experience driving automation for operational efficiency, signal noise reduction, recurring issue mitigation, performance testing, capacity planning, and system optimization in production environments
Experience implementing security best practices and compliance considerations in infrastructure and platform design, along with the ability to influence cross-functional teams, evangelize SRE and DevOps practices, and foster a culture of reliability and operational excellence
Benefits
Medical/dental benefits
FSA or HSA
401k with 6% Safe Harbor employer match
Paid parental leave
Generous PTO (20 days' vacation, 10 days paid sick time, and 12 company holidays)
Fully paid Short Term disability policy
Fully paid Long Term disability policy
Life Insurance
Company
Milestone Systems
Milestone Systems develops open platform IP video management software, delivering easy-to-manage surveillance solutions for enterprises.
H1B Sponsorship
Milestone Systems has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (2)
Funding
Current Stage
Late StageTotal Funding
$27.01MKey Investors
Index Ventures
2014-06-12Acquired
2014-02-01Seed· $0.01M
2008-07-07Series A· $27M
Recent News
Company data provided by crunchbase