Apply on Employer Site

Intuitive · 1 week ago

Senior Site Reliability Engineer, AI/ML

Sunnyvale, CA

Full-time

Onsite

Senior Level

$179K/yr - $257K/yr

7+ years exp

Intuitive is a global leader in robotic-assisted surgery and minimally invasive care, seeking a highly skilled Senior Site Reliability Engineer to join their Technical Operations team. This role focuses on leading reliability, scalability, and performance initiatives for AI/ML workloads across multi-cloud and on-prem environments, ensuring the infrastructure supports advanced data science workflows.

Health CareManufacturingMedical Device

Growth Opportunities

H1B Sponsor Likely

Responsibilities

Contribute to deployment, and maintenance of infrastructure across AWS, GCP, and Azure, as well as on-prem NVIDIA DGX systems

Implement and manage Infrastructure as Code (IaC) using Terraform and Ansible for automated provisioning and configuration

Support cloud and on-prem networking solutions for secure, high-performance connectivity

Manage and optimize Domino Data Lab workflows and Slurm clusters for distributed training and inference

Integrate and support NVIDIA Base Command for GPU-based compute environments

Develop automation scripts and tools in Python to streamline operations and improve reliability

Support CI/CD pipelines using GitLab, ensuring smooth deployments to UAT and production environments

Implement and maintain observability solutions (monitoring, logging, alerting) using tools like Prometheus, Grafana, and cloud-native services

Deploy and manage Kubernetes clusters (EKS, GKE) for scalable containerized workloads

Troubleshoot complex workflows and ensure high availability of critical systems

Collaborate with data science and engineering teams to optimize resource utilization and workflow efficiency

Drive best practices for incident response, capacity planning, and system reliability in multi-cloud and HPC environments

Administer and optimize ITSM platforms (e.g., Jira Service Management, ServiceNow) for release/change/incident workflows

Support tooling across CI/CD, monitoring, and ticketing systems to ensure traceability and automation

Maintain documentation and evidence for audits related to release/change/incident processes

Partner with Compliance and InfoSec teams to ensure controls meet HIPAA, HITRUST, FDA GxP, and ISO 27001 standards

Act as the primary liaison between engineering, product, support, and compliance teams for operational readiness

Facilitate regular status updates, incident reviews, RCA’s and change planning sessions with stakeholders

Support in updating onboarding materials and training sessions for engineers and product managers on release/change/incident protocols

Promote a culture of ownership and reliability through education and process transparency

Support retrospectives for major releases and incidents to identify process gaps and improvement opportunities

Track and report on KPIs such as change success rate, incident recurrence, and release velocity

Identify operational risks and escalate proactively to leadership

Maintain escalation matrices and ensure readiness for high-severity incidents

Qualification

AWSGCPTerraformAnsibleKubernetesPythonDomino Data LabSlurmCI/CDPrometheusGrafanaNVIDIA Base CommandNFSNetApp Data ONTAPLinux systemsCommunication skillsCollaboration skills

Required

5+ years of experience in Site Reliability Engineering or Cloud Infrastructure Engineering

Strong proficiency in AWS and GCP; working knowledge of Azure

Expertise in Terraform, Ansible, and IaC principles

Solid understanding of networking fundamentals, VPC design, and security best practices

Hands-on experience managing AI/ML workloads, including Domino Data Lab, Slurm, and GPU-based environments

Advanced scripting and automation skills in Python

Experience with CI/CD pipelines and release management using GitLab

Strong troubleshooting skills and experience with observability tools (Prometheus, Grafana, ELK)

Hands-on experience with Kubernetes in AWS (EKS) and GCP (GKE)

Proficiency with NFS and NetApp Data ONTAP

Strong Linux systems knowledge, including familiarity with file systems, kernel internals, cgroups, and environment variables

Experience using debugging tools and performing debugging and analysis for complex systems

Excellent communication and collaboration skills in cross-functional environments

Preferred

Familiarity with NVIDIA Base Command and GPU orchestration

Knowledge of container orchestration beyond Kubernetes (Docker, Helm)

Understanding data security and compliance for AI/ML workloads

Exposure to MLOps best practices and ML lifecycle management

Master's degree or certifications in ITIL, DevOps, or regulatory compliance preferred

Minimum of 7+ years in technical operations, SRE, or IT service management roles

Company

Intuitive

Intuitive designs and manufactures robotic-assisted surgical systems.

Founded in 1995

Sunnyvale, California, USA

10001+ employees

https://www.intuitive.com/

H1B Sponsorship

Intuitive has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (339)

2024 (238)

2023 (181)

2022 (285)

2021 (145)

2020 (138)

Funding

Current Stage

Public Company

Total Funding

$5M

Key Investors

St. Cloud Capital

2003-04-30Post Ipo Equity

2000-06-23IPO

1996-01-01Seed· $5M

Leadership Team

Gillian Duncan

Senior Vice President, Professional Education & Program Services - Worldwide

Myriam Curet

Executive Vice President & Chief Medical Officer

Recent News

GlobeNewswire

Intuitive Announces Expanded Indications for da Vinci SP

2025-12-11

EIN Presswire

Trends and Analysis of the Artificial Intelligence-Powered Spinal Surgery Market by Application, with Forecasts 2029

2025-12-11

The Motley Fool

2 Healthcare Stocks for Beginner Investors With a 40-Year Time Horizon

2025-11-14

Company data provided by crunchbase