Intuitive · 1 week ago
Senior Site Reliability Engineer, AI/ML
Intuitive is a global leader in robotic-assisted surgery and minimally invasive care, seeking a highly skilled Senior Site Reliability Engineer to join their Technical Operations team. This role focuses on leading reliability, scalability, and performance initiatives for AI/ML workloads across multi-cloud and on-prem environments, ensuring the infrastructure supports advanced data science workflows.
Health CareManufacturingMedical Device
Responsibilities
Contribute to deployment, and maintenance of infrastructure across AWS, GCP, and Azure, as well as on-prem NVIDIA DGX systems
Implement and manage Infrastructure as Code (IaC) using Terraform and Ansible for automated provisioning and configuration
Support cloud and on-prem networking solutions for secure, high-performance connectivity
Manage and optimize Domino Data Lab workflows and Slurm clusters for distributed training and inference
Integrate and support NVIDIA Base Command for GPU-based compute environments
Develop automation scripts and tools in Python to streamline operations and improve reliability
Support CI/CD pipelines using GitLab, ensuring smooth deployments to UAT and production environments
Implement and maintain observability solutions (monitoring, logging, alerting) using tools like Prometheus, Grafana, and cloud-native services
Deploy and manage Kubernetes clusters (EKS, GKE) for scalable containerized workloads
Troubleshoot complex workflows and ensure high availability of critical systems
Collaborate with data science and engineering teams to optimize resource utilization and workflow efficiency
Drive best practices for incident response, capacity planning, and system reliability in multi-cloud and HPC environments
Administer and optimize ITSM platforms (e.g., Jira Service Management, ServiceNow) for release/change/incident workflows
Support tooling across CI/CD, monitoring, and ticketing systems to ensure traceability and automation
Maintain documentation and evidence for audits related to release/change/incident processes
Partner with Compliance and InfoSec teams to ensure controls meet HIPAA, HITRUST, FDA GxP, and ISO 27001 standards
Act as the primary liaison between engineering, product, support, and compliance teams for operational readiness
Facilitate regular status updates, incident reviews, RCA’s and change planning sessions with stakeholders
Support in updating onboarding materials and training sessions for engineers and product managers on release/change/incident protocols
Promote a culture of ownership and reliability through education and process transparency
Support retrospectives for major releases and incidents to identify process gaps and improvement opportunities
Track and report on KPIs such as change success rate, incident recurrence, and release velocity
Identify operational risks and escalate proactively to leadership
Maintain escalation matrices and ensure readiness for high-severity incidents
Qualification
Required
5+ years of experience in Site Reliability Engineering or Cloud Infrastructure Engineering
Strong proficiency in AWS and GCP; working knowledge of Azure
Expertise in Terraform, Ansible, and IaC principles
Solid understanding of networking fundamentals, VPC design, and security best practices
Hands-on experience managing AI/ML workloads, including Domino Data Lab, Slurm, and GPU-based environments
Advanced scripting and automation skills in Python
Experience with CI/CD pipelines and release management using GitLab
Strong troubleshooting skills and experience with observability tools (Prometheus, Grafana, ELK)
Hands-on experience with Kubernetes in AWS (EKS) and GCP (GKE)
Proficiency with NFS and NetApp Data ONTAP
Strong Linux systems knowledge, including familiarity with file systems, kernel internals, cgroups, and environment variables
Experience using debugging tools and performing debugging and analysis for complex systems
Excellent communication and collaboration skills in cross-functional environments
Preferred
Familiarity with NVIDIA Base Command and GPU orchestration
Knowledge of container orchestration beyond Kubernetes (Docker, Helm)
Understanding data security and compliance for AI/ML workloads
Exposure to MLOps best practices and ML lifecycle management
Master's degree or certifications in ITIL, DevOps, or regulatory compliance preferred
Minimum of 7+ years in technical operations, SRE, or IT service management roles
Company
Intuitive
Intuitive designs and manufactures robotic-assisted surgical systems.
H1B Sponsorship
Intuitive has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (339)
2024 (238)
2023 (181)
2022 (285)
2021 (145)
2020 (138)
Funding
Current Stage
Public CompanyTotal Funding
$5MKey Investors
St. Cloud Capital
2003-04-30Post Ipo Equity
2000-06-23IPO
1996-01-01Seed· $5M
Leadership Team
Recent News
2025-12-11
2025-11-14
Company data provided by crunchbase