IS3 Solutions · 17 hours ago
AI Platform Engineer - GPU Orchestration & Run:AI
IS3 Solutions is seeking an experienced AI Platform Engineer specializing in Run:AI GPU orchestration and Kubernetes for AI/ML workloads. The role involves designing, operating, and optimizing enterprise GPU clusters for AI training, inference, and experimentation across multiple teams.
Cyber SecurityCloud ComputingInformation TechnologyData CenterIT Infrastructure
Responsibilities
Design, deploy, and operate NVIDIA GPU clusters at scale
Administer GPU infrastructure across compute, networking, drivers, and storage
Ensure platform reliability, scalability, and performance
Partner with cloud, infrastructure, and AI teams to drive platform evolution
Serve as the Run:AI subject matter expert
Design and manage scheduling policies (queues, quotas, priorities, fairness)
Enable efficient multi tenant GPU utilization
Monitor GPU usage, job performance, and platform efficiency
Operate and administer GPU optimized Kubernetes clusters
Support containerized AI/ML workflows (training, batch, notebooks, inference)
Integrate NVIDIA drivers, operators, and AI tooling
Implement Kubernetes best practices for reliability and security
Manage the NVIDIA AI stack (CUDA, drivers, libraries)
Operate clusters using MIG, NVLink, and HPC interconnects
Troubleshoot GPU, driver, and workload performance issues
Collaborate across hardware, networking, and platform teams
Apply DevOps automation, CI/CD, and Infrastructure as Code
Develop runbooks, SOPs, and operational playbooks
Implement observability (logging, monitoring, alerting)
Lead incident response and root cause analysis
Train DevOps, Platform, and AI teams on GPU orchestration and Run:AI
Create onboarding and technical documentation
Act as a trusted advisor to AI/ML teams
Promote self?service, standardization, and operational maturity
Qualification
Required
7+ years in DevOps, SRE, Platform, or Infrastructure Engineering
Deep hands-on experience operating NVIDIA GPU clusters
Expert-level Run:AI scheduling/orchestration expertise
Strong Kubernetes experience focused on AI/ML workloads
Proven runbook and operational documentation experience
Strong Linux, networking, and storage fundamentals
Experience with multi-tenant or shared platform support
Preferred
Experience with NVIDIA DGX, BasePOD, SuperPOD
Knowledge of MIG, NVLink, InfiniBand
Familiarity with AI/ML framework workloads
Hybrid/on-prem AI platform experience
Exposure to GPU cost optimization or capacity planning
SRE mindset including SLAs/SLOs