SIGN IN
AI Platform Engineer - GPU Orchestration & Run:AI jobs in United States
info-icon
This job has closed.
company-logo

IS3 Solutions · 17 hours ago

AI Platform Engineer - GPU Orchestration & Run:AI

IS3 Solutions is seeking an experienced AI Platform Engineer specializing in Run:AI GPU orchestration and Kubernetes for AI/ML workloads. The role involves designing, operating, and optimizing enterprise GPU clusters for AI training, inference, and experimentation across multiple teams.
Cyber SecurityCloud ComputingInformation TechnologyData CenterIT Infrastructure

Responsibilities

Design, deploy, and operate NVIDIA GPU clusters at scale
Administer GPU infrastructure across compute, networking, drivers, and storage
Ensure platform reliability, scalability, and performance
Partner with cloud, infrastructure, and AI teams to drive platform evolution
Serve as the Run:AI subject matter expert
Design and manage scheduling policies (queues, quotas, priorities, fairness)
Enable efficient multi tenant GPU utilization
Monitor GPU usage, job performance, and platform efficiency
Operate and administer GPU optimized Kubernetes clusters
Support containerized AI/ML workflows (training, batch, notebooks, inference)
Integrate NVIDIA drivers, operators, and AI tooling
Implement Kubernetes best practices for reliability and security
Manage the NVIDIA AI stack (CUDA, drivers, libraries)
Operate clusters using MIG, NVLink, and HPC interconnects
Troubleshoot GPU, driver, and workload performance issues
Collaborate across hardware, networking, and platform teams
Apply DevOps automation, CI/CD, and Infrastructure as Code
Develop runbooks, SOPs, and operational playbooks
Implement observability (logging, monitoring, alerting)
Lead incident response and root cause analysis
Train DevOps, Platform, and AI teams on GPU orchestration and Run:AI
Create onboarding and technical documentation
Act as a trusted advisor to AI/ML teams
Promote self?service, standardization, and operational maturity

Qualification

Run AI orchestrationNVIDIA GPU clustersKubernetes for AI/MLLinux administrationDevOps automationCI/CD pipelinesNetworking fundamentalsStorage fundamentalsMulti-tenant supportDocumentation experience

Required

7+ years in DevOps, SRE, Platform, or Infrastructure Engineering
Deep hands-on experience operating NVIDIA GPU clusters
Expert-level Run:AI scheduling/orchestration expertise
Strong Kubernetes experience focused on AI/ML workloads
Proven runbook and operational documentation experience
Strong Linux, networking, and storage fundamentals
Experience with multi-tenant or shared platform support

Preferred

Experience with NVIDIA DGX, BasePOD, SuperPOD
Knowledge of MIG, NVLink, InfiniBand
Familiarity with AI/ML framework workloads
Hybrid/on-prem AI platform experience
Exposure to GPU cost optimization or capacity planning
SRE mindset including SLAs/SLOs

Company

IS3 Solutions

twittertwittertwitter
company-logo
IS3 Solutions is an IT company that provides data centers, cloud, cyber security, IT infrastructure, and IT financing solutions.

Funding

Current Stage
Growth Stage

Leadership Team

leader-logo
John Marshall
CEO/Managing Partner
linkedin
Company data provided by crunchbase