This job has closed.

IS3 Solutions · 17 hours ago

AI Platform Engineer - GPU Orchestration & Run:AI

St Louis, MO

Full-time

Onsite

Senior Level

7+ years exp

IS3 Solutions is seeking an experienced AI Platform Engineer specializing in Run:AI GPU orchestration and Kubernetes for AI/ML workloads. The role involves designing, operating, and optimizing enterprise GPU clusters for AI training, inference, and experimentation across multiple teams.

Cyber SecurityCloud ComputingInformation TechnologyData CenterIT Infrastructure

Responsibilities

Design, deploy, and operate NVIDIA GPU clusters at scale

Administer GPU infrastructure across compute, networking, drivers, and storage

Ensure platform reliability, scalability, and performance

Partner with cloud, infrastructure, and AI teams to drive platform evolution

Serve as the Run:AI subject matter expert

Design and manage scheduling policies (queues, quotas, priorities, fairness)

Enable efficient multi tenant GPU utilization

Monitor GPU usage, job performance, and platform efficiency

Operate and administer GPU optimized Kubernetes clusters

Support containerized AI/ML workflows (training, batch, notebooks, inference)

Integrate NVIDIA drivers, operators, and AI tooling

Implement Kubernetes best practices for reliability and security

Manage the NVIDIA AI stack (CUDA, drivers, libraries)

Operate clusters using MIG, NVLink, and HPC interconnects

Troubleshoot GPU, driver, and workload performance issues

Collaborate across hardware, networking, and platform teams

Apply DevOps automation, CI/CD, and Infrastructure as Code

Develop runbooks, SOPs, and operational playbooks

Implement observability (logging, monitoring, alerting)

Lead incident response and root cause analysis

Train DevOps, Platform, and AI teams on GPU orchestration and Run:AI

Create onboarding and technical documentation

Act as a trusted advisor to AI/ML teams

Promote self?service, standardization, and operational maturity

Qualification

Run AI orchestrationNVIDIA GPU clustersKubernetes for AI/MLLinux administrationDevOps automationCI/CD pipelinesNetworking fundamentalsStorage fundamentalsMulti-tenant supportDocumentation experience

Required

7+ years in DevOps, SRE, Platform, or Infrastructure Engineering

Deep hands-on experience operating NVIDIA GPU clusters

Expert-level Run:AI scheduling/orchestration expertise

Strong Kubernetes experience focused on AI/ML workloads

Proven runbook and operational documentation experience

Strong Linux, networking, and storage fundamentals

Experience with multi-tenant or shared platform support

Preferred

Experience with NVIDIA DGX, BasePOD, SuperPOD

Knowledge of MIG, NVLink, InfiniBand

Familiarity with AI/ML framework workloads

Hybrid/on-prem AI platform experience

Exposure to GPU cost optimization or capacity planning

SRE mindset including SLAs/SLOs

Company

IS3 Solutions

IS3 Solutions is an IT company that provides data centers, cloud, cyber security, IT infrastructure, and IT financing solutions.

Founded in 2010

Shrewsbury, New Jersey, USA

51-200 employees

https://is3sol.com

Funding

Current Stage

Growth Stage

Leadership Team

John Marshall

CEO/Managing Partner

Company data provided by crunchbase