AI Infrastructure Engineer – Slurm Platform jobs in United States
cer-icon
Apply on Employer Site
company-logo

Advanced Microdevices Pvt. Ltd. (India) · 2 months ago

AI Infrastructure Engineer – Slurm Platform

Advanced Micro Devices, Inc is a company focused on building innovative products for next-generation computing experiences. They are seeking an AI Infrastructure Engineer to drive the delivery of software solutions for AI software development and optimization, particularly in managing a Slurm-based GPU compute platform.

BiopharmaBiotechnologyIndustrialManufacturing

Responsibilities

Design, deploy, and operate Slurm clusters across on-prem and multi-cloud GPU environments (Azure, OCI, Vultr, DigitalOcean, etc.)
Integrate Slurm with the broader orchestration ecosystem, enabling hybrid scheduling, unified authentication, and telemetry pipelines
Build platform features that improve developer experience — e.g., job submission APIs, automated environment setup, and metrics dashboards
Optimize cluster utilization and scheduling for GPU and CPU workloads; develop fair-share, QoS, and preemption policies
Monitor cluster health and performance, implementing observability pipelines using Prometheus, Grafana, and custom exporters
Collaborate with internal developers (framework, compiler, and application teams) to understand workload needs and translate them into scalable Slurm features
Contribute to storage and network integration, ensuring performant I/O (e.g., NFS, Lustre, Weka) and high-speed interconnect configuration (InfiniBand, NVLink)
Support the job lifecycle — from image builds and environment modules to debugging and performance tuning of Slurm jobs

Qualification

Slurm clusters managementLinux systemsGPU workloadsKubernetes integrationHPC storage technologiesMetrics visualizationCI/CD pipelinesInfrastructure automationTroubleshooting skillsMachine learning workflowsScriptingCollaboration skills

Required

8+ years of experience managing and automating HPC or Slurm clusters in production environments
Deep understanding of Linux systems, job schedulers (Slurm), and resource management for GPU-accelerated workloads
Strong troubleshooting skills across compute, storage, and network layers
Proven ability to collaborate with developers and researchers to design scalable HPC solutions

Preferred

Experience integrating Slurm with Kubernetes or other control planes
Experience with HPC storage and I/O technologies (Lustre, ZFS, WekaFS, NFS)
Familiarity with metrics collection and visualization using Prometheus, Grafana, and Thanos
Exposure to CI/CD pipelines and DevOps practices for scientific or ML workloads
Understanding of machine learning workflows and frameworks (PyTorch, vLLM, SGLang)
Experience with infrastructure automation (e.g., Ansible, Terraform) and scripting (Python, Bash)
Bachelor's or Master's degree in related discipline preferred

Benefits

AMD benefits at a glance.

Company

Advanced Microdevices Pvt. Ltd. (India)

twittertwittertwitter
company-logo
Advanced Microdevices (mdi) is a leader in innovative membrane technologies.

Funding

Current Stage
Late Stage

Leadership Team

leader-logo
Nalini Kant Gupta
Founder & Managing Director
Company data provided by crunchbase