SIGN IN
AI Infrastructure Engineer – Slurm Platform jobs in United States
cer-icon
Apply on Employer Site
company-logo

AMD · 3 months ago

AI Infrastructure Engineer – Slurm Platform

AMD is a leading company in the computing industry, focused on delivering innovative products that enhance next-generation computing experiences. The AI Infrastructure Engineer will be responsible for driving the delivery of software solutions for AI software development, managing Slurm-based GPU compute platforms, and collaborating with various teams to optimize performance and usability.
AI InfrastructureArtificial Intelligence (AI)Cloud ComputingComputerEmbedded SystemsGPUHardwareSemiconductor
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Design, deploy, and operate Slurm clusters across on-prem and multi-cloud GPU environments (Azure, OCI, Vultr, DigitalOcean, etc.)
Integrate Slurm with the broader orchestration ecosystem, enabling hybrid scheduling, unified authentication, and telemetry pipelines
Build platform features that improve developer experience — e.g., job submission APIs, automated environment setup, and metrics dashboards
Optimize cluster utilization and scheduling for GPU and CPU workloads; develop fair-share, QoS, and preemption policies
Monitor cluster health and performance, implementing observability pipelines using Prometheus, Grafana, and custom exporters
Collaborate with internal developers (framework, compiler, and application teams) to understand workload needs and translate them into scalable Slurm features
Contribute to storage and network integration, ensuring performant I/O (e.g., NFS, Lustre, Weka) and high-speed interconnect configuration (InfiniBand, NVLink)
Support the job lifecycle — from image builds and environment modules to debugging and performance tuning of Slurm jobs

Qualification

HPC cluster managementSlurmLinux systemsKubernetes integrationHPC storage technologiesMetrics visualizationCI/CD pipelinesInfrastructure automationPythonBash scriptingMachine learning workflows

Required

8+ years of experience managing and automating HPC or Slurm clusters in production environments
Deep understanding of Linux systems, job schedulers (Slurm), and resource management for GPU-accelerated workloads
Strong troubleshooting skills across compute, storage, and network layers
Proven ability to collaborate with developers and researchers to design scalable HPC solutions

Preferred

Experience integrating Slurm with Kubernetes or other control planes
Experience with HPC storage and I/O technologies (Lustre, ZFS, WekaFS, NFS)
Familiarity with metrics collection and visualization using Prometheus, Grafana, and Thanos
Exposure to CI/CD pipelines and DevOps practices for scientific or ML workloads
Understanding of machine learning workflows and frameworks (PyTorch, vLLM, SGLang)
Experience with infrastructure automation (e.g., Ansible, Terraform) and scripting (Python, Bash)
Bachelor's or Master's degree in related discipline preferred

Benefits

Benefits offered are described: AMD benefits at a glance

Company

Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.

H1B Sponsorship

AMD has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (836)
2024 (770)
2023 (551)
2022 (739)
2021 (519)
2020 (547)

Funding

Current Stage
Public Company
Total Funding
unknown
Key Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity

Leadership Team

leader-logo
Lisa Su
Chair & CEO
linkedin
leader-logo
Mark Papermaster
CTO and EVP
linkedin
Company data provided by crunchbase