Apply on Employer Site

AMD · 3 months ago

AI Infrastructure Engineer – Slurm Platform

San Jose, California

Full-time

Hybrid

Senior Level, Lead/Staff

$175K/yr - $300K/yr

8+ years exp

AMD is a leading company in the computing industry, focused on delivering innovative products that enhance next-generation computing experiences. The AI Infrastructure Engineer will be responsible for driving the delivery of software solutions for AI software development, managing Slurm-based GPU compute platforms, and collaborating with various teams to optimize performance and usability.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingComputerEmbedded SystemsGPUHardwareSemiconductor

Growth Opportunities

H1B Sponsor Likely

Responsibilities

Design, deploy, and operate Slurm clusters across on-prem and multi-cloud GPU environments (Azure, OCI, Vultr, DigitalOcean, etc.)

Integrate Slurm with the broader orchestration ecosystem, enabling hybrid scheduling, unified authentication, and telemetry pipelines

Build platform features that improve developer experience — e.g., job submission APIs, automated environment setup, and metrics dashboards

Optimize cluster utilization and scheduling for GPU and CPU workloads; develop fair-share, QoS, and preemption policies

Monitor cluster health and performance, implementing observability pipelines using Prometheus, Grafana, and custom exporters

Collaborate with internal developers (framework, compiler, and application teams) to understand workload needs and translate them into scalable Slurm features

Contribute to storage and network integration, ensuring performant I/O (e.g., NFS, Lustre, Weka) and high-speed interconnect configuration (InfiniBand, NVLink)

Support the job lifecycle — from image builds and environment modules to debugging and performance tuning of Slurm jobs

Qualification

HPC cluster managementSlurmLinux systemsKubernetes integrationHPC storage technologiesMetrics visualizationCI/CD pipelinesInfrastructure automationPythonBash scriptingMachine learning workflows

Required

8+ years of experience managing and automating HPC or Slurm clusters in production environments

Deep understanding of Linux systems, job schedulers (Slurm), and resource management for GPU-accelerated workloads

Strong troubleshooting skills across compute, storage, and network layers

Proven ability to collaborate with developers and researchers to design scalable HPC solutions

Preferred

Experience integrating Slurm with Kubernetes or other control planes

Experience with HPC storage and I/O technologies (Lustre, ZFS, WekaFS, NFS)

Familiarity with metrics collection and visualization using Prometheus, Grafana, and Thanos

Exposure to CI/CD pipelines and DevOps practices for scientific or ML workloads

Understanding of machine learning workflows and frameworks (PyTorch, vLLM, SGLang)

Experience with infrastructure automation (e.g., Ansible, Terraform) and scripting (Python, Bash)

Bachelor's or Master's degree in related discipline preferred

Benefits

Benefits offered are described: AMD benefits at a glance

Company

AMD

Glassdoor4.1

Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.

Founded in 1969

Santa Clara, California, USA

10001+ employees

http://www.amd.com

H1B Sponsorship

AMD has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (836)

2024 (770)

2023 (551)

2022 (739)

2021 (519)

2020 (547)

Funding

Current Stage

Public Company

Total Funding

unknown

Key Investors

OpenAIDaniel Loeb

2025-10-06Post Ipo Equity

2023-03-02Post Ipo Equity

2021-06-29Post Ipo Equity

Leadership Team

Lisa Su

Chair & CEO

Mark Papermaster

CTO and EVP

Recent News

KitGuru.net

AMD confirms Microsoft’s next-gen Xbox for 2027

2026-02-06

The Next Platform

AMD Finally Makes More Money On GPUs Than CPUs In A Quarter

2026-02-06

semafor.com

Alphabet to double infrastructure spending as it bets on AI

2026-02-06

Company data provided by crunchbase