Apply on Employer Site

Advanced Microdevices Pvt. Ltd. (India) · 2 months ago

AI Infrastructure Engineer – Slurm Platform

San Jose, CA

Full-time

Hybrid

Senior Level, Lead/Staff

8+ years exp

Advanced Micro Devices, Inc is a company focused on building innovative products for next-generation computing experiences. They are seeking an AI Infrastructure Engineer to drive the delivery of software solutions for AI software development and optimization, particularly in managing a Slurm-based GPU compute platform.

BiopharmaBiotechnologyIndustrialManufacturing

Responsibilities

Design, deploy, and operate Slurm clusters across on-prem and multi-cloud GPU environments (Azure, OCI, Vultr, DigitalOcean, etc.)

Integrate Slurm with the broader orchestration ecosystem, enabling hybrid scheduling, unified authentication, and telemetry pipelines

Build platform features that improve developer experience — e.g., job submission APIs, automated environment setup, and metrics dashboards

Optimize cluster utilization and scheduling for GPU and CPU workloads; develop fair-share, QoS, and preemption policies

Monitor cluster health and performance, implementing observability pipelines using Prometheus, Grafana, and custom exporters

Collaborate with internal developers (framework, compiler, and application teams) to understand workload needs and translate them into scalable Slurm features

Contribute to storage and network integration, ensuring performant I/O (e.g., NFS, Lustre, Weka) and high-speed interconnect configuration (InfiniBand, NVLink)

Support the job lifecycle — from image builds and environment modules to debugging and performance tuning of Slurm jobs

Qualification

Slurm clusters managementLinux systemsGPU workloadsKubernetes integrationHPC storage technologiesMetrics visualizationCI/CD pipelinesInfrastructure automationTroubleshooting skillsMachine learning workflowsScriptingCollaboration skills