Apply on Employer Site

AMD · 2 weeks ago

Senior Staff ML Infra Engineer- Distributed Systems

Santa Clara, CA

Full-time

Onsite

Senior Level, Lead/Staff

$192K/yr - $288K/yr

5+ years exp

AMD is a company dedicated to building products that enhance next-generation computing experiences. They are seeking a Senior Staff AI Infra Engineer to lead technical initiatives and optimize performance for AI/ML workloads, particularly focusing on GPU-accelerated computing.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingComputerEmbedded SystemsGPUHardwareSemiconductor

Growth Opportunities

H1B Sponsor Likely

Hiring Manager

Tressa Cooper (she/her)

Responsibilities

Lead technical initiatives and provide architectural guidance for AI/ML infrastructure and performance optimization

Optimize and accelerate LLM training and inference on AMD GPUs, improving kernel, communication, and end-to-end system efficiency

Develop and enhance infrastructure supporting LLMs, Agentic AI, and RAG systems

Design, build, and optimize AI workloads on GPU clusters, including large-scale training and inference orchestration, elastic scaling, and workload scheduling across heterogeneous hardware

Debug and resolve complex system-level performance issues across GPU, network, and runtime layers

Drive technical excellence, foster cross-team collaboration, and champion innovation within the organization

Qualification

AI/ML infrastructureC/C++PythonDistributed systemsTransformer architecturesKubernetesGPU optimizationTechnical ownershipProblem-solving skillsCommunication

Required

5+ years of experience in AI/ML infrastructure, distributed systems, or performance-critical software development

Expert-level proficiency in C/C++ and Python

Solid understanding of transformer-based architectures and distributed training frameworks such as Megatron-LM, DeepSpeed, and PyTorch Distributed

Proven experience optimizing LLM training and inference pipelines, including TP/PP/DP/ZeRO parallelism, quantization, and mixed-precision techniques

Hands-on experience designing, building, and scaling training or inference platforms using Kubernetes, Ray, or Kubeflow

Familiarity with GPU architecture and distributed communication libraries (e.g., NCCL, RCCL, MPI), with the ability to analyze and optimize multi-GPU training performance

Experience with profiling and performance-analysis tools for GPU optimization and system-level debugging

Demonstrated technical ownership, strong communication, and problem-solving skills, with a proven record of delivering end-to-end AI/ML infrastructure solutions

Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or related field

Preferred

In-depth experience with the AMD ROCm ecosystem, including HIP kernel optimization for training and inference

Hands-on experience with model optimization techniques such as quantization, pruning, and distillation for efficient deployment

Knowledge of GPU architecture, memory hierarchy, and compiler-level optimization (e.g., kernel fusion, graph scheduling)

Familiarity with Agentic AI systems and autonomous AI workflows, including tool use, reasoning, and multi-agent orchestration for LLM-based applications

Benefits

AMD benefits at a glance.

Company

AMD

Glassdoor4.1

Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.

Founded in 1969

Santa Clara, California, USA

10001+ employees

http://www.amd.com

H1B Sponsorship

AMD has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (836)

2024 (770)

2023 (551)

2022 (739)

2021 (519)

2020 (547)

Funding

Current Stage

Public Company

Total Funding

unknown

Key Investors

OpenAIDaniel Loeb

2025-10-06Post Ipo Equity

2023-03-02Post Ipo Equity

2021-06-29Post Ipo Equity

Leadership Team

Lisa Su

Chair & CEO

Mark Papermaster

CTO and EVP

Recent News

Livemint.com

Physical AI dominates CES but humanity will still have to wait a while for humanoid servants

2026-01-09

GlobeNewswire

KunlunMeta Partners with AMD to Shine at CES

2026-01-09

The Register

AMD boasts 1000x higher AI perf by 2027 and pulls the lid off Helios compute tray ahead of 2H 2026 launch

2026-01-08

Company data provided by crunchbase