Apply on Employer Site

AMD · 1 week ago

Principal/Senior GPU Software Performance Engineer — Training at Scale

San Jose, CA

Full-time

Onsite

Senior Level, Lead/Staff

$226K/yr - $340K/yr

AMD is a company that builds products to accelerate next-generation computing experiences. The Principal/Senior GPU Software Performance Engineer will focus on optimizing multi-GPU training processes by leading kernel-level performance engineering and collaborating with various teams to enhance efficiency.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingComputerEmbedded SystemsGPUHardwareSemiconductor

Growth Opportunities

H1B Sponsor Likely

Responsibilities

Own kernel performance: Design, implement, and land high‑impact HIP/C++ kernels (e.g., attention, layernorm, softmax, GEMM/epilogues, fused pointwise) that are wave‑size portable and optimized for LDS, caches, and MFMA units

Lead profiling & tuning: Build repeatable workflows with timelines, hardware counters, and roofline analysis; remove memory bottlenecks; tune launch geometry/occupancy; validate speedups with A/B harnesses

Drive fusion & algorithmic improvements: Identify profitable fusions, tiling strategies, vectorized I/O, shared‑memory/scratchpad layouts, asynchronous pipelines, and warp/wave‑level collectives—while maintaining numerical stability

Influence frameworks & libraries: Upstream or extend performance‑critical ops in PyTorch/JAX/XLA/Triton; evaluate and integrate vendor math libraries; guide compiler/codegen choices for target architectures

Scale beyond one GPU: Optimize P2P and collective comms, overlap compute/comm, and improve data/pipeline/tensor parallelism throughput across nodes

Benchmarking & SLOs: Define and own KPIs (throughput, time‑to‑train, $/step, energy/step); maintain dashboards, perf CI gates, and regression triage

Technical leadership: Mentor senior engineers, set coding/perf standards, lead performance “war rooms,” and partner with silicon/vendor teams on microarchitecture‑aware optimizations

Quality & reliability: Build reproducible perf harnesses, deterministic test modes, and documentation/playbooks so improvements persist release‑over‑release

Qualification

GPU kernel optimizationC++17+GPU programming modelPerformance profilingDistributed trainingLinux fundamentalsMixed precisionCompiler/IR knowledgeCluster orchestrationTechnical leadershipMentoringDocumentation skills

Required

Design, implement, and land high‑impact HIP/C++ kernels (e.g., attention, layernorm, softmax, GEMM/epilogues, fused pointwise) that are wave‑size portable and optimized for LDS, caches, and MFMA units

Build repeatable workflows with timelines, hardware counters, and roofline analysis; remove memory bottlenecks; tune launch geometry/occupancy; validate speedups with A/B harnesses

Identify profitable fusions, tiling strategies, vectorized I/O, shared‑memory/scratchpad layouts, asynchronous pipelines, and warp/wave‑level collectives—while maintaining numerical stability

Upstream or extend performance‑critical ops in PyTorch/JAX/XLA/Triton; evaluate and integrate vendor math libraries; guide compiler/codegen choices for target architectures

Optimize P2P and collective comms, overlap compute/comm, and improve data/pipeline/tensor parallelism throughput across nodes

Define and own KPIs (throughput, time‑to‑train, $/step, energy/step); maintain dashboards, perf CI gates, and regression triage

Mentor senior engineers, set coding/perf standards, lead performance “war rooms,” and partner with silicon/vendor teams on microarchitecture‑aware optimizations

Build reproducible perf harnesses, deterministic test modes, and documentation/playbooks so improvements persist release‑over‑release

Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent

Preferred

Experience in systems/HPC/ML performance engineering, with hands‑on GPU kernel work and shipped optimizations in production training or HPC

Expert in modern C++ (C++17+) and at least one GPU programming model (CUDA, HIP, or SYCL/oneAPI) or a GPU kernel DSL (e.g., Triton); comfortable with templates, memory qualifiers, atomics, and warp/wave‑level collectives

Deep understanding of GPU microarchitecture: SIMT execution, occupancy vs. register/scratchpad pressure, memory hierarchy (global/L2/shared or LDS), coalescing, bank conflicts, vectorization, and instruction‑level parallelism

Proficiency with profiling & analysis: timelines and counters (e.g., Nsight Systems/Compute, rocprof/Omniperf, VTune/GPA or equivalents), ISA/disassembly inspection, and correlating metrics to code changes

Proven track record reducing time‑to‑train or $‑per‑step via kernel and collective‑comms optimizations on multi‑GPU clusters

Strong Linux fundamentals (perf/eBPF, NUMA, PCIe/links), build systems (CMake/Bazel), Python, and containerized dev (Docker/Podman)

Experience with distributed training (PyTorch DDP/FSDP/ZeRO/DeepSpeed or JAX) and GPU collectives

Expertise in mixed precision (BF16/FP16/FP8), numerics, and stability/accuracy validation at kernel boundaries

Background in compiler/IR (LLVM/MLIR) or codegen for GPU backends; ability to guide optimization passes with performance goals

Hands‑on with cluster orchestration (Slurm/Kubernetes), IB/RDMA tuning, and compute/communication overlap strategies

Benefits

AMD benefits at a glance.

Company

AMD

Glassdoor4.1

Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.

Founded in 1969

Santa Clara, California, USA

10001+ employees

http://www.amd.com

H1B Sponsorship

AMD has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (836)

2024 (770)

2023 (551)

2022 (739)

2021 (519)

2020 (547)

Funding

Current Stage

Public Company

Total Funding

unknown

Key Investors

OpenAIDaniel Loeb

2025-10-06Post Ipo Equity

2023-03-02Post Ipo Equity

2021-06-29Post Ipo Equity

Leadership Team

Lisa Su

Chair & CEO

Mark Papermaster

CTO and EVP

Recent News

The Register

AMD boasts 1000x higher AI perf by 2027 and pulls the lid off Helios compute tray ahead of 2H 2026 launch

2026-01-08

Italian Startups - Startupbusiness.it

GENE.01, the first humanoid robot from Generative Bionics

2026-01-07

Mashable

CES 2026: AMD says You aint seen nothing yet on AI

2026-01-07

Company data provided by crunchbase