Principal/Senior GPU Software Performance Engineer — Training at Scale jobs in United States
cer-icon
Apply on Employer Site
company-logo

AMD · 1 week ago

Principal/Senior GPU Software Performance Engineer — Training at Scale

AMD is a company that builds products to accelerate next-generation computing experiences. The Principal/Senior GPU Software Performance Engineer will focus on optimizing multi-GPU training processes by leading kernel-level performance engineering and collaborating with various teams to enhance efficiency.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingComputerEmbedded SystemsGPUHardwareSemiconductor
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Own kernel performance: Design, implement, and land high‑impact HIP/C++ kernels (e.g., attention, layernorm, softmax, GEMM/epilogues, fused pointwise) that are wave‑size portable and optimized for LDS, caches, and MFMA units
Lead profiling & tuning: Build repeatable workflows with timelines, hardware counters, and roofline analysis; remove memory bottlenecks; tune launch geometry/occupancy; validate speedups with A/B harnesses
Drive fusion & algorithmic improvements: Identify profitable fusions, tiling strategies, vectorized I/O, shared‑memory/scratchpad layouts, asynchronous pipelines, and warp/wave‑level collectives—while maintaining numerical stability
Influence frameworks & libraries: Upstream or extend performance‑critical ops in PyTorch/JAX/XLA/Triton; evaluate and integrate vendor math libraries; guide compiler/codegen choices for target architectures
Scale beyond one GPU: Optimize P2P and collective comms, overlap compute/comm, and improve data/pipeline/tensor parallelism throughput across nodes
Benchmarking & SLOs: Define and own KPIs (throughput, time‑to‑train, $/step, energy/step); maintain dashboards, perf CI gates, and regression triage
Technical leadership: Mentor senior engineers, set coding/perf standards, lead performance “war rooms,” and partner with silicon/vendor teams on microarchitecture‑aware optimizations
Quality & reliability: Build reproducible perf harnesses, deterministic test modes, and documentation/playbooks so improvements persist release‑over‑release

Qualification

GPU kernel optimizationC++17+GPU programming modelPerformance profilingDistributed trainingLinux fundamentalsMixed precisionCompiler/IR knowledgeCluster orchestrationTechnical leadershipMentoringDocumentation skills

Required

Design, implement, and land high‑impact HIP/C++ kernels (e.g., attention, layernorm, softmax, GEMM/epilogues, fused pointwise) that are wave‑size portable and optimized for LDS, caches, and MFMA units
Build repeatable workflows with timelines, hardware counters, and roofline analysis; remove memory bottlenecks; tune launch geometry/occupancy; validate speedups with A/B harnesses
Identify profitable fusions, tiling strategies, vectorized I/O, shared‑memory/scratchpad layouts, asynchronous pipelines, and warp/wave‑level collectives—while maintaining numerical stability
Upstream or extend performance‑critical ops in PyTorch/JAX/XLA/Triton; evaluate and integrate vendor math libraries; guide compiler/codegen choices for target architectures
Optimize P2P and collective comms, overlap compute/comm, and improve data/pipeline/tensor parallelism throughput across nodes
Define and own KPIs (throughput, time‑to‑train, $/step, energy/step); maintain dashboards, perf CI gates, and regression triage
Mentor senior engineers, set coding/perf standards, lead performance “war rooms,” and partner with silicon/vendor teams on microarchitecture‑aware optimizations
Build reproducible perf harnesses, deterministic test modes, and documentation/playbooks so improvements persist release‑over‑release
Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent

Preferred

Experience in systems/HPC/ML performance engineering, with hands‑on GPU kernel work and shipped optimizations in production training or HPC
Expert in modern C++ (C++17+) and at least one GPU programming model (CUDA, HIP, or SYCL/oneAPI) or a GPU kernel DSL (e.g., Triton); comfortable with templates, memory qualifiers, atomics, and warp/wave‑level collectives
Deep understanding of GPU microarchitecture: SIMT execution, occupancy vs. register/scratchpad pressure, memory hierarchy (global/L2/shared or LDS), coalescing, bank conflicts, vectorization, and instruction‑level parallelism
Proficiency with profiling & analysis: timelines and counters (e.g., Nsight Systems/Compute, rocprof/Omniperf, VTune/GPA or equivalents), ISA/disassembly inspection, and correlating metrics to code changes
Proven track record reducing time‑to‑train or $‑per‑step via kernel and collective‑comms optimizations on multi‑GPU clusters
Strong Linux fundamentals (perf/eBPF, NUMA, PCIe/links), build systems (CMake/Bazel), Python, and containerized dev (Docker/Podman)
Experience with distributed training (PyTorch DDP/FSDP/ZeRO/DeepSpeed or JAX) and GPU collectives
Expertise in mixed precision (BF16/FP16/FP8), numerics, and stability/accuracy validation at kernel boundaries
Background in compiler/IR (LLVM/MLIR) or codegen for GPU backends; ability to guide optimization passes with performance goals
Hands‑on with cluster orchestration (Slurm/Kubernetes), IB/RDMA tuning, and compute/communication overlap strategies

Benefits

AMD benefits at a glance.

Company

Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.

H1B Sponsorship

AMD has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (836)
2024 (770)
2023 (551)
2022 (739)
2021 (519)
2020 (547)

Funding

Current Stage
Public Company
Total Funding
unknown
Key Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity

Leadership Team

leader-logo
Lisa Su
Chair & CEO
linkedin
leader-logo
Mark Papermaster
CTO and EVP
linkedin
Company data provided by crunchbase