AMD · 2 months ago
Principal/Senior GPU Software Performance Engineer — Training at Scale
AMD is a company dedicated to building products that enhance next-generation computing experiences. The Principal/Senior GPU Software Performance Engineer will lead kernel-level performance engineering to optimize training across multi-GPU clusters, collaborating with researchers and framework teams to improve efficiency and effectiveness.
AI InfrastructureArtificial Intelligence (AI)Cloud ComputingComputerEmbedded SystemsGPUHardwareSemiconductor
Responsibilities
Own kernel performance: Design, implement, and land high-impact HIP/C++ kernels (e.g., attention, layernorm, softmax, GEMM/epilogues, fused pointwise) that are wave-size portable and optimized for LDS, caches, and MFMA units
Lead profiling & tuning: Build repeatable workflows with timelines, hardware counters, and roofline analysis; remove memory bottlenecks; tune launch geometry/occupancy; validate speedups with A/B harnesses
Drive fusion & algorithmic improvements: Identify profitable fusions, tiling strategies, vectorized I/O, shared-memory/scratchpad layouts, asynchronous pipelines, and warp/wave-level collectives—while maintaining numerical stability
Influence frameworks & libraries: Upstream or extend performance-critical ops in PyTorch/JAX/XLA/Triton; evaluate and integrate vendor math libraries; guide compiler/codegen choices for target architectures
Scale beyond one GPU: Optimize P2P and collective comms, overlap compute/comm, and improve data/pipeline/tensor parallelism throughput across nodes
Benchmarking & SLOs: Define and own KPIs (throughput, time-to-train, $/step, energy/step); maintain dashboards, perf CI gates, and regression triage
Technical leadership: Mentor senior engineers, set coding/perf standards, lead performance “war rooms,” and partner with silicon/vendor teams on microarchitecture-aware optimizations
Quality & reliability: Build reproducible perf harnesses, deterministic test modes, and documentation/playbooks so improvements persist release-over-release
Qualification
Required
Experience in systems/HPC/ML performance engineering, with hands-on GPU kernel work and shipped optimizations in production training or HPC
Expert in modern C++ (C++17+) and at least one GPU programming model (CUDA, HIP, or SYCL/oneAPI) or a GPU kernel DSL (e.g., Triton); comfortable with templates, memory qualifiers, atomics, and warp/wave-level collectives
Deep understanding of GPU microarchitecture: SIMT execution, occupancy vs. register/scratchpad pressure, memory hierarchy (global/L2/shared or LDS), coalescing, bank conflicts, vectorization, and instruction-level parallelism
Proficiency with profiling & analysis: timelines and counters (e.g., Nsight Systems/Compute, rocprof/Omniperf, VTune/GPA or equivalents), ISA/disassembly inspection, and correlating metrics to code changes
Proven track record reducing time-to-train or $-per-step via kernel and collective-comms optimizations on multi-GPU clusters
Strong Linux fundamentals (perf/eBPF, NUMA, PCIe/links), build systems (CMake/Bazel), Python, and containerized dev (Docker/Podman)
Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent
Preferred
Experience with distributed training (PyTorch DDP/FSDP/ZeRO/DeepSpeed or JAX) and GPU collectives
Expertise in mixed precision (BF16/FP16/FP8), numerics, and stability/accuracy validation at kernel boundaries
Background in compiler/IR (LLVM/MLIR) or codegen for GPU backends; ability to guide optimization passes with performance goals
Hands-on with cluster orchestration (Slurm/Kubernetes), IB/RDMA tuning, and compute/communication overlap strategies
Benefits
AMD benefits at a glance
Company
AMD
Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.
H1B Sponsorship
AMD has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (836)
2024 (770)
2023 (551)
2022 (739)
2021 (519)
2020 (547)
Funding
Current Stage
Public CompanyTotal Funding
unknownKey Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity
Recent News
Italian Startups - Startupbusiness.it
2026-01-07
2026-01-07
Company data provided by crunchbase