Advanced Microdevices Pvt. Ltd. (India) · 2 months ago
Principal/Senior GPU Software Performance Engineer — Training at Scale
Advanced Micro Devices, Inc is dedicated to building innovative products that enhance computing experiences. The Principal/Senior GPU Software Performance Engineer will focus on optimizing GPU performance for large model training across multi-GPU clusters, collaborating with various teams to improve efficiency and throughput.
BiopharmaBiotechnologyIndustrialManufacturing
Responsibilities
Own kernel performance: Design, implement, and land high‑impact HIP/C++ kernels (e.g., attention, layernorm, softmax, GEMM/epilogues, fused pointwise) that are wave‑size portable and optimized for LDS, caches, and MFMA units
Lead profiling & tuning: Build repeatable workflows with timelines, hardware counters, and roofline analysis; remove memory bottlenecks; tune launch geometry/occupancy; validate speedups with A/B harnesses
Drive fusion & algorithmic improvements: Identify profitable fusions, tiling strategies, vectorized I/O, shared‑memory/scratchpad layouts, asynchronous pipelines, and warp/wave‑level collectives—while maintaining numerical stability
Influence frameworks & libraries: Upstream or extend performance‑critical ops in PyTorch/JAX/XLA/Triton; evaluate and integrate vendor math libraries; guide compiler/codegen choices for target architectures
Scale beyond one GPU: Optimize P2P and collective comms, overlap compute/comm, and improve data/pipeline/tensor parallelism throughput across nodes
Benchmarking & SLOs: Define and own KPIs (throughput, time‑to‑train, $/step, energy/step); maintain dashboards, perf CI gates, and regression triage
Technical leadership: Mentor senior engineers, set coding/perf standards, lead performance “war rooms,” and partner with silicon/vendor teams on microarchitecture‑aware optimizations
Quality & reliability: Build reproducible perf harnesses, deterministic test modes, and documentation/playbooks so improvements persist release‑over‑release
Qualification
Required
Experience in systems/HPC/ML performance engineering, with hands-on GPU kernel work and shipped optimizations in production training or HPC
Expert in modern C++ (C++17+) and at least one GPU programming model (CUDA, HIP, or SYCL/oneAPI) or a GPU kernel DSL (e.g., Triton); comfortable with templates, memory qualifiers, atomics, and warp/wave-level collectives
Deep understanding of GPU microarchitecture: SIMT execution, occupancy vs. register/scratchpad pressure, memory hierarchy (global/L2/shared or LDS), coalescing, bank conflicts, vectorization, and instruction-level parallelism
Proficiency with profiling & analysis: timelines and counters (e.g., Nsight Systems/Compute, rocprof/Omniperf, VTune/GPA or equivalents), ISA/disassembly inspection, and correlating metrics to code changes
Proven track record reducing time-to-train or $-per-step via kernel and collective-comms optimizations on multi-GPU clusters
Strong Linux fundamentals (perf/eBPF, NUMA, PCIe/links), build systems (CMake/Bazel), Python, and containerized dev (Docker/Podman)
Experience with distributed training (PyTorch DDP/FSDP/ZeRO/DeepSpeed or JAX) and GPU collectives
Expertise in mixed precision (BF16/FP16/FP8), numerics, and stability/accuracy validation at kernel boundaries
Background in compiler/IR (LLVM/MLIR) or codegen for GPU backends; ability to guide optimization passes with performance goals
Hands-on with cluster orchestration (Slurm/Kubernetes), IB/RDMA tuning, and compute/communication overlap strategies
Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent
Benefits
AMD benefits at a glance.
Company
Advanced Microdevices Pvt. Ltd. (India)
Advanced Microdevices (mdi) is a leader in innovative membrane technologies.
Funding
Current Stage
Late StageLeadership Team
Nalini Kant Gupta
Founder & Managing Director
Recent News
2024-10-18
2024-10-16
Company data provided by crunchbase