Principal Software Engineer – PyTorch Training Frameworks jobs in United States
cer-icon
Apply on Employer Site
company-logo

AMD · 12 hours ago

Principal Software Engineer – PyTorch Training Frameworks

AMD is a company dedicated to building innovative products that enhance computing experiences across various domains. They are seeking a Principal-level expert in PyTorch training frameworks to optimize AI training performance on AMD Instinct accelerators, collaborating closely with multiple teams to ensure high-quality execution and reliability.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingComputerEmbedded SystemsGPUHardwareSemiconductor
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Act as a technical authority for PyTorch training at AMD, setting direction for performance, scalability, and reliability Drive optimization of key PyTorch training workloads (LLMs/foundation models) across single-node and multi-node systems
Improve and debug training performance in areas such as DDP/FSDP, gradient checkpointing, mixed precision, memory planning, and communication/computation overlap
Partner with ROCm compiler/runtime, kernel, and driver teams to resolve performance bottlenecks and correctness issues across the full stack
Contribute to and influence upstream PyTorch (design discussions, code contributions, performance fixes, CI/debug)
Develop and maintain representative training benchmarks, profiling workflows, and performance regression detection for key models
Lead deep-dive investigations of performance regressions and hard correctness issues; drive cross-team resolution to closure
Mentor engineers and raise the bar on framework-quality code, performance engineering practices, and technical rigor
Engage with strategic customers/partners on training enablement, root-cause analysis, and best-practices for AMD platforms

Qualification

PyTorchDistributed trainingPerformance engineeringPythonC/C++Technical communicationMentoringCollaboration

Required

Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent

Preferred

Deep experience with PyTorch internals and training systems (Autograd, optimizers, dataloading, compilation paths, runtime behavior)
Strong distributed training expertise: DDP, FSDP, tensor/pipeline parallel concepts, collectives (NCCL/RCCL), multi-node debugging
Proven track record in performance engineering (profiling, tracing, kernel/runtime analysis, memory optimization, scaling studies)
Strong programming skills in Python and C/C++ (ability to land clean, maintainable changes in large codebases)
Familiarity with PyTorch ecosystem components such as TorchInductor / torch.compile, Triton, CUDA/HIP-style programming models, and performance tooling
Experience working across OS/hardware boundaries in Linux-based environments (containers, CI, drivers/runtimes are a plus)
Clear technical communication: design docs, code reviews, stakeholder updates, and cross-team coordination
Demonstrated ability to lead through influence (principal-level impact, mentoring, and architectural decision-making)

Benefits

AMD benefits at a glance.

Company

Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.

H1B Sponsorship

AMD has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (836)
2024 (770)
2023 (551)
2022 (739)
2021 (519)
2020 (547)

Funding

Current Stage
Public Company
Total Funding
unknown
Key Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity

Leadership Team

leader-logo
Lisa Su
Chair & CEO
linkedin
leader-logo
Mark Papermaster
CTO and EVP
linkedin
Company data provided by crunchbase