Principal Software Engineer – PyTorch Training Frameworks jobs in United States
cer-icon
Apply on Employer Site
company-logo

Advanced Microdevices Pvt. Ltd. (India) · 3 days ago

Principal Software Engineer – PyTorch Training Frameworks

Advanced Micro Devices, Inc. (AMD) is a leader in building products that accelerate next-generation computing experiences. They are seeking a Principal Software Engineer specializing in PyTorch training frameworks to enhance performance, scalability, and correctness of AI training on AMD Instinct™ accelerators, collaborating with multiple teams to optimize and debug training workloads.

BiopharmaBiotechnologyIndustrialManufacturing

Responsibilities

Act as a technical authority for PyTorch training at AMD, setting direction for performance, scalability, and reliability
Drive optimization of key PyTorch training workloads (LLMs/foundation models) across single-node and multi-node systems
Improve and debug training performance in areas such as DDP/FSDP, gradient checkpointing, mixed precision, memory planning, and communication/computation overlap
Partner with ROCm compiler/runtime, kernel, and driver teams to resolve performance bottlenecks and correctness issues across the full stack
Contribute to and influence upstream PyTorch (design discussions, code contributions, performance fixes, CI/debug)
Develop and maintain representative training benchmarks, profiling workflows, and performance regression detection for key models
Lead deep-dive investigations of performance regressions and hard correctness issues; drive cross-team resolution to closure
Mentor engineers and raise the bar on framework-quality code, performance engineering practices, and technical rigor
Engage with strategic customers/partners on training enablement, root-cause analysis, and best-practices for AMD platforms

Qualification

PyTorchDistributed trainingPerformance engineeringPythonC/C++Technical communicationMentoringCollaboration

Required

Deep hands-on experience with PyTorch training and solving complex systems problems (performance, scaling, memory efficiency, distributed communication)
Strong technical leadership and ability to influence architecture across teams
Comfortable driving ambiguity to crisp execution
Clear communication with both engineers and stakeholders
Ability to represent AMD credibly in upstream/open-source discussions
Act as a technical authority for PyTorch training at AMD, setting direction for performance, scalability, and reliability
Drive optimization of key PyTorch training workloads (LLMs/foundation models) across single-node and multi-node systems
Improve and debug training performance in areas such as DDP/FSDP, gradient checkpointing, mixed precision, memory planning, and communication/computation overlap
Partner with ROCm compiler/runtime, kernel, and driver teams to resolve performance bottlenecks and correctness issues across the full stack
Contribute to and influence upstream PyTorch (design discussions, code contributions, performance fixes, CI/debug)
Develop and maintain representative training benchmarks, profiling workflows, and performance regression detection for key models
Lead deep-dive investigations of performance regressions and hard correctness issues; drive cross-team resolution to closure
Mentor engineers and raise the bar on framework-quality code, performance engineering practices, and technical rigor
Engage with strategic customers/partners on training enablement, root-cause analysis, and best-practices for AMD platforms
Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent

Preferred

Deep experience with PyTorch internals and training systems (Autograd, optimizers, dataloading, compilation paths, runtime behavior)
Strong distributed training expertise: DDP, FSDP, tensor/pipeline parallel concepts, collectives (NCCL/RCCL), multi-node debugging
Proven track record in performance engineering (profiling, tracing, kernel/runtime analysis, memory optimization, scaling studies)
Strong programming skills in Python and C/C++ (ability to land clean, maintainable changes in large codebases)
Familiarity with PyTorch ecosystem components such as TorchInductor / torch.compile, Triton, CUDA/HIP-style programming models, and performance tooling
Experience working across OS/hardware boundaries in Linux-based environments (containers, CI, drivers/runtimes are a plus)
Clear technical communication: design docs, code reviews, stakeholder updates, and cross-team coordination
Demonstrated ability to lead through influence (principal-level impact, mentoring, and architectural decision-making)

Benefits

AMD benefits at a glance.

Company

Advanced Microdevices Pvt. Ltd. (India)

twittertwittertwitter
company-logo
Advanced Microdevices (mdi) is a leader in innovative membrane technologies.

Funding

Current Stage
Late Stage

Leadership Team

leader-logo
Nalini Kant Gupta
Founder & Managing Director
Company data provided by crunchbase