Director of Machine Learning Engineering -- Training and Performance jobs in United States
cer-icon
Apply on Employer Site
company-logo

AMD · 2 months ago

Director of Machine Learning Engineering -- Training and Performance

Advanced Micro Devices, Inc (AMD) is a leading company in the technology sector focused on building innovative products for next-generation computing experiences. They are seeking a Director of Machine Learning Engineering to define and execute the technical vision for distributed training of large-scale generative AI and recommendation models, guiding a world-class engineering team to optimize model performance and efficiency.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingComputerEmbedded SystemsGPUHardwareSemiconductor
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Define and drive AMD’s distributed training strategy for large-scale generative and recommendation models
Architect and optimize distributed training pipelines (Pre-training, SFT, RL etc.) for large-scale models
Lead development of high-performance, reliable training pipelines that scale across thousands of GPUs
Partner with compiler, runtime, system software, and hardware architecture teams to co-design solutions that maximize end-to-end performance
Build, mentor, and empower a team of expert engineers focused on innovation, collaboration, and technical excellence
Drive AMD’s engagement in open-source communities through contributions to frameworks such as PyTorch, JAX, TorchTitan, and Megatron-LM
Stay ahead of emerging advances in distributed training, LLMs, recommendation systems, and AI infrastructure — and translate them into scalable engineering practices

Qualification

Distributed trainingAI infrastructureMachine learning applicationsPythonC++ML frameworksLeadership experienceCommunication skillsProblem-solving skillsCollaboration skills

Required

10+ years in machine learning, distributed systems, or AI infrastructure; 5+ years in technical leadership or management roles
Proven experience building and optimizing distributed training systems for large models
Strong familiarity with ML frameworks (PyTorch, JAX, TensorFlow) and distributed frameworks (TorchTitan, Megatron-LM)
Hands-on expertise with LLMs, recommendation systems, or ranking models
Proficiency in Python and C++, including performance profiling, debugging, and large-scale optimization
Experience collaborating across hardware, compiler, and system software layers
Excellent communication, leadership, and problem-solving skills with the ability to influence across organizations and external partners
Master's or Ph.D. in Computer Science, Artificial Intelligence, Machine Learning, or a related field

Preferred

Prefer experience in both model and application-level development and optimization

Benefits

AMD benefits at a glance

Company

Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.

H1B Sponsorship

AMD has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (836)
2024 (770)
2023 (551)
2022 (739)
2021 (519)
2020 (547)

Funding

Current Stage
Public Company
Total Funding
unknown
Key Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity

Leadership Team

leader-logo
Lisa Su
Chair & CEO
linkedin
leader-logo
Mark Papermaster
CTO and EVP
linkedin
Company data provided by crunchbase