Director of Machine Learning Engineering -- Training and Performance jobs in United States
cer-icon
Apply on Employer Site
company-logo

AMD · 13 hours ago

Director of Machine Learning Engineering -- Training and Performance

AMD is a company that focuses on building innovative products for next-generation computing experiences, including AI and data centers. They are seeking a Director of Machine Learning Engineering to define and execute the technical vision for distributed training of large-scale generative AI and recommendation models on AMD GPUs, guiding a world-class engineering team and collaborating across various teams to drive innovation.

Artificial Intelligence (AI)Cloud ComputingComputerEmbedded SystemsGPUHardwareSemiconductor
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Define and drive AMD’s distributed training strategy for large-scale generative and recommendation models. Align technical initiatives with broader AI platform goals and business impact
Architect and optimize distributed training pipelines (Pre-training, SFT, RL etc.) for large-scale models. Explore new approaches for efficient training and inference of LLMs and ranking systems
Lead development of high-performance, reliable training pipelines that scale across thousands of GPUs. Ensure world-class efficiency, stability, and model convergence
Partner with compiler, runtime, system software, and hardware architecture teams to co-design solutions that maximize end-to-end performance
Build, mentor, and empower a team of expert engineers focused on innovation, collaboration, and technical excellence
Drive AMD’s engagement in open-source communities through contributions to frameworks such as PyTorch, JAX, TorchTitan, and Megatron-LM. Represent AMD’s leadership in AI system design across industry and research communities
Stay ahead of emerging advances in distributed training, LLMs, recommendation systems, and AI infrastructure — and translate them into scalable engineering practices

Qualification

Distributed trainingAI infrastructureMachine learning applicationsTechnical leadershipML frameworksPythonC++Communication skillsProblem-solving skillsTeam leadership

Required

10+ years in machine learning, distributed systems, or AI infrastructure; 5+ years in technical leadership or management roles
Proven experience building and optimizing distributed training systems for large models
Strong familiarity with ML frameworks (PyTorch, JAX, TensorFlow) and distributed frameworks (TorchTitan, Megatron-LM)
Hands-on expertise with LLMs, recommendation systems, or ranking models
Proficiency in Python and C++, including performance profiling, debugging, and large-scale optimization
Experience collaborating across hardware, compiler, and system software layers
Excellent communication, leadership, and problem-solving skills with the ability to influence across organizations and external partners
Master's or Ph.D. in Computer Science, Artificial Intelligence, Machine Learning, or a related field

Preferred

Prefer experience in both model and application-level development and optimization
Location: San Jose, CA or Bellevue, WA preferred. Other U.S. locations near AMD offices may be considered

Benefits

AMD benefits at a glance.

Company

Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.

H1B Sponsorship

AMD has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (836)
2024 (770)
2023 (551)
2022 (739)
2021 (519)
2020 (547)

Funding

Current Stage
Public Company
Total Funding
unknown
Key Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity

Leadership Team

leader-logo
Lisa Su
Chair & CEO
linkedin
leader-logo
Mark Papermaster
CTO and EVP
linkedin
Company data provided by crunchbase