Director of Machine Learning Engineering -- Training and Performance jobs in United States
cer-icon
Apply on Employer Site
company-logo

Advanced Microdevices Pvt. Ltd. (India) · 2 months ago

Director of Machine Learning Engineering -- Training and Performance

Advanced Micro Devices, Inc is a company focused on building products that accelerate next-generation computing experiences. They are seeking a Director of Machine Learning Engineering to define and execute the technical vision for distributed training of large-scale generative AI and recommendation models on AMD GPUs, while leading a world-class engineering team and driving innovation in AI systems.

BiopharmaBiotechnologyIndustrialManufacturing

Responsibilities

Define and drive AMD’s distributed training strategy for large-scale generative and recommendation models
Architect and optimize distributed training pipelines (Pre-training, SFT, RL etc.) for large-scale models
Lead development of high-performance, reliable training pipelines that scale across thousands of GPUs
Partner with compiler, runtime, system software, and hardware architecture teams to co-design solutions that maximize end-to-end performance
Build, mentor, and empower a team of expert engineers focused on innovation, collaboration, and technical excellence
Drive AMD’s engagement in open-source communities through contributions to frameworks such as PyTorch, JAX, TorchTitan, and Megatron-LM
Stay ahead of emerging advances in distributed training, LLMs, recommendation systems, and AI infrastructure — and translate them into scalable engineering practices

Qualification

Distributed trainingAI infrastructureMachine learning applicationsML frameworksPythonC++LeadershipCommunicationProblem-solving

Required

Master's or Ph.D. in Computer Science, Artificial Intelligence, Machine Learning, or a related field

Preferred

10+ years in machine learning, distributed systems, or AI infrastructure; 5+ years in technical leadership or management roles
Proven experience building and optimizing distributed training systems for large models
Prefer experience in both model and application-level development and optimization
Strong familiarity with ML frameworks (PyTorch, JAX, TensorFlow) and distributed frameworks (TorchTitan, Megatron-LM)
Hands-on expertise with LLMs, recommendation systems, or ranking models
Proficiency in Python and C++, including performance profiling, debugging, and large-scale optimization
Experience collaborating across hardware, compiler, and system software layers
Excellent communication, leadership, and problem-solving skills with the ability to influence across organizations and external partners

Benefits

AMD benefits at a glance.

Company

Advanced Microdevices Pvt. Ltd. (India)

twittertwittertwitter
company-logo
Advanced Microdevices (mdi) is a leader in innovative membrane technologies.

Funding

Current Stage
Late Stage

Leadership Team

leader-logo
Nalini Kant Gupta
Founder & Managing Director
Company data provided by crunchbase