Principal ML Engineer - Large Scale Training Performance Optimization jobs in United States
cer-icon
Apply on Employer Site
company-logo

Advanced Microdevices Pvt. Ltd. (India) · 9 hours ago

Principal ML Engineer - Large Scale Training Performance Optimization

Advanced Micro Devices, Inc is dedicated to building innovative products that enhance computing experiences across various domains. They are seeking a Principal Machine Learning Engineer to join their Models and Applications team, focusing on optimizing the training of large models on GPUs and improving training efficiency.

BiopharmaBiotechnologyIndustrialManufacturing

Responsibilities

Train large models to convergence on AMD GPUs at scale
Improve the end-to-end training pipeline performance
Optimize the distributed training pipeline and algorithm to scale out
Contribute your changes to open source
Stay up-to-date with the latest training algorithms
Influence the direction of AMD AI platform
Collaborate across teams with various groups and stakeholders

Qualification

Distributed training algorithmsML/DL frameworksGPU kernel optimizationPython programmingC++ programmingLarge models experienceCommunication skillsProblem-solving skills

Required

Experience with distributed training pipelines
Knowledgeable in distributed training algorithms (Data Parallel, Tensor Parallel, Pipeline Parallel, Expert Parallel ZeRO)
Familiar with training large models at scale
Train large models to convergence on AMD GPUs at scale
Improve the end-to-end training pipeline performance
Optimize the distributed training pipeline and algorithm to scale out
Contribute changes to open source
Stay up-to-date with the latest training algorithms
Influence the direction of AMD AI platform
Collaborate across teams with various groups and stakeholders
A master's degree or PhD degree in Computer Science, Artificial Intelligence, Machine Learning, or a related field

Preferred

Experience with ML/DL frameworks such as PyTorch, JAX, or TensorFlow
Experience with distributed training and distributed training frameworks, such as Megatron-LM, MaxText, TorchTitan
Experience with LLMs or computer vision, especially large models
Experience with GPU kernel optimization
Excellent Python or C++ programming skills, including debugging, profiling, and performance analysis at scale
Experience with ML infra at kernel, framework, or system level
Strong communication and problem-solving skills

Benefits

AMD benefits at a glance.

Company

Advanced Microdevices Pvt. Ltd. (India)

twittertwittertwitter
company-logo
Advanced Microdevices (mdi) is a leader in innovative membrane technologies.

Funding

Current Stage
Late Stage

Leadership Team

leader-logo
Nalini Kant Gupta
Founder & Managing Director
Company data provided by crunchbase