Distributed Training Validations and Automation Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Advanced Microdevices Pvt. Ltd. (India) · 2 months ago

Distributed Training Validations and Automation Engineer

Advanced Micro Devices, Inc is a leading company in the computing industry focused on innovation and collaboration. They are seeking an AI solutions validation Engineer to validate AI solutions for distributed training and inference workloads, build automation for these tasks, and design groundbreaking technologies.

BiopharmaBiotechnologyIndustrialManufacturing

Responsibilities

Work with AMD’s architecture specialists to validate AI solutions for distributed training and inference workloads with AMD's ROCM software
Build cluster scale automation for distributed training and inference workloads
Publish reference designs and benchmark numbers for AI workloads
Apply a data minded approach to target optimization efforts
Design and develop new groundbreaking AMD technologies
Participating in new ASIC and hardware bring ups
Develop technical relationships with peers and partners

Qualification

AI infrastructure validationDistributed training automationPerformance benchmarkingPython programmingKubernetesLinux operating systemEffective communicationProblem-solving skills

Required

Passionate about software engineering, system design, validation, automation
Possess leadership skills to drive sophisticated issues to resolution
Able to communicate effectively and work optimally with different teams across AMD
Work with AMD's architecture specialists to validate AI solutions for distributed training and inference workloads with AMD's ROCM software
Build cluster scale automation for distributed training and inference workloads
Publish reference designs and benchmark numbers for AI workloads
Apply a data minded approach to target optimization efforts
Design and develop new groundbreaking AMD technologies
Participating in new ASIC and hardware bring ups
Develop technical relationships with peers and partners

Preferred

Good experience with complex compute systems used in AI, HPC deployments, backend network designs in RDMA clusters
Experience in validating complex AI infrastructure - GPUs, networking, ROCEv2, UEC, running benchmark tests like IBPerf benchmarking, RCCL/NCCL
Experience with running training of LLMs, MoE models, Image Generation, recommendations models with different frameworks like PyTorch, Tensorflow, Megatron-LM, JAX. Running training performance benchmarks
Experience with running inference workloads in AI clusters with different inference frameworks like vLLM, SGLang. Running performance benchmarks for inference
Experience with distributed systems and schedulers like Kubernetes, Slurm
Ability to write high quality automation frameworks and scripts using Python or Golang
Experience with performance profiling of CPUs, GPUs and debugging complex compute, network, storage problems
Experience with AMD ROCM would be an added advantage
Experience with Linux, Windows operating systems
Effective communication and problem-solving skills
Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent

Benefits

AMD benefits at a glance.

Company

Advanced Microdevices Pvt. Ltd. (India)

twittertwittertwitter
company-logo
Advanced Microdevices (mdi) is a leader in innovative membrane technologies.

Funding

Current Stage
Late Stage

Leadership Team

leader-logo
Nalini Kant Gupta
Founder & Managing Director
Company data provided by crunchbase