Advanced Microdevices Pvt. Ltd. (India) · 2 months ago
Distributed Training Validations and Automation Engineer
Advanced Micro Devices, Inc is a leading company in the computing industry focused on innovation and collaboration. They are seeking an AI solutions validation Engineer to validate AI solutions for distributed training and inference workloads, build automation for these tasks, and design groundbreaking technologies.
BiopharmaBiotechnologyIndustrialManufacturing
Responsibilities
Work with AMD’s architecture specialists to validate AI solutions for distributed training and inference workloads with AMD's ROCM software
Build cluster scale automation for distributed training and inference workloads
Publish reference designs and benchmark numbers for AI workloads
Apply a data minded approach to target optimization efforts
Design and develop new groundbreaking AMD technologies
Participating in new ASIC and hardware bring ups
Develop technical relationships with peers and partners
Qualification
Required
Passionate about software engineering, system design, validation, automation
Possess leadership skills to drive sophisticated issues to resolution
Able to communicate effectively and work optimally with different teams across AMD
Work with AMD's architecture specialists to validate AI solutions for distributed training and inference workloads with AMD's ROCM software
Build cluster scale automation for distributed training and inference workloads
Publish reference designs and benchmark numbers for AI workloads
Apply a data minded approach to target optimization efforts
Design and develop new groundbreaking AMD technologies
Participating in new ASIC and hardware bring ups
Develop technical relationships with peers and partners
Preferred
Good experience with complex compute systems used in AI, HPC deployments, backend network designs in RDMA clusters
Experience in validating complex AI infrastructure - GPUs, networking, ROCEv2, UEC, running benchmark tests like IBPerf benchmarking, RCCL/NCCL
Experience with running training of LLMs, MoE models, Image Generation, recommendations models with different frameworks like PyTorch, Tensorflow, Megatron-LM, JAX. Running training performance benchmarks
Experience with running inference workloads in AI clusters with different inference frameworks like vLLM, SGLang. Running performance benchmarks for inference
Experience with distributed systems and schedulers like Kubernetes, Slurm
Ability to write high quality automation frameworks and scripts using Python or Golang
Experience with performance profiling of CPUs, GPUs and debugging complex compute, network, storage problems
Experience with AMD ROCM would be an added advantage
Experience with Linux, Windows operating systems
Effective communication and problem-solving skills
Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent
Benefits
AMD benefits at a glance.
Company
Advanced Microdevices Pvt. Ltd. (India)
Advanced Microdevices (mdi) is a leader in innovative membrane technologies.
Funding
Current Stage
Late StageLeadership Team
Nalini Kant Gupta
Founder & Managing Director
Recent News
2024-10-18
2024-10-16
Company data provided by crunchbase