AMD · 2 months ago
Distributed Training Validations and Automation Engineer
AMD is a leading company focused on building innovative products that enhance next-generation computing experiences. They are seeking a Distributed Training Validations and Automation Engineer who will work on validating AI solutions, building automation for distributed training, and developing new technologies.
AI InfrastructureArtificial Intelligence (AI)Cloud ComputingComputerEmbedded SystemsGPUHardwareSemiconductor
Responsibilities
Work with AMD’s architecture specialists to validate AI solutions for distributed training and inference workloads with AMD's ROCM software
Build cluster scale automation for distributed training and inference workloads
Publish reference designs and benchmark numbers for AI workloads
Apply a data minded approach to target optimization efforts
Design and develop new groundbreaking AMD technologies
Participating in new ASIC and hardware bring ups
Develop technical relationships with peers and partners
Qualification
Required
Passionate about software engineering, system design, validation, automation
Possess leadership skills to drive sophisticated issues to resolution
Able to communicate effectively and work optimally with different teams across AMD
Work with AMD's architecture specialists to validate AI solutions for distributed training and inference workloads with AMD's ROCM software
Build cluster scale automation for distributed training and inference workloads
Publish reference designs and benchmark numbers for AI workloads
Apply a data minded approach to target optimization efforts
Design and develop new groundbreaking AMD technologies
Participating in new ASIC and hardware bring ups
Develop technical relationships with peers and partners
Preferred
Good experience with complex compute systems used in AI, HPC deployments, backend network designs in RDMA clusters
Experience in validating complex AI infrastructure - GPUs, networking, ROCEv2, UEC, running benchmark tests like IBPerf benchmarking, RCCL/NCCL
Experience with running training of LLMs, MoE models, Image Generation, recommendations models with different frameworks like PyTorch, Tensorflow, Megatron-LM, JAX. Running training performance benchmarks
Experience with running inference workloads in AI clusters with different inference frameworks like vLLM, SGLang. Running performance benchmarks for inference
Experience with distributed systems and schedulers like Kubernetes, Slurm
Ability to write high quality automation frameworks and scripts using Python or Golang
Experience with performance profiling of CPUs, GPUs and debugging complex compute, network, storage problems
Experience with AMD ROCM would be an added advantage
Experience with Linux, Windows operating systems
Effective communication and problem-solving skills
Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent
Benefits
AMD benefits at a glance
Company
AMD
Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.
H1B Sponsorship
AMD has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (836)
2024 (770)
2023 (551)
2022 (739)
2021 (519)
2020 (547)
Funding
Current Stage
Public CompanyTotal Funding
unknownKey Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity
Recent News
Morningstar.com
2026-01-11
Company data provided by crunchbase