AMD · 5 days ago
AI/ML and GPU Performance QA engineer
AMD is a company focused on building innovative products that enhance computing experiences across various domains. They are seeking a Senior Technical Validation Engineer to lead validation and performance engineering for Machine Learning and High-Performance Computing frameworks, ensuring the delivery of high-quality software for AI and HPC workloads.
Responsibilities
Lead validation for ML/AI models: accuracy testing, performance benchmarking, regression, drift detection, A/B testing
Test ML frameworks: PyTorch, Hugging Face, MLFlow experiment tracking
Validate wide varieties of AI models to ensure correctness in distributed training or inference
Perform GPU testing & profiling: ROCm/CUDA validation, performance profiling, memory/thermal analysis, multi-GPU scaling
Validate HPC frameworks, distributed runtimes, compilers, and GPU libraries
Build scalable CI/CD workflows for ML/HPC validation. Develop automated test pipelines using Docker, Kubernetes, GitHub Actions, Jenkins
Validate cloud-based AI workloads on AWS SageMaker, Lambda, and S3
Test the benchmarks under containerized and virtualized GPU environments
Design and implement automated validation pipelines for ML frameworks (e.g., PyTorch, TensorFlow, JAX) across GPU platforms
Develop and maintain benchmarking suites for AI models and HPC workloads, focusing on performance, scalability, and regression detection
Multi-node validation efforts using orchestration tools (e.g., Slurm, MPI, Kubernetes) to simulate real-world distributed training and inference
Collaborate with hardware and software teams to validate GPU hardware platforms (NVIDIA CUDA, AMD ROCm) for ML and HPC readiness
Analyze performance metrics using profiling tools (e.g.,rocprof) and provide actionable insights
Drive test content development for emerging AI workloads, including LLMs, vision models, and scientific computing benchmarks
Perform bottleneck analysis, hyperparameter validation, and competitive benchmarking
Mentor junior engineers and contribute to validation strategy, tooling, and best practices
Qualification
Required
Good understanding and experience in ROCm, CUDA, GPU architecture, ML frameworks, CI/CD systems, benchmarking, and competitive analysis
Lead validation for ML/AI models: accuracy testing, performance benchmarking, regression, drift detection, A/B testing
Test ML frameworks: PyTorch, Hugging Face, MLFlow experiment tracking
Validate wide varieties of AI models to ensure correctness in distributed training or inference
Perform GPU testing & profiling: ROCm/CUDA validation, performance profiling, memory/thermal analysis, multi-GPU scaling
Validate HPC frameworks, distributed runtimes, compilers, and GPU libraries
Build scalable CI/CD workflows for ML/HPC validation. Develop automated test pipelines using Docker, Kubernetes, GitHub Actions, Jenkins
Validate cloud-based AI workloads on AWS SageMaker, Lambda, and S3
Test the benchmarks under containerized and virtualized GPU environments
Design and implement automated validation pipelines for ML frameworks (e.g., PyTorch, TensorFlow, JAX) across GPU platforms
Develop and maintain benchmarking suites for AI models and HPC workloads, focusing on performance, scalability, and regression detection
Multi-node validation efforts using orchestration tools (e.g., Slurm, MPI, Kubernetes) to simulate real-world distributed training and inference
Collaborate with hardware and software teams to validate GPU hardware platforms (NVIDIA CUDA, AMD ROCm) for ML and HPC readiness
Analyze performance metrics using profiling tools (e.g., rocprof) and provide actionable insights
Drive test content development for emerging AI workloads, including LLMs, vision models, and scientific computing benchmarks
Perform bottleneck analysis, hyperparameter validation, and competitive benchmarking
Mentor junior engineers and contribute to validation strategy, tooling, and best practices
Preferred
Bachelor's or Master's degree in Computer Science, Electrical Engineering, or related field
8+ years of experience in validation engineering, ML infrastructure, or HPC performance testing
Strong hands-on experience with GPU platforms (NVIDIA CUDA, AMD ROCm) and their software ecosystems
Deep understanding of AI model architectures, training/inference workflows, and ML performance bottlenecks
Proven experience with CI/CD systems, Git, Docker, and automated test frameworks
Expertise in multi-node orchestration and distributed system validation
Familiarity with HPC benchmarks (e.g., HPL, HPCG, MLPerf) and AI model benchmarking methodologies
Proficiency in scripting and automation (Python, Bash, YAML) in Linux environments
Strong communication, documentation, and cross-functional collaboration skills
Benefits
AMD benefits at a glance.
Company
AMD
Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.
H1B Sponsorship
AMD has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (836)
2024 (770)
2023 (551)
2022 (739)
2021 (519)
2020 (547)
Funding
Current Stage
Public CompanyTotal Funding
unknownKey Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity
Recent News
2026-01-13
Morningstar.com
2026-01-11
Company data provided by crunchbase