AI/ML and GPU Performance QA engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

AMD · 5 days ago

AI/ML and GPU Performance QA engineer

AMD is a company focused on building innovative products that enhance computing experiences across various domains. They are seeking a Senior Technical Validation Engineer to lead validation and performance engineering for Machine Learning and High-Performance Computing frameworks, ensuring the delivery of high-quality software for AI and HPC workloads.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingComputerEmbedded SystemsGPUHardwareSemiconductor
check
Growth Opportunities
check
H1B Sponsor Likelynote
Hiring Manager
Tressa Cooper (she/her)
linkedin

Responsibilities

Lead validation for ML/AI models: accuracy testing, performance benchmarking, regression, drift detection, A/B testing
Test ML frameworks: PyTorch, Hugging Face, MLFlow experiment tracking
Validate wide varieties of AI models to ensure correctness in distributed training or inference
Perform GPU testing & profiling: ROCm/CUDA validation, performance profiling, memory/thermal analysis, multi-GPU scaling
Validate HPC frameworks, distributed runtimes, compilers, and GPU libraries
Build scalable CI/CD workflows for ML/HPC validation. Develop automated test pipelines using Docker, Kubernetes, GitHub Actions, Jenkins
Validate cloud-based AI workloads on AWS SageMaker, Lambda, and S3
Test the benchmarks under containerized and virtualized GPU environments
Design and implement automated validation pipelines for ML frameworks (e.g., PyTorch, TensorFlow, JAX) across GPU platforms
Develop and maintain benchmarking suites for AI models and HPC workloads, focusing on performance, scalability, and regression detection
Multi-node validation efforts using orchestration tools (e.g., Slurm, MPI, Kubernetes) to simulate real-world distributed training and inference
Collaborate with hardware and software teams to validate GPU hardware platforms (NVIDIA CUDA, AMD ROCm) for ML and HPC readiness
Analyze performance metrics using profiling tools (e.g.,rocprof) and provide actionable insights
Drive test content development for emerging AI workloads, including LLMs, vision models, and scientific computing benchmarks
Perform bottleneck analysis, hyperparameter validation, and competitive benchmarking
Mentor junior engineers and contribute to validation strategy, tooling, and best practices

Qualification

GPU architectureML frameworksCI/CD systemsPerformance benchmarkingROCmCUDADistributed systemsScripting PythonScripting BashCommunication skillsCollaboration skills

Required

Good understanding and experience in ROCm, CUDA, GPU architecture, ML frameworks, CI/CD systems, benchmarking, and competitive analysis
Lead validation for ML/AI models: accuracy testing, performance benchmarking, regression, drift detection, A/B testing
Test ML frameworks: PyTorch, Hugging Face, MLFlow experiment tracking
Validate wide varieties of AI models to ensure correctness in distributed training or inference
Perform GPU testing & profiling: ROCm/CUDA validation, performance profiling, memory/thermal analysis, multi-GPU scaling
Validate HPC frameworks, distributed runtimes, compilers, and GPU libraries
Build scalable CI/CD workflows for ML/HPC validation. Develop automated test pipelines using Docker, Kubernetes, GitHub Actions, Jenkins
Validate cloud-based AI workloads on AWS SageMaker, Lambda, and S3
Test the benchmarks under containerized and virtualized GPU environments
Design and implement automated validation pipelines for ML frameworks (e.g., PyTorch, TensorFlow, JAX) across GPU platforms
Develop and maintain benchmarking suites for AI models and HPC workloads, focusing on performance, scalability, and regression detection
Multi-node validation efforts using orchestration tools (e.g., Slurm, MPI, Kubernetes) to simulate real-world distributed training and inference
Collaborate with hardware and software teams to validate GPU hardware platforms (NVIDIA CUDA, AMD ROCm) for ML and HPC readiness
Analyze performance metrics using profiling tools (e.g., rocprof) and provide actionable insights
Drive test content development for emerging AI workloads, including LLMs, vision models, and scientific computing benchmarks
Perform bottleneck analysis, hyperparameter validation, and competitive benchmarking
Mentor junior engineers and contribute to validation strategy, tooling, and best practices

Preferred

Bachelor's or Master's degree in Computer Science, Electrical Engineering, or related field
8+ years of experience in validation engineering, ML infrastructure, or HPC performance testing
Strong hands-on experience with GPU platforms (NVIDIA CUDA, AMD ROCm) and their software ecosystems
Deep understanding of AI model architectures, training/inference workflows, and ML performance bottlenecks
Proven experience with CI/CD systems, Git, Docker, and automated test frameworks
Expertise in multi-node orchestration and distributed system validation
Familiarity with HPC benchmarks (e.g., HPL, HPCG, MLPerf) and AI model benchmarking methodologies
Proficiency in scripting and automation (Python, Bash, YAML) in Linux environments
Strong communication, documentation, and cross-functional collaboration skills

Benefits

AMD benefits at a glance.

Company

Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.

H1B Sponsorship

AMD has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (836)
2024 (770)
2023 (551)
2022 (739)
2021 (519)
2020 (547)

Funding

Current Stage
Public Company
Total Funding
unknown
Key Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity

Leadership Team

leader-logo
Lisa Su
Chair & CEO
linkedin
leader-logo
Mark Papermaster
CTO and EVP
linkedin
Company data provided by crunchbase