AI/ML and GPU Performance QA engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Advanced Microdevices Pvt. Ltd. (India) · 4 weeks ago

AI/ML and GPU Performance QA engineer

Advanced Micro Devices, Inc is a company focused on building products that enhance next-generation computing experiences. They are seeking a Senior Technical Validation Engineer to lead validation and performance engineering for Machine Learning and High-Performance Computing frameworks, ensuring the delivery of high-quality software for AI and HPC workloads.

BiopharmaBiotechnologyIndustrialManufacturing

Responsibilities

Lead validation for ML/AI models: accuracy testing, performance benchmarking, regression, drift detection, A/B testing
Test ML frameworks: PyTorch, Hugging Face, MLFlow experiment tracking
Validate wide varieties of AI models to ensure correctness in distributed training or inference
Perform GPU testing & profiling: ROCm/CUDA validation, performance profiling, memory/thermal analysis, multi-GPU scaling
Validate HPC frameworks, distributed runtimes, compilers, and GPU libraries
Build scalable CI/CD workflows for ML/HPC validation. Develop automated test pipelines using Docker, Kubernetes, GitHub Actions, Jenkins
Validate cloud-based AI workloads on AWS SageMaker, Lambda, and S3
Test the benchmarks under containerized and virtualized GPU environments
Design and implement automated validation pipelines for ML frameworks (e.g., PyTorch, TensorFlow, JAX) across GPU platforms
Develop and maintain benchmarking suites for AI models and HPC workloads, focusing on performance, scalability, and regression detection
Multi-node validation efforts using orchestration tools (e.g., Slurm, MPI, Kubernetes) to simulate real-world distributed training and inference
Collaborate with hardware and software teams to validate GPU hardware platforms (NVIDIA CUDA, AMD ROCm) for ML and HPC readiness
Analyze performance metrics using profiling tools (e.g.,rocprof) and provide actionable insights
Drive test content development for emerging AI workloads, including LLMs, vision models, and scientific computing benchmarks
Perform bottleneck analysis, hyperparameter validation, and competitive benchmarking
Mentor junior engineers and contribute to validation strategy, tooling, and best practices

Qualification

GPU architectureMachine Learning frameworksCI/CD systemsPerformance benchmarkingDockerKubernetesPythonDistributed systemsCommunication skillsDocumentation skillsCollaboration skills

Required

Good understanding and experience in ROCm, CUDA, GPU architecture, ML frameworks, CI/CD systems, benchmarking, and competitive analysis
Lead validation for ML/AI models: accuracy testing, performance benchmarking, regression, drift detection, A/B testing
Test ML frameworks: PyTorch, Hugging Face, MLFlow experiment tracking
Validate wide varieties of AI models to ensure correctness in distributed training or inference
Perform GPU testing & profiling: ROCm/CUDA validation, performance profiling, memory/thermal analysis, multi-GPU scaling
Validate HPC frameworks, distributed runtimes, compilers, and GPU libraries
Build scalable CI/CD workflows for ML/HPC validation. Develop automated test pipelines using Docker, Kubernetes, GitHub Actions, Jenkins
Validate cloud-based AI workloads on AWS SageMaker, Lambda, and S3
Test the benchmarks under containerized and virtualized GPU environments
Design and implement automated validation pipelines for ML frameworks (e.g., PyTorch, TensorFlow, JAX) across GPU platforms
Develop and maintain benchmarking suites for AI models and HPC workloads, focusing on performance, scalability, and regression detection
Multi-node validation efforts using orchestration tools (e.g., Slurm, MPI, Kubernetes) to simulate real-world distributed training and inference
Collaborate with hardware and software teams to validate GPU hardware platforms (NVIDIA CUDA, AMD ROCm) for ML and HPC readiness
Analyze performance metrics using profiling tools (e.g., rocprof) and provide actionable insights
Drive test content development for emerging AI workloads, including LLMs, vision models, and scientific computing benchmarks
Perform bottleneck analysis, hyperparameter validation, and competitive benchmarking
Mentor junior engineers and contribute to validation strategy, tooling, and best practices

Preferred

Bachelor's or Master's degree in Computer Science, Electrical Engineering, or related field
8+ years of experience in validation engineering, ML infrastructure, or HPC performance testing
Strong hands-on experience with GPU platforms (NVIDIA CUDA, AMD ROCm) and their software ecosystems
Deep understanding of AI model architectures, training/inference workflows, and ML performance bottlenecks
Proven experience with CI/CD systems, Git, Docker, and automated test frameworks
Expertise in multi-node orchestration and distributed system validation
Familiarity with HPC benchmarks (e.g., HPL, HPCG, MLPerf) and AI model benchmarking methodologies
Proficiency in scripting and automation (Python, Bash, YAML) in Linux environments
Strong communication, documentation, and cross-functional collaboration skills

Benefits

AMD benefits at a glance.

Company

Advanced Microdevices Pvt. Ltd. (India)

twittertwittertwitter
company-logo
Advanced Microdevices (mdi) is a leader in innovative membrane technologies.

Funding

Current Stage
Late Stage

Leadership Team

leader-logo
Nalini Kant Gupta
Founder & Managing Director
Company data provided by crunchbase