Apply on Employer Site

Advanced Microdevices Pvt. Ltd. (India) · 1 month ago

AI/ML and GPU Performance QA engineer

Austin, TX

Full-time

Hybrid

Senior Level, Lead/Staff

8+ years exp

Advanced Micro Devices, Inc is a company focused on building products that enhance next-generation computing experiences. They are seeking a Senior Technical Validation Engineer to lead validation and performance engineering for Machine Learning and High-Performance Computing frameworks, ensuring the delivery of high-quality software for AI and HPC workloads.

BiotechnologyIndustrialPharmaceuticalManufacturingBiopharma

Responsibilities

Lead validation for ML/AI models: accuracy testing, performance benchmarking, regression, drift detection, A/B testing

Test ML frameworks: PyTorch, Hugging Face, MLFlow experiment tracking

Validate wide varieties of AI models to ensure correctness in distributed training or inference

Perform GPU testing & profiling: ROCm/CUDA validation, performance profiling, memory/thermal analysis, multi-GPU scaling

Validate HPC frameworks, distributed runtimes, compilers, and GPU libraries

Build scalable CI/CD workflows for ML/HPC validation. Develop automated test pipelines using Docker, Kubernetes, GitHub Actions, Jenkins

Validate cloud-based AI workloads on AWS SageMaker, Lambda, and S3

Test the benchmarks under containerized and virtualized GPU environments

Design and implement automated validation pipelines for ML frameworks (e.g., PyTorch, TensorFlow, JAX) across GPU platforms

Develop and maintain benchmarking suites for AI models and HPC workloads, focusing on performance, scalability, and regression detection

Multi-node validation efforts using orchestration tools (e.g., Slurm, MPI, Kubernetes) to simulate real-world distributed training and inference

Collaborate with hardware and software teams to validate GPU hardware platforms (NVIDIA CUDA, AMD ROCm) for ML and HPC readiness

Analyze performance metrics using profiling tools (e.g.,rocprof) and provide actionable insights

Drive test content development for emerging AI workloads, including LLMs, vision models, and scientific computing benchmarks

Perform bottleneck analysis, hyperparameter validation, and competitive benchmarking

Mentor junior engineers and contribute to validation strategy, tooling, and best practices

Qualification

GPU architectureMachine Learning frameworksCI/CD systemsPerformance benchmarkingDockerKubernetesPythonDistributed systemsCommunication skillsDocumentation skillsCollaboration skills

Required

Good understanding and experience in ROCm, CUDA, GPU architecture, ML frameworks, CI/CD systems, benchmarking, and competitive analysis