Senior Staff ML Infra Engineer- Distributed Systems jobs in United States
cer-icon
Apply on Employer Site
company-logo

AMD ยท 2 weeks ago

Senior Staff ML Infra Engineer- Distributed Systems

AMD is a company dedicated to building products that enhance next-generation computing experiences. They are seeking a Senior Staff AI Infra Engineer to lead technical initiatives and optimize performance for AI/ML workloads, particularly focusing on GPU-accelerated computing.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingComputerEmbedded SystemsGPUHardwareSemiconductor
check
Growth Opportunities
check
H1B Sponsor Likelynote
Hiring Manager
Tressa Cooper (she/her)
linkedin

Responsibilities

Lead technical initiatives and provide architectural guidance for AI/ML infrastructure and performance optimization
Optimize and accelerate LLM training and inference on AMD GPUs, improving kernel, communication, and end-to-end system efficiency
Develop and enhance infrastructure supporting LLMs, Agentic AI, and RAG systems
Design, build, and optimize AI workloads on GPU clusters, including large-scale training and inference orchestration, elastic scaling, and workload scheduling across heterogeneous hardware
Debug and resolve complex system-level performance issues across GPU, network, and runtime layers
Drive technical excellence, foster cross-team collaboration, and champion innovation within the organization

Qualification

AI/ML infrastructureC/C++PythonDistributed systemsTransformer architecturesKubernetesGPU optimizationTechnical ownershipProblem-solving skillsCommunication

Required

5+ years of experience in AI/ML infrastructure, distributed systems, or performance-critical software development
Expert-level proficiency in C/C++ and Python
Solid understanding of transformer-based architectures and distributed training frameworks such as Megatron-LM, DeepSpeed, and PyTorch Distributed
Proven experience optimizing LLM training and inference pipelines, including TP/PP/DP/ZeRO parallelism, quantization, and mixed-precision techniques
Hands-on experience designing, building, and scaling training or inference platforms using Kubernetes, Ray, or Kubeflow
Familiarity with GPU architecture and distributed communication libraries (e.g., NCCL, RCCL, MPI), with the ability to analyze and optimize multi-GPU training performance
Experience with profiling and performance-analysis tools for GPU optimization and system-level debugging
Demonstrated technical ownership, strong communication, and problem-solving skills, with a proven record of delivering end-to-end AI/ML infrastructure solutions
Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or related field

Preferred

In-depth experience with the AMD ROCm ecosystem, including HIP kernel optimization for training and inference
Hands-on experience with model optimization techniques such as quantization, pruning, and distillation for efficient deployment
Knowledge of GPU architecture, memory hierarchy, and compiler-level optimization (e.g., kernel fusion, graph scheduling)
Familiarity with Agentic AI systems and autonomous AI workflows, including tool use, reasoning, and multi-agent orchestration for LLM-based applications

Benefits

AMD benefits at a glance.

Company

Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.

H1B Sponsorship

AMD has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (836)
2024 (770)
2023 (551)
2022 (739)
2021 (519)
2020 (547)

Funding

Current Stage
Public Company
Total Funding
unknown
Key Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity

Leadership Team

leader-logo
Lisa Su
Chair & CEO
linkedin
leader-logo
Mark Papermaster
CTO and EVP
linkedin
Company data provided by crunchbase