HPC Systems Staff Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

AMD ยท 4 hours ago

HPC Systems Staff Engineer

AMD is a company focused on building products that accelerate next-generation computing experiences. They are seeking an HPC Systems Engineer responsible for designing, developing, and administering High-Performance Computing infrastructure, GPU clusters, and AI workload schedulers.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingComputerEmbedded SystemsGPUHardwareSemiconductor
check
Growth Opportunities
badNo H1Bnote
Hiring Manager
Matthew Fesl
linkedin

Responsibilities

Develop, implement, and maintain GPU-based clusters, ensuring optimal performance
Administer ML/AI platforms - Distributed ML services, LLMs and AI inferencing, by managing deployments, resource allocation, monitoring, and security
Automate system provisioning and Cluster management end to end
Collaborate with cross-functional teams to address AI infrastructure requirements, support AI-related projects, and provide technical expertise
Monitor and evaluate the performance of AI systems and clusters, ensuring that they adhere to industry best practices and meet company standards
Use AI/ML to continuously improve internal processes and tools that are used in end-to-end delivery of your services in this team

Qualification

HPC infrastructure engineeringGPU cluster managementAI workload schedulersPythonKubernetesSLURMAutomation toolsProblem-solving skillsCommunication skills

Required

Design, development, and administration of High-Performance Computing (HPC) infrastructure
Development, implementation, and maintenance of GPU-based clusters
Administration of ML/AI platforms - Distributed ML services, LLMs and AI inferencing
Automate system provisioning and Cluster management end to end
Collaborate with cross-functional teams to address AI infrastructure requirements
Monitor and evaluate the performance of AI systems and clusters
Use AI/ML to continuously improve internal processes and tools

Preferred

Experience in developing Python based AI apps and UI
HPC infrastructure engineering for AI/HPC domain
SLURM and Kubernetes management
Managing GPU clusters optimizing GPU-based services/tools/software
Experience in creating web services with HPC backend (like AI)
Proficiency in RoCEv2, K8s, KVM, Ubuntu, Python, Shell, GPU drivers, and Cluster interconnect with 400G networking
Demonstrated experience with AI workload schedulers and allocation optimization
Automation/monitoring tool - Ansible / Saltstack, Terraform, Prometheus, Grafana
Strong organizational, problem-solving, and troubleshooting skills
Excellent verbal and written communication skills

Benefits

AMD benefits at a glance.

Company

Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.

Funding

Current Stage
Public Company
Total Funding
unknown
Key Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity

Leadership Team

leader-logo
Lisa Su
Chair & CEO
linkedin
leader-logo
Mark Papermaster
CTO and EVP
linkedin
Company data provided by crunchbase