AMD · 1 hour ago
AI Systems Engineer - HPC
AMD is a company focused on building products that accelerate next-generation computing experiences. The AI Systems Engineer will design, develop, and administer High-Performance Computing (HPC) infrastructure, GPU clusters, and AI workload schedulers, while collaborating with cross-functional teams to support AI-related projects.
Responsibilities
Develop, implement, and maintain GPU-based clusters, ensuring optimal performance
Administer ML/AI platforms - Distributed ML services, LLMs and AI inferencing, by managing deployments, resource allocation, monitoring, and security
Automate system provisioning and Cluster management end to end
Collaborate with cross-functional teams to address AI infrastructure requirements, support AI-related projects, and provide technical expertise
Monitor and evaluate the performance of AI systems and clusters, ensuring that they adhere to industry best practices and meet company standards
Use AI/ML to continuously improve internal processes and tools that are used in end-to-end delivery of your services in this team
Qualification
Required
Design, development, and administration of High-Performance Computing (HPC) infrastructure, GPU clusters, and AI workload schedulers
Passion for learning and the field of large-scale distributed computing in AI and HPC workloads
Responsibility for end-to-end outcomes of efforts
Desire to build scalable and highly performant HPC/AI/Data services with AMD hardware, software, people and processes
Curiosity to learn and improve scalable HPC systems
Significant experience in working across a globally distributed organization
Develop, implement, and maintain GPU-based clusters, ensuring optimal performance
Administer ML/AI platforms - Distributed ML services, LLMs and AI inferencing, by managing deployments, resource allocation, monitoring, and security
Automate system provisioning and Cluster management end to end
Collaborate with cross-functional teams to address AI infrastructure requirements, support AI-related projects, and provide technical expertise
Monitor and evaluate the performance of AI systems and clusters, ensuring that they adhere to industry best practices and meet company standards
Use AI/ML to continuously improve internal processes and tools that are used in end-to-end delivery of services in the team
Preferred
Experience in developing Python based AI apps and UI
HPC infrastructure engineering for AI/HPC domain
SLURM and Kubernetes management
Managing GPU clusters optimizing GPU-based services/tools/software
Experience in creating web services with HPC backend (like AI)
Proficiency in RoCEv2, K8s, KVM, Ubuntu, Python, Shell, GPU drivers, and Cluster interconnect with 400G networking
Demonstrated experience with AI workload schedulers and allocation optimization
Automation/monitoring tool - Ansible / Saltstack, Terraform, Prometheus, Grafana
Strong organizational, problem-solving, and troubleshooting skills, with the ability to manage multiple projects simultaneously
Excellent verbal and written communication skills, with the ability to collaborate effectively with team members and stakeholders at all levels of the organization
Benefits
AMD benefits at a glance.
Company
AMD
Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.
Funding
Current Stage
Public CompanyTotal Funding
unknownKey Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity
Recent News
2025-12-30
2025-12-28
Investing.com
2025-12-27
Company data provided by crunchbase