Apply on Employer Site

AMD · 1 hour ago

AI Systems Engineer - HPC

San Jose, CA

Full-time

Onsite

Mid, Senior Level

$138K/yr - $208K/yr

AMD is a company focused on building products that accelerate next-generation computing experiences. The AI Systems Engineer will design, develop, and administer High-Performance Computing (HPC) infrastructure, GPU clusters, and AI workload schedulers, while collaborating with cross-functional teams to support AI-related projects.

Artificial Intelligence (AI)Cloud ComputingComputerEmbedded SystemsGPUHardwareSemiconductor

Growth Opportunities

No H1B

Hiring Manager

Matthew Fesl

Responsibilities

Develop, implement, and maintain GPU-based clusters, ensuring optimal performance

Administer ML/AI platforms - Distributed ML services, LLMs and AI inferencing, by managing deployments, resource allocation, monitoring, and security

Automate system provisioning and Cluster management end to end

Collaborate with cross-functional teams to address AI infrastructure requirements, support AI-related projects, and provide technical expertise

Monitor and evaluate the performance of AI systems and clusters, ensuring that they adhere to industry best practices and meet company standards

Use AI/ML to continuously improve internal processes and tools that are used in end-to-end delivery of your services in this team

Qualification

HPC infrastructure engineeringGPU cluster managementAI workload schedulersPythonKubernetes managementAutomation toolsProblem-solving skillsCommunication skills

Required

Design, development, and administration of High-Performance Computing (HPC) infrastructure, GPU clusters, and AI workload schedulers

Passion for learning and the field of large-scale distributed computing in AI and HPC workloads

Responsibility for end-to-end outcomes of efforts

Desire to build scalable and highly performant HPC/AI/Data services with AMD hardware, software, people and processes

Curiosity to learn and improve scalable HPC systems

Significant experience in working across a globally distributed organization

Develop, implement, and maintain GPU-based clusters, ensuring optimal performance

Administer ML/AI platforms - Distributed ML services, LLMs and AI inferencing, by managing deployments, resource allocation, monitoring, and security

Automate system provisioning and Cluster management end to end

Collaborate with cross-functional teams to address AI infrastructure requirements, support AI-related projects, and provide technical expertise

Monitor and evaluate the performance of AI systems and clusters, ensuring that they adhere to industry best practices and meet company standards

Use AI/ML to continuously improve internal processes and tools that are used in end-to-end delivery of services in the team

Preferred

Experience in developing Python based AI apps and UI

HPC infrastructure engineering for AI/HPC domain

SLURM and Kubernetes management

Managing GPU clusters optimizing GPU-based services/tools/software

Experience in creating web services with HPC backend (like AI)

Proficiency in RoCEv2, K8s, KVM, Ubuntu, Python, Shell, GPU drivers, and Cluster interconnect with 400G networking

Demonstrated experience with AI workload schedulers and allocation optimization

Automation/monitoring tool - Ansible / Saltstack, Terraform, Prometheus, Grafana

Strong organizational, problem-solving, and troubleshooting skills, with the ability to manage multiple projects simultaneously

Excellent verbal and written communication skills, with the ability to collaborate effectively with team members and stakeholders at all levels of the organization