Systems Design Engineer – AI Cluster Storage Architect jobs in United States
cer-icon
Apply on Employer Site
company-logo

Advanced Microdevices Pvt. Ltd. (India) · 2 months ago

Systems Design Engineer – AI Cluster Storage Architect

Advanced Micro Devices, Inc is focused on building great products that accelerate next-generation computing experiences. The role involves applying HPC expertise to shape AI infrastructure, creating reference architectures, and supporting internal teams and customers with informed hardware and software decisions.

BiopharmaBiotechnologyIndustrialManufacturing

Responsibilities

Apply your HPC expertise to shape AI infrastructure by creating reference architectures, configuration guides, and deployment blueprints that help internal teams and customers make informed hardware and software decisions
Build a library of technical artifacts—including presentations, design documents, and “how it works” guides, to support pre-sales engineers and enable others to skill up from an HPC perspective
Perform deep technical evaluations of HPC and AI stacks with a specific focus on storage solutions, documenting how they work, where they fit, and the tradeoffs involved between storage technologies and vendors
Design and execute reproducible experiments and benchmarking harnesses to compare storage technologies and their fit-for-purpose across AI and HPC workloads
Develop small reference implementations and tools to validate performance hypotheses, analyze system behavior and more
Present findings through demos, documentation, and internal talks, and create templates and checklists to support repeatable evaluations and cluster designs

Qualification

HPC expertiseAI infrastructureStorage solutionsLinux fundamentalsParallel filesystemsOrchestration systemsComparative analysisDocumentation skillsEngineering mindsetCuriosityClear communicationInitiative

Required

Engineering mindset: Evidence of end-to-end systems thinking, debugging, and tradeoff decisions
Storage/data: parallel filesystems (Lustre, BeeGFS), object stores, RDMA, data pipeline throughput and caching strategies
Extensive knowledge of the current storage vendor landscape
AI/HPC cluster background: hands-on familiarity with schedulers and/or orchestration systems (e.g., Slurm, Kubernetes), MPI/OpenMP, distributed storage patterns, or performance analysis
Comparative analysis: experience writing evaluation docs/RFCs with clear criteria, benchmarks, risks, and recommendations
Strong Linux fundamentals: Linux operating systems, networking, filesystems, containers, performance tooling (perf, flamegraphs, nvprof/rocprof, basic eBPF)
Clear communication: ability to turn complex systems into accessible, structured documentation with diagrams and reproducible steps

Preferred

AMD ecosystem experience: ROCm, RCCL, Instinct GPUs, EPYC platforms, compiler/toolchain impacts, and performance tuning
Distributed training internals: DDP, collective comms, sharded/stateful optimizers; NCCL/RCCL behavior and transport considerations (PCIe, NVLink, IF)
Orchestration models: Slurm configuration patterns, Kubernetes for HPC/AI (GPU operators, device plugins), Apptainer/Singularity
Enterprise storage solutions (NAS, NFS), particularly large scale, and design patterns for federation/replication, backup, and DR
IaC literacy: Terraform/Ansible for reproducible blueprints—focused on design and sample configs, not running prod clusters
Documentation tooling: reproducible docs/workbooks, literate programming notebooks, CI for benchmarks

Benefits

AMD benefits at a glance.

Company

Advanced Microdevices Pvt. Ltd. (India)

twittertwittertwitter
company-logo
Advanced Microdevices (mdi) is a leader in innovative membrane technologies.

Funding

Current Stage
Late Stage

Leadership Team

leader-logo
Nalini Kant Gupta
Founder & Managing Director
Company data provided by crunchbase