Advanced Microdevices Pvt. Ltd. (India) · 2 months ago
Systems Design Engineer – AI Cluster Storage Architect
Advanced Micro Devices, Inc is focused on building great products that accelerate next-generation computing experiences. The role involves applying HPC expertise to shape AI infrastructure, creating reference architectures, and supporting internal teams and customers with informed hardware and software decisions.
BiopharmaBiotechnologyIndustrialManufacturing
Responsibilities
Apply your HPC expertise to shape AI infrastructure by creating reference architectures, configuration guides, and deployment blueprints that help internal teams and customers make informed hardware and software decisions
Build a library of technical artifacts—including presentations, design documents, and “how it works” guides, to support pre-sales engineers and enable others to skill up from an HPC perspective
Perform deep technical evaluations of HPC and AI stacks with a specific focus on storage solutions, documenting how they work, where they fit, and the tradeoffs involved between storage technologies and vendors
Design and execute reproducible experiments and benchmarking harnesses to compare storage technologies and their fit-for-purpose across AI and HPC workloads
Develop small reference implementations and tools to validate performance hypotheses, analyze system behavior and more
Present findings through demos, documentation, and internal talks, and create templates and checklists to support repeatable evaluations and cluster designs
Qualification
Required
Engineering mindset: Evidence of end-to-end systems thinking, debugging, and tradeoff decisions
Storage/data: parallel filesystems (Lustre, BeeGFS), object stores, RDMA, data pipeline throughput and caching strategies
Extensive knowledge of the current storage vendor landscape
AI/HPC cluster background: hands-on familiarity with schedulers and/or orchestration systems (e.g., Slurm, Kubernetes), MPI/OpenMP, distributed storage patterns, or performance analysis
Comparative analysis: experience writing evaluation docs/RFCs with clear criteria, benchmarks, risks, and recommendations
Strong Linux fundamentals: Linux operating systems, networking, filesystems, containers, performance tooling (perf, flamegraphs, nvprof/rocprof, basic eBPF)
Clear communication: ability to turn complex systems into accessible, structured documentation with diagrams and reproducible steps
Preferred
AMD ecosystem experience: ROCm, RCCL, Instinct GPUs, EPYC platforms, compiler/toolchain impacts, and performance tuning
Distributed training internals: DDP, collective comms, sharded/stateful optimizers; NCCL/RCCL behavior and transport considerations (PCIe, NVLink, IF)
Orchestration models: Slurm configuration patterns, Kubernetes for HPC/AI (GPU operators, device plugins), Apptainer/Singularity
Enterprise storage solutions (NAS, NFS), particularly large scale, and design patterns for federation/replication, backup, and DR
IaC literacy: Terraform/Ansible for reproducible blueprints—focused on design and sample configs, not running prod clusters
Documentation tooling: reproducible docs/workbooks, literate programming notebooks, CI for benchmarks
Benefits
AMD benefits at a glance.
Company
Advanced Microdevices Pvt. Ltd. (India)
Advanced Microdevices (mdi) is a leader in innovative membrane technologies.
Funding
Current Stage
Late StageLeadership Team
Nalini Kant Gupta
Founder & Managing Director
Recent News
2024-10-18
2024-10-16
Company data provided by crunchbase