Apply on Employer Site

Advanced Microdevices Pvt. Ltd. (India) · 9 hours ago

Post-Training Platform Infrastructure Engineer

San Jose, CA

Full-time

Hybrid

Mid, Senior Level

Advanced Micro Devices, Inc is a company focused on building products that enhance next-generation computing experiences. They are seeking a systems-minded engineer to work on post-training and inference infrastructure, emphasizing performance optimization and distributed systems.

BiotechnologyIndustrialPharmaceuticalManufacturingBiopharma

Responsibilities

Research and deeply understand modern LLM inference frameworks, including:

Architecture and design tradeoffs of P/D (prefill / decode) disaggregation

KV cache lifecycle, memory layout, eviction strategies, and reuse

KV cache offloading mechanisms across GPU, CPU, and storage backends

Analyze and compare inference execution paths to identify:

Performance bottlenecks (latency, throughput, memory pressure)

Inefficiencies in scheduling, cache management, and resource utilization

Develop and implement infrastructure-level features to:

Improve inference latency, throughput, and memory efficiency

Optimize KV cache management and offloading strategies

Enhance scalability across multi-GPU and multi-node deployments

Apply the same research-driven approach to RL frameworks:

Study post-training and RL systems (e.g., policy rollout, inference-heavy loops)

Debug performance and correctness issues in distributed RL pipelines

Optimize inference, rollout efficiency, and memory usage during training

Collaborate with research and applied ML teams to:

Translate model-level requirements into infrastructure capabilities

Validate performance gains with benchmarks and real workloads

Document findings, architectural insights, and best practices to guide future system design

Qualification

LLM inference frameworksDistributed systemsGPU-accelerated workloadsPythonC++Performance optimizationKV cache managementAnalytical skillsCollaborationProblem-solving

Required

Bachelor's or master's degree in computer science, computer engineering, electrical engineering, or equivalent

Preferred

Strong background in systems engineering, distributed systems, or ML infrastructure

Hands-on experience with GPU-accelerated workloads and memory-constrained systems

Solid understanding of LLM inference workflows (prefill vs decode)

Attention mechanisms and KV cache behavior

Multi-process / multi-GPU execution models

Proficiency in Python and C++ (or similar systems languages)

Experience debugging performance issues using profiling tools (GPU, CPU, memory)

Ability to read, understand, and modify complex open-source codebases

Strong analytical skills and comfort working in research-heavy, ambiguous problem spaces

Direct experience with LLM inference frameworks or serving stacks

Familiarity with GPU memory hierarchies (HBM, pinned memory, NUMA considerations)

KV cache compression, paging, or eviction strategies

Storage-backed offloading (NVMe, object stores, distributed file system)

Experience with distributed RL or post-training pipelines

Knowledge of scheduling systems, async execution, or actor-based runtimes

Contributions to open-source ML or systems projects

Experience designing benchmarking suites or performance evaluation frameworks

Benefits

AMD benefits at a glance.

Company

Advanced Microdevices Pvt. Ltd. (India)

Advanced Microdevices (mdi) is a leader in innovative membrane technologies.

Founded in 1976

Ambala, Haryana, IND

501-1000 employees

https://mdimembrane.com

Funding

Current Stage

Late Stage

Leadership Team

Nalini Kant Gupta

Founder & Managing Director

Recent News

The Motley Fool

Lisa Su Just Delivered Incredible News for Advanced Micro Devices Stock Investors

2024-10-18

TradingView

What's Going On With Advanced Micro Devices Stock Tuesday?

2024-10-16

Company data provided by crunchbase