SIGN IN
Post-Training Platform Infrastructure Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Advanced Microdevices Pvt. Ltd. (India) · 9 hours ago

Post-Training Platform Infrastructure Engineer

Advanced Micro Devices, Inc is a company focused on building products that enhance next-generation computing experiences. They are seeking a systems-minded engineer to work on post-training and inference infrastructure, emphasizing performance optimization and distributed systems.
BiotechnologyIndustrialPharmaceuticalManufacturingBiopharma

Responsibilities

Research and deeply understand modern LLM inference frameworks, including:
Architecture and design tradeoffs of P/D (prefill / decode) disaggregation
KV cache lifecycle, memory layout, eviction strategies, and reuse
KV cache offloading mechanisms across GPU, CPU, and storage backends
Analyze and compare inference execution paths to identify:
Performance bottlenecks (latency, throughput, memory pressure)
Inefficiencies in scheduling, cache management, and resource utilization
Develop and implement infrastructure-level features to:
Improve inference latency, throughput, and memory efficiency
Optimize KV cache management and offloading strategies
Enhance scalability across multi-GPU and multi-node deployments
Apply the same research-driven approach to RL frameworks:
Study post-training and RL systems (e.g., policy rollout, inference-heavy loops)
Debug performance and correctness issues in distributed RL pipelines
Optimize inference, rollout efficiency, and memory usage during training
Collaborate with research and applied ML teams to:
Translate model-level requirements into infrastructure capabilities
Validate performance gains with benchmarks and real workloads
Document findings, architectural insights, and best practices to guide future system design

Qualification

LLM inference frameworksDistributed systemsGPU-accelerated workloadsPythonC++Performance optimizationKV cache managementAnalytical skillsCollaborationProblem-solving

Required

Bachelor's or master's degree in computer science, computer engineering, electrical engineering, or equivalent

Preferred

Strong background in systems engineering, distributed systems, or ML infrastructure
Hands-on experience with GPU-accelerated workloads and memory-constrained systems
Solid understanding of LLM inference workflows (prefill vs decode)
Attention mechanisms and KV cache behavior
Multi-process / multi-GPU execution models
Proficiency in Python and C++ (or similar systems languages)
Experience debugging performance issues using profiling tools (GPU, CPU, memory)
Ability to read, understand, and modify complex open-source codebases
Strong analytical skills and comfort working in research-heavy, ambiguous problem spaces
Direct experience with LLM inference frameworks or serving stacks
Familiarity with GPU memory hierarchies (HBM, pinned memory, NUMA considerations)
KV cache compression, paging, or eviction strategies
Storage-backed offloading (NVMe, object stores, distributed file system)
Experience with distributed RL or post-training pipelines
Knowledge of scheduling systems, async execution, or actor-based runtimes
Contributions to open-source ML or systems projects
Experience designing benchmarking suites or performance evaluation frameworks

Benefits

AMD benefits at a glance.

Company

Advanced Microdevices Pvt. Ltd. (India)

twittertwittertwitter
company-logo
Advanced Microdevices (mdi) is a leader in innovative membrane technologies.

Funding

Current Stage
Late Stage

Leadership Team

leader-logo
Nalini Kant Gupta
Founder & Managing Director
Company data provided by crunchbase