Fellow Software Development Eng (MI-450 Fleet Management) jobs in United States
cer-icon
Apply on Employer Site
company-logo

AMD · 1 day ago

Fellow Software Development Eng (MI-450 Fleet Management)

AMD is a company focused on building innovative products that enhance computing experiences across various domains. They are seeking a Fellow in Infrastructure Management who will lead the architectural direction of software managing GPU clusters and drive complex AI solutions.

Artificial Intelligence (AI)Cloud ComputingComputerEmbedded SystemsGPUHardwareSemiconductor
check
Growth Opportunities
check
H1B Sponsor Likelynote
Hiring Manager
Gearóid Ó.
linkedin

Responsibilities

Lead the technical vision, strategy, and architectural direction of our Infrastructure management software that manages our GPU clusters supporting large foundation model training, high-performance inference services, multi-tenant GPU sharing and scheduling
Architect and implement GPU node orchestration, failure detection, auto-remediation, auto-scaling of clusters. Design infrastructure software that can support different distributed training frameworks PyTorch, Megatron, JAX, Tensorflow and different distributed inference frameworks like SGLang, VLLM, Ray. Design and deliver software that can manage different scale-up and scale out transport protocols and deliver the best network performance
Architect telemetry, observability, and profiling systems (Prometheus/Thanos, Open Telemetry, Mimir) to measure GPU health and cluster efficiency. Architect monitoring systems that can investigate network congestion, latency spikes, scheduling inefficiencies, system bottlenecks
Architect our Infrastructure management software for scale, efficiency and deliver industry leading GPU cluster utilization that is highly reliable and self-healing. Deliver infrastructure services that reduce job latencies for Slurm and Kubernetes clusters, improve scheduling efficiencies, and reduce operational cost
Design and deliver AI Agents that can troubleshoot complex infrastructure problems without any human intervention and reduce OPEX cost and MTTR. Design and deliver AI Agents that can proactively identify nodes that are bound to fail before they fail
Define our long-term infrastructure management roadmap, drive cross-team initiatives, and deliver. Mentor Principal Engineers, Technical staff across teams. Work with external partners and vendors to develop and deliver the most comprehensive infrastructure solution for AMD

Qualification

Infrastructure managementGPU orchestrationDistributed training frameworksTelemetryObservabilityAI Agents designLeadership skillsEffective communicationCollaboration

Required

Passionate about software engineering, system design, infrastructure management
Possess leadership skills to drive sophisticated issues to resolution
Able to communicate effectively and work optimally with different teams across AMD
Lead the technical vision, strategy, and architectural direction of Infrastructure management software
Architect and implement GPU node orchestration, failure detection, auto-remediation, auto-scaling of clusters
Design infrastructure software that can support different distributed training frameworks PyTorch, Megatron, JAX, Tensorflow
Design and deliver software that can manage different scale-up and scale out transport protocols and deliver the best network performance
Architect telemetry, observability, and profiling systems (Prometheus/Thanos, Open Telemetry, Mimir)
Architect monitoring systems that can investigate network congestion, latency spikes, scheduling inefficiencies, system bottlenecks
Architect Infrastructure management software for scale, efficiency and deliver industry leading GPU cluster utilization
Deliver infrastructure services that reduce job latencies for Slurm and Kubernetes clusters, improve scheduling efficiencies, and reduce operational cost
Design and deliver AI Agents that can troubleshoot complex infrastructure problems without any human intervention
Design and deliver AI Agents that can proactively identify nodes that are bound to fail before they fail
Define long-term infrastructure management roadmap, drive cross-team initiatives, and deliver
Mentor Principal Engineers, Technical staff across teams
Work with external partners and vendors to develop and deliver comprehensive infrastructure solution for AMD
Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent

Benefits

AMD benefits at a glance.

Company

Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.

H1B Sponsorship

AMD has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (836)
2024 (770)
2023 (551)
2022 (739)
2021 (519)
2020 (547)

Funding

Current Stage
Public Company
Total Funding
unknown
Key Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity

Leadership Team

leader-logo
Lisa Su
Chair & CEO
linkedin
leader-logo
Mark Papermaster
CTO and EVP
linkedin
Company data provided by crunchbase