Member of Technical Staff, Hardware Health - MAI Superintelligence Team jobs in United States
cer-icon
Apply on Employer Site
company-logo

Microsoft · 1 week ago

Member of Technical Staff, Hardware Health - MAI Superintelligence Team

Microsoft AI operates one of the world’s most advanced AI training infrastructures, featuring multi-gigawatt clusters and high-performance GPUs. The Member of Technical Staff, Hardware Health, will ensure the reliability and performance of these systems by developing predictive health models and collaborating with various engineering teams.

Agentic AIApplication Performance ManagementArtificial Intelligence (AI)Business DevelopmentDevOpsInformation ServicesInformation TechnologyManagement Information SystemsNetwork SecuritySoftware
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Design and develop next-generation hardware health monitoring and diagnostic frameworks for large GPU clusters (NVL16/NVL72/GB200+ scale)
Build predictive analytics pipelines leveraging telemetry, power, and thermal data to anticipate hardware degradation and systemic issues
Collaborate with silicon, firmware, and datacenter engineers to identify root causes and remediate large-scale hardware anomalies
Define system health KPIs (e.g., NIS/RIS, MTBF, failure domain analysis) and integrate them into real-time observability platforms
Lead incident triage for high-impact GPU, network, and cooling issues across distributed clusters
Drive automation in health management to reduce manual intervention to the top 5% of anomalies
Partner with cross-functional teams to influence hardware design for reliability, thermal efficiency, and serviceability

Qualification

CC++PythonGPU systemsHardware telemetryPredictive maintenanceReliability modelingSoft skills

Required

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience

Preferred

Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Experience working with large-scale HPC or GPU systems (NVIDIA H100/GB200 or equivalent)
Deep understanding of GPU architecture, high-speed interconnects (NVLink, InfiniBand, RoCE), and large datacenter topologies
Proficiency in hardware telemetry, diagnostics, or failure analysis tools
Experience with exascale-class systems or cloud-scale AI clusters
Familiarity with reliability modeling, machine learning-based anomaly detection, or predictive maintenance
Contributions to large-scale infrastructure operations, supercomputing centers, or AI hardware design

Company

Microsoft

company-logo
Microsoft is a software corporation that develops, manufactures, licenses, supports, and sells a range of software products and services.

H1B Sponsorship

Microsoft has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (9192)
2024 (9343)
2023 (7677)
2022 (11403)
2021 (7210)
2020 (7852)

Funding

Current Stage
Public Company
Total Funding
$1M
Key Investors
Technology Venture Investors
2022-12-09Post Ipo Equity
1986-03-13IPO
1981-09-01Series Unknown· $1M

Leadership Team

leader-logo
Satya Nadella
Chairman and CEO
linkedin
leader-logo
Vukani Mngxati
Chief Executive Officer - Microsft South Africa
linkedin
Company data provided by crunchbase