Microsoft · 1 month ago
Member of Technical Staff, Hardware Health
Microsoft AI operates one of the world’s most advanced AI training infrastructures, and they are seeking a Member of Technical Staff, Hardware Health, to ensure these systems deliver sustained reliability, performance, and availability. The role involves designing and developing hardware health monitoring frameworks and collaborating with cross-functional teams to influence hardware design for reliability and efficiency.
Agentic AIApplication Performance ManagementArtificial Intelligence (AI)Business DevelopmentDevOpsInformation ServicesInformation TechnologyManagement Information SystemsNetwork SecuritySoftware
Responsibilities
Design and develop next-generation hardware health monitoring and diagnostic frameworks for large GPU clusters (NVL16/NVL72/GB200+ scale)
Build predictive analytics pipelines leveraging telemetry, power, and thermal data to anticipate hardware degradation and systemic issues
Collaborate with silicon, firmware, and datacenter engineers to identify root causes and remediate large-scale hardware anomalies
Define system health KPIs (e.g., NIS/RIS, MTBF, failure domain analysis) and integrate them into real-time observability platforms
Lead incident triage for high-impact GPU, network, and cooling issues across distributed clusters
Drive automation in health management to reduce manual intervention to the top 5% of anomalies
Partner with cross-functional teams to influence hardware design for reliability, thermal efficiency, and serviceability
Qualification
Required
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Preferred
Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Experience working with large-scale HPC or GPU systems (NVIDIA H100/GB200 or equivalent)
Deep understanding of GPU architecture, high-speed interconnects (NVLink, InfiniBand, RoCE), and large datacenter topologies
Proficiency in hardware telemetry, diagnostics, or failure analysis tools
Experience with exascale-class systems or cloud-scale AI clusters
Familiarity with reliability modeling, machine learning-based anomaly detection, or predictive maintenance
Contributions to large-scale infrastructure operations, supercomputing centers, or AI hardware design
Benefits
Certain roles may be eligible for benefits and other compensation.
Company
Microsoft
Microsoft is a software corporation that develops, manufactures, licenses, supports, and sells a range of software products and services.
H1B Sponsorship
Microsoft has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (9192)
2024 (9343)
2023 (7677)
2022 (11403)
2021 (7210)
2020 (7852)
Funding
Current Stage
Public CompanyTotal Funding
$1MKey Investors
Technology Venture Investors
2022-12-09Post Ipo Equity
1986-03-13IPO
1981-09-01Series Unknown· $1M
Leadership Team
Recent News
2026-01-16
Morningstar.com
2026-01-16
Company data provided by crunchbase