Microsoft · 4 days ago
MTS - Site Reliability Engineer
Microsoft is a leading technology company focused on innovation and accessibility in artificial intelligence. They are seeking an experienced Site Reliability Engineer (SRE) to join their infrastructure team, responsible for maintaining the reliability and efficiency of large-scale distributed AI systems.
Agentic AIApplication Performance ManagementArtificial Intelligence (AI)Business DevelopmentDevOpsInformation ServicesInformation TechnologyManagement Information SystemsNetwork SecuritySoftware
Responsibilities
Ensure uptime, resiliency, and fault tolerance of AI model training and inference systems
Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into model serving pipelines and infra
Analyze system performance and scalability, optimize resource utilization (compute, GPU clusters, storage, networking)
Build automation for deployments, incident response, scaling, and failover in hybrid cloud/on-prem CPU+GPU environments
Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
Ensure data privacy, compliance, and secure operations across model training and serving environments
Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows
Qualification
Required
4+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles
Preferred
Strong proficiency in Kubernetes, Docker, and container orchestration
Knowledge of CI/CD pipelines for Inference and ML model deployment
Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
Strong programming/scripting skills in Python, Go, or Bash
Solid knowledge of distributed systems, networking, and storage
Experience running large-scale GPU clusters for ML/AI workloads (preferred)
Familiarity with ML training/inference pipelines
Experience with high-performance computing (HPC) and workload schedulers (Kubernetes operators)
Background in capacity planning & cost optimization for GPU-heavy environments
Work on cutting-edge infrastructure that powers the future of Generative AI
Collaborate with world-class researchers and engineers
Impact millions of users through reliable and responsible AI deployments
Benefits
Competitive compensation
Equity options
Comprehensive benefits
Company
Microsoft
Microsoft is a software corporation that develops, manufactures, licenses, supports, and sells a range of software products and services.
H1B Sponsorship
Microsoft has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (9192)
2024 (9343)
2023 (7677)
2022 (11403)
2021 (7210)
2020 (7852)
Funding
Current Stage
Public CompanyTotal Funding
$1MKey Investors
Technology Venture Investors
2022-12-09Post Ipo Equity
1986-03-13IPO
1981-09-01Series Unknown· $1M
Leadership Team
Recent News
MarketScreener
2026-01-06
2026-01-06
Company data provided by crunchbase