Apply on Employer Site

Microsoft · 6 days ago

Service Engineer II

Redmond, WA

Full-time

Hybrid

Mid, Senior Level

$101K/yr - $199K/yr

2+ years exp

Microsoft is seeking a customer-obsessed and AI-curious Service Engineer II to join their Engineering Operations team. This role involves managing live-site incidents, enhancing customer experience across Azure, and collaborating with various teams to ensure service reliability and performance.

Agentic AIApplication Performance ManagementArtificial Intelligence (AI)Business DevelopmentDevOpsInformation ServicesInformation TechnologyManagement Information SystemsNetwork SecuritySoftware

Growth Opportunities

H1B Sponsor Likely

Responsibilities

Lead and manage high-severity incidents across Azure services, serving as the single point of accountability to ensure rapid detection, triage, resolution, and customer communication

Act as the central authority during live site incidents, driving real-time decision-making and coordination across Engineering, Support, PM, Communications, and Field teams

Contribute to the design of V. Next architecture for Cloud infrastructure services, based on Customer/ First party engagements

Engage in major production triage efforts and work with different teams in the identification of root cause of highly impactful or complex issues as required and identify Product gaps and work with Product teams to bridge the gaps

Partner closely with Software developers, Product Managers, architects, and Infrastructure teams to drive delivery of sustainable and reusable design solution patterns to ensure non-functional production support requirements are adopted early in the Migration /Deployment

Promote a customer-first culture by prioritizing availability, reliability, and platform trust in every response

Participate in the on-call rotation

Analyze customer-impacting signals from telemetry, support cases, and feedback to identify root causes, drive incident reviews (RCAs/PIRs), and implement preventative service improvements

Drive continuous improvement of the Azure platform by incorporating learnings from live site events and customer feedback, ensuring improved reliability, observability, and supportability

Collaborate closely with Engineering and Product teams to influence and implement service resiliency enhancements, auto-remediation tools, and customer-centric mitigation strategies

Identify and advocate for customer self-service capabilities, improved documentation, and scalable solutions that empower customers to resolve common issues independently

Design and drive adoption of incident response playbooks, mitigation levers, and operational frameworks aligned to real-world support scenarios and strategic customer needs

Contribute to the design of next-generation architecture for cloud infrastructure services with a focus on reliability and strategic customer support outcomes

Build and maintain cross-functional partnerships, ensuring alignment across engineering, business, and support organizations

Be data-driven and results-focused, using metrics to evaluate incident response effectiveness and platform health

Bring an engineering mindset to operational challenges, balancing agility, scalability, and technical excellence

Exhibit strong cross-team collaboration, engineering mindset, and results-oriented execution under pressure

Qualification

Azure Core ServicesIncident ManagementCloud Architecture PatternsMonitoring ToolsAutomation LanguagesITIL FrameworkService EngineeringProblem ResolutionCommunication SkillsTeam Leadership

Required

Bachelor's Degree in Computer Science, Information Technology, Mechanical Engineering, Electrical Engineering, Aerospace Engineering, Data Science, Cybersecurity, or related field AND 2+ years technical experience in software engineering, network engineering, service engineering, systems engineering, or industrial controls + OR equivalent experience

2-4+ Yrs of experience in roles cloud operations, incident response, SRE or large-scale system engineering preferably in platforms like Azure, AWS, or GCP

Must have Service Engineering experience in a 24 x 7 x 365 enterprise environments

Exceptional command-and-control communication skills—able to drive clarity and direction with customers - internal Microsoft stakeholders and third-party vendors during ambiguity and chaos

Deep understanding of cloud architecture patterns, microservices, and containerization

Demonstrated ability to make decisions quickly, under pressure, and with limited data—without compromising long-term reliability

Familiarity with monitoring and observability tools (e.g., Grafana, Prometheus, Datadog, Splunk, New Relic)

Fluency in one or more automation languages (PowerShell, Python, CLI etc.)

Understanding ITIL or other incident management frameworks is a must

Understand High Availability, Disaster Recovery, Business Continuity, Performance Tuning

Demonstrates strategic thinking, quantitative and analytical skills, team leadership, and collaboration

Excellent problem resolution, judgment, negotiating and decision-making skills

Desired Strong knowledge of Windows Platform or Linux, developer tools and ability to diagnose and debug user code

Effectively manage and prioritize multiple tasks in accordance with high level objectives/projects

Excellent communication skill (written + verbal) in English, especially in high-pressure scenarios

Ability to communicate with a variety of audiences; including high-profile customers, executive management, and engineering teams

Desired BS/BA in Computer Science, Engineering, Math or equivalent experience

Preferred

4+ Years of demonstrated experience as an Incident Commander or Crisis Manager for critical, high-severity incidents in high-availability, distributed environments

Experience with SRE (Site Reliability Engineering) principles and practices

Exposure to chaos engineering, fault injection, or high availability architecture

AI/ML Experience: [Beginner to Intermediate]

Familiarity with how AI/ML models are integrated into cloud infrastructure and their potential failure modes

Experience using AI-powered tools for incident analysis, log correlation, or predictive alerting

An understanding of the challenges and risks associated with AI/ML systems in a production environment

Certifications: Relevant cloud certifications (e.g., AWS Certified DevOps Engineer, Azure Solutions Architect, GCP Professional Cloud Architect)

Certifications in ITIL, SRE, or other relevant frameworks

Company

Microsoft

Glassdoor4.3

Microsoft is a software corporation that develops, manufactures, licenses, supports, and sells a range of software products and services.

Founded in 1975

Redmond, Washington, USA

10001+ employees

https://www.microsoft.com

H1B Sponsorship

Microsoft has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (9192)

2024 (9343)

2023 (7677)

2022 (11403)

2021 (7210)

2020 (7852)

Funding

Current Stage

Public Company

Total Funding

$1M

Key Investors

Technology Venture Investors

2022-12-09Post Ipo Equity

1986-03-13IPO

1981-09-01Series Unknown· $1M

Leadership Team

Satya Nadella

Chairman and CEO

Vukani Mngxati

Chief Executive Officer - Microsft South Africa

Recent News

The Motley Fool

Nebius Stock Tripled in 2025. Is There More Growth Ahead in 2026?

2026-01-15

Business Wire

Cadence Delivers Enterprise-Level Reliability with Next-Gen Low-Power DRAM for AI Applications Featuring Microsoft RAIDDR ECC Technology

2026-01-15

TechTrendsKE

Microsoft Kenya Country Manager Phyllis Migwi Exits After Three Years

2026-01-14

Company data provided by crunchbase