57 applicants

Company

Original Job Post

Microsoft · 6 days ago

Site Reliability Engineer II

Aliso Viejo, CA

Full-time

Hybrid

Mid Level

$98K/yr - $193K/yr

4+ years exp

Wonder how qualified you are to the job?

Maximize your interview chances

Data ManagementDeveloper Tools

H1B Sponsorship

Actively Hiring

Growth Opportunities

Insider Connection @Microsoft

Discover valuable connections within the company who might provide insights and potential referrals, giving your job application an inside edge.

Responsibilities

Demonstrates expertise in distributed systems design, interactions between cloud technology layers and components, common dependencies at scale, and the code that defines infrastructures. Can identify and recommend configurations optimal of cloud technology solutions and modify the code base that defines systems or cloud technologies to improve the reliability and operability of supported products with minimal guidance from other engineers.

Develops an understanding of the code, features, and operations of specific products at scale as required to contribute to incremental improvements in product availability, reliability, efficiency, observability, and/or performance; participates in on-boarding, code/design reviews, and regular meetings with the engineering teams that develop and/or manage those products.

Researches and maintains an awareness in industry trends, advances in distributed systems and cloud technologies, new tools, and/or processes for maintaining and improving product availability, reliability, efficiency, observability, and/or performance. Contributes to the implementation of new solutions within their team by identifying ways they can be applied to solve persistent problems.

Leverages technical expertise in large scale distributed systems and specific products, as well as objective insights drawn from analyses of production telemetry data to suggest changes or add-ons to product features or code to improve the availability, reliability, efficiency, observability, and performance of product components or features supported by their team.

Designing and implementing Service Reliability services, tooling and processes.

Generating software specifications, proof-of-concepts, and prototype solutions given high level feature requirements.

Using data and telemetry to improve feature work and propose feature improvements to existing products.

Develops and tests basic changes to optimize code and improve the observability, reliability and operability of a defined range of platform, system, or product components or features with direction from other engineers.

Independently develops code or scripts that automate the performance of repetitive and easily scalable operations processes (e.g., monitoring, alerting, deploying products and updates) across components and features of products operating at scale.

Leverages technical expertise and telemetry analysis across a range of components and/or features to identify patterns and opportunities to implement configuration and data changes for one or more platforms, systems, or products in production using code, tooling, and automation.

Identifies opportunities to leverage existing tools and automation to enable product engineering teams to increase the velocity in which they can reliably and safely implement changes in production; monitors the effects of changes across multiple components or features within a single platform or system.

Designs, develops, and maintains telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of product components and features operating at scale. Independently performs analyses using existing tools and/or models to identify insights and shares them with product engineering teams to directly contribute to improvements in product development and/or operations; monitors the impact of changes on operations metrics (e.g., Time-to-X).

Independently uses existing tools and/or models to troubleshoot problems or flaws affecting the availability, reliability, performance, and/or efficiency of components and features; proposes solutions that will resolve and prevent recurring issues and brings them to the attention of their Site Reliability Engineering (SRE) and/or product engineering teams.

Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting issues, and deploying appropriate fixes to resolve root cause(s); alerts product teams and owners to major customer impacting issues and escalates resolution of highly impactful issues affecting multiple components or features to other engineers or engineering teams as needed. Shares details related to incidents and their resolution through post-mortem reports and during regular review meetings.

Develops alerts and instrumentation across components and features to monitor product capacity and resource demands and analyze telemetry data using existing capacity planning models; draws insights from analyses of capacity and resource data to optimize component and feature code to manage resources and capacity across limited range of use conditions and system parameters.

Utilizes insights from performance and resource monitoring tools to identify whether there is a need to optimize the efficiency of component and feature code, or if changes to compute resources are required; models the predicted effect of changes to code and/or compute resources across components or features to document the efficacy of proposed solutions.

Shares insights and practices that can be applied to improve development and operations of system, platform, or product components and features by participating in code/design reviews, incident drills and debriefs, and regular meetings, as well as interactions with more experienced SREs and members of product engineering teams.

Embody our culture and values

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

Software EngineeringNetwork EngineeringSystems AdministrationMicrosoft Cloud Background CheckService ReliabilityCloud ServicesFlow EngineeringLLM SystemsReporting DashboardsPower BI

Required

4+ years technical experience in software engineering, network engineering, or systems administration

Bachelor's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration

Master's Degree in Computer Science, Information Technology, or related field

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check

Preferred

Experience in developing highly scalable Service Reliability services and extensive experience using cloud online services

Experience in prompt flow engineering and LLM systems

Experience in developing reporting dashboards such as Power BI

Benefits

Health Insurance

Company

Microsoft

Glassdoor

4.3

Microsoft is a software corporation that develops, manufactures, licenses, supports, and sells a range of software products and services.

Founded in 1975

Redmond, Washington, USA

10,001+ employees

https://www.microsoft.com

H1B Sponsorship

Microsoft has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Trends of Total Sponsorships

2023 (5862)

2022 (11005)

2021 (8174)

2020 (6856)