Site Reliability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Alibaba Cloud · 4 hours ago

Site Reliability Engineer

Alibaba Cloud is committed to delivering a cutting-edge MaaS platform and toolkits for application development through technological innovation. They are seeking a passionate and technically skilled Site Reliability Engineer (SRE) to build and maintain a highly available, high-performance model service platform, focusing on incident response, troubleshooting, and automation systems.

Cloud Data ServicesCloud ManagementData CenterData ManagementFoundational AISoftware
check
H1B Sponsor Likelynote
Hiring Manager
Moyin Hu
linkedin

Responsibilities

Oversee the deployment, operation, maintenance, and continuous improvement of the standalone website and platform, including its initial construction and subsequent operational changes
Oversee the monitoring and alerting of our platform's and system aplications, rapidly diagnosing and resolving network, service, and hardware-level failures to meet SLA targets
Design and optimize monitoring metrics, log collection, and alerting strategies to enhance system observability
Participate in the emergency response and handling of online incidents, conduct root cause analysis (RCA), and drive long-term solutions to prevent recurrence
Investigate and resolve customer-reported issues related to QoS of API service(e.g., latency, performance, optimization), collaborating with development teams to identify flaws in application clusters, edge networks, or infrastructure
Develop tools and scripts (Python/Go) to automate deployment, scaling, fault recovery, and other operational workflows
Build automated diagnostic toolchains to accelerate issue resolution and improve customer satisfaction

Qualification

Site Reliability EngineeringCloud ComputingPythonLinux SystemsGolangKubernetesNetwork ProtocolsMonitoring SystemsIncident ManagementCustomer Issue ResolutionFluency in ChineseFluency in English

Required

3+ years of experience in SRE, DevOps, or backend development, with expertise in distributed system operations. Experience in cloud computing, AI infrastructure, Alibaba Cloud is a plus
Experience programming with at least one modern language such as Python, Golang, Java, C++
Strong ability to work under pressure, manage critical incidents, and participate in an on-call rotation
Fluency in both Chinese and English for daily communication

Preferred

Familiarity with MaaS or related knowledge
Deep knowledge of Linux systems, network protocols (TCP/HTTP), and databases, have deep understanding of cloud-native architecture design
Experience with large-scale containers, kubernetes cluster operation and maintenance, have strong professional knowledge of Cloud Native related components (e.g., Prometheus, Istio, Calico, etc.)
Extensive experience in building large-scale monitoring systems and utilizing them for in-depth analysis and operations

Benefits

Medical, dental, and vision insurance
A 401(k) plan
Basic life insurance
Wellbeing benefits like FSA
Up to 12 paid holidays
Accrue up to 15 paid vacation days for this position
Receive up to 72 hours paid sick time (front-loaded) per calendar year

Company

Alibaba Cloud

twittertwittertwitter
company-logo
Alibaba Cloud develops cloud computing and data management services. It is a sub-organization of Alibaba Group.

H1B Sponsorship

Alibaba Cloud has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (18)
2024 (14)
2023 (2)
2022 (1)

Funding

Current Stage
Late Stage
Total Funding
$1.2B
Key Investors
Alibaba Group
2015-07-29Series B· $1B
2012-09-20Series A· $200M
Company data provided by crunchbase