Alibaba Cloud · 4 hours ago
Site Reliability Engineer
Alibaba Cloud is committed to delivering a cutting-edge MaaS platform and toolkits for application development through technological innovation. They are seeking a passionate and technically skilled Site Reliability Engineer (SRE) to build and maintain a highly available, high-performance model service platform, focusing on incident response, troubleshooting, and automation systems.
Responsibilities
Oversee the deployment, operation, maintenance, and continuous improvement of the standalone website and platform, including its initial construction and subsequent operational changes
Oversee the monitoring and alerting of our platform's and system aplications, rapidly diagnosing and resolving network, service, and hardware-level failures to meet SLA targets
Design and optimize monitoring metrics, log collection, and alerting strategies to enhance system observability
Participate in the emergency response and handling of online incidents, conduct root cause analysis (RCA), and drive long-term solutions to prevent recurrence
Investigate and resolve customer-reported issues related to QoS of API service(e.g., latency, performance, optimization), collaborating with development teams to identify flaws in application clusters, edge networks, or infrastructure
Develop tools and scripts (Python/Go) to automate deployment, scaling, fault recovery, and other operational workflows
Build automated diagnostic toolchains to accelerate issue resolution and improve customer satisfaction
Qualification
Required
3+ years of experience in SRE, DevOps, or backend development, with expertise in distributed system operations. Experience in cloud computing, AI infrastructure, Alibaba Cloud is a plus
Experience programming with at least one modern language such as Python, Golang, Java, C++
Strong ability to work under pressure, manage critical incidents, and participate in an on-call rotation
Fluency in both Chinese and English for daily communication
Preferred
Familiarity with MaaS or related knowledge
Deep knowledge of Linux systems, network protocols (TCP/HTTP), and databases, have deep understanding of cloud-native architecture design
Experience with large-scale containers, kubernetes cluster operation and maintenance, have strong professional knowledge of Cloud Native related components (e.g., Prometheus, Istio, Calico, etc.)
Extensive experience in building large-scale monitoring systems and utilizing them for in-depth analysis and operations
Benefits
Medical, dental, and vision insurance
A 401(k) plan
Basic life insurance
Wellbeing benefits like FSA
Up to 12 paid holidays
Accrue up to 15 paid vacation days for this position
Receive up to 72 hours paid sick time (front-loaded) per calendar year
Company
Alibaba Cloud
Alibaba Cloud develops cloud computing and data management services. It is a sub-organization of Alibaba Group.
H1B Sponsorship
Alibaba Cloud has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (18)
2024 (14)
2023 (2)
2022 (1)
Funding
Current Stage
Late StageTotal Funding
$1.2BKey Investors
Alibaba Group
2015-07-29Series B· $1B
2012-09-20Series A· $200M
Recent News
2026-01-13
South China Morning Post
2026-01-12
Company data provided by crunchbase