Alibaba Cloud · 1 day ago
Senior Network Engineer
Alibaba Cloud is a leading cloud computing and data intelligence company. They are seeking a Senior Network Engineer to develop and implement stability solutions, establish monitoring mechanisms, and enhance operational efficiency within their operations and maintenance platforms.
Responsibilities
Have a global perspective on stability, capable of developing and implementing stability solutions
Establish and continually optimize monitoring mechanisms for application operations and maintenance; develop and maintain corresponding monitoring platforms/tools
Establish and continuously optimize warning mechanisms for application operations and maintenance, ensuring that faults can be quickly discovered, located, and addressed
Quickly analyze, diagnose, and locate problems, and collaborate with relevant personnel to resolve issues; establish and improve the rapid recovery service mechanism to reduce business impact and ensure stable business operations by identifying and eliminating potential risks through stability governance projects and architectural optimizations
Design, develop, and maintain reliable operations and maintenance platforms and tools, such as inspection systems, water level systems, delivery systems, cost management systems, etc., to address issues related to delivery, performance, stability, and cost encountered by production systems, ensuring business availability and enhancing performance and efficiency
Responsible for data-driven analysis of operations and maintenance quality; analyze and study daily operations and maintenance metrics, issues, and risks to establish models and provide optimization suggestions for operations and maintenance
Establish operation and maintenance process specifications and standardization (such as change standards, protection plans, cloud product configuration standards, etc.) to ensure the normativity and standardization of operations and maintenance, thereby enhancing stability
Develop and implement emergency response specifications and standards for application operations and maintenance faults
Develop and implement alarm handling specifications and standards for application operations and maintenance, as well as Service Level Agreements (SLA)
Based on business requirements, plan budget preparation, capacity planning, and readiness, and coordinate with development teams for predictions and estimates of resource consumption such as storage and computing
Analyze business demands, ensuring stability while integrating water levels, specifications, and billing rules; control the reasonableness of resource estimation in technical solutions and collaborate with development to reduce resource costs
24/7 emergency response, daily monitoring alerts, and emergency handling, continuously identifying and rectifying existing issues
Responsible for operations and maintenance support during major events (such as National Day, Spring Festival, New Year's Day, and significant activities)
Develop and drill emergency plans, respond to emergencies, and handle faults
Establish a problem/fault record repository, conduct targeted analysis of the repository, and enhance and optimize the emergency plan repository and standard process repository
Responsible for system architecture upgrades, such as kernel upgrades, architecture upgrades, inter-room service migration, and containerization transformation
Responsible for the design and implementation of disaster recovery architecture, such as local disaster recovery and multi-active geographically distributed setups
Qualification
Required
Fluent in Chinese communication skills, able to clearly articulate technical issues and solutions
Over 3 years of experience in operations and maintenance in related fields such as applications, networks, and containerization
Basic mastery of professional abilities in architecture design, performance optimization, and stability optimization
Capable of applying intelligent and automated operations and maintenance platforms and tools, designing and utilizing complex workflows and daily operational templates, quickly identifying, locating, and resolving relatively complex faults, thereby improving operational efficiency
Able to summarize and consolidate issues discovered in daily operations and maintenance into operational experience, and apply this knowledge to enhance capabilities within the operations and maintenance platform
Proficient in protocols such as TCP/IP, DNS, and HTTP, with the ability to perform preliminary analysis of network traffic and troubleshoot network issues
Familiar with at least one cloud service platform (such as AWS, Alibaba Cloud, Azure, etc.) and its related mainstream products (such as Flink, MaxCompute, Log Service, RDS, Redis, etc.), able to preliminarily troubleshoot and resolve basic issues related to the use of corresponding cloud products
Preferred
Familiarity with DPDK (Data Plane Development Kit) and experience in enhancing network processing performance
Some development capabilities to advance automation in operations and maintenance capabilities
Strong business understanding, capable of independently handling complex issues with real case examples
Possessing personal judgment regarding business issues, able to skillfully utilize processes and tools to identify risks and formulate solutions
Having a certain level of influence within the business line and able to gain recognition from surrounding teams
Company
Alibaba Cloud
Alibaba Cloud develops cloud computing and data management services. It is a sub-organization of Alibaba Group.
H1B Sponsorship
Alibaba Cloud has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (18)
2024 (14)
2023 (2)
2022 (1)
Funding
Current Stage
Late StageTotal Funding
$1.2BKey Investors
Alibaba Group
2015-07-29Series B· $1B
2012-09-20Series A· $200M
Recent News
2025-12-19
2025-12-19
Company data provided by crunchbase