Alibaba Cloud · 2 days ago
Cloud Infrastructure – Site Reliability Engineer (SRE)-Sunnyvale
Alibaba Cloud is responsible for innovative messaging products and is seeking a Site Reliability Engineer to oversee the stability and performance of cloud middleware systems. The role involves managing the lifecycle of containerized middleware on Kubernetes, leading incident responses, and developing automation tools to enhance operational efficiency.
Cloud Data ServicesCloud ManagementData CenterData ManagementFoundational AISoftware
Responsibilities
Oversee stability maintenance, performance tuning, and high-availability architecture design for cloud middleware, including messaging middleware (Kafka/RocketMQ)
Manage the containerized middleware lifecycle on Kubernetes clusters: implement deployments, auto-scaling, version upgrades, and resource optimization in K8s environments
Lead the troubleshooting of middleware-related incidents (e.g., message backlog, service registration failures) through log analysis, distributed tracing, and monitoring systems
Develop diagnostic tools using Java/Go to resolve production issues, performance bottlenecks, and compatibility challenges
Build Python/Go/Shell automation tools to standardize middleware deployment, monitoring, and disaster recovery workflows
Implement chaos engineering experiments, capacity planning strategies, and failover mechanisms to enhance system resilience
Qualification
Required
Over 2 years of experience in distributed systems reliability engineering
Familiar with high-availability architecture design
Proficient in at least one of Python, Go, or Java
Cluster management, message reliability assurance, and performance optimization for Kafka/RocketMQ
Hands-on Experience Deploying Middleware On Kubernetes
Ability to convert operations experience into automated solutions
Familiarity with various message middleware, e.g., Kafka and RocketMQ
Strong scripting skills in Shell/Python
Experience with Infrastructure as Code (IaC) tools (Terraform preferred)
Preferred
Familiar with core SRE practices (incident review, error budgeting, chaos engineering)
Experienced in building automated risk control systems
Hands-on Experience Deploying Middleware On Kubernetes (Helm/Operator Preferred)
Company
Alibaba Cloud
Alibaba Cloud develops cloud computing and data management services. It is a sub-organization of Alibaba Group.
H1B Sponsorship
Alibaba Cloud has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (18)
2024 (14)
2023 (2)
2022 (1)
Funding
Current Stage
Late StageTotal Funding
$1.2BKey Investors
Alibaba Group
2015-07-29Series B· $1B
2012-09-20Series A· $200M
Recent News
Benzinga.com
2026-01-20
Company data provided by crunchbase