Cloud Infrastructure – Site Reliability Engineer (SRE) jobs in United States
cer-icon
Apply on Employer Site
company-logo

Alibaba Cloud · 2 weeks ago

Cloud Infrastructure – Site Reliability Engineer (SRE)

Alibaba Cloud is responsible for creating a stable and user-friendly messaging platform. The Site Reliability Engineer (SRE) will oversee the stability maintenance and performance tuning of cloud middleware, manage the lifecycle of containerized middleware, and lead incident response efforts.

Cloud Data ServicesCloud ManagementData CenterData ManagementFoundational AISoftware
check
H1B Sponsor Likelynote
Hiring Manager
Moyin Hu
linkedin

Responsibilities

Oversee stability maintenance, performance tuning, and high-availability architecture design for cloud middleware, including messaging middleware (Kafka/RocketMQ)
Manage the containerized middleware lifecycle on Kubernetes clusters: implement deployments, auto-scaling, version upgrades, and resource optimization in K8s environments
Lead the troubleshooting of middleware-related incidents (e.g., message backlog, service registration failures) through log analysis, distributed tracing, and monitoring systems
Develop diagnostic tools using Java/Go to resolve production issues, performance bottlenecks, and compatibility challenges
Build Python/Go/Shell automation tools to standardize middleware deployment, monitoring, and disaster recovery workflows
Implement chaos engineering experiments, capacity planning strategies, and failover mechanisms to enhance system resilience

Qualification

KubernetesPythonGoJavaKafkaRocketMQTerraformShell scriptingIncident responseChaos engineering

Required

Over 2 years of experience in distributed systems reliability engineering
Familiar with high-availability architecture design
Proficient in at least one of Python, Go, or Java
Cluster management, message reliability assurance, and performance optimization for Kafka/RocketMQ
Hands-on experience deploying middleware on Kubernetes
Ability to convert operations experience into automated solutions
Familiarity with various message middleware, e.g., Kafka and RocketMQ
Strong scripting skills in Shell/Python
Experience with Infrastructure as Code (IaC) tools (Terraform preferred)

Preferred

Familiar with core SRE practices (incident review, error budgeting, chaos engineering)
Experienced in building automated risk control systems
Hands-on experience deploying middleware on Kubernetes (Helm/Operator preferred)

Company

Alibaba Cloud

twittertwittertwitter
company-logo
Alibaba Cloud develops cloud computing and data management services. It is a sub-organization of Alibaba Group.

H1B Sponsorship

Alibaba Cloud has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (18)
2024 (14)
2023 (2)
2022 (1)

Funding

Current Stage
Late Stage
Total Funding
$1.2B
Key Investors
Alibaba Group
2015-07-29Series B· $1B
2012-09-20Series A· $200M
Company data provided by crunchbase