Apply on Employer Site

Supermicro · 2 days ago

Sr. Reliability Engineer (26861)

San Jose, CA

Full-time

Onsite

Senior Level, Lead/Staff

$145K/yr - $165K/yr

8+ years exp

Supermicro is a Top Tier provider of advanced server, storage, and networking solutions. As a Cloud Reliability Engineer, you will help deploy, scale, and ensure the high availability and performance of AI cloud platforms, while bridging Dev and Ops through automation and applying SRE best practices.

Artificial Intelligence (AI)Cloud ComputingCloud InfrastructureEmbedded SystemsManufacturingSoftware

H1B Sponsor Likely

Responsibilities

Cloud Infra Automation: Design and provision cloud infrastructure using Infrastructure as Code (Terraform, Ansible, or Helm) on bare metal or cloud platforms. Develop custom automation and tooling in Python or Go to extend deployment workflows and streamline operations

Platform Reliability: Deploy, scale, maintain, and optimize uptime for AI cloud services including GPU clusters, Kubernetes (K8s), and storage systems (e.g., Ceph, BeeGFS, or Weka). Understand the tools required to benchmark and assure consistent application performance

Monitoring & Alerting: Implement observability tools (e.g., Prometheus, Grafana, ELK, Loki, Fluentd) to monitor system health and alert on anomalies or performance degradation

Capacity Planning: Analyze usage trends and forecast infrastructure needs to support AI workloads and large-scale model training/inference

Incident Management: Lead root cause analysis and resolution for system outages or degraded performance. Define and maintain service level objectives (SLOs), indicators (SLIs), and agreements (SLAs) aligned with uptime and performance goals

CI/CD Integration: Collaborate with DevOps and MLOps teams to ensure reliable delivery pipelines using GitLab CI/CD, ArgoCD, or similar tools

Security & Compliance: Harden Linux systems, manage TLS certificates, and enforce secure access controls via Role-Based Access Control (RBAC), LDAP-integrated SSO, TLS, and network segmentation policies

Documentation & Playbooks: Maintain clear, version-controlled documentation, including architecture diagrams, runbooks, and incident response playbooks to support cross-team knowledge transfer and rapid onboarding

Qualification

Linux proficiencyKubernetesCloud Infra AutomationObservability toolsGPU compute clustersScripting skillsNetwork protocolsAI/ML architecturesITIL processesCertificationsCollaboration skillsCommunication skills

Required

Bachelor's degree in Computer Science, Engineering, or a related field—or equivalent experience and 8 years of experience in the areas below

Proficiency in Linux (Ubuntu, RHEL/CentOS), containers (Docker, Podman), and orchestration (Kubernetes)

Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm)

Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.)

Strong scripting and coding skills (Bash, Python, or Go)

Exposure to secure multi-tenant environments and zero trust architectures

Familiarity with network protocols, DNS, DHCP, BGP, ROCEv2, and InfiniBand or high-throughput Ethernet fabrics

Excellent collaboration and communication skills for cross-team, partner, and customer initiatives

Preferred

Understanding of AI/ML reference architectures and experience with workflows, MLFlow, or Kubeflow

Familiarity with storage backends optimized for AI (CephFS, BeeGFS, WekaFS)

Prior experience in bare-metal provisioning via PXE, Ironic, or Foreman

Understanding of NVIDIA GPU telemetry and NCCL testing for performance benchmarking

Familiarity with ITIL processes or structured change management in production systems is a plus

Certifications: CKA, CKAD, Linux+, or related credentials

Benefits

Comprehensive benefits package

Participation in bonus and equity award programs

Company

Supermicro

Glassdoor2.9

Supermicro is a global leader in high-performance, high-efficiency server technology and innovation.

Founded in 1993

San Jose, California, USA

5001-10000 employees

http://www.supermicro.com

H1B Sponsorship

Supermicro has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (35)

2024 (33)

2023 (27)

2022 (29)

2021 (30)

2020 (42)

Funding

Current Stage

Public Company

Total Funding

$4.5B

2025-06-24Post Ipo Debt· $2.3B

2025-02-11Post Ipo Debt· $700M

2024-02-23Post Ipo Debt· $1.5B

Leadership Team

Matt Thauberger

Senior Vice President Strategy Business Development

Somik Behera

General Manager, Cloud, Cluster Mgmt, Datacenter & AI Software Products

Recent News

MarketScreener

Supermicro Unveils High-Density, Liquid-Cooled and Air-Cooled 6U SuperBlade® Powered by Intel® Xeon® 6900 Series Processors for Maximum Performance and Efficiency

2026-01-03

Financial IT

Supermicro Unveils High-Density, Liquid-Cooled And Air-Cooled 6U SuperBlade® Powered By Intel® Xeon® 6900 Series Processors for Maximum Performance And Efficiency

2026-01-03

Benzinga.com

Why Is Super Micro Stock Gaining Friday?

2026-01-03

Company data provided by crunchbase