Super Micro Computer Spain, S.L. · 3 months ago
Sr. Reliability Engineer (26861)
Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for various customers worldwide. The Cloud Reliability Engineer will be responsible for deploying, scaling, and ensuring the high availability and performance of AI cloud platforms, while bridging Dev and Ops through automation and best practices.
Data StorageInternet of ThingsNetwork HardwareSoftware
Responsibilities
Cloud Infra Automation: Design and provision cloud infrastructure using Infrastructure as Code (Terraform, Ansible, or Helm) on bare metal or cloud platforms. Develop custom automation and tooling in Python or Go to extend deployment workflows and streamline operations
Platform Reliability: Deploy, scale, maintain, and optimize uptime for AI cloud services including GPU clusters, Kubernetes (K8s), and storage systems (e.g., Ceph, BeeGFS, or Weka). Understand the tools required to benchmark and assure consistent application performance
Monitoring & Alerting: Implement observability tools (e.g., Prometheus, Grafana, ELK, Loki, Fluentd) to monitor system health and alert on anomalies or performance degradation
Capacity Planning: Analyze usage trends and forecast infrastructure needs to support AI workloads and large-scale model training/inference
Incident Management: Lead root cause analysis and resolution for system outages or degraded performance. Define and maintain service level objectives (SLOs), indicators (SLIs), and agreements (SLAs) aligned with uptime and performance goals
CI/CD Integration: Collaborate with DevOps and MLOps teams to ensure reliable delivery pipelines using GitLab CI/CD, ArgoCD, or similar tools
Security & Compliance: Harden Linux systems, manage TLS certificates, and enforce secure access controls via Role-Based Access Control (RBAC), LDAP-integrated SSO, TLS, and network segmentation policies
Documentation & Playbooks: Maintain clear, version-controlled documentation, including architecture diagrams, runbooks, and incident response playbooks to support cross-team knowledge transfer and rapid onboarding
Qualification
Required
Bachelor's degree in Computer Science, Engineering, or a related field—or equivalent experience and 8 years of experience in the areas below
Proficiency in Linux (Ubuntu, RHEL/CentOS), containers (Docker, Podman), and orchestration (Kubernetes)
Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm)
Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.)
Strong scripting and coding skills (Bash, Python, or Go)
Exposure to secure multi-tenant environments and zero trust architectures
Familiarity with network protocols, DNS, DHCP, BGP, ROCEv2, and InfiniBand or high-throughput Ethernet fabrics
Excellent collaboration and communication skills for cross-team, partner, and customer initiatives
Preferred
Understanding of AI/ML reference architectures and experience with workflows, MLFlow, or Kubeflow
Familiarity with storage backends optimized for AI (CephFS, BeeGFS, WekaFS)
Prior experience in bare-metal provisioning via PXE, Ironic, or Foreman
Understanding of NVIDIA GPU telemetry and NCCL testing for performance benchmarking
Familiarity with ITIL processes or structured change management in production systems is a plus
Certifications: CKA, CKAD, Linux+, or related credentials
Benefits
Comprehensive benefits package
Participation in bonus and equity award programs
Company
Super Micro Computer Spain, S.L.
Super Micro Computer Inc., fundada en 1993 en California, USA, fabricante líder en placas base, chasis y servidores de altas prestaciones.
Funding
Current Stage
Early StageRecent News
2025-12-08
Digi Power X Inc.
2025-12-03
2025-12-02
Company data provided by crunchbase