Apply on Employer Site

Fluidstack · 12 hours ago

Senior / Staff Site Reliability Engineer

San Francisco Bay Area

Full-time

Onsite

Senior Level, Lead/Staff

$175K/yr - $320K/yr

2+ years exp

Fluidstack is building the infrastructure for abundant intelligence, partnering with top AI labs and enterprises. The Senior / Staff Site Reliability Engineer will ensure the reliability and performance of the global GPU cloud, collaborating with various teams to tackle complex production issues and improve platform stability.

Cloud ComputingCloud StorageGenerative AIGPUInformation TechnologyMachine LearningPrivate CloudSoftware

H1B Sponsor Likely

Responsibilities

Deploying clusters of 1,000+ GPUs using custom written playbooks; modifying these tools as necessary to provide the perfect solution for a customer

Validating correctness and performance of underlying compute, storage, and networking infrastructure, and working with providers to optimize these subsystems

Migrating petabytes of data from public cloud platforms to local storage, as quickly and cost effectively as possible

Debugging issues anywhere in the stack, from “this server’s fan is blocked by a plastic bag” to “optimizing S3 dataloaders from buckets in different regions”

Building internal tooling to decrease deployment time and increase cluster reliability, including automation where the customer benefits clearly outweigh the implementation overhead

Qualification

KubernetesGoPythonAnsibleTerraformSysadminHPC engineeringCommunication skillsProblem-solvingTeam collaboration

Required

2+ years of SRE, DevOps, Sysadmin, and/or HPC engineering experience

Great verbal and written communication skills in English

Experience deploying and operating Kubernetes and/or SLURM clusters

Experience in writing Go, Python, Bash

Experience using Ansible, Terraform, and other automation or IAC tools

Strong engineering background, preferably in Computer Science, Software Engineering, Math, Computer Engineering, or similar fields

Preferred

You have built and operated an AI workload at 1000+ GPU scale

You have built multi-tenant, hyperscale Kubernetes based services

You have physically deployed infrastructure in a datacenter, managed bare metal hardware via MaaS or Netbox, etc

You have deployed and managed multi-tenant InfiniBand or RoCE networks

You have deployed and managed petabyte scale all-flash storage systems, including DDN, VAST, and/or Weka; or Ceph, LUSTRE, or similar open source tools

Benefits

Retirement or pension plan, in line with local norms.

Health, dental, and vision insurance.

Generous PTO policy, in line with local norms.

Company

Fluidstack

FluidStack is an AI cloud platform for frontier labs and startups.

Founded in 2017

London, England, GBR

51-200 employees

https://www.fluidstack.io

H1B Sponsorship

Fluidstack has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (1)

2024 (2)

Funding

Current Stage

Growth Stage

Total Funding

unknown

Key Investors

Seedcamp

2025-06-01Undisclosed

2024-10-01Private Equity

2018-02-01Pre Seed

Leadership Team

Gary Wu

CEO, Co-Founder

Recent News

Cointelegraph

Riot Platforms offloads $161M in Bitcoin in December amid strategy shift

2026-01-07

Business Insider

The investor who blocked a $9 billion AI deal expects that bet to soon pay off

2026-01-06

TradingView

Hut 8 finishes 2025 strong despite difficult year for Bitcoin miners

2026-01-04

Company data provided by crunchbase