Senior / Staff Site Reliability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Fluidstack · 12 hours ago

Senior / Staff Site Reliability Engineer

Fluidstack is building the infrastructure for abundant intelligence, partnering with top AI labs and enterprises. The Senior / Staff Site Reliability Engineer will ensure the reliability and performance of the global GPU cloud, collaborating with various teams to tackle complex production issues and improve platform stability.

Cloud ComputingCloud StorageGenerative AIGPUInformation TechnologyMachine LearningPrivate CloudSoftware
check
H1B Sponsor Likelynote

Responsibilities

Deploying clusters of 1,000+ GPUs using custom written playbooks; modifying these tools as necessary to provide the perfect solution for a customer
Validating correctness and performance of underlying compute, storage, and networking infrastructure, and working with providers to optimize these subsystems
Migrating petabytes of data from public cloud platforms to local storage, as quickly and cost effectively as possible
Debugging issues anywhere in the stack, from “this server’s fan is blocked by a plastic bag” to “optimizing S3 dataloaders from buckets in different regions”
Building internal tooling to decrease deployment time and increase cluster reliability, including automation where the customer benefits clearly outweigh the implementation overhead

Qualification

KubernetesGoPythonAnsibleTerraformSysadminHPC engineeringCommunication skillsProblem-solvingTeam collaboration

Required

2+ years of SRE, DevOps, Sysadmin, and/or HPC engineering experience
Great verbal and written communication skills in English
Experience deploying and operating Kubernetes and/or SLURM clusters
Experience in writing Go, Python, Bash
Experience using Ansible, Terraform, and other automation or IAC tools
Strong engineering background, preferably in Computer Science, Software Engineering, Math, Computer Engineering, or similar fields

Preferred

You have built and operated an AI workload at 1000+ GPU scale
You have built multi-tenant, hyperscale Kubernetes based services
You have physically deployed infrastructure in a datacenter, managed bare metal hardware via MaaS or Netbox, etc
You have deployed and managed multi-tenant InfiniBand or RoCE networks
You have deployed and managed petabyte scale all-flash storage systems, including DDN, VAST, and/or Weka; or Ceph, LUSTRE, or similar open source tools

Benefits

Retirement or pension plan, in line with local norms.
Health, dental, and vision insurance.
Generous PTO policy, in line with local norms.

Company

Fluidstack

twittertwittertwitter
company-logo
FluidStack is an AI cloud platform for frontier labs and startups.

H1B Sponsorship

Fluidstack has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)
2024 (2)

Funding

Current Stage
Growth Stage
Total Funding
unknown
Key Investors
Seedcamp
2025-06-01Undisclosed
2024-10-01Private Equity
2018-02-01Pre Seed

Leadership Team

leader-logo
Gary Wu
CEO, Co-Founder
linkedin
Company data provided by crunchbase