Senior / Principal Site Reliability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

DataCrunch · 1 month ago

Senior / Principal Site Reliability Engineer

DataCrunch is building a European AI cloud that provides low-cost access to intelligence powered by renewable energy. They are seeking a Senior or Principal Site Reliability Engineer to work closely with European teams in scaling their cloud infrastructure and to set the standard for operational excellence as their first U.S. hire.

Artificial Intelligence (AI)Information TechnologyMachine LearningSoftware

Responsibilities

Ensure the reliability, scalability, and performance of HPC and cloud systems
Build and maintain automation, observability, and monitoring frameworks for compute clusters
Collaborate with ML, data, and infrastructure teams to deliver high-availability systems
Develop and enhance CI/CD pipelines, deployment workflows, and on-call processes
Participate in architecture design and long-term infrastructure strategy discussions
Help establish local infrastructure and contribute to the setup of our future San Francisco office
Play a key role in recruiting and mentoring as our U.S. team grows

Qualification

Site Reliability EngineeringHigh-Performance ComputingCloud PlatformsLinux ExpertiseScriptingAutomationInfrastructure as CodeNetworking KnowledgeKubernetes UnderstandingML Model Training Familiarity

Required

7+ years in SRE, DevOps, or Infrastructure Engineering—preferably in HPC or large-scale distributed systems
Linux expertise (Ubuntu or Debian preferred)
Strong experience with scripting and automation (Python, Go, Bash)
Proven ability with cloud platforms (AWS, GCP, Azure, or modern HPC providers such as CoreWeave, Lambda, Nebius)
Deep understanding networking (DNS/TCP), and infrastructure-as-code tools (Terraform, Ansible)
Experience managing Slurm-based HPC GPU clusters, diagnosing performance issues, and designing efficient HPC jobs
Familiarity with ML model training environments

Preferred

Understanding of Kubernetes (nice to have)

Benefits

Generous cash + equity compensation along with various fringe benefits (e.g., healthcare, lunch, wellbeing, etc.)

Company

DataCrunch

twittertwitter
company-logo
DataCrunch.io is a fresh cloud service provider, our main focus is providing our own infrastructure for machine learning.

Funding

Current Stage
Growth Stage
Total Funding
$78.56M
Key Investors
byFounders
2025-09-08Series A· $64.47M
2025-09-08Debt Financing
2024-10-21Seed· $7.6M
Company data provided by crunchbase