Site Reliability Engineer (SRE) — AI Training & Inference Infrastructure jobs in United States
cer-icon
Apply on Employer Site
company-logo

STACK Construction Technologies · 22 hours ago

Site Reliability Engineer (SRE) — AI Training & Inference Infrastructure

STACK Construction Technologies builds software that helps teams plan, build, and operate with clarity and speed. They are seeking a Site Reliability Engineer to own reliability for their model training and inference platforms, focusing on operating and evolving GPU-enabled clusters and improving developer experience for AI workloads.

ConstructionReal EstateSoftware
check
Growth Opportunities

Responsibilities

Build and operate AI compute platforms
Design, provision, and scale GPU-backed clusters for training and inference (Kubernetes-based and/or HPC-style schedulers)
Own cluster lifecycle management: provisioning, bootstrapping, upgrades, autoscaling/capacity scaling, and decommissioning
Build reliable abstractions so training jobs can run across multiple clusters/environments with minimal friction
Define and track SLIs/SLOs for training and inference systems (job success rate, queue latency, throughput, tail latency, GPU utilization, etc.)
Lead incident response and root-cause analysis; drive permanent fixes and “never again” automation
Improve recovery and maintenance workflows (e.g., reducing restart/upgrade times; safer rollouts)
Implement end-to-end monitoring across compute, networking, storage, and accelerators
Build dashboards, alerting, and anomaly detection that catch issues early—before they derail long runs
Tune performance and cost: GPU utilization, scheduling efficiency, I/O bottlenecks, and network hotspots
Partner with vendors and internal stakeholders on firmware/driver alignment, and node health
Provide paved paths for training: reproducible environments, job templates, secure secrets, artifact storage, and dataset access patterns
Collaborate closely with ML researchers/engineers to understand workload needs and remove infrastructure bottlenecks

Qualification

KubernetesInfrastructure-as-CodeProduction infrastructurePythonLinux/UnixOperational mindset

Required

5+ years building/operating production infrastructure as an SRE, infrastructure engineer, or systems engineer
Strong Kubernetes experience (cluster operations, upgrades, networking, storage, and troubleshooting)
Proficiency in at least one programming/scripting language (Python, Go, etc.) for automation and tooling
Experience with Infrastructure-as-Code (Terraform preferred) and CI/CD for infra or platform components
Solid Linux/Unix fundamentals (performance, debugging, kernel/userland tooling)
Strong operational mindset: you care about reliability, safe change management, and measurable outcomes

Company

STACK Construction Technologies

twittertwittertwitter
company-logo
Stack Construction Technologies is a construction company.

Funding

Current Stage
Growth Stage
Total Funding
$29.3M
Key Investors
Level Equity ManagementCincyTech
2025-08-14Series Unknown· $3M
2022-03-22Series Unknown· $17M
2020-05-27Series B· $2M

Leadership Team

leader-logo
Raymond DeZenzo
Chief Operating Officer & CFO
linkedin
leader-logo
Dave Wagner
VP of Product Marketing and Partner Development
linkedin
Company data provided by crunchbase