Forward Deployed SRE jobs in United States
cer-icon
Apply on Employer Site
company-logo

Baseten · 10 hours ago

Forward Deployed SRE

Baseten is a company that powers mission-critical inference for leading AI companies by providing flexible infrastructure and developer tooling. The Forward Deployed SRE will be responsible for ensuring the smooth deployment and performance of machine learning workloads for strategic customers, managing escalations, and collaborating with various teams to drive product improvements.

Artificial Intelligence (AI)Developer ToolsMachine LearningSoftwareSoftware Engineering
check
H1B Sponsor Likelynote

Responsibilities

Diagnose and resolve runtime issues related to latency, memory behavior, GPU utilization, concurrency, and model lifecycle management
Debug infrastructure issues across Kubernetes (pods, controllers), networking, observability, and alerting systems
Lead incident response during outages or escalations, managing coordination between Product, FDE, Sales, and Engineering
Serve as the technical owner for top enterprise accounts with strict SLAs and high responsiveness expectations
Identify common failure modes and translate user feedback into roadmap signals, product improvements, our internal runbooks, knowledge bases, and diagnostic best practices
Own project coordination end-to-end: scoping, execution, communication, and stakeholder alignment across technical and non-technical teams ranging from feature requests, new deployments, and operational debugging issues

Qualification

Kubernetes troubleshootingInfrastructure debuggingIncident managementProject managementCommunication skillsExecutive presenceCustomer-facing experienceAI model familiarityTicketing systemsCI/CD tooling

Required

Deep Kubernetes troubleshooting expertise, including advanced resource debugging, pod/runtime analysis, and log-based diagnostics using observability tooling such as Grafana, Loki, and Prometheus
Strong infrastructure debugging ability across container orchestration, networking, and service dependencies, with hands-on experience supporting production-grade clusters
Experience managing high-severity incidents with major customers, including SLAs, post-incident reviews, and clear communication throughout escalations
Proven project management and organizational skills with an ownership mindset, able to manage multiple complex, multi-stakeholder initiatives in parallel — including issue resolution, root-cause analysis, and feature delivery
Ability to translate recurring technical pain points into roadmap-level insights, documentation improvements, or product enhancements
Strong communication skills and executive presence during high-visibility situations, ensuring technical clarity and customer confidence
3+ years of experience in a fast-paced, high-growth, or customer-facing engineering environment

Preferred

Familiarity with running high-performance AI models and workloads, including troubleshooting ML pipelines from preprocessing through inference and serving
Experience implementing or managing ticketing and incident-response systems such as Zendesk or Pylon
Familiarity with Helm, Flux, CI/CD tooling, or scripting automations to improve deployment, release, or operational workflows

Benefits

Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.

Company

Baseten

twittertwittertwitter
company-logo
Baseten is an AI infrastructure company that integrates machine learning into business operations, production, and processes.

H1B Sponsorship

Baseten has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (6)
2024 (8)
2023 (1)
2020 (1)

Funding

Current Stage
Late Stage
Total Funding
$285M
Key Investors
BondGreylock
2025-09-05Series D· $150M
2025-02-19Series C· $75M
2024-03-04Series B· $40M

Leadership Team

leader-logo
Aaron Relph
Design
linkedin
Company data provided by crunchbase