Apply on Employer Site

Cerebras · 1 month ago

Principal Engineer, AI Inference Reliability

United States

Full-time

Remote

Senior Level, Lead/Staff

7+ years exp

Cerebras Systems builds the world's largest AI chip and is seeking a hands-on Reliability Tech Lead to ensure the reliability of their AI inference service. The role involves defining reliability strategies, implementing mechanisms for fault detection, and collaborating across teams to maintain world-class reliability standards.

AI InfrastructureArtificial Intelligence (AI)ComputerHardwareSemiconductorSoftware

Growth Opportunities

H1B Sponsor Likely

Responsibilities

Define and drive reliability strategy: establish SLOs and ensure alignment across engineering

Design and implement reliability mechanisms: build and evolve systems for fault detection, graceful degradation, failover, throttling, and recovery across multiple regions and data centers

Lead large-scale incident management: own postmortems, root-cause analysis, and prevention loops for reliability-related incidents

Architect for reliability and observability: influence system design for redundancy, durability, and debuggability

Develop reliability tooling: create internal tools and frameworks for chaos testing, load simulation, and distributed fault injection

Collaborate broadly: work across software, infrastructure, and hardware teams to ensure reliability is embedded into every layer of our inference service

Monitor and communicate reliability metrics: build dashboards and alerts that measure service health and provide actionable insights

Mentor and influence: guide engineers and set best practices for designing, testing, and operating reliable large-scale systems

Qualification

Reliability engineeringLarge-scale distributed systemsBackend programmingIncident responseSLO/SLI/SLA designMentoring engineersChaos testingLoad simulationDistributed fault injectionPostmortem cultureAI infrastructure systemsCommunicationCross-functional leadership

Required

Bachelor's or master's degree in computer science or related field

7+ years of experience in backend, infrastructure, or reliability engineering for large-scale distributed systems

Strong programming skills in at least one popular backend programming language such as Python, C++, Go, or Rust

Deep and hard-earned experience of reliability principles: SLO/SLI/SLA design, incident response, and postmortem culture

Excellent communication and cross-functional leadership skills

Preferred

prior experience building large-scale AI infrastructure systems

Benefits

Enjoy job stability with startup vitality.

Our simple, non-corporate work culture that respects individual beliefs.

Company

Cerebras

Cerebras Systems is the world's fastest AI inference. We are powering the future of generative AI.

Founded in 2016

Sunnyvale, California, USA

501-1000 employees

https://cerebras.ai

H1B Sponsorship

Cerebras has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (31)

2024 (16)

2023 (18)

2022 (17)

2021 (34)

2020 (23)

Funding

Current Stage

Late Stage

Total Funding

$1.82B

Key Investors

Alpha Wave VenturesVy CapitalCoatue

2025-12-03Secondary Market

2025-09-30Series G· $1.1B

2024-09-27Series Unknown

Leadership Team

Andrew Feldman

CEO & Founder

Bob Komin

Chief Financial Officer

Recent News

Crunchbase News

Sector Snapshot: US Semiconductor Startup Funding Hits Record High

2026-01-06

Benzinga.com

Who's Going Public Next? Kalshi Bets Drop US IPO Clues Before 2027— And It's Not Just SpaceX Or OpenAI

2026-01-03

Foundation Capital

Foundation Capital Portfolio

2026-01-02

Company data provided by crunchbase