Cerebras · 1 month ago
Principal Engineer, AI Inference Reliability
Cerebras Systems builds the world's largest AI chip and is seeking a hands-on Reliability Tech Lead to ensure the reliability of their AI inference service. The role involves defining reliability strategies, implementing mechanisms for fault detection, and collaborating across teams to maintain world-class reliability standards.
AI InfrastructureArtificial Intelligence (AI)ComputerHardwareSemiconductorSoftware
Responsibilities
Define and drive reliability strategy: establish SLOs and ensure alignment across engineering
Design and implement reliability mechanisms: build and evolve systems for fault detection, graceful degradation, failover, throttling, and recovery across multiple regions and data centers
Lead large-scale incident management: own postmortems, root-cause analysis, and prevention loops for reliability-related incidents
Architect for reliability and observability: influence system design for redundancy, durability, and debuggability
Develop reliability tooling: create internal tools and frameworks for chaos testing, load simulation, and distributed fault injection
Collaborate broadly: work across software, infrastructure, and hardware teams to ensure reliability is embedded into every layer of our inference service
Monitor and communicate reliability metrics: build dashboards and alerts that measure service health and provide actionable insights
Mentor and influence: guide engineers and set best practices for designing, testing, and operating reliable large-scale systems
Qualification
Required
Bachelor's or master's degree in computer science or related field
7+ years of experience in backend, infrastructure, or reliability engineering for large-scale distributed systems
Strong programming skills in at least one popular backend programming language such as Python, C++, Go, or Rust
Deep and hard-earned experience of reliability principles: SLO/SLI/SLA design, incident response, and postmortem culture
Excellent communication and cross-functional leadership skills
Preferred
prior experience building large-scale AI infrastructure systems
Benefits
Enjoy job stability with startup vitality.
Our simple, non-corporate work culture that respects individual beliefs.
Company
Cerebras
Cerebras Systems is the world's fastest AI inference. We are powering the future of generative AI.
H1B Sponsorship
Cerebras has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (31)
2024 (16)
2023 (18)
2022 (17)
2021 (34)
2020 (23)
Funding
Current Stage
Late StageTotal Funding
$1.82BKey Investors
Alpha Wave VenturesVy CapitalCoatue
2025-12-03Secondary Market
2025-09-30Series G· $1.1B
2024-09-27Series Unknown
Recent News
2026-01-06
Foundation Capital
2026-01-02
Company data provided by crunchbase