SIGN IN
Software Engineer, AI Reliability jobs in United States
cer-icon
Apply on Employer Site
company-logo

Anthropic · 9 hours ago

Software Engineer, AI Reliability

Anthropic is a public benefit corporation focused on creating reliable, interpretable, and steerable AI systems. They are seeking a Software Engineer in AI Reliability to improve system reliability across critical serving paths and collaborate with various teams to enhance the robustness of their AI services.
Artificial Intelligence (AI)Foundational AIGenerative AIInformation TechnologyMachine Learning
check
H1B Sponsorednote

Responsibilities

Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity
Design and implement monitoring and observability systems across the token path
Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers
Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements
Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic's safety commitments

Qualification

Distributed systemsReliability engineeringLarge-scale infrastructureML hardware acceleratorsAI observability toolsChaos engineeringCommunication skillsCollaboration skills

Required

Bachelor's degree in a related field or equivalent experience
Strong distributed systems, infrastructure, or reliability backgrounds
Curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don't have deep expertise yet
Think holistically about how systems compose and where the seams are
Can build lasting relationships across teams
Care about users and feel ownership over outcomes, even for systems you don't own
Excellent communication and collaboration skills
Diverse experience in building product stacks, scaling databases, running massive distributed systems, and everything in between

Preferred

Experience as an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems
Experience operating large-scale model serving or training infrastructure (>1000 GPUs)
Experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium)
Understanding of ML-specific networking optimizations like RDMA and InfiniBand
Expertise in AI-specific observability tools and frameworks
Experience with chaos engineering and systematic resilience testing
Contributed to open-source infrastructure or ML tooling

Benefits

Competitive compensation and benefits
Optional equity donation matching
Generous vacation and parental leave
Flexible working hours
A lovely office space in which to collaborate with colleagues

Company

Anthropic

twittertwittertwitter
company-logo
Anthropic is an AI research company that focuses on the safety and alignment of AI systems with human values.

H1B Sponsorship

Anthropic has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (105)
2024 (13)
2023 (3)
2022 (4)
2021 (1)

Funding

Current Stage
Late Stage
Total Funding
$33.74B
Key Investors
Fidelity,ICONIQ Capital,Lightspeed Venture PartnersLightspeed Venture PartnersGoogle
2025-09-02Series F· $13B
2025-05-16Debt Financing· $2.5B
2025-03-03Series E· $3.5B

Leadership Team

leader-logo
Dario Amodei
Co-Founder and CEO
linkedin
leader-logo
Daniela Amodei
President and co-founder
linkedin
Company data provided by crunchbase