Apply on Employer Site

krea.ai · 5 months ago

Distributed Systems Engineer

San Francisco

Full-time

Onsite

Mid Level

Krea is a company dedicated to building next-generation AI creative tools that empower human creativity. As a Distributed Systems Engineer, you will design and maintain large-scale distributed infrastructure to support AI research and real-time model serving, collaborating closely with ML engineers and researchers.

ArtArtificial Intelligence (AI)Photo EditingVideo Editing

H1B Sponsor Likely

Responsibilities

Design, build, and maintain large-scale distributed infrastructure to reliably support AI research and real-time model serving

Own and scale our multi-thousand-node Kubernetes GPU clusters, ensuring efficient and fault-tolerant operations

Collaborate closely with ML engineers and researchers to architect systems that enable rapid experimentation and deployment

Improve network architecture, optimize load balancing, and streamline operational practices across multi-zone cloud deployments

Own and manage a large-scale Kubernetes cluster designed to run extensive ML training and inference workloads

Architect fault-tolerant systems ensuring uninterrupted model training and real-time inference despite individual node failures

Develop and implement optimized load-balancing strategies to efficiently distribute workloads across zones

Create comprehensive monitoring, alerting systems, and operational playbooks for high-availability clusters

Migrate existing deployments to Infrastructure as Code (Terraform) for reproducibility and scalability

Setting up IP-based rate-limiting to prevent GPU abuse

Qualification

KubernetesCloud infrastructure managementFault-tolerant systemsPythonInfrastructure as CodeLow-level Linux administrationDebugging distributed systemsLoad balancing

Required

design, build, and maintain large-scale distributed infrastructure to reliably support AI research and real-time model serving

own and scale our multi-thousand-node Kubernetes GPU clusters, ensuring efficient and fault-tolerant operations

collaborate closely with ML engineers and researchers to architect systems that enable rapid experimentation and deployment

improve network architecture, optimize load balancing, and streamline operational practices across multi-zone cloud deployments

own and manage a large-scale Kubernetes cluster designed to run extensive ML training and inference workloads

architect fault-tolerant systems ensuring uninterrupted model training and real-time inference despite individual node failures

develop and implement optimized load-balancing strategies to efficiently distribute workloads across zones

create comprehensive monitoring, alerting systems, and operational playbooks for high-availability clusters

migrate existing deployments to Infrastructure as Code (Terraform) for reproducibility and scalability

setting up IP-based rate-limiting to prevent GPU abuse

Preferred

Kubernetes at scale (thousands of nodes)

Cloud infrastructure management (AWS/GCP/Azure)

High-performance and fault-tolerant networking

Low-level Linux interfaces and administration

Debugging complex distributed systems in production

Python, Golang, Ruby, Rust, and similar systems languages

Bonus: Infrastructure as Code (e.g. Terraform)

Company

krea.ai

Founded in 2022

San Francisco, California, USA

2-10 employees

https://www.krea.ai

H1B Sponsorship

krea.ai has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (2)

2021 (1)

Funding

Current Stage

Growth Stage

Total Funding

$83M

Key Investors

Bain Capital VenturesAndreessen HorowitzAbstract

2025-04-07Series B· $47M

2024-01-01Series A· $33M

2023-04-01Seed· $3M

Leadership Team

Víctor Perez

Co-founder & CEO

Diego Rodriguez

Co-founder & CTO

Recent News

Business Insider

These 13 creator economy startups pulled in about $2 billion in funding this year

2025-12-26

NDTV

Facing US Student Visa Issues In 2025? Expert Advice, Alternatives, And What Comes Next

2025-05-31

Venture Capital

Krea Secures $83 Million to Simplify Working with AI Models for Visual Creatives

2025-04-08

Company data provided by crunchbase