krea.ai · 5 months ago
Distributed Systems Engineer
Krea is a company dedicated to building next-generation AI creative tools that empower human creativity. As a Distributed Systems Engineer, you will design and maintain large-scale distributed infrastructure to support AI research and real-time model serving, collaborating closely with ML engineers and researchers.
ArtArtificial Intelligence (AI)Photo EditingVideo Editing
Responsibilities
Design, build, and maintain large-scale distributed infrastructure to reliably support AI research and real-time model serving
Own and scale our multi-thousand-node Kubernetes GPU clusters, ensuring efficient and fault-tolerant operations
Collaborate closely with ML engineers and researchers to architect systems that enable rapid experimentation and deployment
Improve network architecture, optimize load balancing, and streamline operational practices across multi-zone cloud deployments
Own and manage a large-scale Kubernetes cluster designed to run extensive ML training and inference workloads
Architect fault-tolerant systems ensuring uninterrupted model training and real-time inference despite individual node failures
Develop and implement optimized load-balancing strategies to efficiently distribute workloads across zones
Create comprehensive monitoring, alerting systems, and operational playbooks for high-availability clusters
Migrate existing deployments to Infrastructure as Code (Terraform) for reproducibility and scalability
Setting up IP-based rate-limiting to prevent GPU abuse
Qualification
Required
design, build, and maintain large-scale distributed infrastructure to reliably support AI research and real-time model serving
own and scale our multi-thousand-node Kubernetes GPU clusters, ensuring efficient and fault-tolerant operations
collaborate closely with ML engineers and researchers to architect systems that enable rapid experimentation and deployment
improve network architecture, optimize load balancing, and streamline operational practices across multi-zone cloud deployments
own and manage a large-scale Kubernetes cluster designed to run extensive ML training and inference workloads
architect fault-tolerant systems ensuring uninterrupted model training and real-time inference despite individual node failures
develop and implement optimized load-balancing strategies to efficiently distribute workloads across zones
create comprehensive monitoring, alerting systems, and operational playbooks for high-availability clusters
migrate existing deployments to Infrastructure as Code (Terraform) for reproducibility and scalability
setting up IP-based rate-limiting to prevent GPU abuse
Preferred
Kubernetes at scale (thousands of nodes)
Cloud infrastructure management (AWS/GCP/Azure)
High-performance and fault-tolerant networking
Low-level Linux interfaces and administration
Debugging complex distributed systems in production
Python, Golang, Ruby, Rust, and similar systems languages
Bonus: Infrastructure as Code (e.g. Terraform)
Company
krea.ai
H1B Sponsorship
krea.ai has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (2)
2021 (1)
Funding
Current Stage
Growth StageTotal Funding
$83MKey Investors
Bain Capital VenturesAndreessen HorowitzAbstract
2025-04-07Series B· $47M
2024-01-01Series A· $33M
2023-04-01Seed· $3M
Recent News
2025-12-26
2025-04-08
Company data provided by crunchbase