Cluster Infrastructure Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Cartesia · 2 months ago

Cluster Infrastructure Engineer

Cartesia is pioneering the next generation of AI, focusing on real-time, multimodal intelligence. They are seeking a Cluster Infrastructure Engineer to design and operate large-scale GPU clusters that support their foundation models, ensuring speed, reliability, and automation in their research and product development processes.

Artificial Intelligence (AI)Real TimeSoftware
check
H1B Sponsor Likelynote

Responsibilities

Design and build large-scale GPU clusters for model training and low-latency inference
Develop automation for provisioning, scaling, and monitoring to ensure clusters are fast, resilient, and self-healing
Collaborate closely with research and product teams to enable distributed training at scale, optimizing for speed, reliability, and utilization
Implement robust observability and alerting systems to monitor GPU health, node stability, and job performance
Diagnose and triage hardware, networking, and distributed training issues across environments, coordinating with provider support as needed
Continuously improve cluster reliability, developer ergonomics, and overall system efficiency across Cartesia’s research and production workloads

Qualification

GPU cluster managementDistributed systemsInfrastructure-as-CodeObservability toolsDebugging skillsDeveloper empathyPerformance engineeringCollaboration

Required

Strong engineering fundamentals and experience building and operating large-scale distributed systems
Deep familiarity with GPU cluster management using Kubernetes and Slurm
A blend of developer empathy and raw performance engineering, designing systems and tools that are intuitive to use and fast
Ability to balance principled engineering with the urgency of keeping mission-critical systems alive
Proficiency with Infrastructure-as-Code tools (Terraform, Ansible, etc.) and observability tools (Prometheus, Grafana, etc.)
Strong debugging skills— comfortable diagnosing NCCL issues, CUDA errors, and network or driver-level faults

Preferred

Experience optimizing large-scale distributed training frameworks such as DeepSpeed, Megatron-LM, or similar
Familiarity with advanced parallelization techniques such as FSDP, context parallelism, or tensor parallelism

Company

Cartesia

twittertwittertwitter
company-logo
Cartesia provides real-time multimodal intelligence for all devices.

H1B Sponsorship

Cartesia has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (6)
2024 (2)

Funding

Current Stage
Early Stage
Total Funding
$86M
Key Investors
Kleiner PerkinsIndex Ventures
2025-03-11Series A· $64M
2024-12-12Seed· $22M

Leadership Team

leader-logo
Karan Goel
Founder / CEO
linkedin
leader-logo
Arjun Desai
Co-Founder
linkedin
Company data provided by crunchbase