Cartesia · 2 months ago
Cluster Infrastructure Engineer
Cartesia is pioneering the next generation of AI, focusing on real-time, multimodal intelligence. They are seeking a Cluster Infrastructure Engineer to design and operate large-scale GPU clusters that support their foundation models, ensuring speed, reliability, and automation in their research and product development processes.
Artificial Intelligence (AI)Real TimeSoftware
Responsibilities
Design and build large-scale GPU clusters for model training and low-latency inference
Develop automation for provisioning, scaling, and monitoring to ensure clusters are fast, resilient, and self-healing
Collaborate closely with research and product teams to enable distributed training at scale, optimizing for speed, reliability, and utilization
Implement robust observability and alerting systems to monitor GPU health, node stability, and job performance
Diagnose and triage hardware, networking, and distributed training issues across environments, coordinating with provider support as needed
Continuously improve cluster reliability, developer ergonomics, and overall system efficiency across Cartesia’s research and production workloads
Qualification
Required
Strong engineering fundamentals and experience building and operating large-scale distributed systems
Deep familiarity with GPU cluster management using Kubernetes and Slurm
A blend of developer empathy and raw performance engineering, designing systems and tools that are intuitive to use and fast
Ability to balance principled engineering with the urgency of keeping mission-critical systems alive
Proficiency with Infrastructure-as-Code tools (Terraform, Ansible, etc.) and observability tools (Prometheus, Grafana, etc.)
Strong debugging skills— comfortable diagnosing NCCL issues, CUDA errors, and network or driver-level faults
Preferred
Experience optimizing large-scale distributed training frameworks such as DeepSpeed, Megatron-LM, or similar
Familiarity with advanced parallelization techniques such as FSDP, context parallelism, or tensor parallelism
Company
Cartesia
Cartesia provides real-time multimodal intelligence for all devices.
H1B Sponsorship
Cartesia has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (6)
2024 (2)
Funding
Current Stage
Early StageTotal Funding
$86MKey Investors
Kleiner PerkinsIndex Ventures
2025-03-11Series A· $64M
2024-12-12Seed· $22M
Recent News
South China Morning Post
2025-11-26
2025-10-23
Company data provided by crunchbase