Luma AI · 7 hours ago
Software Engineer - Reliability
Luma AI is focused on building multimodal AI to enhance human capabilities, requiring a robust GPU infrastructure. The Software Engineer - Reliability will architect, maintain, and scale the company's infrastructure while ensuring high performance and security across multi-cloud environments.
Responsibilities
Architect for Reliability & Scale: Participate in critical re-architecture sessions to redesign our systems for higher efficiency and scale. You won't just maintain existing clusters; you will help define how our next-generation infrastructure operates
Own Multi-Cloud GPU Clusters: Take end-to-end ownership of our production clusters for training and inference across AWS and OCI, ensuring high availability and peak performance
Drive Security & Compliance: Assist in achieving and maintaining security certifications (SOC 2 Type 1 & 2, ISO standards) by implementing robust infrastructure security practices in a fast-moving AI startup environment
Deep Linux Performance Tuning: Use your mastery of Linux systems to troubleshoot and optimize performance at the OS and kernel level
Build Robust Automation: Write high-quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure without relying on heavy operational toil
Debug Complex Hardware/Software Failures: Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues, often collaborating directly with hardware vendors like NVIDIA
Qualification
Required
8+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast-paced, large-scale environment
Deep Linux Mastery: You possess deep, hands-on expertise in Linux, containerized systems, and debugging low-level system performance
Cloud Infrastructure Expert: You have strong experience with providers like AWS or OCI
Tenacious Troubleshooter: You thrive on solving complex, low-level problems where hardware and software intersect
Startup DNA: You are energetic and thrive in a less structured, fast-paced environment
Security-Minded: You possess a working knowledge of security best practices and familiarity with compliance frameworks, such as SOC 2 and ISO
Expert in High-Performance Networking: You have practical experience with InfiniBand, RDMA, or RoCE and understand how to optimize throughput for massive distributed training jobs
Preferred
Deep expertise with GPU tooling for NVIDIA and AMD GPUs like DCGM or ROCm
Experience managing large-scale GPU clusters for AI/ML workloads (training or inference)
Familiarity with job management systems based on Kubernetes or orchestration frameworks like Ray
Company
Luma AI
Luma AI develops tools that let users generate photorealistic images and videos from text, image, or video prompts.
H1B Sponsorship
Luma AI has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (10)
2024 (3)
Funding
Current Stage
Growth StageTotal Funding
$1.06BKey Investors
HUMAINAndreessen HorowitzAmplify Partners
2025-11-19Series C· $900M
2024-12-06Series B· $90M
2024-01-09Series B· $43M
Recent News
2026-01-09
2026-01-06
Company data provided by crunchbase