Cedana · 5 months ago
Senior Infrastructure Engineer - Systems, Kubernetes and SLURM Internals
Cedana is focused on solving the challenge of seamless, live migration of active CPU and GPU containers through innovative cloud infrastructure solutions. The role involves designing and implementing core components of the system, enhancing reliability, and collaborating with customers to address complex infrastructure challenges.
Artificial Intelligence (AI)Developer ToolsSoftware
Responsibilities
Design and Build: Architect and implement core components of our system, leveraging our unique insights into checkpointing, virtualization, and container orchestration to create capabilities that don't exist anywhere else
Engineer Rock-Solid Reliability: Enhance the stability and performance of our entire system, from kernel-level interactions and hypervisor optimizations to our managed Kubernetes cloud platform
Partner with Customers: Work directly with customers to solve their most complex infrastructure challenges, acting as a trusted technical partner and gathering insights that drive our product roadmap
Develop Sophisticated Tooling: Build and refine our internal observability and alerting infrastructure to proactively identify and resolve issues anywhere in the stack, ensuring our systems meet the highest standards of performance and availability
Qualification
Required
SLURM Internals: Experience writing SLURM plugins (e.g., sched, job_submit, prolog/epilog), or extending SLURM behavior via Lua or C
Fairshare & Scheduling: Deep understanding of SLURM's multifactor priority, fairshare decay, and QOS management
HPC & GPU Workloads: Deployed or managed GPU workloads under SLURM, with knowledge of workload isolation and accelerator resource accounting
Linux & Container Internals: You possess a fundamental understanding of Linux/UNIX (system libraries, services, networking, kernel/user-space interaction) and containerization tech (containerd/cri-o, runc, cgroups, namespaces, seccomp)
Understanding of Networking: You understand how packets flow in Kubernetes, and have hacked around or deployed tooling like CNI, Cilium and/or Istio
Production Experience: You have hands-on experience scaling infrastructure, managing production-level Kubernetes clusters, and working with infrastructure-as-code tools like Helm and Terraform
Low-Level Familiarity: You are comfortable with concepts in low-level systems programming
On-Call Ready: You understand the importance of reliability and are familiar with being on-call
Preferred
Contributed to open-source projects like Kubernetes, containerd, or the Linux kernel
Experience with virtualization in Kubernetes, like KubeVirt or kata
Experience checkpointing and restoring jobs within SLURM (e.g., DMTCP, BLCR, CRIU)
Worked on multi-cluster or federated SLURM setups
Built tooling to bridge SLURM and Kubernetes or run mixed workload environments
Contributed to open-source schedulers or job systems (SLURM, Flux, Torque, PBS, etc)
Familiarity with HPC environments (SLURM, MPI, RDMA) or GPU-centric Kubernetes tooling (Kueue, KubeFlow, KServe)
A passion for debugging weird kernel panics just as much as you enjoy writing elegant Go or Rust code
Experience leading teams or mentoring other engineers in a remote environment
Have written your own container runtime!
Company
Cedana
Cedana is VMWare for GPUs. We enable enterprises to orchestrate and operationalize intelligence precisely, reliably, and efficiently.
H1B Sponsorship
Cedana has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)
Funding
Current Stage
Early StageTotal Funding
$0.5MKey Investors
Y Combinator
2023-09-06Pre Seed· $0.5M
Recent News
GlobeNewswire
2025-04-15
Company data provided by crunchbase