Apply on Employer Site

Cedana · 5 months ago

Senior Infrastructure Engineer - Systems, Kubernetes and SLURM Internals

United States

Full-time

Remote

Senior Level

$120K/yr - $140K/yr

Cedana is focused on solving the challenge of seamless, live migration of active CPU and GPU containers through innovative cloud infrastructure solutions. The role involves designing and implementing core components of the system, enhancing reliability, and collaborating with customers to address complex infrastructure challenges.

Artificial Intelligence (AI)Developer ToolsSoftware

H1B Sponsor Likely

Responsibilities

Design and Build: Architect and implement core components of our system, leveraging our unique insights into checkpointing, virtualization, and container orchestration to create capabilities that don't exist anywhere else

Engineer Rock-Solid Reliability: Enhance the stability and performance of our entire system, from kernel-level interactions and hypervisor optimizations to our managed Kubernetes cloud platform

Partner with Customers: Work directly with customers to solve their most complex infrastructure challenges, acting as a trusted technical partner and gathering insights that drive our product roadmap

Develop Sophisticated Tooling: Build and refine our internal observability and alerting infrastructure to proactively identify and resolve issues anywhere in the stack, ensuring our systems meet the highest standards of performance and availability

Qualification

SLURM InternalsHPC & GPU WorkloadsLinux & Container InternalsProduction ExperienceNetworkingLow-Level FamiliarityProven CollaboratorOn-Call ReadyCreative Problem-Solver

Required

SLURM Internals: Experience writing SLURM plugins (e.g., sched, job_submit, prolog/epilog), or extending SLURM behavior via Lua or C

Fairshare & Scheduling: Deep understanding of SLURM's multifactor priority, fairshare decay, and QOS management

HPC & GPU Workloads: Deployed or managed GPU workloads under SLURM, with knowledge of workload isolation and accelerator resource accounting

Linux & Container Internals: You possess a fundamental understanding of Linux/UNIX (system libraries, services, networking, kernel/user-space interaction) and containerization tech (containerd/cri-o, runc, cgroups, namespaces, seccomp)

Understanding of Networking: You understand how packets flow in Kubernetes, and have hacked around or deployed tooling like CNI, Cilium and/or Istio

Production Experience: You have hands-on experience scaling infrastructure, managing production-level Kubernetes clusters, and working with infrastructure-as-code tools like Helm and Terraform

Low-Level Familiarity: You are comfortable with concepts in low-level systems programming

On-Call Ready: You understand the importance of reliability and are familiar with being on-call

Preferred

Contributed to open-source projects like Kubernetes, containerd, or the Linux kernel

Experience with virtualization in Kubernetes, like KubeVirt or kata

Experience checkpointing and restoring jobs within SLURM (e.g., DMTCP, BLCR, CRIU)

Worked on multi-cluster or federated SLURM setups

Built tooling to bridge SLURM and Kubernetes or run mixed workload environments

Contributed to open-source schedulers or job systems (SLURM, Flux, Torque, PBS, etc)

Familiarity with HPC environments (SLURM, MPI, RDMA) or GPU-centric Kubernetes tooling (Kueue, KubeFlow, KServe)

A passion for debugging weird kernel panics just as much as you enjoy writing elegant Go or Rust code

Experience leading teams or mentoring other engineers in a remote environment

Have written your own container runtime!

Company

Cedana

Cedana is VMWare for GPUs. We enable enterprises to orchestrate and operationalize intelligence precisely, reliably, and efficiently.

Founded in 2023

New York, New York, USA

2-10 employees

https://www.cedana.ai

H1B Sponsorship

Cedana has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (1)

Funding

Current Stage

Early Stage

Total Funding

$0.5M

Key Investors

Y Combinator

2023-09-06Pre Seed· $0.5M

Leadership Team

Niranjan Ravichandra

Co-Founder & CTO

Recent News

GlobeNewswire

Medāna Enters the Spanish Market with an AI Platform to Transform Healthcare Across Europe

2025-04-15

GlobeNewswire

Medāna Enters the Netherlands Market with an AI Platform to Transform Healthcare Across Europe

2025-04-15

Company data provided by crunchbase