fal · 5 hours ago
Sr. Linux System Administrator
fal is a company focused on maintaining the health, security, and performance of Linux systems at scale. The Sr. Linux System Administrator will be responsible for managing the bare-metal and OS-level foundation for their GPU cloud, ensuring optimal performance and security across a fleet of servers.
Artificial Intelligence (AI)SoftwareInformation TechnologyAI InfrastructureDeveloper PlatformMachine Learning
Responsibilities
Own the full lifecycle of our bare-metal GPU server fleet: provisioning, imaging, configuration management, patching, and decommissioning across multiple data centers and providers
Build and maintain our server automation stack using Ansible, Terraform, and custom tooling to manage OS configuration, kernel parameters, driver versions, and firmware updates at scale
Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes)
Manage and optimize distributed and local storage systems supporting model weights, checkpoints, and ephemeral scratch: NVMe arrays, NFS, parallel file systems, and object storage
Implement and enforce OS-level security: hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning, and compliance automation
Own system observability: deploy and maintain node-level metrics collection, log aggregation, and alerting using Prometheus, node_exporter, Loki, and Grafana
Collaborate with the Compute platform team to ensure smooth integration between our infrastructure layer (K8s, Nomad, FluxCD) and the underlying Linux hosts
Qualification
Required
8+ years of experience administering Linux systems at scale, ideally in GPU cloud, HPC, or large bare-metal environments
Deep expertise in Linux internals: systemd, kernel tuning (sysctl, cgroups, namespaces), boot process, package management, and performance profiling (perf, bpftrace, sar)
Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform, cloud-init, PXE/iPXE, and custom imaging pipelines
Solid understanding of storage technologies: LVM, RAID, NVMe, NFS, Lustre or GPFS, and Linux I/O stack tuning
Familiarity with the NVIDIA GPU software stack: drivers, CUDA toolkit, nvidia-smi, MIG, and container runtimes (nvidia-container-toolkit)
Proficiency in Python and Bash scripting for automation, monitoring, and fleet management tooling
Excellent communication and a self-starter mindset—you take ownership and constantly seek improvement
Preferred
Experience operating Kubernetes on bare metal (kubeadm, Kubespray) and managing GPU scheduling in K8s (device plugins, MIG slicing)
Hands-on experience with BMC/IPMI/Redfish for out-of-band server management and firmware lifecycle automation
Familiarity with fleet-scale observability: Prometheus federation, Thanos, or Victoria Metrics for multi-cluster monitoring
Contributions to open-source infrastructure tooling or Linux distributions
Experience with compliance frameworks relevant to cloud providers (SOC 2, ISO 27001)
Benefits
Competitive salary and equity
Health, dental, and vision insurance (US)
Regular team events and offsite
Company
fal
Fal is a generative media platform that helps developers create applications using AI models.
Funding
Current Stage
Late StageTotal Funding
$337MKey Investors
Sequoia CapitalMeritech Capital PartnersAndreessen Horowitz,Notable Capital
2025-12-09Series D· $140M
2025-07-31Series C· $125M
2025-02-12Series B· $49M
Recent News
2026-02-04
2026-01-07
2025-12-16
Company data provided by crunchbase