SIGN IN
Sr. Linux System Administrator jobs in United States
cer-icon
Apply on Employer Site
company-logo

fal · 19 hours ago

Sr. Linux System Administrator

fal is a company focused on maintaining the health, security, and performance of Linux systems at scale. The Sr. Linux System Administrator will be responsible for managing the bare-metal and OS-level foundation for their GPU cloud, ensuring optimal performance and security across a fleet of servers.
Artificial Intelligence (AI)SoftwareInformation TechnologyAI InfrastructureDeveloper PlatformMachine Learning
check
H1B Sponsorednote

Responsibilities

Own the full lifecycle of our bare-metal GPU server fleet: provisioning, imaging, configuration management, patching, and decommissioning across multiple data centers and providers
Build and maintain our server automation stack using Ansible, Terraform, and custom tooling to manage OS configuration, kernel parameters, driver versions, and firmware updates at scale
Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes)
Manage and optimize distributed and local storage systems supporting model weights, checkpoints, and ephemeral scratch: NVMe arrays, NFS, parallel file systems, and object storage
Implement and enforce OS-level security: hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning, and compliance automation
Own system observability: deploy and maintain node-level metrics collection, log aggregation, and alerting using Prometheus, node_exporter, Loki, and Grafana
Collaborate with the Compute platform team to ensure smooth integration between our infrastructure layer (K8s, Nomad, FluxCD) and the underlying Linux hosts

Qualification

Linux administrationKernel tuningConfiguration managementStorage technologiesNVIDIA GPU softwarePython scriptingBash scriptingCommunicationSelf-starter mindset

Required

8+ years of experience administering Linux systems at scale, ideally in GPU cloud, HPC, or large bare-metal environments
Deep expertise in Linux internals: systemd, kernel tuning (sysctl, cgroups, namespaces), boot process, package management, and performance profiling (perf, bpftrace, sar)
Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform, cloud-init, PXE/iPXE, and custom imaging pipelines
Solid understanding of storage technologies: LVM, RAID, NVMe, NFS, Lustre or GPFS, and Linux I/O stack tuning
Familiarity with the NVIDIA GPU software stack: drivers, CUDA toolkit, nvidia-smi, MIG, and container runtimes (nvidia-container-toolkit)
Proficiency in Python and Bash scripting for automation, monitoring, and fleet management tooling
Excellent communication and a self-starter mindset—you take ownership and constantly seek improvement

Preferred

Experience operating Kubernetes on bare metal (kubeadm, Kubespray) and managing GPU scheduling in K8s (device plugins, MIG slicing)
Hands-on experience with BMC/IPMI/Redfish for out-of-band server management and firmware lifecycle automation
Familiarity with fleet-scale observability: Prometheus federation, Thanos, or Victoria Metrics for multi-cluster monitoring
Contributions to open-source infrastructure tooling or Linux distributions
Experience with compliance frameworks relevant to cloud providers (SOC 2, ISO 27001)

Benefits

Competitive salary and equity
Health, dental, and vision insurance (US)
Regular team events and offsite

Company

fal

twittertwittertwitter
company-logo
Fal is a generative media platform that helps developers create applications using AI models.

Funding

Current Stage
Late Stage
Total Funding
$337M
Key Investors
Sequoia CapitalMeritech Capital PartnersAndreessen Horowitz,Notable Capital
2025-12-09Series D· $140M
2025-07-31Series C· $125M
2025-02-12Series B· $49M

Leadership Team

leader-logo
Burkay Gur
Co-Founder
linkedin
leader-logo
Gorkem Yurtseven
Co-Founder
linkedin
Company data provided by crunchbase