Zyphra · 5 hours ago
Machine Learning Systems Administrator - HPC Infrastructure
Zyphra is an artificial intelligence company based in San Francisco, California. The role involves maintaining and developing core infrastructure for machine learning research and production, ensuring smooth operations and scalable workflows.
Artificial Intelligence (AI)Cloud ComputingMachine LearningSoftware
Responsibilities
Administration and automation of our Linux-based cluster environments
Managing user onboarding/offboarding, security auditing, and access control
Monitoring system resources and job scheduling
Supporting and improving developer workflows (e.g., VSCode compatibility, Docker)
Enabling and supporting AI/ML workloads, including large-scale training jobs
Comfortable operating across a wide range of infrastructure concerns and excited to own and improve critical systems
You’ll have a significant impact on both developer productivity and training and inference performance
Qualification
Required
Strong experience with Linux system administration, user and access management, and automation
Demonstrated expertise in scripting languages for system tooling and automation (bash, Python, etc.)
Familiarity with containerized environments (e.g., Docker) and job scheduling systems like Slurm
Experience building tooling for cluster validation and reliability (GPU, networking, storage health checks)
Experience setting up and managing developer tools and third-party services (e.g, Cloud storage providers, Dockerhub, Slack, Gmail, Telegraf, experiment trackers, etc.)
Excellent debugging and troubleshooting skills across compute, storage, and networking
Strong communication skills and ability to collaborate across technical and non-technical teams
Preferred
Experience with infrastructure as code (e.g., Ansible, Terraform)
Prior work supporting ML/AI infrastructure, including GPU management and workload optimization
Exposure to backend development for ML model serving (e.g., vLLM, Ray, SGLang)
Experience working with cloud platforms such as AWS, Azure, or GCP
Familiarity with containers (Docker, Apptainer) and their integration with scheduling systems (Slurm, Kubernetes)
Benefits
Comprehensive medical, dental, vision, and FSA plans
Competitive compensation and 401(k)
Relocation and immigration support on a case-by-case basis
On-site meals prepared by a dedicated culinary team; Thursday Happy Hours
Company
Zyphra
Zyphra is superintelligence research and product company based in San Francisco, California.
H1B Sponsorship
Zyphra has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)
Funding
Current Stage
Growth StageTotal Funding
$100M2025-06-09Series A· $100M
2023-06-09Seed
2021-11-18Pre Seed
Recent News
2025-11-30
2025-11-27
Company data provided by crunchbase