FirstPrinciples ยท 3 months ago
Member of Technical Staff, DevOps / Infrastructure Engineering
FirstPrinciples is a non-profit organization focused on advancing humanity's understanding of fundamental laws of nature through an autonomous AI Physicist. They are seeking a Member of Technical Staff in DevOps/Infrastructure Engineering to architect, automate, and scale the infrastructure for large-scale model training and research workflows, while collaborating closely with engineers and researchers.
Artificial Intelligence (AI)Non Profit
Responsibilities
Design and run large-scale pre-training experiments for both dense and MoE architectures, from experiment planning through multi-week production runs
Architect hybrid infrastructure solutions that span cloud and on-premises HPC environments seamlessly
Automate configuration management and drift detection using tools like Ansible, Salt, or Chef
Build systems that reduce operational toil and establish guardrails that let researchers focus on experiments, not operations
Build and own comprehensive CI/CD pipelines for training workflows, evaluation jobs, internal tools, and services with rollback capabilities, observability, and safety built in
Develop tooling for developer workflows including reproducible builds, ephemeral environments, secrets management, and cluster resource allocation
Create self-service infrastructure patterns that empower researchers and engineers
Design infrastructure that accelerates experimentation while maintaining reliability and reproducibility
Manage and extend HPC environments including GPU clusters, InfiniBand networks, job schedulers (Slurm/Kubernetes hybrid), and container orchestration
Operate containerized and scheduled workloads efficiently across Docker, Kubernetes, and Slurm environments
Optimize cluster scheduling and resource allocation for high-performance GPU workloads
Debug GPU driver quirks, Slurm job issues, and InfiniBand networking hiccups as they arise
Implement comprehensive monitoring, logging, and alerting across all infrastructure layers using Prometheus, Grafana, ELK/EFK, and OpenTelemetry
Establish SLOs/SLIs for infrastructure reliability and create observability dashboards for long-horizon training runs
Build observability stacks that provide visibility into both system health and job-level performance
Proactively detect and resolve infrastructure issues before they impact research workflows
Implement and manage secrets management and identity security solutions (Vault, KMS, IAM)
Champion security best practices, IAM policies, and compliance standards across hybrid infrastructure
Design infrastructure with least privilege principles and strong security hygiene from the start
Maintain zero-trust security posture and comprehensive auditing capabilities
Partner closely with training engineers and researchers to translate research needs into robust infrastructure solutions
Document best practices, create runbooks, and evangelize DevOps culture across the organization
Mentor teammates on infrastructure patterns, automation techniques, and operational excellence
Enable efficient pre-training runs and safe deployment of new infrastructure patterns through collaboration
Qualification
Required
Bachelor's or Master's degree in Computer Science, Engineering, or related field
6-10+ years in DevOps, Infrastructure, or SRE roles with proven hands-on systems engineering experience (not just certification-based)
Deep Unix/Linux administration expertise including kernel tuning, networking, storage, and process control
Advanced Infrastructure-as-Code experience with Terraform, Pulumi, or CloudFormation
Expertise building CI/CD systems and reproducible build pipelines (GitHub Actions, GitLab CI, Jenkins, etc.)
Hands-on experience with AWS (EC2, S3, IAM, VPC, etc.) and cloud infrastructure management
Cluster orchestration and job scheduling experience with Kubernetes and Slurm
Strong monitoring and observability stack experience (Prometheus, Grafana, ELK/EFK, OpenTelemetry)
Demonstrated success scaling infrastructure for high-performance or GPU workloads
Track record of managing GPU-accelerated clusters or HPC infrastructure
Experience in automating workflows that reduced toil and scaling deployments safely
Strong programming skills in at least one compiled/systems language (Python, Go, or Rust) plus Bash fluency
Ability to work cross-functionally. Strong communicator who can simplify complex topics for diverse audiences
Entrepreneurial & mission-driven, comfortable in a fast-growing, startup-style environment, and motivated by the ambition of tackling one of the greatest scientific challenges in history
Demonstrated passion for physics and for making scientific knowledge accessible and impactful
Preferred
Prior work with HPC vendors or AI compute providers (Buzz HPC, NVIDIA DGX, Lambda, CoreWeave)
Experience designing self-service infrastructure or internal developer platforms
Deep familiarity with GPU cluster management, scheduling, and high-throughput networking (InfiniBand)
Security and compliance expertise including zero-trust architectures, secrets management, and auditing frameworks
Cost management and optimization experience for large-scale compute infrastructure
Build system fluency and comfort with modern build tools (CMake, Bazel, Meson, Buck, Ninja)
Experience supporting AI/ML research environments and training pipeline infrastructure
"Automation first" mindset - you reduce toil by codifying repeatable operations
Deep understanding of DevOps philosophy, not just the tools - you live and breathe the culture
HPC comfort - you can debug Slurm jobs, GPU driver issues, or InfiniBand problems without hesitation
Cloud + HPC pragmatism - you know when to leverage AWS primitives versus optimizing HPC schedulers
Track record of mentoring and elevating teams, building collaboratively rather than in isolation
Passion for building state-of-the-art platforms with reproducibility and robust CI/CD at their core
Company
FirstPrinciples
Building AI to understand the nature of reality.
Funding
Current Stage
Early StageCompany data provided by crunchbase