FirstPrinciples ยท 3 weeks ago
Member of Technical Staff, DevOps / Infrastructure Engineering
FirstPrinciples is a non-profit organization focused on developing an autonomous AI Physicist to advance our understanding of the universe. They are seeking a Member of Technical Staff in DevOps/Infrastructure Engineering to architect and automate the infrastructure for large-scale model training and research workflows, ensuring reliable and scalable operations for their AI initiatives.
Artificial Intelligence (AI)Non Profit
Responsibilities
Design and run large-scale pre-training experiments for both dense and MoE architectures, from experiment planning through multi-week production runs
Architect hybrid infrastructure solutions that span cloud and on-premises HPC environments seamlessly
Automate configuration management and drift detection using tools like Ansible, Salt, or Chef
Build systems that reduce operational toil and establish guardrails that let researchers focus on experiments, not operations
Build and own comprehensive CI/CD pipelines for training workflows, evaluation jobs, internal tools, and services with rollback capabilities, observability, and safety built in
Develop tooling for developer workflows including reproducible builds, ephemeral environments, secrets management, and cluster resource allocation
Create self-service infrastructure patterns that empower researchers and engineers
Design infrastructure that accelerates experimentation while maintaining reliability and reproducibility
Manage and extend HPC environments including GPU clusters, InfiniBand networks, job schedulers (Slurm/Kubernetes hybrid), and container orchestration
Operate containerized and scheduled workloads efficiently across Docker, Kubernetes, and Slurm environments
Optimize cluster scheduling and resource allocation for high-performance GPU workloads
Debug GPU driver quirks, Slurm job issues, and InfiniBand networking hiccups as they arise
Implement comprehensive monitoring, logging, and alerting across all infrastructure layers using Prometheus, Grafana, ELK/EFK, and OpenTelemetry
Establish SLOs/SLIs for infrastructure reliability and create observability dashboards for long-horizon training runs
Build observability stacks that provide visibility into both system health and job-level performance
Proactively detect and resolve infrastructure issues before they impact research workflows
Implement and manage secrets management and identity security solutions (Vault, KMS, IAM)
Champion security best practices, IAM policies, and compliance standards across hybrid infrastructure
Design infrastructure with least privilege principles and strong security hygiene from the start
Maintain zero-trust security posture and comprehensive auditing capabilities
Partner closely with training engineers and researchers to translate research needs into robust infrastructure solutions
Document best practices, create runbooks, and evangelize DevOps culture across the organization
Mentor teammates on infrastructure patterns, automation techniques, and operational excellence
Enable efficient pre-training runs and safe deployment of new infrastructure patterns through collaboration
Qualification
Required
Bachelor's or Master's degree in Computer Science, Engineering, or related field
3-10+ years in DevOps, Infrastructure, or SRE roles with proven hands-on systems engineering experience
Strong Unix/Linux systems background including kernel tuning, networking, storage, and process control experience
Infrastructure-as-Code experience with Terraform, Pulumi, or CloudFormation
Expertise building CI/CD systems and reproducible build pipelines (GitHub Actions, GitLab CI, Jenkins, etc.)
Hands-on experience with AWS (EC2, S3, IAM, VPC, etc.) and cloud infrastructure fundamentals
Cluster orchestration and job scheduling experience with Kubernetes and Slurm
Monitoring and observability stack experience (Prometheus, Grafana, ELK/EFK, OpenTelemetry)
Demonstrated success scaling infrastructure for high-performance or GPU workloads
Track record of managing GPU-accelerated clusters or HPC infrastructure
Experience in automating workflows that reduced toil and scaling deployments safely
Strong programming skills in at least one compiled/systems language (Python, Go, or Rust) plus Bash fluency
Ability to work cross-functionally. Strong communicator who can simplify complex topics for diverse audiences
Entrepreneurial & mission-driven, comfortable in a fast-growing, startup-style environment, and motivated by the ambition of tackling one of the greatest scientific challenges in history
Demonstrated passion for physics and for making scientific knowledge accessible and impactful
Preferred
Prior work with HPC vendors or AI compute providers (Buzz HPC, NVIDIA DGX, Lambda, CoreWeave)
Experience designing self-service infrastructure or internal developer platforms
Deep familiarity with GPU cluster management, scheduling, and high-throughput networking (InfiniBand)
Cost management and optimization experience for large-scale compute infrastructure
Build system fluency and comfort with modern build tools (CMake, Bazel, Meson, Buck, Ninja)
Experience supporting AI/ML research environments and training pipeline infrastructure
Company
FirstPrinciples
Building AI to understand the nature of reality.
Funding
Current Stage
Early StageCompany data provided by crunchbase