Apply on Employer Site

FirstPrinciples · 3 months ago

Member of Technical Staff, DevOps / Infrastructure Engineering

United States

Full-time

Remote

Senior Level

6+ years exp

FirstPrinciples is a non-profit organization focused on advancing humanity's understanding of fundamental laws of nature through an autonomous AI Physicist. They are seeking a Member of Technical Staff in DevOps/Infrastructure Engineering to architect, automate, and scale the infrastructure for large-scale model training and research workflows, while collaborating closely with engineers and researchers.

Artificial Intelligence (AI)Non Profit

Responsibilities

Design and run large-scale pre-training experiments for both dense and MoE architectures, from experiment planning through multi-week production runs

Architect hybrid infrastructure solutions that span cloud and on-premises HPC environments seamlessly

Automate configuration management and drift detection using tools like Ansible, Salt, or Chef

Build systems that reduce operational toil and establish guardrails that let researchers focus on experiments, not operations

Build and own comprehensive CI/CD pipelines for training workflows, evaluation jobs, internal tools, and services with rollback capabilities, observability, and safety built in

Develop tooling for developer workflows including reproducible builds, ephemeral environments, secrets management, and cluster resource allocation

Create self-service infrastructure patterns that empower researchers and engineers

Design infrastructure that accelerates experimentation while maintaining reliability and reproducibility

Manage and extend HPC environments including GPU clusters, InfiniBand networks, job schedulers (Slurm/Kubernetes hybrid), and container orchestration

Operate containerized and scheduled workloads efficiently across Docker, Kubernetes, and Slurm environments

Optimize cluster scheduling and resource allocation for high-performance GPU workloads

Debug GPU driver quirks, Slurm job issues, and InfiniBand networking hiccups as they arise

Implement comprehensive monitoring, logging, and alerting across all infrastructure layers using Prometheus, Grafana, ELK/EFK, and OpenTelemetry

Establish SLOs/SLIs for infrastructure reliability and create observability dashboards for long-horizon training runs

Build observability stacks that provide visibility into both system health and job-level performance

Proactively detect and resolve infrastructure issues before they impact research workflows

Implement and manage secrets management and identity security solutions (Vault, KMS, IAM)

Champion security best practices, IAM policies, and compliance standards across hybrid infrastructure

Design infrastructure with least privilege principles and strong security hygiene from the start

Maintain zero-trust security posture and comprehensive auditing capabilities

Partner closely with training engineers and researchers to translate research needs into robust infrastructure solutions

Document best practices, create runbooks, and evangelize DevOps culture across the organization

Mentor teammates on infrastructure patterns, automation techniques, and operational excellence

Enable efficient pre-training runs and safe deployment of new infrastructure patterns through collaboration

Qualification

Unix/Linux administrationInfrastructure-as-CodeCI/CD systemsAWS managementHPC infrastructure managementMonitoringGPU cluster managementSecrets managementProgramming skillsObservabilityEntrepreneurial mindsetPassion for physicsCollaborationCommunication

Required

Bachelor's or Master's degree in Computer Science, Engineering, or related field

6-10+ years in DevOps, Infrastructure, or SRE roles with proven hands-on systems engineering experience (not just certification-based)

Deep Unix/Linux administration expertise including kernel tuning, networking, storage, and process control

Advanced Infrastructure-as-Code experience with Terraform, Pulumi, or CloudFormation

Expertise building CI/CD systems and reproducible build pipelines (GitHub Actions, GitLab CI, Jenkins, etc.)

Hands-on experience with AWS (EC2, S3, IAM, VPC, etc.) and cloud infrastructure management

Cluster orchestration and job scheduling experience with Kubernetes and Slurm

Strong monitoring and observability stack experience (Prometheus, Grafana, ELK/EFK, OpenTelemetry)

Demonstrated success scaling infrastructure for high-performance or GPU workloads

Track record of managing GPU-accelerated clusters or HPC infrastructure

Experience in automating workflows that reduced toil and scaling deployments safely

Strong programming skills in at least one compiled/systems language (Python, Go, or Rust) plus Bash fluency

Ability to work cross-functionally. Strong communicator who can simplify complex topics for diverse audiences

Entrepreneurial & mission-driven, comfortable in a fast-growing, startup-style environment, and motivated by the ambition of tackling one of the greatest scientific challenges in history

Demonstrated passion for physics and for making scientific knowledge accessible and impactful

Preferred

Prior work with HPC vendors or AI compute providers (Buzz HPC, NVIDIA DGX, Lambda, CoreWeave)

Experience designing self-service infrastructure or internal developer platforms

Deep familiarity with GPU cluster management, scheduling, and high-throughput networking (InfiniBand)

Security and compliance expertise including zero-trust architectures, secrets management, and auditing frameworks

Cost management and optimization experience for large-scale compute infrastructure

Build system fluency and comfort with modern build tools (CMake, Bazel, Meson, Buck, Ninja)

Experience supporting AI/ML research environments and training pipeline infrastructure

"Automation first" mindset - you reduce toil by codifying repeatable operations

Deep understanding of DevOps philosophy, not just the tools - you live and breathe the culture

HPC comfort - you can debug Slurm jobs, GPU driver issues, or InfiniBand problems without hesitation

Cloud + HPC pragmatism - you know when to leverage AWS primitives versus optimizing HPC schedulers

Track record of mentoring and elevating teams, building collaboratively rather than in isolation

Passion for building state-of-the-art platforms with reproducibility and robust CI/CD at their core

Company

FirstPrinciples

Building AI to understand the nature of reality.

Founded in 2024

Toronto, Ontario, CAN

11-50 employees

https://firstprinciples.org

Funding

Current Stage

Early Stage

Company data provided by crunchbase