Apply on Employer Site

FirstPrinciples · 3 weeks ago

Member of Technical Staff, DevOps / Infrastructure Engineering

United States

Full-time

Remote

Mid, Senior Level

3+ years exp

FirstPrinciples is a non-profit organization focused on developing an autonomous AI Physicist to advance our understanding of the universe. They are seeking a Member of Technical Staff in DevOps/Infrastructure Engineering to architect and automate the infrastructure for large-scale model training and research workflows, ensuring reliable and scalable operations for their AI initiatives.

Artificial Intelligence (AI)Non Profit

Responsibilities

Design and run large-scale pre-training experiments for both dense and MoE architectures, from experiment planning through multi-week production runs

Architect hybrid infrastructure solutions that span cloud and on-premises HPC environments seamlessly

Automate configuration management and drift detection using tools like Ansible, Salt, or Chef

Build systems that reduce operational toil and establish guardrails that let researchers focus on experiments, not operations

Build and own comprehensive CI/CD pipelines for training workflows, evaluation jobs, internal tools, and services with rollback capabilities, observability, and safety built in

Develop tooling for developer workflows including reproducible builds, ephemeral environments, secrets management, and cluster resource allocation

Create self-service infrastructure patterns that empower researchers and engineers

Design infrastructure that accelerates experimentation while maintaining reliability and reproducibility

Manage and extend HPC environments including GPU clusters, InfiniBand networks, job schedulers (Slurm/Kubernetes hybrid), and container orchestration

Operate containerized and scheduled workloads efficiently across Docker, Kubernetes, and Slurm environments

Optimize cluster scheduling and resource allocation for high-performance GPU workloads

Debug GPU driver quirks, Slurm job issues, and InfiniBand networking hiccups as they arise

Implement comprehensive monitoring, logging, and alerting across all infrastructure layers using Prometheus, Grafana, ELK/EFK, and OpenTelemetry

Establish SLOs/SLIs for infrastructure reliability and create observability dashboards for long-horizon training runs

Build observability stacks that provide visibility into both system health and job-level performance

Proactively detect and resolve infrastructure issues before they impact research workflows

Implement and manage secrets management and identity security solutions (Vault, KMS, IAM)

Champion security best practices, IAM policies, and compliance standards across hybrid infrastructure

Design infrastructure with least privilege principles and strong security hygiene from the start

Maintain zero-trust security posture and comprehensive auditing capabilities

Partner closely with training engineers and researchers to translate research needs into robust infrastructure solutions

Document best practices, create runbooks, and evangelize DevOps culture across the organization

Mentor teammates on infrastructure patterns, automation techniques, and operational excellence

Enable efficient pre-training runs and safe deployment of new infrastructure patterns through collaboration

Qualification

Unix/Linux systemsInfrastructure-as-CodeCI/CD systemsAWSHPC infrastructurePythonKubernetesMonitoring toolsGPU workloadsEntrepreneurial mindsetPassion for physicsCollaborationCommunication

Required

Bachelor's or Master's degree in Computer Science, Engineering, or related field

3-10+ years in DevOps, Infrastructure, or SRE roles with proven hands-on systems engineering experience

Strong Unix/Linux systems background including kernel tuning, networking, storage, and process control experience

Infrastructure-as-Code experience with Terraform, Pulumi, or CloudFormation

Expertise building CI/CD systems and reproducible build pipelines (GitHub Actions, GitLab CI, Jenkins, etc.)

Hands-on experience with AWS (EC2, S3, IAM, VPC, etc.) and cloud infrastructure fundamentals

Cluster orchestration and job scheduling experience with Kubernetes and Slurm

Monitoring and observability stack experience (Prometheus, Grafana, ELK/EFK, OpenTelemetry)

Demonstrated success scaling infrastructure for high-performance or GPU workloads

Track record of managing GPU-accelerated clusters or HPC infrastructure

Experience in automating workflows that reduced toil and scaling deployments safely

Strong programming skills in at least one compiled/systems language (Python, Go, or Rust) plus Bash fluency

Ability to work cross-functionally. Strong communicator who can simplify complex topics for diverse audiences

Entrepreneurial & mission-driven, comfortable in a fast-growing, startup-style environment, and motivated by the ambition of tackling one of the greatest scientific challenges in history

Demonstrated passion for physics and for making scientific knowledge accessible and impactful

Preferred

Prior work with HPC vendors or AI compute providers (Buzz HPC, NVIDIA DGX, Lambda, CoreWeave)

Experience designing self-service infrastructure or internal developer platforms

Deep familiarity with GPU cluster management, scheduling, and high-throughput networking (InfiniBand)

Cost management and optimization experience for large-scale compute infrastructure

Build system fluency and comfort with modern build tools (CMake, Bazel, Meson, Buck, Ninja)

Experience supporting AI/ML research environments and training pipeline infrastructure

Company

FirstPrinciples

Building AI to understand the nature of reality.

Founded in 2024

Toronto, Ontario, CAN

11-50 employees

https://firstprinciples.org

Funding

Current Stage

Early Stage

Company data provided by crunchbase