Apply on Employer Site

FirstPrinciples · 3 months ago

Member of Technical Staff, Training Engineer (Large Scale Foundation Models)

United States

Full-time

Remote

Senior Level

7+ years exp

FirstPrinciples is a non-profit organization building an autonomous AI Physicist designed to advance humanity's understanding of the fundamental laws of nature. We are seeking a Member of Technical Staff, Training Engineer to develop and lead end-to-end pre-training of large language models on GPU clusters, making critical modeling choices and guiding the development of data pipelines to revolutionize fundamental physics research.

Artificial Intelligence (AI)Non Profit

Responsibilities

Design and run large-scale pre-training experiments for both dense and MoE architectures, from experiment planning through multi-week production runs

Tune optimizer configurations (AdamW/Adafactor/Sophia variants), learning rate schedules with warmup strategies, dropout, gradient clipping, weight decay, EMA, and activation checkpointing to ensure stability at scale

Own model and training recipes end-to-end, making informed decisions about microbatch and global batch configurations

Run ablations and scaling-law studies to set optimal tokens-to-train targets, entropy/perplexity goals, and checkpoint cadence that optimize cost-to-quality ratios

Provide strategic insights to the executive team on financial implications of major decisions, from international expansion to new research initiatives

Design capital allocation frameworks that maximize scientific impact while ensuring long-term sustainability

Build and harden high-throughput data pipelines encompassing dataset curation, filtering, deduplication, pack-by-length optimization, and contamination control

Design and implement multilingual and multimodal data ingest systems with intelligent repeat scheduling (e.g., D4-style approaches)

Architect comprehensive data pipelines across diverse modalities (web/book/code/speech/vision) with filtering, heuristic and learned scoring, temperature sampling, multilingual balancing, and curriculum learning

Demonstrate measurable impact from data quality work including large-scale deduplication, contamination audits, and repeat/mixture scheduling that improves downstream accuracy

Operate distributed training infrastructure using FSDP/ZeRO, tensor/pipeline/expert/context parallelism, and high-speed interconnects (NCCL, NVLink/InfiniBand)

Choose and configure optimal distributed strategies (FSDP vs ZeRO; 3D/5D hybrid parallelism for MoE) and launch parameters, documenting trade-offs for future reference

Exploit modern kernels and mixed-precision training (FlashAttention-3, FP8 via NVIDIA Transformer Engine) to maximize tokens/sec while maintaining perplexity targets

Integrate performance primitives including FlashAttention-3, fused optimizers, and custom CUDA/Triton kernels while maintaining convergence guarantees

Write production-grade PyTorch and Triton/CUDA kernels when required to unlock critical performance gains

Debug complex distributed training issues including deadlocks, OOMs, divergence, and stragglers using tools like Nsight, py-spy, TensorBoard, and W&B

Build comprehensive observability systems for long-horizon runs tracking throughput/efficiency, gradient statistics, loss spikes, token-mix drift, data freshness, and evaluation dashboards

Manage multi-node GPU jobs (SLURM/Kubernetes/Ray), debug NCCL hangs, clock skew issues, and implement elastic restart mechanisms

Shepherd multi-week training jobs through completion, recover gracefully from failures, and deliver stable checkpoints with measurable evaluation wins

Establish systems for managing multiple currencies, cross-border partnerships, international payments, and complex funding structures

Create financial frameworks that can adapt to new funding models, from traditional grants to innovative financing mechanisms

Define evaluation suites and red-team protocols to monitor scaling behavior and catch regression signals over long training runs

Partner with safety and alignment teams on SFT/RLAIF/DPO stages and evaluations, ensuring pre-training choices support downstream alignment objectives

Collaborate across research, infrastructure, product, and safety teams to turn research wins into robust model artifacts and services

Lead cross-functional efforts and mentor engineers on distributed training best practices and stabilization techniques

Write crisp RFCs and retrospectives to document learnings and establish institutional knowledge

Qualification

PyTorchDistributed frameworksMulti-node GPU jobsApplied mathematicsMoE pre-trainingCUDA/Triton fundamentalsData quality workCollaboration & CommunicationEntrepreneurial mindsetPassion for physics

Required

Bachelor's or Master's degree in Computer Science, Engineering, or related field

7-12+ years of total experience, including 2+ years training large Transformers at scale (10B→100B+ parameters; MoE experience is a plus) with a track record of shipped models or published training methods

Hands-on experience with at least one frontier-style training run where you've shepherded multi-week training jobs, recovered from failures, and delivered stable checkpoints with measurable evaluation improvements

Expert-level proficiency in PyTorch (including compiled mode/torch.compile), with strong understanding of CUDA/Triton fundamentals

Deep facility with distributed frameworks (PyTorch FSDP or DeepSpeed ZeRO) and multi-dimensional parallelism (TP/PP/EP/DP/CP), ideally with Megatron-Core experience

Proven success operating multi-node GPU jobs with experience debugging NCCL hangs, clock skew, and elastic restarts

Demonstrated impact from data quality work, including deduplication/contamination mitigation and data-mix design that measurably improved evaluation metrics

Strong applied mathematics background for training stability (optimization, numerics, initialization, learning rate scaling) with excellent experiment design and statistical rigor

Ability to work cross-functionally

Strong communicator who can simplify complex topics for diverse audiences

Entrepreneurial & mission-driven, comfortable in a fast-growing, startup-style environment, and motivated by the ambition of tackling one of the greatest scientific challenges in history

Demonstrated passion for physics and for making scientific knowledge accessible and impactful

Preferred

MoE pre-training experience including router design, load-balancing, expert capacity tuning, z-loss, auxiliary losses, and parallelism mapping across thousands of GPUs

Accelerator-aware optimization expertise (kernel fusion, TMA/warp-specialization, cache locality) and production adoption of FlashAttention-3 and FP8 training on Hopper/Blackwell architectures

Modern evaluation and safety exposure including contamination detection, leakage/membership inference awareness

Experience guiding model design decisions for inference efficiency (KV-cache strategies, quantization, speculative decoding)

Advanced throughput optimization techniques: sequence packing with dynamic padding, fused attention/MLP, gradient accumulation tuned to saturate interconnects

Expertise in stability at scale: BF16/FP8 mixed precision with delayed scaling, norm-based clipping, cosine decay with warmup, EMA on very-large runs

MoE reliability expertise: router jitter/noise management, capacity factor tuning, token-dropless routing, and expert parallel + tensor/pipeline co-design

Deep understanding of data quality impact: aggressive deduplication (near-dup & fuzzy matching), contamination audits, and intelligent repeat scheduling strategies versus one-epoch-over-everything approaches

Company

FirstPrinciples

Building AI to understand the nature of reality.

Founded in 2024

Toronto, Ontario, CAN

11-50 employees

https://firstprinciples.org

Funding

Current Stage

Early Stage

Company data provided by crunchbase