Member of Engineering (Pre-training and inference fault tolerance) jobs in United States
cer-icon
Apply on Employer Site
company-logo

poolside · 1 month ago

Member of Engineering (Pre-training and inference fault tolerance)

Poolside is a company focused on building a world where AI drives economically valuable work and scientific progress. The role involves working in the pre-training team to enhance the reliability and fault tolerance of distributed training and inference for Large Language Models (LLMs), with responsibilities including troubleshooting hardware issues and developing tools for training recovery.

Artificial Intelligence (AI)Developer PlatformInformation TechnologyInfrastructureSoftware
check
H1B Sponsor Likelynote

Responsibilities

Identify, study, and troubleshoot hardware problems during training at scale
Minimize the GPU idle time during faults, both operationally and strategically
Design and develop tools and add-ons to accelerate the training recovery
Improve the performance and reliability of checkpointing
Write high-quality Python (PyTorch), Cython, C/C++, CUDA API code

Qualification

Large Language ModelsDistributed systemsPython with PyTorchNVIDIA GPU architectureLinux APIC/C++NCCLEngineering backgroundDeep learning fundamentalsAlgorithmic skillsK8s stackCritical thinking

Required

Strong engineering skills
Good knowledge of Torch
NVIDIA GPU architecture
Reliability concepts
Distributed systems
Best coding practices
Basic understanding of LLM training and inference principles
Fast learners who are prepared for a steep learning curve
Not afraid to step out of their comfort zone
Understanding of Large Language Models (LLM)
Basic knowledge of Transformers
Knowledge of deep learning fundamentals
Strong engineering background
Programming experience
Linux API
Linux kernel
Strong algorithmic skills
Python with numpy, PyTorch, or Jax
C/C++
NCCL
Use modern tools and are always looking to improve
Strong critical thinking and ability to question code quality policies when applicable
Distributed systems
Reliability
Observability
Fault-tolerance
K8s stack

Benefits

Fully remote work & flexible hours
37 days/year of vacation & holidays
Health insurance allowance for you and dependents
Company-provided equipment
Wellbeing, always-be-learning and home office allowances
Frequent team get togethers
Great diverse & inclusive people-first culture

Company

poolside

twittertwittertwitter
company-logo
Poolside is an artificial intelligence platform that offers foundation concepts and infrastructure to write software codes.

H1B Sponsorship

poolside has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)

Funding

Current Stage
Growth Stage
Total Funding
$626M
Key Investors
Bain Capital VenturesRedpoint
2024-10-02Series B· $500M
2023-08-24Series A· $100M
2023-05-14Seed· $26M

Leadership Team

leader-logo
Eiso Kant
Co-CEO & Co-founder
linkedin
leader-logo
Jason Warner
Co-CEO & Co-Founder
linkedin
Company data provided by crunchbase