poolside · 1 month ago
Member of Engineering (Pre-training and inference fault tolerance)
Poolside is a company focused on building a world where AI drives economically valuable work and scientific progress. The role involves working in the pre-training team to enhance the reliability and fault tolerance of distributed training and inference for Large Language Models (LLMs), with responsibilities including troubleshooting hardware issues and developing tools for training recovery.
Artificial Intelligence (AI)Developer PlatformInformation TechnologyInfrastructureSoftware
Responsibilities
Identify, study, and troubleshoot hardware problems during training at scale
Minimize the GPU idle time during faults, both operationally and strategically
Design and develop tools and add-ons to accelerate the training recovery
Improve the performance and reliability of checkpointing
Write high-quality Python (PyTorch), Cython, C/C++, CUDA API code
Qualification
Required
Strong engineering skills
Good knowledge of Torch
NVIDIA GPU architecture
Reliability concepts
Distributed systems
Best coding practices
Basic understanding of LLM training and inference principles
Fast learners who are prepared for a steep learning curve
Not afraid to step out of their comfort zone
Understanding of Large Language Models (LLM)
Basic knowledge of Transformers
Knowledge of deep learning fundamentals
Strong engineering background
Programming experience
Linux API
Linux kernel
Strong algorithmic skills
Python with numpy, PyTorch, or Jax
C/C++
NCCL
Use modern tools and are always looking to improve
Strong critical thinking and ability to question code quality policies when applicable
Distributed systems
Reliability
Observability
Fault-tolerance
K8s stack
Benefits
Fully remote work & flexible hours
37 days/year of vacation & holidays
Health insurance allowance for you and dependents
Company-provided equipment
Wellbeing, always-be-learning and home office allowances
Frequent team get togethers
Great diverse & inclusive people-first culture
Company
poolside
Poolside is an artificial intelligence platform that offers foundation concepts and infrastructure to write software codes.
H1B Sponsorship
poolside has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)
Funding
Current Stage
Growth StageTotal Funding
$626MKey Investors
Bain Capital VenturesRedpoint
2024-10-02Series B· $500M
2023-08-24Series A· $100M
2023-05-14Seed· $26M
Recent News
2025-12-17
2025-11-13
2025-11-11
Company data provided by crunchbase