Clockwork Systems, Inc. · 5 months ago
Software Engineer - Distributed Training
Clockwork Systems is a pioneer in AI networking, redefining the foundations of distributed computing. They are seeking an experienced software engineer to build, optimize, and maintain large-scale distributed training infrastructure focusing on the PyTorch ecosystem, ensuring efficient and scalable training jobs.
Artificial Intelligence (AI)Cloud ComputingInformation TechnologyReal TimeSoftware
Responsibilities
Develop and support distributed PyTorch training jobs using torch.distributed / c10d
Integrate and maintain frameworks like Megatron-LM, DeepSpeed, and related LLM training stacks
Diagnose and resolve distributed training issues (e.g., NCCL hangs, OOM, checkpoint corruption)
Optimize performance across communication, I/O, and memory bottlenecks
Implement fault tolerance, checkpointing, and recovery mechanisms for long-running jobs
Write tooling and scripts to streamline training workflows and experiment management
Collaborate with ML engineers to ensure compatibility with orchestration and container environments (e.g., Slurm, Kubernetes)
Qualification
Required
Deep experience with PyTorch and torch.distributed (c10d)
Hands-on experience with at least one of: Megatron-LM, DeepSpeed, or FairScale
Proficiency in Python and Linux shell scripting
Experience with multi-node GPU clusters using Slurm, Kubernetes, or similar
Strong understanding of NCCL, collective communication, and GPU topology
Familiarity with debugging tools and techniques for distributed systems
Preferred
Experience scaling LLM training across 8+ GPUs and multiple nodes
Knowledge of tensor, pipeline, and data parallelism
Familiarity with containerized training environments (Docker, Singularity)
Exposure to HPC environments or cloud GPU infrastructure
Experience with training workload orchestration tools or custom job launchers
Comfort with large-scale checkpointing, resume/restart logic, and model I/O
Benefits
Competitive compensation.
A great benefits package.
Catered lunch
Company
Clockwork Systems, Inc.
Clockwork is the Software-Driven Fabric company for AI and high-performance workloads.
H1B Sponsorship
Clockwork Systems, Inc. has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (2)
2024 (8)
2023 (5)
2022 (1)
Funding
Current Stage
Early StageTotal Funding
$41.58MKey Investors
New Enterprise Associates
2025-09-10Series A· $20.57M
2022-03-16Series A· $21M
Leadership Team
Recent News
2022-03-17
2022-03-17
vcnewsdaily.com
2022-03-17
Company data provided by crunchbase