Software Engineer - Distributed Training jobs in United States
cer-icon
Apply on Employer Site
company-logo

Clockwork Systems, Inc. · 5 months ago

Software Engineer - Distributed Training

Clockwork Systems is a pioneer in AI networking, redefining the foundations of distributed computing. They are seeking an experienced software engineer to build, optimize, and maintain large-scale distributed training infrastructure focusing on the PyTorch ecosystem, ensuring efficient and scalable training jobs.

Artificial Intelligence (AI)Cloud ComputingInformation TechnologyReal TimeSoftware
check
H1B Sponsor Likelynote

Responsibilities

Develop and support distributed PyTorch training jobs using torch.distributed / c10d
Integrate and maintain frameworks like Megatron-LM, DeepSpeed, and related LLM training stacks
Diagnose and resolve distributed training issues (e.g., NCCL hangs, OOM, checkpoint corruption)
Optimize performance across communication, I/O, and memory bottlenecks
Implement fault tolerance, checkpointing, and recovery mechanisms for long-running jobs
Write tooling and scripts to streamline training workflows and experiment management
Collaborate with ML engineers to ensure compatibility with orchestration and container environments (e.g., Slurm, Kubernetes)

Qualification

PyTorchTorch.distributedMulti-node GPU clustersPythonLinux shell scriptingNCCLMegatron-LMDeepSpeedDebugging toolsContainerized environmentsPerformance tuning

Required

Deep experience with PyTorch and torch.distributed (c10d)
Hands-on experience with at least one of: Megatron-LM, DeepSpeed, or FairScale
Proficiency in Python and Linux shell scripting
Experience with multi-node GPU clusters using Slurm, Kubernetes, or similar
Strong understanding of NCCL, collective communication, and GPU topology
Familiarity with debugging tools and techniques for distributed systems

Preferred

Experience scaling LLM training across 8+ GPUs and multiple nodes
Knowledge of tensor, pipeline, and data parallelism
Familiarity with containerized training environments (Docker, Singularity)
Exposure to HPC environments or cloud GPU infrastructure
Experience with training workload orchestration tools or custom job launchers
Comfort with large-scale checkpointing, resume/restart logic, and model I/O

Benefits

Competitive compensation.
A great benefits package.
Catered lunch

Company

Clockwork Systems, Inc.

twittertwitter
company-logo
Clockwork is the Software-Driven Fabric company for AI and high-performance workloads.

H1B Sponsorship

Clockwork Systems, Inc. has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (2)
2024 (8)
2023 (5)
2022 (1)

Funding

Current Stage
Early Stage
Total Funding
$41.58M
Key Investors
New Enterprise Associates
2025-09-10Series A· $20.57M
2022-03-16Series A· $21M

Leadership Team

leader-logo
Suresh Vasudevan
Chief Executive Officer
linkedin
leader-logo
Balaji Prabhakar
Co-Founder (Oct 2018--present)
linkedin
Company data provided by crunchbase