Member of Technical Staff - Decentralized High-Performance Computing Leader jobs in United States
cer-icon
Apply on Employer Site
company-logo

Andiamo · 1 day ago

Member of Technical Staff - Decentralized High-Performance Computing Leader

Andiamo is a globally recognized staffing and consulting firm specializing in placing top technology professionals. They are seeking a highly skilled Software Engineer to design and build systems for next-generation AI infrastructure, focusing on large-scale machine learning workloads and collaborating with teams to create scalable solutions.

ConsultingHuman ResourcesInformation TechnologyStaffing Agency
check
Comp. & Benefits
check
H1B Sponsor Likelynote

Responsibilities

Design and enhance job scheduling systems to increase GPU efficiency and throughput for large-scale machine learning workloads
Develop intuitive management interfaces and APIs that simplify cluster control and integration with frameworks like PyTorch, JAX, and TensorFlow
Build observability and monitoring systems to track performance, utilization, and progress across vast distributed training environments
Streamline data pipelines to accelerate both model training and inference processes, ensuring smooth and reliable data flow
Integrate deeply with ML tooling such as MLflow, Kubeflow, and Weights & Biases, developing seamless services and connectors that enhance developer productivity
Write high-performance libraries and internal utilities to automate deployment, scaling, and the management of distributed training workloads

Qualification

Large-scale ML systemsGPU optimizationCluster orchestrationAPIsSDKsDistributed training workflowsML infrastructure toolsCommunication skillsTeam collaborationCustomer-focused mindsetProblem-solving

Required

Design and enhance job scheduling systems to increase GPU efficiency and throughput for large-scale machine learning workloads
Develop intuitive management interfaces and APIs that simplify cluster control and integration with frameworks like PyTorch, JAX, and TensorFlow
Build observability and monitoring systems to track performance, utilization, and progress across vast distributed training environments
Streamline data pipelines to accelerate both model training and inference processes, ensuring smooth and reliable data flow
Integrate deeply with ML tooling such as MLflow, Kubeflow, and Weights & Biases, developing seamless services and connectors that enhance developer productivity
Write high-performance libraries and internal utilities to automate deployment, scaling, and the management of distributed training workloads
A customer-focused mindset and the ability to turn user needs into thoughtful, scalable solutions
A drive to take initiative, act decisively, and deliver results without waiting for perfect conditions
Comfort working in ambiguous, fast-evolving problem spaces with shifting priorities
Excellent communication skills and a collaborative approach that uplifts teammates and partners alike

Preferred

Developed or optimized systems for training or serving large-scale ML models, ideally across 1,000+ GPUs
Improved performance and efficiency of distributed training workflows spanning multiple nodes and accelerators
Built APIs, SDKs, or interfaces that simplify machine learning operations and enhance developer experience
Experience with cluster orchestration technologies such as Kubernetes or SLURM in the context of large-scale ML workloads
Contributed to or worked with ML infrastructure tools such as Ray, Horovod, or DeepSpeed, and have experience with workflow systems like MLflow, Kubeflow, or Weights & Biases

Company

The Talent Partners for the AI Revolution.

H1B Sponsorship

Andiamo has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2022 (2)
2021 (1)

Funding

Current Stage
Growth Stage

Leadership Team

leader-logo
Patrick McAdams
CEO & Co-Founder
linkedin
leader-logo
Steven Kottler
CFO
linkedin
Company data provided by crunchbase