Andiamo · 1 day ago
Member of Technical Staff - Decentralized High-Performance Computing Leader
Andiamo is a globally recognized staffing and consulting firm specializing in placing top technology professionals. They are seeking a highly skilled Software Engineer to design and build systems for next-generation AI infrastructure, focusing on large-scale machine learning workloads and collaborating with teams to create scalable solutions.
ConsultingHuman ResourcesInformation TechnologyStaffing Agency
Responsibilities
Design and enhance job scheduling systems to increase GPU efficiency and throughput for large-scale machine learning workloads
Develop intuitive management interfaces and APIs that simplify cluster control and integration with frameworks like PyTorch, JAX, and TensorFlow
Build observability and monitoring systems to track performance, utilization, and progress across vast distributed training environments
Streamline data pipelines to accelerate both model training and inference processes, ensuring smooth and reliable data flow
Integrate deeply with ML tooling such as MLflow, Kubeflow, and Weights & Biases, developing seamless services and connectors that enhance developer productivity
Write high-performance libraries and internal utilities to automate deployment, scaling, and the management of distributed training workloads
Qualification
Required
Design and enhance job scheduling systems to increase GPU efficiency and throughput for large-scale machine learning workloads
Develop intuitive management interfaces and APIs that simplify cluster control and integration with frameworks like PyTorch, JAX, and TensorFlow
Build observability and monitoring systems to track performance, utilization, and progress across vast distributed training environments
Streamline data pipelines to accelerate both model training and inference processes, ensuring smooth and reliable data flow
Integrate deeply with ML tooling such as MLflow, Kubeflow, and Weights & Biases, developing seamless services and connectors that enhance developer productivity
Write high-performance libraries and internal utilities to automate deployment, scaling, and the management of distributed training workloads
A customer-focused mindset and the ability to turn user needs into thoughtful, scalable solutions
A drive to take initiative, act decisively, and deliver results without waiting for perfect conditions
Comfort working in ambiguous, fast-evolving problem spaces with shifting priorities
Excellent communication skills and a collaborative approach that uplifts teammates and partners alike
Preferred
Developed or optimized systems for training or serving large-scale ML models, ideally across 1,000+ GPUs
Improved performance and efficiency of distributed training workflows spanning multiple nodes and accelerators
Built APIs, SDKs, or interfaces that simplify machine learning operations and enhance developer experience
Experience with cluster orchestration technologies such as Kubernetes or SLURM in the context of large-scale ML workloads
Contributed to or worked with ML infrastructure tools such as Ray, Horovod, or DeepSpeed, and have experience with workflow systems like MLflow, Kubeflow, or Weights & Biases
Company
Andiamo
The Talent Partners for the AI Revolution.
H1B Sponsorship
Andiamo has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2022 (2)
2021 (1)
Funding
Current Stage
Growth StageCompany data provided by crunchbase