Dyna Robotics · 5 months ago
Staff Machine Learning Infrastructure Engineer
Dyna Robotics is a pioneering company in robotic manipulation, leveraging advanced foundation models to automate tasks with intelligent robotic arms. They are seeking a Staff Machine Learning Infrastructure Engineer to design and maintain large-scale ML infrastructure, enhancing model training performance and system reliability across a growing GPU ecosystem.
Artificial Intelligence (AI)Information TechnologyMachine LearningRobotics
Responsibilities
Architect and implement large-scale ML training pipelines that leverage parallel GPU processing on platforms like GCP or AWS
Enhance our existing infrastructure to fully exploit parallelism and design for future expansion, ensuring that our system is ready to support growth
Manage and optimize high-performance computing resources
Develop robust distributed computing solutions, addressing challenges like race conditions, memory optimization, and resource allocation
Optimize model training with techniques like mixed precision, ZeRO, Lora, etc
Design systems for job rescheduling, automated retries, and failure recovery to maximize uptime and training efficiency
Implement intelligent job queuing mechanisms to optimize training workloads and resource utilization
Evaluate and implement tradeoffs between different local and networked storage solutions to improve data throughput and access
Develop strategies for caching training data to optimize performance
Work closely with ML researchers and data scientists to understand training requirements and bottlenecks
Continuously monitor system performance, identify areas for improvement, and implement best practices to enhance scalability and reliability
Qualification
Required
Bachelor's degree or higher in Computer Science or a related field
At least 7 years of professional experience in the software industry, with a minimum of 2 years in a tech lead role
Proven experience with high-performance computing environments and distributed systems
Demonstrated ability to scale ML training systems and optimize resource utilization
Hands-on experience with job scheduling systems and managing cloud GPU environments (GCP, AWS, etc.)
Deep understanding of distributed computing concepts, including race conditions, memory optimization, and parallel processing
Hands-on experience in ML model tuning for performance
Experience with common ML training and inference tools including PyTorch, TensorRT, Triton, Accelerate, etc
Strong analytical and problem-solving skills with the ability to troubleshoot complex system issues
Excellent communication skills to collaborate effectively with cross-functional teams
Preferred
Experience with container orchestration tools (e.g., Kubernetes) and infrastructure-as-code frameworks
Benefits
Competitive salary and equity in a seed-stage venture-backed startup
Comprehensive health, dental, and vision insurance
Professional growth and development through training, mentorship, and challenging projects
Daily catered lunches and dinner with a fully stocked kitchen
Company
Dyna Robotics
Dyna Robotics develops advanced robotic manipulation models to automate repetitive and stationary tasks.
H1B Sponsorship
Dyna Robotics has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (7)
2024 (3)
Funding
Current Stage
Early StageTotal Funding
$143.5M2025-09-15Series A· $120M
2025-03-25Seed· $23.5M
Recent News
Crunchbase News
2025-11-01
Company data provided by crunchbase