Staff Machine Learning Infrastructure Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Dyna Robotics · 5 months ago

Staff Machine Learning Infrastructure Engineer

Dyna Robotics is a pioneering company in robotic manipulation, leveraging advanced foundation models to automate tasks with intelligent robotic arms. They are seeking a Staff Machine Learning Infrastructure Engineer to design and maintain large-scale ML infrastructure, enhancing model training performance and system reliability across a growing GPU ecosystem.

Artificial Intelligence (AI)Information TechnologyMachine LearningRobotics
check
H1B Sponsor Likelynote

Responsibilities

Architect and implement large-scale ML training pipelines that leverage parallel GPU processing on platforms like GCP or AWS
Enhance our existing infrastructure to fully exploit parallelism and design for future expansion, ensuring that our system is ready to support growth
Manage and optimize high-performance computing resources
Develop robust distributed computing solutions, addressing challenges like race conditions, memory optimization, and resource allocation
Optimize model training with techniques like mixed precision, ZeRO, Lora, etc
Design systems for job rescheduling, automated retries, and failure recovery to maximize uptime and training efficiency
Implement intelligent job queuing mechanisms to optimize training workloads and resource utilization
Evaluate and implement tradeoffs between different local and networked storage solutions to improve data throughput and access
Develop strategies for caching training data to optimize performance
Work closely with ML researchers and data scientists to understand training requirements and bottlenecks
Continuously monitor system performance, identify areas for improvement, and implement best practices to enhance scalability and reliability

Qualification

High-performance computingDistributed systemsML training systemsCloud GPU environmentsJob scheduling systemsML model tuningPyTorchAnalytical skillsCommunication skills

Required

Bachelor's degree or higher in Computer Science or a related field
At least 7 years of professional experience in the software industry, with a minimum of 2 years in a tech lead role
Proven experience with high-performance computing environments and distributed systems
Demonstrated ability to scale ML training systems and optimize resource utilization
Hands-on experience with job scheduling systems and managing cloud GPU environments (GCP, AWS, etc.)
Deep understanding of distributed computing concepts, including race conditions, memory optimization, and parallel processing
Hands-on experience in ML model tuning for performance
Experience with common ML training and inference tools including PyTorch, TensorRT, Triton, Accelerate, etc
Strong analytical and problem-solving skills with the ability to troubleshoot complex system issues
Excellent communication skills to collaborate effectively with cross-functional teams

Preferred

Experience with container orchestration tools (e.g., Kubernetes) and infrastructure-as-code frameworks

Benefits

Competitive salary and equity in a seed-stage venture-backed startup
Comprehensive health, dental, and vision insurance
Professional growth and development through training, mentorship, and challenging projects
Daily catered lunches and dinner with a fully stocked kitchen

Company

Dyna Robotics

twittertwittertwitter
company-logo
Dyna Robotics develops advanced robotic manipulation models to automate repetitive and stationary tasks.

H1B Sponsorship

Dyna Robotics has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (7)
2024 (3)

Funding

Current Stage
Early Stage
Total Funding
$143.5M
2025-09-15Series A· $120M
2025-03-25Seed· $23.5M
Company data provided by crunchbase