Distributed Machine Learning Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) · 5 months ago

Distributed Machine Learning Engineer

Mohamed bin Zayed University of Artificial Intelligence is dedicated to research, innovation, and empowering brilliant minds in AI. The Distributed Machine Learning Engineer will optimize performance for machine learning software stacks, develop new systems, and work alongside researchers to tackle challenges in AI development.

Artificial Intelligence (AI)Higher EducationUniversities
check
H1B Sponsorednote

Responsibilities

Understand, analyze, profile, optimize, and provide guidance to the team on deep learning workloads on state-of-the-art hardware and software platforms to improve their efficiency with different levels of optimization
Design and implement performance benchmarks and testing methodologies to evaluate application performance
Build tools to automate workload analysis, workload optimization, and other critical workflows
Triage system issues and identify bottleneck and inefficiencies by analyzing the sources of issues and the impact on hardware, network and propose solutions to enhance GPU utilization
Support the team to develop appropriate kernels and systems for new model architectures and algorithms
Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies
Review code developed by other developers and provide feedback to ensure best practices (e.g., style guidelines, checking code in, accuracy, testability, and efficiency)
Contribute to existing documentation or educational content and adapt content based on product/program updates and user feedback
Represent MBZUAI at industry conferences and events, showcasing the institution’s cutting-edge HPC and deep learning capabilities and establishing MBZUAI as a global leader in AI research and innovation
Perform all other duties as reasonably directed by the line manager that are commensurate with these functional objectives

Qualification

Parallel ComputingSystem Level CodingLarge-scale Machine LearningDeep Learning OptimizationPerformance BenchmarkingDebug MethodologiesWorkload AnalysisGPU UtilizationDesign ReviewsCode ReviewDocumentation Contribution

Required

Ph.D. in CS, EE or CSEE with 1+ years working experience, OR
Masters in CS, EE or CSEE or equivalent experience with 2+ year working experience
Strong background in parallel computing
Hands-on experience in system level coding
Debug methodologies experience
Large-scale machine learning experience

Benefits

Comprehensive medical, dental, and vision benefits
Bonus
401K Plan
Generous paid time off, sick leave and holidays
Paid Parental Leave
Employee Assistance Program
Life insurance and disability

Company

MBZUAI (Mohamed bin Zayed University of Artificial Intelligence)

twittertwittertwitter
company-logo
Official account of Mohamed bin Zayed University of Artificial Intelligence. Dedicated to research, innovation, and empowering brilliant minds in AI.

Funding

Current Stage
Growth Stage
Total Funding
$0.04M
Key Investors
Llama
2024-09-24Grant· $0.04M

Leadership Team

leader-logo
Eric Xing
President
linkedin
Company data provided by crunchbase