Apply on Employer Site

Prime Intellect · 1 month ago

Research Engineer - Distributed Training

United States

Full-time

Hybrid

Mid, Senior Level

Prime Intellect is building the open superintelligence stack, and they are seeking a Research Engineer to work on Distributed Training. This role involves leading research efforts to create a decentralized training orchestration solution and optimizing AI workload performance.

Artificial Intelligence (AI)Cloud Computing

H1B Sponsored

Responsibilities

Lead and participate in novel research to build a massive scale, highly reliable and secure decentralized training orchestration solution

Optimize the performance, cost, and resource utilization of AI workloads by leveraging the most recent advances for compute & memory optimization techniques

Contribute to the development of our open-source libraries and frameworks for distributed model training

Publish research in top-tier AI conferences such as ICML & NeurIPS

Distill highly technical project outcomes in layman approachable technical blogs to our customers and developers

Stay up-to-date with the latest advancements in AI/ML infrastructure and tools, decentralized training research and proactively identify opportunities to enhance our platform's capabilities and user experience

Qualification

AI/ML engineeringDistributed training techniquesMLOps best practicesPyTorch DistributedDeepSpeedMosaicML’s LLM FoundryRayExperiment trackingContinuous integration/deploymentTechnical blogging

Required

Strong background in AI/ML engineering, with extensive experience in designing and implementing end-to-end pipelines for training and deploying large-scale AI models

Deep expertise in distributed training techniques, frameworks (e.g., PyTorch Distributed, DeepSpeed, MosaicML's LLM Foundry), and tools (e.g. Ray) for optimizing the performance and scalability of AI workloads

Experience in large-scale model training incl. distributed training techniques such as data, tensor & pipeline parallelism

Solid understanding of MLOps best practices, including model versioning, experiment tracking, and continuous integration/deployment (CI/CD) pipelines

Passion for advancing the state-of-the-art in decentralized AI model training and democratizing access to AI capabilities for researchers, developers, and businesses worldwide

Benefits

Competitive compensation, including equity incentives, aligning your success with the growth and impact of Prime Intellect.

Flexible work arrangements, with the option to work remotely or in-person at our offices in San Francisco.

Visa sponsorship and relocation assistance for international candidates.

Quarterly team off-sites, hackathons, conferences and learning opportunities.

Opportunity to work with a talented, hard-working and mission-driven team, united by a shared passion for leveraging technology to accelerate science and AI.