DeepRec.ai · 3 days ago
Senior/Staff Machine Learning Engineer
Maximize your interview chances
Insider Connection @DeepRec.ai
Get 3x more responses when you reach out via email instead of LinkedIn.
Responsibilities
Productionize Advanced ML Frameworks: Work closely with researchers to develop, test, and deploy parallelization and verification frameworks optimized for high-performance training and inference.
Convert Research into Production Code: Translate novel hybrid parallelization and verification methods from research concepts into production-grade code ready for real-world applications.
Optimize ML Systems at Scale: Implement and refine frameworks that support highly scalable training (e.g., FSDP, Megatron-LM, DeepSpeed) and production-scale inference (e.g., ONNX Runtime, TensorRT, NVIDIA Triton).
Qualification
Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.
Required
Ability to operate in a research-heavy environment, making strategic trade-offs and working with ambiguity as you drive high-impact projects to completion.
Proven experience with parallelization frameworks for both training (e.g., FSDP, Megatron-LM, DeepSpeed) and inference (e.g., ONNX Runtime, DeepSpeed-Inference, NVIDIA Triton).
Strong foundation in either deep learning or distributed systems, enabling you to develop and optimize complex ML architectures.
Preferred
Background in fast-paced, high-growth environments, with a demonstrated ability to navigate rapid changes.
Proficiency in core networking protocols (e.g., IP, TCP, UDP, HTTP) and communication backends (e.g., NCCL, GLOO, MPI) essential for optimizing distributed ML systems