Nebius · 1 day ago
Professional Services Engineer
Maximize your interview chances
Insider Connection @Nebius
Get 3x more responses when you reach out via email instead of LinkedIn.
Responsibilities
Design and implement distributed ML training and inference workflows: develop and maintain scalable, efficient and reliable ML training pipelines on K8s and Slurm, leveraging containerization (e.g. Docker) and orchestration (e.g. K8s).
Optimize ML training performance: collaborate with data scientists and engineers to optimize ML model training and inference performance.
Develop and contribute to training and inference Solutions Library: design, deploy and manage K8s and Slurm clusters for large-scale ML training, leveraging our ready-to deploy solutions.
Integrate with ML frameworks: integrate K8s and Slurm with popular ML frameworks like TensorFlow, PyTorch or MXNet, ensuring seamless execution of distributed ML training workloads.
Monitor and troubleshoot distributed training: develop monitoring and logging tools to track distributed training performance, identify bottlenecks and troubleshoot issues.
Develop automation scripts and tools: create automation scripts and tools to streamline ML training workflows, leveraging technologies like Ansible, Terraform or Python.
Stay up-to-date with industry trends: participate in industry conferences, meetups and online forums to stay up-to-date with the latest developments in MLOps, K8S, Slurm and ML.
Qualification
Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.
Required
3+ years of experience in MLOps, DevOps or a related field
Strong experience with K8s and containerization (e.g. Docker)
Experience with Slurm or other distributed computing frameworks
Proficiency in Python, with experience in ML frameworks like TensorFlow, PyTorch or MXNet
Strong understanding of distributed computing concepts, including parallel processing and job scheduling
Experience with automation tools like Ansible, Terraform or Python
Excellent problem-solving skills with the ability to troubleshoot complex issues
Strong communication and collaboration skills, with experience working with cross-functional teams
Preferred
Experience with cloud providers like AWS, GCP or Azure
Knowledge of ML model serving and deployment
Familiarity with CI/CD pipelines and tools like Jenkins, GitLab CI/CD or CircleCI
Experience with monitoring and logging tools like Prometheus, Grafana or ELK Stack
Benefits
Health Insurance: 100% company-paid medical, dental, and vision coverage for employees and families.
401(k) Plan: Up to 4% company match with immediate vesting.
Parental Leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
Remote Work Reimbursement: Up to $85/month for mobile and internet.
Disability & Life Insurance: Company-paid short-term, long-term, and life insurance coverage.
Company
Nebius
Cloud platform specifically designed to train AI models
Funding
Current Stage
Public CompanyTotal Funding
unknown2024-10-21IPO· nasdaq:NBIS
Recent News
2024-10-24
2024-10-22
2024-10-18
Company data provided by crunchbase