Professional Services Engineer @ Nebius | Jobright.ai
JOBSarrow
RecommendedLiked
0
Applied
0
External
0
Professional Services Engineer jobs in United States
Be an early applicantLess than 25 applicants
company-logo

Nebius · 1 day ago

Professional Services Engineer

ftfMaximize your interview chances
Cloud InfrastructureGPU
check
Growth Opportunities
Hiring Manager
Austin McQuay
linkedin

Insider Connection @Nebius

Discover valuable connections within the company who might provide insights and potential referrals.
Get 3x more responses when you reach out via email instead of LinkedIn.

Responsibilities

Design and implement distributed ML training and inference workflows: develop and maintain scalable, efficient and reliable ML training pipelines on K8s and Slurm, leveraging containerization (e.g. Docker) and orchestration (e.g. K8s).
Optimize ML training performance: collaborate with data scientists and engineers to optimize ML model training and inference performance.
Develop and contribute to training and inference Solutions Library: design, deploy and manage K8s and Slurm clusters for large-scale ML training, leveraging our ready-to deploy solutions.
Integrate with ML frameworks: integrate K8s and Slurm with popular ML frameworks like TensorFlow, PyTorch or MXNet, ensuring seamless execution of distributed ML training workloads.
Monitor and troubleshoot distributed training: develop monitoring and logging tools to track distributed training performance, identify bottlenecks and troubleshoot issues.
Develop automation scripts and tools: create automation scripts and tools to streamline ML training workflows, leveraging technologies like Ansible, Terraform or Python.
Stay up-to-date with industry trends: participate in industry conferences, meetups and online forums to stay up-to-date with the latest developments in MLOps, K8S, Slurm and ML.

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

MLOpsK8sSlurmPythonTensorFlowPyTorchMXNetDockerAnsibleTerraformDistributed computingAWSGCPAzureCI/CDJenkinsGitLab CI/CDCircleCIPrometheusGrafanaELK StackCollaboration skills

Required

3+ years of experience in MLOps, DevOps or a related field
Strong experience with K8s and containerization (e.g. Docker)
Experience with Slurm or other distributed computing frameworks
Proficiency in Python, with experience in ML frameworks like TensorFlow, PyTorch or MXNet
Strong understanding of distributed computing concepts, including parallel processing and job scheduling
Experience with automation tools like Ansible, Terraform or Python
Excellent problem-solving skills with the ability to troubleshoot complex issues
Strong communication and collaboration skills, with experience working with cross-functional teams

Preferred

Experience with cloud providers like AWS, GCP or Azure
Knowledge of ML model serving and deployment
Familiarity with CI/CD pipelines and tools like Jenkins, GitLab CI/CD or CircleCI
Experience with monitoring and logging tools like Prometheus, Grafana or ELK Stack

Benefits

Health Insurance: 100% company-paid medical, dental, and vision coverage for employees and families.
401(k) Plan: Up to 4% company match with immediate vesting.
Parental Leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
Remote Work Reimbursement: Up to $85/month for mobile and internet.
Disability & Life Insurance: Company-paid short-term, long-term, and life insurance coverage.

Company

Nebius

twittertwittertwitter
company-logo
Cloud platform specifically designed to train AI models

Funding

Current Stage
Public Company
Total Funding
unknown
2024-10-21IPO· nasdaq:NBIS
Company data provided by crunchbase
logo

Orion

Your AI Copilot