Nebius · 2 days ago
Senior Support Engineer L2
Maximize your interview chances
Insider Connection @Nebius
Get 3x more responses when you reach out via email instead of LinkedIn.
Responsibilities
Diagnose and resolve escalated issues with high proficiency in Linux, networking, Kubernetes and data storage, minimizing downtime.
Lead complex troubleshooting efforts and document solutions for use across teams.
Apply advanced Linux skills for efficient OS management and problem resolution.
Utilize in-depth networking knowledge to troubleshoot and optimize network configurations.
Manage containerized applications within Kubernetes environments, handling complex deployments and ensuring service continuity.
Use advanced Python and Bash scripting to automate tasks, streamline workflows, and improve team efficiency.
Demonstrate deep understanding of data storage concepts to diagnose storage issues and optimize data management practices.
Lead, mentor, and develop a support team of 5+ engineers, sharing technical knowledge and best practices.
Collaborate with internal teams and provide guidance to L1 support to enhance overall service quality.
Foster a supportive team environment, promote continuous learning and drive efficiency.
Ensure clear, professional updates to customers, explaining complex issues in a user-friendly way.
Oversee escalations to higher-level support or engineering teams, ensuring adherence to escalation protocols.
Create, update and oversee technical documentation, troubleshooting guides and knowledge base articles.
Identify recurring issues, recommend improvements, and implement best practices to enhance service reliability and team efficiency.
Qualification
Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.
Required
7+ years in technical support with advanced skills in Linux and networking; experience managing and mentoring a support team of 5+ engineers.
Advanced expertise in Linux administration and troubleshooting.
Strong networking knowledge, including protocols, IP configurations and diagnostics.
Knowledge of Docker (for packaging ML workflows) and Kubernetes (for scaling and managing GPU workloads in cloud environments).
Proficient in Python and Bash for complex automation and task management.
In-depth understanding of data storage principles, types and management.
An understanding of how GPUs accelerate ML workloads.
The ability to assist with resource provisioning, scaling, and integration within ML workflows.
Familiarity with CUDA, Tensor Cores, and distributed training across multiple GPUs.
The ability to troubleshoot memory errors, driver/library mismatches, and GPU utilization bottlenecks.
The ability to debug common errors during model training (e.g., OOM errors, version compatibility issues).
Preferred
Bachelor’s degree in Computer Science, Information Technology or related field preferred.
Company
Nebius
Cloud platform specifically designed to train AI models
Funding
Current Stage
Public CompanyTotal Funding
$700M2024-12-02Post Ipo Equity· $700M
2024-10-21IPO
Recent News
High-Performance Computing News Analysis | insideHPC
2024-12-04
2024-10-24
2024-10-22
Company data provided by crunchbase