Apply on Employer Site

Sustainable Talent · 3 months ago

Senior Site Reliability Engineer

Santa Clara, CA

Full-time

Onsite

Senior Level

$75/hr - $90/hr

5+ years exp

Sustainable Talent is supporting NVIDIA by seeking a Senior Site Reliability Engineer to join their Infrastructure, Planning, and Process organization. In this role, you will troubleshoot and manage on-premises infrastructure to ensure reliability for various software engineering teams while monitoring system performance and driving automation of monitoring processes.

ConsultingHuman ResourcesInformation Technology

Growth Opportunities

Responsibilities

Working on systems deployed in NVIDIA's internal cloud making them available and reliable for our end users

Monitor system performance and troubleshoot issues related to CPU, memory, disk, and network utilization

Providing high quality of user support

Monitoring KPIs and making sure that team’s SLAs are met

Managing and maintaining production Kubernetes clusters

Drive automation of monitoring to gain more insight into applications and system health

Craft and implement critical metrics using various analytics methods and dashboards

Reuse AI techniques to extract useful signals about machines and jobs from the data generated

Qualification

Site Reliability EngineeringKubernetesLinux OS troubleshootingMonitoring toolsREST APIScripting (Python)AnsibleCI/CD (Jenkins)Data analytics toolsInterpersonal skillsProblem-solving

Required

Proven SRE experience as an L1 support with on-call responsibilities, ideally over 5+ years

Proficient in troubleshooting Linux OS issues such as SSH and performance

Experience troubleshooting networking issues like DNS, DHCP, and familiarity with networking principles and protocols, including TCP/IP and VLANs

Hands-on experience with monitoring and alerting tools such as Prometheus, Grafana, Elastic, or similar

Strong understanding and practical experience with REST API calls

Proficiency in basic scripting, with familiarity in Python or similar programming languages being a plus

Knowledge of Ansible roles and playbooks, Jenkins CI/CD processes, and deployment experience with Kubernetes

Experience with the Kickstart process for automated Linux installations

Experience managing and troubleshooting Linux systems, as well as managing systems in data centers, using tools like BMC (Redfish), KVM, and IPMI

Background in databases such as SQL (MySQL) and timeseries DBs like Prometheus

Experience with data analytics and visualization tools like Kibana, Grafana, and Splunk

Proficient with source code management and binary repository systems like GitLab, GitHub, Artifactory, and Perforce

Advanced knowledge of standard methodologies related to security

Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent experience

Preferred

Working knowledge of OpenStack

Previous experience managing NVIDIA hardware such as GPUs and Tegras

Prior experience with large scale operations teams

Experience managing Windows server infrastructure

Outstanding interpersonal skills and ability to communicate effectively with all levels of management

Ability to analyze complex problems, design simple systems that function efficiently with minimal support, and thrive in a multi-tasking environment with evolving priorities