Sustainable Talent · 3 months ago
Senior Site Reliability Engineer
Sustainable Talent is supporting NVIDIA by seeking a Senior Site Reliability Engineer to join their Infrastructure, Planning, and Process organization. In this role, you will troubleshoot and manage on-premises infrastructure to ensure reliability for various software engineering teams while monitoring system performance and driving automation of monitoring processes.
ConsultingHuman ResourcesInformation Technology
Responsibilities
Working on systems deployed in NVIDIA's internal cloud making them available and reliable for our end users
Monitor system performance and troubleshoot issues related to CPU, memory, disk, and network utilization
Providing high quality of user support
Monitoring KPIs and making sure that team’s SLAs are met
Managing and maintaining production Kubernetes clusters
Drive automation of monitoring to gain more insight into applications and system health
Craft and implement critical metrics using various analytics methods and dashboards
Reuse AI techniques to extract useful signals about machines and jobs from the data generated
Qualification
Required
Proven SRE experience as an L1 support with on-call responsibilities, ideally over 5+ years
Proficient in troubleshooting Linux OS issues such as SSH and performance
Experience troubleshooting networking issues like DNS, DHCP, and familiarity with networking principles and protocols, including TCP/IP and VLANs
Hands-on experience with monitoring and alerting tools such as Prometheus, Grafana, Elastic, or similar
Strong understanding and practical experience with REST API calls
Proficiency in basic scripting, with familiarity in Python or similar programming languages being a plus
Knowledge of Ansible roles and playbooks, Jenkins CI/CD processes, and deployment experience with Kubernetes
Experience with the Kickstart process for automated Linux installations
Experience managing and troubleshooting Linux systems, as well as managing systems in data centers, using tools like BMC (Redfish), KVM, and IPMI
Background in databases such as SQL (MySQL) and timeseries DBs like Prometheus
Experience with data analytics and visualization tools like Kibana, Grafana, and Splunk
Proficient with source code management and binary repository systems like GitLab, GitHub, Artifactory, and Perforce
Advanced knowledge of standard methodologies related to security
Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent experience
Preferred
Working knowledge of OpenStack
Previous experience managing NVIDIA hardware such as GPUs and Tegras
Prior experience with large scale operations teams
Experience managing Windows server infrastructure
Outstanding interpersonal skills and ability to communicate effectively with all levels of management
Ability to analyze complex problems, design simple systems that function efficiently with minimal support, and thrive in a multi-tasking environment with evolving priorities
Benefits
Full benefits
PTO
Amazing company culture
Company
Sustainable Talent
Sustainable Talent provides staffing, consulting and outsourcing services.