NVIDIA · 1 hour ago
Site Reliability Engineer, HPC and LSF
NVIDIA has been a leader in computer graphics and accelerated computing for over 25 years, now focusing on AI to shape the future of computing. As a Site Reliability Engineer, you will lead the design and implementation of high-performance compute clusters, ensuring their reliability and efficiency while improving engineering productivity through automation.
AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality
Responsibilities
Troubleshoot incoming support requests in a large-scale HPC environment
Contribute enhancements to existing deployment automation, configuration management, observability, and operational monitoring and day to day operation through automation
Ensure compute servers are running correct Operating System and configuration
Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency
Collaborate with specialist teams to drive issues to closure
Collaborate with domain experts to improve how our chip development process utilizes our infrastructure
Directly contribute to the overall quality and improve time to market for our next generation chips
Qualification
Required
Proficient in administering Centos/RHEL Linux distributions
Understanding of container technologies like Docker
Proficiency in Python and UNIX scripting languages such as bash
Excellent problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions
Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals
BS in Computer Science, similar degree (or equivalent experience) with 2+yrs of relevant post degree experience
Solid understanding of cluster configuration managements tools such as Ansible
Preferred
Understanding of key Linux technologies such as NFS, automounter, LDAP, DNS, and TCP/IP networking in Red Hat Linux distribution flavors
Familiarity with job scheduler administration (e.g. IBM Spectrum LSF or SLURM) and experience building/ operating large scale compute infrastructure
Knowledge of the FlexLM license management system
Proficiency in Perl for maintaining legacy automation scripts
Familiarity with High-Speed Networking (InfiniBand, RDMA, RoCE etc.) and fast, distributed storage systems (Lustre, GPFS, etc.)
Benefits
Equity
Benefits
Company
NVIDIA
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.
H1B Sponsorship
NVIDIA has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)
Funding
Current Stage
Public CompanyTotal Funding
$4.09BKey Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity
Recent News
The Motley Fool
2026-01-12
The Motley Fool
2026-01-12
Company data provided by crunchbase