Site Reliability Engineer, HPC and LSF jobs in United States
cer-icon
Apply on Employer Site
company-logo

NVIDIA · 4 hours ago

Site Reliability Engineer, HPC and LSF

NVIDIA has been a leader in computer graphics and accelerated computing for over 25 years, now focusing on AI to shape the future of computing. As a Site Reliability Engineer, you will lead the design and implementation of high-performance compute clusters, ensuring their reliability and efficiency while improving engineering productivity through automation.

AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Troubleshoot incoming support requests in a large-scale HPC environment
Contribute enhancements to existing deployment automation, configuration management, observability, and operational monitoring and day to day operation through automation
Ensure compute servers are running correct Operating System and configuration
Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency
Collaborate with specialist teams to drive issues to closure
Collaborate with domain experts to improve how our chip development process utilizes our infrastructure
Directly contribute to the overall quality and improve time to market for our next generation chips

Qualification

Centos/RHEL LinuxContainer technologiesPythonCluster configuration managementUNIX scriptingJob scheduler administrationFlexLM license managementHigh-Speed NetworkingPerlProblem-solvingCommunication skills

Required

Proficient in administering Centos/RHEL Linux distributions
Understanding of container technologies like Docker
Proficiency in Python and UNIX scripting languages such as bash
Excellent problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions
Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals
BS in Computer Science, similar degree (or equivalent experience) with 2+yrs of relevant post degree experience
Solid understanding of cluster configuration managements tools such as Ansible

Preferred

Understanding of key Linux technologies such as NFS, automounter, LDAP, DNS, and TCP/IP networking in Red Hat Linux distribution flavors
Familiarity with job scheduler administration (e.g. IBM Spectrum LSF or SLURM) and experience building/ operating large scale compute infrastructure
Knowledge of the FlexLM license management system
Proficiency in Perl for maintaining legacy automation scripts
Familiarity with High-Speed Networking (InfiniBand, RDMA, RoCE etc.) and fast, distributed storage systems (Lustre, GPFS, etc.)

Benefits

Equity
Benefits

Company

NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.

H1B Sponsorship

NVIDIA has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)

Funding

Current Stage
Public Company
Total Funding
$4.09B
Key Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity

Leadership Team

leader-logo
Jensen Huang
Founder and CEO
linkedin
leader-logo
Michael Kagan
Chief Technology Officer
linkedin
Company data provided by crunchbase