Senior Software Engineer, Profiling Services jobs in United States
cer-icon
Apply on Employer Site
company-logo

NVIDIA · 1 month ago

Senior Software Engineer, Profiling Services

NVIDIA is a leading technology company specializing in GPU performance analysis for Machine Learning workloads. The Senior Software Engineer will be responsible for designing, implementing, and leading the Always-On Profiling service, ensuring high standards in software development and mentoring engineers.

AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Architect and Build Scalable Systems: Drive the design and implementation of the AON profiling service's core systems. You'll master inter-process communication (IPC), memory management, and building low-overhead architectures to handle profiling data from complex multi-node, multi-process, multi-GPU, and cluster environments
Elevate Software Engineering Excellence: Promote high standards in software development, including design patterns, concurrency, parallelism, and advanced debugging for asynchronous systems. Our commitment to code quality and robust testing ensures a reliable profiling service
Lead, Mentor, and Innovate: Guide and mentor engineers, provides impactful code reviews, and shape technical roadmaps. Proactively identify complex technical issues within the AON project, break them down, and craft innovative solutions. Your problem-solving prowess will be crucial for AON's success with ML workloads
Architect and Build High-Performance Platforms: Transform user needs into clear requirements and design documents. Explore diverse approaches to problems, making well-reasoned recommendations. Lead end-to-end feature development—from planning and prototyping to implementation, testing, and customer evaluation. This involves hands-on development across user applications, drivers, performance counter libraries, and lower-level platform/hardware abstraction layers
Collaborate Across Boundaries: Partner effectively with diverse internal and external teams. Exceptional communication and collaboration skills are key to integrating AON seamlessly into the broader profiling and ML ecosystem

Qualification

C++PythonCUDAProfiling TechnologiesMachine Learning FrameworksSystem Software DesignAPI DesignTechnical LeadershipProblem SolvingCommunication Skills

Required

BS or MS degree or equivalent experience in Computer Engineering, Computer Science, or related degree
8+ years of significant software development experience in C, C++, and Python
12+ years in system software design, operating systems fundamentals, computer architectures, performance analysis, and delivering production-quality software
Strong interpersonal, verbal, and written communication, demonstrating the ability to build cross-organizational partnerships and lead technical teams through complex challenges
Profiling & Performance Tools Expert: Extensive knowledge of profiling technologies (sampling, tracing), overhead analysis, and diverse profiling data (CPU/GPU events, performance counters, API traces, event correlation). Familiarity with existing profiling ecosystems and their limitations is a plus
GPU & CUDA Proficiency: In-depth knowledge of CUDA APIs, runtime, streams, kernels, and GPU architecture
ML Ecosystem & Performance Analysis: Familiarity with ML frameworks such as PyTorch and JAX, and knowledge of performance analysis for AI training/inference applications
Large-Scale System Development & Debugging: Experience developing and debugging across complex multi-layered software systems, including user mode and kernel drivers, with a proven ability to provide exceptional solutions and extend codebases (100s of millions of lines)
Proficiency in Designing APIs and Interfaces for Profiling Tools: Designs robust, flexible APIs and interfaces enabling seamless integration of profiling tools with various frameworks and custom code
Proficiency in Problem Simplification: A history of breaking down ill-defined problems in complex technical domains, crafting effective solutions, and leading teams to implement them

Preferred

Pioneering Low-Overhead Profiling Systems: A track record of designing and implementing profiling systems with minimal performance impact on target workloads, especially in complex multi-process and distributed environments
Deep Understanding of PyTorch Internals & CUDA Usage: A comprehensive grasp of how PyTorch uses CUDA, including tensor memory, operations, and distributed training functionalities
Proficiency in analyzing profiling data and translating it into concrete, actionable insights, especially within CUDA and ML Frameworks like PyTorch
Translating Customer Needs: Skilled at redefining customer requests into actionable use cases and requirements
Strong understanding of system security principles

Benefits

Equity
Benefits

Company

NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.

H1B Sponsorship

NVIDIA has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)

Funding

Current Stage
Public Company
Total Funding
$4.09B
Key Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity

Leadership Team

leader-logo
Jensen Huang
Founder and CEO
linkedin
leader-logo
Michael Kagan
Chief Technology Officer
linkedin
Company data provided by crunchbase