Solutions Architect, AI Cloud Partner Performance jobs in United States
cer-icon
Apply on Employer Site
company-logo

NVIDIA · 6 hours ago

Solutions Architect, AI Cloud Partner Performance

NVIDIA is a pioneering company in computer graphics and accelerated computing, now leveraging AI for the next era of computing. They are seeking a Solutions Architect to guide partners in adopting Reference Architectures, ensuring high performance, reliability, and security while creating training materials and sharing knowledge with internal teams.

AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Work closely with NVIDIA Cloud Partners (NCP), as a compute and networking performance specialist, ensuring they are reaching high standards for performance and accomplishing their business goals
Enable NCPs to achieve Exemplar Cloud status through demonstration of performance capabilities with respect to reference benchmarks
Accelerate NCP onboarding time by resolving deviations from reference performance targets
Improve NVIDIA Cloud Partner cluster manageability, and reliability by advising customers on application of available solutions
Scale knowledge, reach, and opportunities by educating internal teams and communities on NVIDIA Reference Architectures and Exemplar Cloud program
Communicate feedback from the field to teams creating and maintaining Reference Architectures

Qualification

Cloud Service ProvidersPerformance tuningGPU clustersLinux administrationNetworking fundamentalsDebugging skillsRoot cause analysisObservability tooling

Required

Strong foundational expertise, from a BS, MS, or Ph.D. degree in Engineering, Mathematics, Physics, Computer Science, Data Science (or equivalent experience)
5+ years of proven experience with one or more Cloud Service Providers (AWS, Azure, GCP or OCI), NCPs (CoreWeave, Lambda Labs, Crusoe, etc) and cloud-native architectures and software
Experience leading joint debugging and optimization sessions with partners, driving the resolution of distributed training bottlenecks and fabric anomalies
Expertise in performance tuning of RDMA-enabled GPU clusters including running performance benchmarks and diagnosing performance issue with compute and network tracing tools
Strong coding and outstanding debugging skills. Proven expertise in the following areas: LLM training and inference workloads, Slurm, Kubernetes, MPI, NCCL
Linux-based configuration, management, monitoring, and system administration with proficiency in problem-solving in both bare metal and virtual environments
Understanding of networking fundamentals (e.g. router, firewall, load balancer, DNS, VPN) for high performance infrastructure

Preferred

Ability to perform root cause analysis on distributed training failures using Nsight Systems and NCCL-tests, applying a detailed divide-and-conquer approach to isolate network/fabric issues
Experience running LLM Benchmarks, NCCL-tests, and automating RDMA diagnostic tools
Background with deploying and configuring observability tooling including Grafana, Prometheus, W&B, Nagios, Zabbix
Ability to take ownership when resolving cluster downtime or degraded performance with customers

Benefits

Equity
Benefits

Company

NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.

H1B Sponsorship

NVIDIA has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)

Funding

Current Stage
Public Company
Total Funding
$4.09B
Key Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity

Leadership Team

leader-logo
Jensen Huang
Founder and CEO
linkedin
leader-logo
Michael Kagan
Chief Technology Officer
linkedin
Company data provided by crunchbase