Founding GPU Kernel Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

SF Tensor · 1 week ago

Founding GPU Kernel Engineer

SF Tensor is a company focused on revolutionizing AI and high-performance computing through innovative software and infrastructure solutions. They are seeking a Founding GPU Kernel Engineer who will optimize GPU kernels for machine learning workloads and develop automated compiler passes to enhance performance across various GPU architectures.

Artificial Intelligence (AI)Cloud ComputingMachine LearningSoftware

Responsibilities

Write and hand-optimize GPU kernels for ML workloads (matmuls, attention, normalization, etc.) to set the performance ceilings
Profile at the microarchitectural level: look into SM utilization, warp stalls, memory bank conflicts, register pressure, instruction throughput
Debug performance issues by digging deep into things like clock speeds, thermal throttling, driver behavior, hardware errata
Turn your hand-optimization insights into automated compiler passes (working closely with our compiler team)
Develop performance models that predict how kernels will behave across different GPU architectures
Build tools and methods for systematic kernel optimization
Work with NVIDIA, AMD, and emerging AI accelerators - understand the common parts and what's vendor-specific

Qualification

GPU architectureC++CUDALow-level profiling toolsKernel optimizationPTX/SASSDistributed training systemsML operations mappingHPC backgroundDriver developmentMLIRPublications in GPU optimization

Required

Deep expertise in GPU architecture
Proven track record of hand-writing kernels that match or beat vendor libraries (cuBLAS, cuDNN, CUTLASS)
Strong skills with low-level profiling tools: Nsight Compute, Nsight Systems, rocprof, or equivalents
Experience reading and reasoning about PTX/SASS or GPU assembly
Solid systems programming in C++ and CUDA (or ROCm/HIP)
Good understanding of how high-level ML operations map to hardware execution
Experience with distributed training systems: collective ops like all-reduce and all-gather, NCCL/RCCL, multi-node communication patterns

Preferred

HPC background: experience with large-scale scientific computing, MPI, or work in supercomputing
Background in electrical engineering, computer architecture, or hardware design
Driver development experience (NVIDIA, AMD, or other accelerators)
Experience with MLIR, LLVM, or compiler backends
Deep knowledge of distributed ML training: gradient accumulation, activation checkpointing, pipeline/tensor parallelism, ZeRO-style optimizations
Familiarity with custom accelerators: TPUs, Trainium, Inferentia, or similar
Knowledge of high-speed interconnects: NVLink, NVSwitch, InfiniBand, RoCE
Publications or contributions in GPU optimization, HPC, or ML systems
Experience at NVIDIA, AMD, a national lab, or an AI hardware/infrastructure company

Benefits

Bonus
Equity
Benefits
Relocation assistance

Company

SF Tensor

twittertwittertwitter
company-logo
The San Francisco Tensor Company is reinventing the software and infrastructure stack for modern AI and HPC.

Funding

Current Stage
Early Stage
Total Funding
$0.5M
Key Investors
Y Combinator
2025-10-01Pre Seed· $0.5M
Company data provided by crunchbase