SF Tensor · 1 week ago
Founding GPU Kernel Engineer
SF Tensor is a company focused on revolutionizing AI and high-performance computing through innovative software and infrastructure solutions. They are seeking a Founding GPU Kernel Engineer who will optimize GPU kernels for machine learning workloads and develop automated compiler passes to enhance performance across various GPU architectures.
Artificial Intelligence (AI)Cloud ComputingMachine LearningSoftware
Responsibilities
Write and hand-optimize GPU kernels for ML workloads (matmuls, attention, normalization, etc.) to set the performance ceilings
Profile at the microarchitectural level: look into SM utilization, warp stalls, memory bank conflicts, register pressure, instruction throughput
Debug performance issues by digging deep into things like clock speeds, thermal throttling, driver behavior, hardware errata
Turn your hand-optimization insights into automated compiler passes (working closely with our compiler team)
Develop performance models that predict how kernels will behave across different GPU architectures
Build tools and methods for systematic kernel optimization
Work with NVIDIA, AMD, and emerging AI accelerators - understand the common parts and what's vendor-specific
Qualification
Required
Deep expertise in GPU architecture
Proven track record of hand-writing kernels that match or beat vendor libraries (cuBLAS, cuDNN, CUTLASS)
Strong skills with low-level profiling tools: Nsight Compute, Nsight Systems, rocprof, or equivalents
Experience reading and reasoning about PTX/SASS or GPU assembly
Solid systems programming in C++ and CUDA (or ROCm/HIP)
Good understanding of how high-level ML operations map to hardware execution
Experience with distributed training systems: collective ops like all-reduce and all-gather, NCCL/RCCL, multi-node communication patterns
Preferred
HPC background: experience with large-scale scientific computing, MPI, or work in supercomputing
Background in electrical engineering, computer architecture, or hardware design
Driver development experience (NVIDIA, AMD, or other accelerators)
Experience with MLIR, LLVM, or compiler backends
Deep knowledge of distributed ML training: gradient accumulation, activation checkpointing, pipeline/tensor parallelism, ZeRO-style optimizations
Familiarity with custom accelerators: TPUs, Trainium, Inferentia, or similar
Knowledge of high-speed interconnects: NVLink, NVSwitch, InfiniBand, RoCE
Publications or contributions in GPU optimization, HPC, or ML systems
Experience at NVIDIA, AMD, a national lab, or an AI hardware/infrastructure company
Benefits
Bonus
Equity
Benefits
Relocation assistance
Company
SF Tensor
The San Francisco Tensor Company is reinventing the software and infrastructure stack for modern AI and HPC.
Funding
Current Stage
Early StageTotal Funding
$0.5MKey Investors
Y Combinator
2025-10-01Pre Seed· $0.5M
Company data provided by crunchbase