Apply on Employer Site

smallest.ai · 4 hours ago

Senior GPU Optimisation Engineer | San Francisco

San Francisco, CA

Full-time

Onsite

Mid, Senior Level

$200K/yr - $300K/yr

3+ years exp

smallest.ai is seeking a Senior GPU Optimization Engineer who has a deep understanding of GPUs and can optimize model architectures for real-time performance. The role involves working on CUDA kernels, model graph optimizations, and tuning models across various GPU architectures to enhance the performance of real-time speech models.

Artificial Intelligence (AI)Generative AIInformation TechnologySaaSSoftware

Responsibilities

Optimize model architectures (ASR, TTS, SLMs) for maximum performance on specific GPU hardware

Profile models end-to-end to identify GPU bottlenecks — memory bandwidth, kernel launch overhead, fusion opportunities, quantization constraints

Design and implement custom kernels (CUDA/Triton/Tinygrad) for performance-critical model sections

Perform operator fusion, graph optimization, and kernel-level scheduling improvements

Tune models to fit GPU memory limits while maintaining quality

Benchmark and calibrate inference across NVIDIA, AMD, and potentially emerging accelerators

Port models across GPU chipsets (NVIDIA → AMD / edge GPUs / new compute backends)

Work with TensorRT, ONNX Runtime, and custom runtimes for deployment

Partner with the research and infra teams to ensure the entire stack is optimized for real-time workloads

Qualification

GPU architectureCUDAModel optimizationPyTorchTensorRTONNX RuntimeKernel fusionProfiling toolsQuantizationAudio/speech modelsProblem-solving

Required

Strong understanding of GPU architecture — SMs, warps, memory hierarchy, occupancy tuning

Hands-on experience with CUDA, kernel writing, and kernel-level debugging

Experience with kernel fusion and model graph optimizations

Familiarity with TensorRT, ONNX, Triton, tinygrad, or similar inference engines

Strong proficiency in PyTorch and Python

Deep understanding of model architectures (transformers, convs, RNNs, attention, diffusion blocks)

Experience profiling GPU workloads using Nsight, nvprof, or similar tools

Strong problem-solving abilities with a performance-first mindset

3-5 years of specialized experience in GPU Optimization through academia or industry

Master's or PhD in GPU Programming or related field

Preferred

Experience with quantization (INT8, FP8, hybrid formats)

Experience with audio/speech models (ASR, TTS, SSL, vocoders)

Contributions to open-source GPU stacks or inference runtimes

Published work related to systems-level model optimization

Company

smallest.ai

Smallest.ai is a Software Development developing a voice AI foundation models for enterprise deployment, sales and support.

Founded in 2023

San Francisco, California, USA

2-10 employees

https://smallest.ai

Funding

Current Stage

Early Stage

Total Funding

$9M

Key Investors

Amazon Web ServicesSierra Ventures

2025-10-09Non Equity Assistance· $1M

2025-09-22Seed· $8M

2025-03-20Pre Seed

Leadership Team

Apoorv Sood

Global Head of Go-To-Market

Recent News

Inc42 Media

How Smallest.ai Is Leveraging Small Models To Fix Voice AI’s Latency Problem

2026-01-25

EIN Presswire

Calsoft pushes for Gen AI in software development automation

2026-01-22

Inc42 Media

2025 In Review: The Best Of Inc42’s 30 Startups To Watch

2026-01-03

Company data provided by crunchbase