Apply on Employer Site

Pragmatike · 6 days ago

CUDA Kernel Engineer

Washington, DC

Full-time

Onsite

Mid, Senior Level

Pragmatike is a fast-growing AI startup recognized as a Top 10 GenAI company by GTM Capital, founded by MIT CSAIL researchers. They are seeking a CUDA Kernel Engineer to develop and optimize NVIDIA CUDA kernels for large-scale AI systems, directly influencing GPU efficiency and performance.

Information TechnologyRecruitingSoftware

Responsibilities

Design, implement, and optimize custom CUDA kernels for NVIDIA GPUs, with a focus on maximizing occupancy, memory throughput, and warp efficiency

Profile GPU workloads using tools such as Nsight Compute, Nsight Systems, nvprof, and CUDA‐MEMCHECK

Analyze and eliminate performance bottlenecks including warp divergence, uncoalesced memory access, register pressure, and PCIe transfer overhead

Improve GPU memory pipelines (global, shared, L2, texture memory) and ensure proper memory coalescing

Collaborate closely with AI systems, model acceleration, and backend distributed systems teams

Contribute to GPU architecture decisions, kernel libraries, and internal performance-engineering best practices

Qualification

NVIDIA CUDAGPU architectureC++GPU profiling toolsPerformance optimizationMulti-GPU systemsModel inference optimizationSoft skills

Required

Hands-on experience developing and optimizing NVIDIA CUDA kernels from scratch

Deep understanding of NVIDIA GPU architecture, memory hierarchy, warp-level execution, and profiling workflows

Proven track record building NVIDIA CUDA kernels from scratch not just calling existing libraries

Strong ability to optimize kernels (tiling strategies, occupancy tuning, shared memory design, warp scheduling)

Deep understanding of CUDA threads, warps, blocks, and grids, GPU memory hierarchy and memory coalescing, as well as warp divergence (how to detect, analyze, and mitigate it)

Experience diagnosing PCIe bottlenecks and optimizing host-device transfers (pinned memory, streams, batching, overlap)

Familiarity with C++, CUDA runtime APIs, and GPU debugging/profiling tooling