Apply on Employer Site

Inferact · 13 hours ago

Member of Technical Staff, Exceptional Generalist (Remote)

United States

Full-time

Remote

Senior Level

Inferact is on a mission to grow vLLM as the world's AI inference engine and accelerate AI progress. They are seeking exceptional generalist engineers who can work across the entire vLLM stack, with responsibilities including optimizing GPU kernels and building distributed systems to enhance AI inference.

Computer Software

H1B Sponsored

Responsibilities

Work across the entire vLLM stack: from low-level GPU kernels to high-level distributed systems

Optimize CUDA kernels one week, designing distributed orchestration systems the next, and implementing new model architectures the week after

Push the boundaries of LLM and diffusion model serving

Write the low-level kernels and optimizations that make vLLM the fastest inference engine in the world, running on hundreds of accelerator types

Build the distributed systems that power inference at global scale—design foundational layers enabling vLLM to serve models across thousands of accelerators with minimal latency

Build the operational backbone for cluster management, deployment automation, and production monitoring that enables teams worldwide to serve AI models without friction

Qualification

GPU/accelerator programmingDistributed systemsML infrastructureCUDA kernelsPython with PyTorchKubernetesHigh-performance systemsAsynchronous communicationProject completionTechnical blogging

Required

Bachelor's degree or equivalent experience in computer science, engineering, or similar

Demonstrated ability to work autonomously and drive projects to completion without close supervision

Excellent asynchronous communication skills and ability to collaborate effectively across time zones

Strong track record of shipping high-impact work in complex technical environments

Deep expertise in at least one of: systems programming, GPU/accelerator programming, distributed systems, or ML infrastructure

Technical Depth (strong in at least two): CUDA kernels or equivalent (Triton, TileLang, Pallas) with deep understanding of GPU architecture

Technical Depth (strong in at least two): High-performance distributed systems in Rust, Go, or C++

Technical Depth (strong in at least two): Python with PyTorch internals and LLM inference systems (vLLM, TensorRT-LLM, SGLang)

Technical Depth (strong in at least two): Kubernetes, container orchestration, and infrastructure-as-code at scale

Technical Depth (strong in at least two): Transformer architectures, KV-cache memory management, and model serving

Preferred

Contributions to vLLM or other major open-source ML/systems projects

Experience with multiple accelerator platforms (NVIDIA, AMD, TPU, Intel)

Knowledge of quantization techniques, ML-specific kernel optimization, or compiler technologies

Track record of improving system reliability and performance at scale

Written widely-shared technical blogs or impactful side projects in the ML infrastructure space

Benefits

Health coverage where applicable

Company

Inferact

Inferact is a startup founded by creators and core maintainers of vLLM, the most popular open-source LLM inference engine.

Founded in 2025

San Francisco, CA, US

11-50 employees

https://inferact.ai

Funding

Current Stage

Early Stage

Company data provided by crunchbase