Apply on Employer Site

inference.net · 22 hours ago

Senior Software Engineer - Model Performance

San Francisco, CA

Full-time

Hybrid

Mid, Senior Level

$220K/yr - $320K/yr

2+ years exp

Inference.net is a company that trains and hosts specialized language models for companies needing high-quality AI solutions. They are seeking a Senior Software Engineer to optimize their inference stack, focusing on performance, efficiency, and cost-effectiveness in model serving.

Artificial Intelligence (AI)Machine LearningSoftware

H1B Sponsor Likely

Responsibilities

Implement and productionize optimization techniques including quantization, speculative decoding, KV cache optimization, continuous batching, and LoRA serving

Deep dive into inference frameworks (vLLM, SGLang, TensorRT-LLM) and underlying libraries to debug and improve performance

Profile and optimize CUDA kernels and GPU utilization across our serving infrastructure

Add support for new model architectures, ensuring they meet our performance standards before going to production

Experiment with novel inference techniques and bring successful approaches into production

Build tooling and benchmarks to measure and track inference performance across our fleet

Collaborate with applied ML engineers to ensure trained models can be served efficiently

Qualification

ML systemsInference optimizationGPU programmingPythonC++LLM inference frameworksGPU architectureLLM optimization techniquesPyTorchCUDA programmingDockerKubernetes

Required

2+ years of experience in ML systems, inference optimization, or GPU programming

Strong proficiency in Python and familiarity with C++

Hands-on experience with LLM inference frameworks (vLLM, SGLang, TensorRT-LLM, or similar)

Deep understanding of GPU architecture and experience profiling GPU workloads

Familiarity with LLM optimization techniques (quantization, speculative decoding, continuous batching, KV cache management)

Experience with PyTorch and understanding of how models execute on hardware

Track record of measurably improving system performance

Preferred

Experience with CUDA programming

Familiarity with serving non-LLM models (TTS, vision, embeddings)

Experience with distributed inference and multi-GPU serving

Contributions to open-source inference frameworks

Experience with Docker and Kubernetes

Benefits

Equity in a high-growth startup

Comprehensive benefits

Company

inference.net

Inference.net helps teams ship AI that’s faster, smarter, and dramatically more cost-efficient.

Founded in 2023

Bozeman, Montana, USA

11-50 employees

https://usecontext.io/

H1B Sponsorship

inference.net has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (1)

2023 (1)

2022 (1)

2021 (1)

Funding

Current Stage

Early Stage

Total Funding

unknown

2023-05-03Pre Seed

Company data provided by crunchbase