Senior Software Engineer - Model Performance jobs in United States
cer-icon
Apply on Employer Site
company-logo

inference.net · 22 hours ago

Senior Software Engineer - Model Performance

Inference.net is a company that trains and hosts specialized language models for companies needing high-quality AI solutions. They are seeking a Senior Software Engineer to optimize their inference stack, focusing on performance, efficiency, and cost-effectiveness in model serving.

Artificial Intelligence (AI)Machine LearningSoftware
check
H1B Sponsor Likelynote

Responsibilities

Implement and productionize optimization techniques including quantization, speculative decoding, KV cache optimization, continuous batching, and LoRA serving
Deep dive into inference frameworks (vLLM, SGLang, TensorRT-LLM) and underlying libraries to debug and improve performance
Profile and optimize CUDA kernels and GPU utilization across our serving infrastructure
Add support for new model architectures, ensuring they meet our performance standards before going to production
Experiment with novel inference techniques and bring successful approaches into production
Build tooling and benchmarks to measure and track inference performance across our fleet
Collaborate with applied ML engineers to ensure trained models can be served efficiently

Qualification

ML systemsInference optimizationGPU programmingPythonC++LLM inference frameworksGPU architectureLLM optimization techniquesPyTorchCUDA programmingDockerKubernetes

Required

2+ years of experience in ML systems, inference optimization, or GPU programming
Strong proficiency in Python and familiarity with C++
Hands-on experience with LLM inference frameworks (vLLM, SGLang, TensorRT-LLM, or similar)
Deep understanding of GPU architecture and experience profiling GPU workloads
Familiarity with LLM optimization techniques (quantization, speculative decoding, continuous batching, KV cache management)
Experience with PyTorch and understanding of how models execute on hardware
Track record of measurably improving system performance

Preferred

Experience with CUDA programming
Familiarity with serving non-LLM models (TTS, vision, embeddings)
Experience with distributed inference and multi-GPU serving
Contributions to open-source inference frameworks
Experience with Docker and Kubernetes

Benefits

Equity in a high-growth startup
Comprehensive benefits

Company

inference.net

twittertwittertwitter
company-logo
Inference.net helps teams ship AI that’s faster, smarter, and dramatically more cost-efficient.

H1B Sponsorship

inference.net has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)
2023 (1)
2022 (1)
2021 (1)

Funding

Current Stage
Early Stage
Total Funding
unknown
2023-05-03Pre Seed
Company data provided by crunchbase