Apply on Employer Site

Featherless AI · 17 hours ago

Machine Learning Engineer — Inference Optimization

United States

Full-time

Remote

Mid, Senior Level

FeatherlessAI is seeking a Machine Learning Engineer to optimize model inference performance at scale. The role involves profiling systems, implementing optimization techniques, and collaborating with research engineers to enhance production-grade performance.

Artificial Intelligence (AI)Cloud ComputingDatabase

H1B Sponsor Likely

Responsibilities

Optimize inference latency, throughput, and cost for large-scale ML models in production

Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)

Implement and tune techniques such as:

Quantization (fp16, bf16, int8, fp8)

KV-cache optimization & reuse

Speculative decoding, batching, and streaming

Model pruning or architectural simplifications for inference

Collaborate with research engineers to productionize new model architectures

Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)

Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups

Improve system reliability, observability, and cost efficiency under real workloads

Qualification

ML inference optimizationDeep learning internalsPyTorchGPU performance tuningInference frameworksScaling inferenceOpen-source contributionsDistributed systemsLow-latency services

Required

Strong experience in ML inference optimization or high-performance ML systems

Solid understanding of deep learning internals (attention, memory layout, compute graphs)

Hands-on experience with PyTorch (or similar) and model deployment

Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)

Experience scaling inference for real users (not just research benchmarks)

Comfortable working in fast-moving startup environments with ownership and ambiguity

Preferred

Experience with LLM or long-context model inference

Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)

Experience optimizing across different hardware vendors

Open-source contributions in ML systems or inference tooling

Background in distributed systems or low-latency services

Benefits

Competitive compensation + meaningful equity at Series A

Company

Featherless AI

We enable serverless inference via our GPU orchestration and model load-balancing system.

Founded in 2023

San Francisco, California, USA

2-10 employees

https://featherless.ai/

H1B Sponsorship

Featherless AI has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (1)

Funding

Current Stage

Early Stage

Total Funding

$5M

Key Investors

Airbus Ventures

2025-10-31Series A

2025-03-17Seed· $5M

Company data provided by crunchbase