Apply on Employer Site

Prime Intellect · 1 month ago

Member of Technical Staff - Inference

United States

Full-time

Hybrid

Mid Level

3+ years exp

Prime Intellect is building the open superintelligence stack, facilitating the creation, training, and deployment of advanced AI models. The role focuses on optimizing and serving large language models (LLMs) efficiently at scale, integrating them into reinforcement learning systems, and enhancing the overall infrastructure for AI development.

Artificial Intelligence (AI)Cloud Computing

H1B Sponsored

Responsibilities

Build a multi-tenant LLM serving platform that operates across our cloud GPU fleets

Design placement and scheduling algorithms for heterogeneous accelerators

Implement multi-region/zone failover and traffic shifting for resilience and cost control

Build autoscaling, routing, and load balancing to meet throughput/latency SLOs

Optimize model distribution and cold-start times across clusters

Integrate and contribute to LLM inference frameworks such as vLLM, SGLang, TensorRT‑LLM

Optimize configurations for tensor/pipeline/expert parallelism, prefix caching, memory management and other axes for maximum performance

Profile kernels, memory bandwidth and transport; apply techniques such as quantization and speculative decoding

Develop reproducible performance suites (latency, throughput, context length, batch size, precision)

Embed and optimize distributed inference within our RL stack

Establish CI/CD with artifact promotion, performance gates, and reproducible builds

Build metrics, logs, tracing; structured incident response and SLO management

Document architectures, playbooks, and API contracts; mentor and collaborate cross‑functionally

Qualification

Building ML SystemsInference BackendsDistributed Serving InfraFull-Stack DebuggingPythonPyTorchCloud & AutomationKubernetesGPU & NetworkingKernel-Level OptimizationSystems Performance LanguagesData & ObservabilityInfra & Config AutomationOpen Source