Member of Technical Staff - Inference jobs in United States
cer-icon
Apply on Employer Site
company-logo

Prime Intellect · 1 month ago

Member of Technical Staff - Inference

Prime Intellect is building the open superintelligence stack, facilitating the creation, training, and deployment of advanced AI models. The role focuses on optimizing and serving large language models (LLMs) efficiently at scale, integrating them into reinforcement learning systems, and enhancing the overall infrastructure for AI development.

Artificial Intelligence (AI)Cloud Computing
check
H1B Sponsorednote

Responsibilities

Build a multi-tenant LLM serving platform that operates across our cloud GPU fleets
Design placement and scheduling algorithms for heterogeneous accelerators
Implement multi-region/zone failover and traffic shifting for resilience and cost control
Build autoscaling, routing, and load balancing to meet throughput/latency SLOs
Optimize model distribution and cold-start times across clusters
Integrate and contribute to LLM inference frameworks such as vLLM, SGLang, TensorRT‑LLM
Optimize configurations for tensor/pipeline/expert parallelism, prefix caching, memory management and other axes for maximum performance
Profile kernels, memory bandwidth and transport; apply techniques such as quantization and speculative decoding
Develop reproducible performance suites (latency, throughput, context length, batch size, precision)
Embed and optimize distributed inference within our RL stack
Establish CI/CD with artifact promotion, performance gates, and reproducible builds
Build metrics, logs, tracing; structured incident response and SLO management
Document architectures, playbooks, and API contracts; mentor and collaborate cross‑functionally

Qualification

Building ML SystemsInference BackendsDistributed Serving InfraFull-Stack DebuggingPythonPyTorchCloud & AutomationKubernetesGPU & NetworkingKernel-Level OptimizationSystems Performance LanguagesData & ObservabilityInfra & Config AutomationOpen Source

Required

3+ years building and running large‑scale ML/LLM services with clear latency/availability SLOs
Hands‑on with at least one of vLLM, SGLang, TensorRT‑LLM
Familiarity with distributed and disaggregated serving infrastructure such as NVIDIA Dynamo
Deep understanding of prefill vs. decode, KV‑cache behavior, batching, sampling, speculative decoding, parallelism strategies
Comfortable debugging CUDA/NCCL, drivers/kernels, containers, service mesh/networking, and storage, owning incidents end‑to‑end
Python: Systems tooling and backend services
PyTorch: LLM Inference engine development and integration, deployment readiness
AWS/GCP service experience, cloud deployment patterns
Running infrastructure at scale with containers on Kubernetes
Architecture, CUDA runtime, NCCL, InfiniBand; GPU‑aware bin‑packing and scheduling across heterogeneous fleets

Preferred

Familiarity with CUDA/Triton kernel development; Nsight Systems/Compute profiling
Rust, C++
Kafka/PubSub, Redis, gRPC/Protobuf; Prometheus/Grafana, OpenTelemetry; reliability patterns
Terraform/Ansible, infrastructure-as-code, reproducible environments
Contributions to serving, inference, or RL infrastructure projects

Benefits

Competitive compensation with significant equity incentives
Flexible work arrangement (remote or San Francisco office)
Full visa sponsorship and relocation support
Professional development budget
Regular team off-sites and conference attendance

Company

Prime Intellect

twittertwittertwitter
company-logo
Find compute. Train Models. Co-own intelligence.