Prime Intellect · 1 month ago
Member of Technical Staff - Inference
Prime Intellect is building the open superintelligence stack, facilitating the creation, training, and deployment of advanced AI models. The role focuses on optimizing and serving large language models (LLMs) efficiently at scale, integrating them into reinforcement learning systems, and enhancing the overall infrastructure for AI development.
Artificial Intelligence (AI)Cloud Computing
Responsibilities
Build a multi-tenant LLM serving platform that operates across our cloud GPU fleets
Design placement and scheduling algorithms for heterogeneous accelerators
Implement multi-region/zone failover and traffic shifting for resilience and cost control
Build autoscaling, routing, and load balancing to meet throughput/latency SLOs
Optimize model distribution and cold-start times across clusters
Integrate and contribute to LLM inference frameworks such as vLLM, SGLang, TensorRT‑LLM
Optimize configurations for tensor/pipeline/expert parallelism, prefix caching, memory management and other axes for maximum performance
Profile kernels, memory bandwidth and transport; apply techniques such as quantization and speculative decoding
Develop reproducible performance suites (latency, throughput, context length, batch size, precision)
Embed and optimize distributed inference within our RL stack
Establish CI/CD with artifact promotion, performance gates, and reproducible builds
Build metrics, logs, tracing; structured incident response and SLO management
Document architectures, playbooks, and API contracts; mentor and collaborate cross‑functionally
Qualification
Required
3+ years building and running large‑scale ML/LLM services with clear latency/availability SLOs
Hands‑on with at least one of vLLM, SGLang, TensorRT‑LLM
Familiarity with distributed and disaggregated serving infrastructure such as NVIDIA Dynamo
Deep understanding of prefill vs. decode, KV‑cache behavior, batching, sampling, speculative decoding, parallelism strategies
Comfortable debugging CUDA/NCCL, drivers/kernels, containers, service mesh/networking, and storage, owning incidents end‑to‑end
Python: Systems tooling and backend services
PyTorch: LLM Inference engine development and integration, deployment readiness
AWS/GCP service experience, cloud deployment patterns
Running infrastructure at scale with containers on Kubernetes
Architecture, CUDA runtime, NCCL, InfiniBand; GPU‑aware bin‑packing and scheduling across heterogeneous fleets
Preferred
Familiarity with CUDA/Triton kernel development; Nsight Systems/Compute profiling
Rust, C++
Kafka/PubSub, Redis, gRPC/Protobuf; Prometheus/Grafana, OpenTelemetry; reliability patterns
Terraform/Ansible, infrastructure-as-code, reproducible environments
Contributions to serving, inference, or RL infrastructure projects
Benefits
Competitive compensation with significant equity incentives
Flexible work arrangement (remote or San Francisco office)
Full visa sponsorship and relocation support
Professional development budget
Regular team off-sites and conference attendance
Company
Prime Intellect
Find compute. Train Models. Co-own intelligence.
Funding
Current Stage
Early StageTotal Funding
$20.5MKey Investors
Founders Fund
2025-02-28Seed· $15M
2024-04-22Seed· $5.5M
Recent News
2025-10-09
Company data provided by crunchbase