Apply on Employer Site

Perplexity · 1 day ago

Engineering Manager - Inference

San Francisco, CA

Full-time

Onsite

Senior Level

$300K/yr - $385K/yr

5+ years exp

Perplexity is seeking an Inference Engineering Manager to lead their AI Inference team, focusing on building and scaling infrastructure for their products and APIs. The role involves owning the technical direction of inference systems and leading a team of engineers to enhance AI capabilities for millions of users.

Artificial Intelligence (AI)ChatbotMachine LearningNatural Language ProcessingSearch Engine

H1B Sponsor Likely

Responsibilities

Lead and grow a high-performing team of AI inference engineers

Develop APIs for AI inference used by both internal and external customers

Architect and scale our inference infrastructure for reliability and efficiency

Benchmark and eliminate bottlenecks throughout our inference stack

Drive large sparse/MoE model inference at rack scale, including sharding strategies for massive models

Push the frontier with building inference systems to support sparse attention, disaggregated pre-fill/decoding serving, etc

Improve the reliability and observability of our systems and lead incident response

Own technical decisions around batching, throughput, latency, and GPU utilization

Partner with ML research teams on model optimization and deployment

Recruit, mentor, and develop engineering talent

Establish team processes, engineering standards, and operational excellence

Qualification

ML systemsInference frameworksTechnical leadershipPythonPyTorchKubernetesDistributed systemsTechnical communicationCross-functional collaboration

Required

5+ years of engineering experience with 2+ years in a technical leadership or management role

Deep experience with ML systems and inference frameworks (PyTorch, TensorFlow, ONNX, TensorRT, vLLM)

Strong understanding of LLM architecture: Multi-Head Attention, Multi/Grouped-Query Attention, and common layers

Experience with inference optimizations: batching, quantization, kernel fusion, FlashAttention

Familiarity with GPU characteristics, roofline models, and performance analysis

Experience deploying reliable, distributed, real-time systems at scale

Track record of building and leading high-performing engineering teams

Experience with parallelism strategies: tensor parallelism, pipeline parallelism, expert parallelism

Strong technical communication and cross-functional collaboration skills

Preferred

Experience with CUDA, Triton, or custom kernel development

Background in training infrastructure and RL workloads

Experience with Kubernetes and container orchestration at scale

Published work or contributions to inference optimization research

Company

Perplexity

Perplexity is an AI-powered answer engine designed to provide accurate, real-time responses to user queries.

Founded in 2022

San Francisco, California, USA

201-500 employees

https://www.perplexity.ai

H1B Sponsorship

Perplexity has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (12)

2024 (7)

2023 (2)

Funding

Current Stage

Late Stage

Total Funding

$1.68B

Key Investors

Cristiano RonaldoNuVenturesAccel

2025-12-05Undisclosed

2025-09-10Series Unknown· $400M

2025-08-15Secondary Market

Leadership Team

Aravind Srinivas

Cofounder, President, CEO

Denis Yarats

Co-Founder & CTO

Recent News

bloomberglaw.com

News Outlets’ Perplexity AI Suits Strike at Existential Threat

2026-02-09

WSJ.com: US Business

Snap Sales Rise But Perplexity Deal Is Delayed

2026-02-05

hackernoon.com

In AI We Trust

2026-02-04

Company data provided by crunchbase