Apply on Employer Site

Cango Inc. · 3 weeks ago

AI System Solution Architect

United States

Full-time

Remote

Senior Level

5+ years exp

Cango Inc. is a company focused on innovative AI solutions, and they are seeking an AI System Solution Architect to design and optimize technical architecture for AI inference on GPU clusters. The role involves leading performance engineering efforts, engaging with clients, and guiding the engineering team in implementing advanced AI solutions.

Automotive

Hiring Manager

Logan Long

Responsibilities

Design end-to-end technical architecture for LLM and Diffusion model inference on large-scale GPU clusters

Develop innovative solutions in KV Cache management, distributed scheduling, pipelining/batching strategies, memory allocation, and P2P/IB communication

Architect a multi-tenant serving framework that balances throughput, latency, and cost

Define product positioning and differentiation based on industry trends and company strategy

Develop technical evolution plans (e.g., token streaming like vLLM, syntax parsing like SGLang, Diffusion acceleration)

Align closely with internal GPU infrastructure and business teams to ensure timely product delivery

Lead performance engineering efforts including NCCL tuning, NUMA binding, CUDA kernel optimization

Drive cross-team collaboration (GPU kernel, compiler, distributed system, frontend APIs) to ensure system stability and scalability

Organize benchmarking and performance testing against industry leaders (vLLM, SGLang, TensorRT, etc.)

Guide engineering team on implementation strategies, experimental methodologies, and optimization pathways

Engage with open-source communities and contribute core components to enhance technical influence

Communicate directly with North America-based clients to understand their needs for AI inference, training, and deployment

Translate customer needs into internal implementation plans and coordinate across operations, engineering, and delivery teams

Qualification

GPU optimizationDeep learning systemsSystem architecturePyTorchCUDANCCLTritonTensorRTMPI/IB/RDMACross-functional communicationArchitectural thinkingOpen-source contributions

Required

5+ years of experience in computer infrastructure, GPU cloud, or large-scale cloud computing in the U.S., with a deep understanding of the North American tech ecosystem

Master's or Ph.D. in Computer Science, Electrical Engineering, or related fields preferred

5+ years of hands-on experience in deep learning systems or GPU optimization, including leading the design of at least one large-scale AI inference or training system

Proficiency with PyTorch, CUDA, NCCL, Triton, TensorRT, MPI/IB/RDMA, etc

Deep understanding of projects like vLLM, SGLang, DeepSpeed, FasterTransformer

Practical experience in LLM inference optimization (e.g., KV Cache, P2P vs CPU routing, batching strategies)

Ability to integrate system-level optimization with product usability (API and Serving layers)

Strong architectural thinking and cross-functional communication skills to translate complexity into clear product roadmaps