Apply on Employer Site

Cerebras · 3 weeks ago

Senior Research Engineer - Inference ML

Sunnyvale CA or Toronto Canada

Full-time

Onsite

Mid, Senior Level

4+ years exp

Cerebras Systems builds the world's largest AI chip, transforming AI applications across various fields. As a Senior Research Engineer on the Inference ML team, you will adapt advanced language and vision models to run efficiently on Cerebras architecture and work on cutting-edge inference research.

Artificial Intelligence (AI)ComputerHardwareSemiconductorSoftware

Growth Opportunities

Responsibilities

Design, implement, and optimize state-of-the-art transformer architectures for NLP and computer vision on Cerebras hardware

Research and prototype novel inference algorithms and model architectures that exploit the unique capabilities of Cerebras hardware, with emphasis on speculative decoding, pruning/compression, sparse attention, and sparsity

Train models to convergence, perform hyperparameter sweeps, and analyze results to inform next steps

Bring up new models on the Cerebras system, validate functional correctness, and troubleshoot any integration issues

Profile and optimize model code using Cerebras tools to maximize throughput and minimize latency

Develop diagnostic tooling or scripts to surface performance bottlenecks and guide optimization strategies for inference workloads

Collaborate across teams, including software, hardware, and product, to drive projects from inception through delivery

Qualification

Machine LearningPythonC++Deep LearningPyTorchTransformersGenerative AIPerformance OptimizationPassion for AICollaborative ApproachSelf-directed Mindset

Required

Bachelor's degree in Computer Science, Software Engineering, Computer Engineering, Electrical Engineering, or a related technical field AND 7+ years of ML software development experience, OR

Master's degree in Computer Science or related technical field AND 4+ years of software development experience, OR

PhD in Computer Science or related technical field with 2+ years of relevant research or industry experience, OR

Equivalent practical experience

4+ years of experience testing, maintaining, or launching software products, including 2+ years of experience with software design and architecture

3+ years of experience in software development focused on machine learning (e.g., deep learning, large language models, or computer vision)

Strong programming skills in Python and/or C++

Experience with Generative AI and Machine Learning systems

Proficiency with at least one major ML framework (PyTorch, Transformers, vLLM, or SGLang)

Deep understanding of transformer-based models in language and/or vision domains, with demonstrated experience implementing and optimizing them

Proven ability to implement custom layers, operators, and backpropagation logic

Strong foundation in performance optimization on specialized hardware (e.g., GPUs, TPUs, or HPC interconnects)

Deep understanding of modern ML architectures and strong intuition for optimizing their performance, particularly for inference workloads using sparse attention, pruning/compression, and speculative decoding

Track record of owning problems end-to-end and autonomously acquiring whatever knowledge is needed to deliver results

Self-directed mindset with a demonstrated ability to identify and tackle the most impactful problems

Collaborative approach with humility, eagerness to help colleagues, and commitment to team success

Genuine passion for AI and a drive to push the limits of inference performance

Preferred

Master's degree or PhD in Computer Science, Computer Engineering, or a related technical field

Experience independently driving complex ML or inference projects from prototype to production-quality implementations

Hands-on experience with relevant ML frameworks such as PyTorch, Transformers, vLLM, or SGLang

Experience with large language models, mixture-of-experts models, multimodal learning, or AI agents

Experience with speculative decoding, neural network pruning and compression, sparse attention, quantization, sparsity, post-training techniques, and inference-focused evaluations

Familiarity with large-scale model training and deployment, including performance and cost trade-offs in production systems

Triton/CUDA experience is a big plus