Essential AI · 5 months ago
Member of Technical Staff: ML Infrastructure, Platform Engineer
Essential AI is building an open platform to fuel and accelerate AI breakthroughs globally. The ML Infra Platform Engineer will be responsible for architecting and building the compute infrastructure that powers the training and serving of models, optimizing distributed systems for throughput and robustness.
Artificial Intelligence (AI)Information TechnologySoftware
Responsibilities
Design, build, and maintain scalable machine learning infrastructure to support our model training, inference and applications
Design and implement scalable machine learning and distributed systems that enable training and scaling of LLMs. Work on parallelism methods improve training of in a fast and reliable way
You will help oversee and drive the vision of how we should build, test, and deploy models, while taking ownership and transform state-of-the-art development experience for research
Develop tools and frameworks to automate and streamline ML experimentation and management
Collaborate with other researchers and product engineers to bring magical product experiences through large language models
Working on lower levels of the stack to build high-performing and optimal training and serving infrastructure including researching new techniques and writing custom kernels as needed to achieve improvements
Be willing to optimize performance and efficiency across different accelerators
Qualification
Required
A strong understanding of architectures of new AI accelerators like GPU, TPU, IPU, HPU etc and their tradeoffs
Knowledge of parallel computing concepts and distributed systems
Experience with Kernels, Low precision training, MoE
Prior experience in performance tuning of training and/or inference LLM workloads
Experience with MLPerf or internal production workloads will be valued
6+ years of relevant industry experience in leading the design of large-scale & production ML infra systems
Experience with Communication Libraries
Experience with training and building large language models using frameworks such as Megatron, DeepSpeed, etc
Experience with deployment frameworks like vLLM, TGI, TensorRT-LLM etc
Comfortable with working under-the-hood with kernel languages like OAI Triton, Pallas and compilers like XLA
Experience with INT8/FP8 training and inference, quantization and/or distillation
Knowledge of container technologies like Docker and Kubernetes and cloud platforms like AWS, GCP, etc
Intermediate fluency with network fundamentals like VPC, Subnets, Routing Tables, Firewalls etc
Company
Essential AI
Essential AI creates AI solutions that enhance efficiency through the automation of labor-intensive and repetitive workflows.
H1B Sponsorship
Essential AI has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2024 (1)
2023 (1)
Funding
Current Stage
Early StageTotal Funding
$64.5MKey Investors
March CapitalThrive Capital
2023-12-12Series A· $56.5M
2023-05-04Seed· $8M
Recent News
2025-02-25
21st CENTURY CHRONICLE
2025-02-25
Company data provided by crunchbase