Scale AI · 6 hours ago
AI Research Engineer, Enterprise Evaluations
Scale AI is seeking a technically rigorous and driven AI Research Engineer to join their Enterprise Evaluations team. This high-impact role focuses on developing and maintaining AI evaluation systems that ensure safety and reliability in LLM-powered workflows for enterprise clients.
AI InfrastructureArtificial Intelligence (AI)Data Collection and LabelingGenerative AIImage RecognitionMachine Learning
Responsibilities
Partner with Scale’s Operations team and enterprise customers to translate ambiguity into structured evaluation data, guiding the creation and maintenance of gold-standard human-rated datasets and expert rubrics that anchor AI evaluation systems
Analyze feedback and collected data to identify patterns, refine evaluation frameworks, and establish iterative improvement loops that enhance the quality and relevance of human-curated assessments
Design, research, and develop LLM-as-a-Judge autorater frameworks and AI-assisted evaluation systems. This includes creating models that critique, grade, and explain agent outputs (e.g., RLAIF, model-judging-model setups), along with scalable evaluation pipelines and diagnostic tools
Pursue research initiatives that explore new methodologies for automatically analyzing, evaluating, and improving the behavior of enterprise agents, pushing the boundaries of how AI systems are assessed and optimized in real-world contexts
Qualification
Required
Bachelor's degree in Computer Science, Electrical Engineering, a related field, or equivalent practical experience
2+ years of experience in Machine Learning or Applied Research, focused on applied ML systems or evaluation infrastructure
Hands-on experience with Large Language Models (LLMs) and Generative AI in professional or research environments
Strong understanding of frontier model evaluation methodologies and the current research landscape
Proficiency in Python and major ML frameworks (e.g., PyTorch, TensorFlow)
Solid engineering and statistical analysis foundation, with experience developing data-driven methods for assessing model quality
Preferred
Advanced degree (Master's or Ph.D.) in Computer Science, Machine Learning, or a related quantitative field
Published research in leading ML or AI conferences such as NeurIPS, ICML, ICLR, or KDD
Experience designing, building, or deploying LLM-as-a-Judge frameworks or other automated evaluation systems for complex models
Experience collaborating with operations or external teams to define high-quality human annotator guidelines
Expertise in ML research engineering, stochastic systems, observability, or LLM-powered applications for model evaluation and analysis
Experience contributing to scalable pipelines that automate the evaluation and monitoring of large-scale models and agents
Familiarity with distributed computing frameworks and modern cloud infrastructure
Benefits
Comprehensive health, dental and vision coverage
Retirement benefits
A learning and development stipend
Generous PTO
Commuter stipend
Company
Scale AI
Scale’s mission is to develop reliable AI systems for the world’s most important decisions.
H1B Sponsorship
Scale AI has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (82)
2024 (54)
2023 (29)
2022 (17)
2021 (10)
2020 (10)
Funding
Current Stage
Late StageTotal Funding
$15.9BKey Investors
MetaAccelTiger Global Management
2025-06-10Corporate Round· $14.3B
2025-06-04Series Unknown
2024-05-21Series F· $1B
Recent News
2026-01-16
CB Insights
2026-01-09
Crunchbase News
2026-01-07
Company data provided by crunchbase