Grafton Sciences ยท 2 hours ago
LLM Evals Engineering Lead
Grafton Sciences is building AI systems with general physical ability, aiming to push the frontier of physical AI. The Senior LLM Evals Engineer will be responsible for building the evaluation and verification layer for LLM systems, focusing on autonomous workflows and collaboration with various engineering teams.
Machine LearningRobotics
Responsibilities
Build an eval harness for agentic LLM systems (offline, simulator-in-the-loop, and workflow-in-the-loop)
Design evals for long-horizon planning, specific agent-call correctness, recovery behavior, and safety/constraint adherence
Help with verifier-driven scoring (symbolic checks, simulation/twin checks, surrogate checks) and automated self correction of execution pipeline
Create regression gates and release criteria for model/prompt/toolchain changes; prevent capability and safety regressions
Define metrics for outliers identification and efficient question-asking that reduces uncertainty per unit time
Partner with training teams to turn eval failures into data (SFT/DPO/RL signals) and continuously improve the suite
Qualification
Required
Strong experience building evaluation systems for ML models
Excellent software engineering skills (Python, data pipelines, test harnesses, distributed execution, reproducibility)
Deep understanding of agentic failure modes (tool misuse, hallucinated evidence, reward hacking, brittle formatting) and how to measure them
Ability to work across research and production systems in a fast-moving environment
Preferred
LLMs preferred
Benefits
Meaningful equity
Benefits
Company
Grafton Sciences
Building systems of general physical ability to enable superintelligence
H1B Sponsorship
Grafton Sciences has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (2)
Funding
Current Stage
Early StageCompany data provided by crunchbase