DeepRec.ai · 2 weeks ago
LLM Evaluation Engineering Lead
DeepRec.ai is a deep-tech AI company focused on building autonomous systems for complex environments. They are seeking an LLM Evaluations Engineering Lead to own the evaluation and verification processes for agentic LLM systems, ensuring that these systems improve and function reliably.
Responsibilities
Build eval harnesses for agentic LLM systems (offline + in-workflow)
Design evals for planning, execution, recovery, and safety
Implement verifier-driven scoring and regression gates
Turn eval failures into training signals (SFT / DPO / RL)
Qualification
Required
Strong experience building evaluation systems for ML models (LLMs strongly preferred)
Excellent software engineering fundamentals:
Python
Data pipelines
Test harnesses
Distributed execution
Reproducibility
Deep understanding of agentic failure modes, including:
Tool misuse
Hallucinated evidence
Reward hacking
Brittle formatting and schema drift
Ability to reason about what to measure, not just how to measure it
Comfortable operating between research experimentation and production systems
Benefits
High autonomy, strong technical peers, and meaningful equity