DeepRec.ai · 1 week ago
AI Evaluation Engineer
Responsibilities
Own the evaluation stack – design frameworks that define “good,” “risky,” and “catastrophic” outputs
Automate at scale – build data pipelines, LLM judges, and integrate with CI to block unsafe releases
Stress testing – red team AI systems with challenge prompts to expose brittleness, bias, or jailbreaks
Track and monitor – establish model/prompt versioning, build observability, and create incident response playbooks
Empower others – deliver tooling, APIs, and dashboards that put eval into every engineer’s workflow
Qualification
Required
Strong software engineering background (TypeScript a plus)
Deep experience with OpenAI API or similar LLM ecosystems
Practical knowledge of prompting, function calling, and eval techniques (e.g. LLM grading, moderation APIs)
Familiarity with statistical analysis and validating data quality/performance
Preferred
experience with observability, monitoring, or data science tooling