Intertru Inc · 3 days ago
Founding AI Prompt Engineer
Intertru Inc is a method-driven interviewing accuracy platform that uses AI to ensure companies hire the strongest candidates. The Founding AI Prompt Engineer will be responsible for architecting prompts that yield consistent outputs, analyzing model performance, and designing evaluation metrics to define reliable AI.
Artificial Intelligence (AI)Human ResourcesRecruiting
Responsibilities
Audit all current production prompts and identify at least 3 key inconsistency risks across our core flows (accuracy, structure, or latency)
Create and document a standardized JSON schema format for all structured outputs and enforce it across at least 2 primary use cases
Build a "Golden Dataset" of 100+ validated prompt/input-output examples to serve as a foundation for regression testing
Launch automated regression testing for at least 3 production prompts, with a defined performance scoring system (accuracy, structure adherence, token cost)
Benchmark at least 2 LLMs (e.g., GPT-4o vs Claude 3.5) for latency, cost, and output consistency on key use cases; present findings and recommendation
Collaborate with engineering to integrate prompt versioning into Git or a prompt ops platform; document and train team on rollback protocol
Design and deploy a full prompt evaluation suite (Eval) that triggers on prompt updates, validating against golden datasets
Optimize at least 2 prompts for lower token usage while maintaining ≥95% accuracy on critical structured outputs
Publish a 90-day prompt performance report tracking model behavior, format consistency, and improvement over time
Qualification
Required
Hands-on experience with OpenAI, Anthropic, or open-source LLMs via API—not just chat interfaces
Experience deploying prompt chains that use advanced techniques like CoT, Few-Shot, and ToT prompting to solve complex problems
Ability to enforce strict JSON/XML structures in responses and test them for integrity
Fluency in Python and capability of building eval harnesses using tools like LangChain, DSPy, or custom logic
Experience benchmarking models, dialing in temperature, and debugging prompt behaviors under load
Real-world prompt engineering experience—not just demos
Ability to think in systems, not scripts
Experience creating evaluation frameworks to score prompt performance
Understanding of how to trade off between model cost, latency, and accuracy
Ability to explain how to use regression testing for prompts or how to enforce a JSON schema 100% of the time