Apply on Employer Site

Guild.ai · 3 weeks ago

AI Engineer, Agents & Evaluation

United States

Full-time

Remote

Mid, Senior Level

2+ years exp

Guild.ai is seeking its first AI Engineer focused on agents and evaluation to shape the development of intelligent systems. The role involves designing evaluation frameworks, orchestrating agent strategies, and collaborating across teams to enhance agent performance.

Artificial Intelligence (AI)Developer PlatformDeveloper ToolsGenerative AISoftware

Responsibilities

Create Task Evaluations That Matter: Design and implement task-specific evaluations that measure and improve agent quality. Each eval should both drive concrete iteration on our agents and spark broader innovation around the task itself

Define Tasks, Datasets, and Harnesses: Clearly specify tasks, collect and curate balanced datasets, and build robust evaluation harnesses that can be used across agents and modeling approaches. There is ample room for architectural design and systems thinking here

Build and Use a Reusable Evaluation Framework: Develop frameworks and tools for running evaluations at scale. Use these frameworks to tune existing agents and to guide the development of new ones in our environment

Explore Agent Orchestration Strategies: Investigate and implement orchestration patterns (tooling, routing, decomposition, multi-agent setups, etc.) that allow agents to tackle increasingly complex, multi-step, and long-horizon tasks

Apply Post-Training Techniques: Experiment with post-training approaches (e.g., fine-tuning, preference optimization, reward shaping, distillation) to produce high-performance models tailored to specific tasks and workflows

Run Experiments End-to-End: Design, run, and analyze experiments with rigor. Turn experimental results into clear recommendations and concrete changes to model configurations, prompts, and system design

Collaborate Deeply Across the Stack: Work closely with founders, product, and infrastructure engineers to ensure evaluations, agents, and platform primitives all reinforce each other

Qualification

Machine LearningLarge Language ModelsPythonExperiment DesignAgent OrchestrationEvaluation FrameworksTypeScriptCommunication SkillsSelf-Motivated

Required

MS or Ph.D. in a relevant field (e.g., Computer Science, Machine Learning, NLP) or equivalent practical experience

Strong background in machine learning and large language models, ideally including both research and hands-on implementation

2–5 years working with LLM technology, with familiarity across: Prompting and interaction patterns, Agent and tool orchestration strategies, Evaluation strategies for complex, open-ended tasks

Proficiency writing production-quality code, especially in Python; comfort working with TypeScript or modern web/backend stacks

Experience designing and running experiments, and interpreting results in messy, real-world settings

Self-motivated, comfortable operating in an unstructured, high-ambiguity environment

Strong communication skills and the ability to translate vague goals into concrete, testable setups

Preferred

Experience building agentic systems (tool-using agents, workflows, or multi-agent systems) in real products

Prior work on model evaluation frameworks, benchmarking, or reliability/robustness testing

Familiarity with modern ML tooling (training/inference stacks, experiment tracking, data pipelines)

Contributions to open-source LLM, tooling, or evaluation projects

Experience at an early-stage startup or research lab where you owned projects end-to-end

Benefits

Significant equity in an early-stage, venture-backed startup

Comprehensive Health Benefits (Medical, Dental, Vision)

Flexible PTO to ensure you have the time you need to recharge

Company

Guild.ai

AI, Software Development

Founded in 2025

San Francisco, California, USA

2-10 employees

https://www.guild.ai

Funding

Current Stage

Early Stage

Total Funding

$0M

2025-09-01Seed· $0M

Company data provided by crunchbase