Building a Production Eval System for AI Agents
Companion document: There’s an LLM-optimized version of this research designed to be fed as context to Claude, ChatGPT, or any model working on eval systems. Dense, structured, no narrative — just the principles, frameworks, and source citations.
The Problem
We run a multi-agent AI system built on Mastra. We wanted to move most agents from cloud inference (Claude Sonnet) to local inference (Gemma 26B on Ollama). Not all agents could make the switch — one stays on cloud because its multi-step tool chains are too complex for current local models.
That forced the question: how do we know the agents are still good after the switch? And more broadly: how do we build a practice where we can prove agents are getting better over time?
We had eval infrastructure — golden datasets, an LLM-as-judge harness — but no continuous quality loop. Before building more, we read what practitioners who’ve shipped production eval systems actually recommend.
What the Best Practitioners Agree On
We drew from five sources that converge on the same core principles.
| Source | Key Contribution |
|---|---|
| Braintrust | Flywheel model, aspirational evals |
| Hamel Husain | Domain expert calibration, binary scoring, 6 eval skills |
| Eugene Yan | AlignEval data-first workflow, LLM judge bias research |
| applied-llms.org | Intern test, Goodhart’s Law, hallucination baselines |
| Langfuse docs | Score API types, annotation queues |
Binary Pass/Fail Over Likert Scales
Every source states or implies this. Hamel: “A binary decision forces everyone to consider what truly matters.” Likert scales create ambiguity — is a 3 acceptable? Binary forces a decision. When you fail something, you write a critique explaining why. The critique is more valuable than any number.
The Domain Expert is the Benevolent Dictator
One person’s judgment calibrates the system. Not a committee. Eval criteria drift as you see more data — that’s expected. But it must be one person’s evolving judgment. Their critiques become the few-shot examples that train LLM judges.
Look at the Data (60-80% of Effort)
Hamel: “The real value of this process is looking at your data and doing careful analysis.” Infrastructure is only valuable insofar as it removes friction from looking at data. The eval workbench is a trace review tool first, a scoring system second.
Custom Scorers, Not Generic Ones
“Generic metrics embed someone else’s requirements, not yours.” Off-the-shelf metrics (helpfulness, coherence) measure what someone else decided matters. Your agents have specific failure modes. Build judges calibrated to those.
Guardrails Are Not Evals
Guardrails run synchronously in the request path (fast, cheap, block bad output). Evals run asynchronously post-response (expensive, subjective, feed improvement loops). They serve different purposes: guardrails prevent harm, evals drive improvement.
The Flywheel
Production trace → human judgment → becomes eval case → hardens suite
→ automated judge learns from critiques → monitors production
→ surfaces new traces for review → repeat
Without this loop, your eval suite is frozen in time — it tests what you imagined users would do, not what they actually do.
Aspirational Evals
Write evals that currently score 10% but become viable when better models arrive. When a new model drops, aspirational evals already answer “is this model ready?” without having to think about what to test.
LLM Judge Biases
Eugene Yan’s survey of the research literature surfaced quantified biases:
| Bias | What Happens | Mitigation |
|---|---|---|
| Position bias | Prefers first response in pairwise comparison (~70% for Claude-v1) | Swap order, run twice |
| Verbosity bias | Prefers longer responses (>90% for both Claude and GPT) | Penalize unnecessary verbosity in judge prompt |
| Self-enhancement | Rates own outputs higher (GPT-4: 10%, Claude-v1: 25%) | Use different model family for judging vs generation |
| Criteria drift | ”Good” changes as you see more data | Expected. Document evolution. Re-calibrate periodically. |
A panel of three smaller LLMs with majority voting outperformed GPT-4 alone as a judge. Hallucinations baseline at 5-10% and remain difficult to suppress below 2% — don’t target zero, target knowing your rate and monitoring for increases.
The Review Practice
Based on Hamel’s “Critique Shadowing” and Braintrust’s cadence:
- Open the review workbench
- Queue shows traces sorted by “interestingness”
- For each trace: read the conversation, look at tool calls, judge the response
- Pass or Fail. If Fail: one-line critique + failure category tag
- Keyboard-driven:
pfor pass,ffor fail,jfor next - 20-30 traces per session. Diminishing returns after that.
The goal is not to review everything — it’s to build a representative sample that can train an automated judge.
Smart Trace Sampling
Don’t review random traces. Score by interestingness:
| Signal | Score |
|---|---|
| New model (recently switched) | +100 |
| Tool errors | +80 |
| High latency outlier (>2x mean) | +60 |
| Low automated score (<0.5) | +40 |
| Unreviewed | +20 |
Writing Good Critiques
The critique becomes a few-shot example for the LLM judge:
- Name the failure mode: “Called gmail_search instead of gmail_read for a specific email” not “wrong tool”
- Explain why it matters: “This would send the user’s email to the wrong recipient”
- Be specific and short: One to two sentences.
Failure Categories
Let these emerge from data. Starting points: wrong_tool, hallucinated_tool, missing_action, wrong_args, poor_synthesis, tone_mismatch, incomplete_chain, wrong_delegation.
What We Built
Two Tracks
Track 1: Infrastructure Safety Net — lives in the application, runs automatically:
- Per-agent model routing via environment variables, with fallback to the default router
- Async production scoring on sampled traffic, writing to Langfuse
- Failure alerting via Telegram, throttled per-agent
- Quality monitoring cron that alerts on rolling average drops
Track 2: Eval Workbench — a keyboard-driven review UI:
- Smart-sampled review queue that surfaces interesting traces from Langfuse
- Judgment API that writes binary verdicts + critiques as Langfuse scores
- Calibration dashboard showing four-box confusion matrices per automated scorer (precision, recall, F1, Cohen’s kappa)
Judge Design
We use a different model family for judging than for generation — this avoids self-enhancement bias. All scores stored in Langfuse as the single source of truth.
Data Flow
User message → Agent response → Async scoring → Langfuse
→ Review workbench surfaces interesting traces
→ Human: pass/fail + critique
→ Critiques become few-shot examples for automated judge
→ Judge runs on future traces → repeat
Lessons Learned
Namespace your scores. Our eval harness wrote scores named human-judgment on traces. The review workbench checked for that same name to determine if a trace was reviewed. Every trace showed as “already reviewed.” Fix: distinct names for different systems writing to the same store.
Your eval infra generates more traces than production. Out of 3,000+ Langfuse traces, only ~100 were real conversations. The rest were eval harness runs. The review queue needs aggressive filtering.
Don’t chase pass rates. From applied-llms.org: “When a measure becomes a target, it ceases to be a good measure.” A 70% pass rate with meaningful tests beats 100% with easy ones.
What’s Next
- Promote-to-eval-suite: One-click promotion of a reviewed trace into the golden dataset
- Per-agent custom judges: Need 20+ critiques per agent to build meaningful judges — requires real traffic
- CI quality gates: Assertion tests on every change, golden dataset smoke tests on prompt/model changes
- Judge calibration loop: Iterate on judge prompts based on human-judge disagreements
Sources
| Source | URL |
|---|---|
| Braintrust: “Evals are the new PRD” | braintrust.dev/blog/evals-are-the-new-prd |
| Braintrust: “Five hard-learned lessons” | braintrust.dev/blog/five-lessons-evals |
| Hamel Husain: “Your AI Product Needs Evals” | hamel.dev/blog/posts/evals/ |
| Hamel Husain: “LLM-as-Judge Complete Guide” | hamel.dev/blog/posts/llm-judge/ |
| Hamel Husain: “Field Guide to Rapidly Improving AI” | hamel.dev/blog/posts/field-guide/ |
| Hamel Husain: evals-skills | github.com/hamelsmu/evals-skills |
| Eugene Yan: “Product Evals in Three Steps” | eugeneyan.com/writing/product-evals/ |
| Eugene Yan: “Evaluating LLM-Evaluators” | eugeneyan.com/writing/llm-evaluators/ |
| Eugene Yan: “AlignEval” | eugeneyan.com/writing/aligneval/ |
| applied-llms.org | applied-llms.org |
| Langfuse Score API docs | langfuse.com/docs/scores/custom |