Building a Production Eval System for AI Agents

Companion document: There’s an LLM-optimized version of this research designed to be fed as context to Claude, ChatGPT, or any model working on eval systems. Dense, structured, no narrative — just the principles, frameworks, and source citations.

The Problem

We run a multi-agent AI system built on Mastra. We wanted to move most agents from cloud inference (Claude Sonnet) to local inference (Gemma 26B on Ollama). Not all agents could make the switch — one stays on cloud because its multi-step tool chains are too complex for current local models.

That forced the question: how do we know the agents are still good after the switch? And more broadly: how do we build a practice where we can prove agents are getting better over time?

We had eval infrastructure — golden datasets, an LLM-as-judge harness — but no continuous quality loop. Before building more, we read what practitioners who’ve shipped production eval systems actually recommend.

What the Best Practitioners Agree On

We drew from five sources that converge on the same core principles.

Source	Key Contribution
Braintrust	Flywheel model, aspirational evals
Hamel Husain	Domain expert calibration, binary scoring, 6 eval skills
Eugene Yan	AlignEval data-first workflow, LLM judge bias research
applied-llms.org	Intern test, Goodhart’s Law, hallucination baselines
Langfuse docs	Score API types, annotation queues

Binary Pass/Fail Over Likert Scales

Every source states or implies this. Hamel: “A binary decision forces everyone to consider what truly matters.” Likert scales create ambiguity — is a 3 acceptable? Binary forces a decision. When you fail something, you write a critique explaining why. The critique is more valuable than any number.

The Domain Expert is the Benevolent Dictator

One person’s judgment calibrates the system. Not a committee. Eval criteria drift as you see more data — that’s expected. But it must be one person’s evolving judgment. Their critiques become the few-shot examples that train LLM judges.

Look at the Data (60-80% of Effort)

Hamel: “The real value of this process is looking at your data and doing careful analysis.” Infrastructure is only valuable insofar as it removes friction from looking at data. The eval workbench is a trace review tool first, a scoring system second.

Custom Scorers, Not Generic Ones

“Generic metrics embed someone else’s requirements, not yours.” Off-the-shelf metrics (helpfulness, coherence) measure what someone else decided matters. Your agents have specific failure modes. Build judges calibrated to those.

Guardrails Are Not Evals

Guardrails run synchronously in the request path (fast, cheap, block bad output). Evals run asynchronously post-response (expensive, subjective, feed improvement loops). They serve different purposes: guardrails prevent harm, evals drive improvement.

The Flywheel

Production trace → human judgment → becomes eval case → hardens suite
→ automated judge learns from critiques → monitors production
→ surfaces new traces for review → repeat

Without this loop, your eval suite is frozen in time — it tests what you imagined users would do, not what they actually do.

Aspirational Evals

Write evals that currently score 10% but become viable when better models arrive. When a new model drops, aspirational evals already answer “is this model ready?” without having to think about what to test.

LLM Judge Biases

Eugene Yan’s survey of the research literature surfaced quantified biases:

Bias	What Happens	Mitigation
Position bias	Prefers first response in pairwise comparison (~70% for Claude-v1)	Swap order, run twice
Verbosity bias	Prefers longer responses (>90% for both Claude and GPT)	Penalize unnecessary verbosity in judge prompt
Self-enhancement	Rates own outputs higher (GPT-4: 10%, Claude-v1: 25%)	Use different model family for judging vs generation
Criteria drift	”Good” changes as you see more data	Expected. Document evolution. Re-calibrate periodically.

A panel of three smaller LLMs with majority voting outperformed GPT-4 alone as a judge. Hallucinations baseline at 5-10% and remain difficult to suppress below 2% — don’t target zero, target knowing your rate and monitoring for increases.

The Review Practice

Based on Hamel’s “Critique Shadowing” and Braintrust’s cadence:

Open the review workbench
Queue shows traces sorted by “interestingness”
For each trace: read the conversation, look at tool calls, judge the response
Pass or Fail. If Fail: one-line critique + failure category tag
Keyboard-driven: p for pass, f for fail, j for next
20-30 traces per session. Diminishing returns after that.

The goal is not to review everything — it’s to build a representative sample that can train an automated judge.

Smart Trace Sampling

Don’t review random traces. Score by interestingness:

Signal	Score
New model (recently switched)	+100
Tool errors	+80
High latency outlier (>2x mean)	+60
Low automated score (<0.5)	+40
Unreviewed	+20

Writing Good Critiques

The critique becomes a few-shot example for the LLM judge:

Name the failure mode: “Called gmail_search instead of gmail_read for a specific email” not “wrong tool”
Explain why it matters: “This would send the user’s email to the wrong recipient”
Be specific and short: One to two sentences.

Failure Categories

Let these emerge from data. Starting points: wrong_tool, hallucinated_tool, missing_action, wrong_args, poor_synthesis, tone_mismatch, incomplete_chain, wrong_delegation.

What We Built

Two Tracks

Track 1: Infrastructure Safety Net — lives in the application, runs automatically:

Per-agent model routing via environment variables, with fallback to the default router
Async production scoring on sampled traffic, writing to Langfuse
Failure alerting via Telegram, throttled per-agent
Quality monitoring cron that alerts on rolling average drops

Track 2: Eval Workbench — a keyboard-driven review UI:

Smart-sampled review queue that surfaces interesting traces from Langfuse
Judgment API that writes binary verdicts + critiques as Langfuse scores
Calibration dashboard showing four-box confusion matrices per automated scorer (precision, recall, F1, Cohen’s kappa)

Judge Design

We use a different model family for judging than for generation — this avoids self-enhancement bias. All scores stored in Langfuse as the single source of truth.

Data Flow

User message → Agent response → Async scoring → Langfuse
    → Review workbench surfaces interesting traces
    → Human: pass/fail + critique
    → Critiques become few-shot examples for automated judge
    → Judge runs on future traces → repeat

Lessons Learned

Namespace your scores. Our eval harness wrote scores named human-judgment on traces. The review workbench checked for that same name to determine if a trace was reviewed. Every trace showed as “already reviewed.” Fix: distinct names for different systems writing to the same store.

Your eval infra generates more traces than production. Out of 3,000+ Langfuse traces, only ~100 were real conversations. The rest were eval harness runs. The review queue needs aggressive filtering.

Don’t chase pass rates. From applied-llms.org: “When a measure becomes a target, it ceases to be a good measure.” A 70% pass rate with meaningful tests beats 100% with easy ones.

What’s Next

Promote-to-eval-suite: One-click promotion of a reviewed trace into the golden dataset
Per-agent custom judges: Need 20+ critiques per agent to build meaningful judges — requires real traffic
CI quality gates: Assertion tests on every change, golden dataset smoke tests on prompt/model changes
Judge calibration loop: Iterate on judge prompts based on human-judge disagreements

Sources

Source	URL
Braintrust: “Evals are the new PRD”	braintrust.dev/blog/evals-are-the-new-prd
Braintrust: “Five hard-learned lessons”	braintrust.dev/blog/five-lessons-evals
Hamel Husain: “Your AI Product Needs Evals”	hamel.dev/blog/posts/evals/
Hamel Husain: “LLM-as-Judge Complete Guide”	hamel.dev/blog/posts/llm-judge/
Hamel Husain: “Field Guide to Rapidly Improving AI”	hamel.dev/blog/posts/field-guide/
Hamel Husain: evals-skills	github.com/hamelsmu/evals-skills
Eugene Yan: “Product Evals in Three Steps”	eugeneyan.com/writing/product-evals/
Eugene Yan: “Evaluating LLM-Evaluators”	eugeneyan.com/writing/llm-evaluators/
Eugene Yan: “AlignEval”	eugeneyan.com/writing/aligneval/
applied-llms.org	applied-llms.org
Langfuse Score API docs	langfuse.com/docs/scores/custom