2026-06-10 · Updated 2026-07-22

Evaluating AI Agents in Production: From Traces to Test Suites

A chatbot hands you one answer to grade. An agent hands you a whole tree of decisions: plans, tool calls, retries, and the moment it decided it was done.

That difference requires a different evaluation method. A final answer can look correct even when the agent skipped a required tool, repeated a call 17 times, misread a result, or followed a path that production policy forbids. Answer-only grading hides those failures.

TL;DR: Agent evals need three layers: outcome metrics, trajectory metrics, and component metrics. Build around this loop: trace -> label -> cluster -> dedupe -> versioned dataset -> CI gate -> online monitoring. Use deterministic checks for tool order, arguments, loops, and invariants. Use LLM judges only where the check depends on interpretation, shape those judges with Schema-Guided Reasoning (SGR), and calibrate them against human labels before trusting them.

Why agent evals are different

Traditional LLM evals usually score one input-output pair: relevance, faithfulness, correctness, safety, maybe style. Agents add planning, tool calls, retries, and termination checks, and each step is a new place to fail.

Take a refund agent. The transcript can end well while the trace is wrong:

lookup_order -> issue_refund -> final_answer

The output eval passes. A trajectory eval should fail because verify_identity never ran before issue_refund. For tool-using agents, answer-only evals are smoke tests: they catch total breakage and miss everything else.

There’s a second problem: errors compound. If a workflow has 20 required steps, each succeeds independently, and every step has the same 95% reliability, its end-to-end success rate lands around 36%:

0.95^{20} \approx 0.36

So the agent can look solid in isolated checks and still fail most full runs. The break is usually somewhere in the middle, and finding it takes component-level visibility, not another look at the answer.

A row versus a tree: where agent failures hide

Two research teams put numbers on this.

tau-bench gives an agent airline and retail customer-service tasks. The agent talks to a simulated user, calls APIs, and must follow domain policy. After the conversation, the grader checks whether the database reached the annotated goal state. A plausible transcript with the wrong rows still fails.

Under that grading, even GPT-4o succeeded on fewer than half of the tasks. The paper also introduced pass^k: run the same task k times and count a pass only if the agent succeeds in all k runs.

Retail scores that looked tolerable for one attempt dropped below 25% at k = 8. The same agent faced the same task eight times and produced mostly different outcomes. A one-run eval cannot expose that inconsistency.

MAST studies why agents fail. The authors annotated over 1,600 execution traces from 7 popular multi-agent frameworks and sorted the failures into 14 recurring modes. The taxonomy includes vague role definitions (system design), one agent ignoring what another agent reported (inter-agent misalignment), and declaring success without checking the result (no verification). These failures implicate prompts, orchestration logic, and missing checks in the harness. A stronger base model cannot execute a verification step that was never built, so the evaluation target must include the harness around the model.

The adoption gap

The LangChain survey suggests that many respondents already have the raw material for better evals: 89% reported some observability, while 52.4% ran offline evals and 37.3% ran online evals.

The same State of Agent Engineering survey reports that 57.3% of respondents already have agents in production. When asked what blocks production, 32% named quality and 20% named latency. This is a vendor survey of its respondents, not a census of agent teams, but it exposes a useful gap between trace collection and systematic evaluation.

That leaves teams in an awkward middle state: they can inspect a bad run after the fact, then still ship the same failure twice.

Every diagnosed production failure should leave behind a trace, a label, a dataset row, and a scorer. A repeatable failure belongs in the regression suite.

Pick metrics by failure mode

The right metric depends on the failure mode, not on the framework. The useful split has three scopes:

Outcome evals answer whether the task succeeded.
Trajectory evals answer whether the path was valid, efficient, and policy-compliant.
Component evals answer which tool, retriever, sub-agent, or decision step broke.

Three levels of agent evaluation with their metrics

Each scope can run offline on fixed, replayable cases before release or online on sampled production traces after the response. The guardrails section below covers that split in detail. Offline evals can require goldens. Online evals should prefer invariants, distributions, and async checks that stay out of the request path.

Question	Metric family	Offline / online contract	Deterministic or judge?	Watch out for
Did the agent call the right tools?	Tool correctness: exact, in-order, or any-order match	Exact goldens offline; required-tool invariants and anomalies online	Deterministic	Exact match punishes valid alternate paths
Did it call them with the right inputs?	Argument correctness, schema validation, parameter match	Expected arguments offline; schema, range, and policy checks online	Both	Right tool plus wrong arguments is still broken
Did it waste steps?	Step efficiency, retry count, loop detection, cost and latency	Step and loop budgets offline; cost and latency drift online	Mostly deterministic	High task completion can hide expensive wandering
Did the task actually succeed?	Task completion, outcome grading, final state diff	Simulator or golden state offline; final state, user signal, or async judge online	Judge or state check	Grade the environment state when possible
Did it preserve context across turns?	Multi-turn fidelity, role adherence, conversation completeness	Scripted long-horizon cases offline; sampled long sessions online	Judge	Single-turn tests say nothing about turn 14
Did it stop at the right time?	Termination correctness, premature success, endless work	Scenario tests offline; loop, timeout, and false-success monitors online	Both	”Done” can be a hallucinated state
Did it interpret tool results correctly?	Tool-result understanding, downstream state checks	Adversarial tool outputs offline; downstream state checks and sampled review online	Both	The tool can be correct while the agent reads it wrong

Start with deterministic metrics. They’re cheap, fast, and they don’t drift.

Tool-call correctness

Tool correctness compares the called tools with the expected tools. Pick the strictness deliberately:

Exact match: the sequence must match exactly. Use this when order is policy, for example lookup_order -> verify_identity -> issue_refund.
In-order match: required tools must appear in the correct relative order, but extra harmless calls are allowed.
Any-order match: required tools must appear, but order can vary.

A small local scorer is enough to start:

from collections import Counter


def tool_correctness(called: list[str], expected: list[str], mode: str = "in_order") -> float:
    if not expected:
        return 1.0
    if mode == "exact":
        return float(called == expected)
    if mode == "any_order":
        matched = sum((Counter(called) & Counter(expected)).values())
        return matched / len(expected)

    rows = [[0] * (len(expected) + 1) for _ in range(len(called) + 1)]
    for i, tool in enumerate(called):
        for j, wanted in enumerate(expected):
            if tool == wanted:
                rows[i + 1][j + 1] = rows[i][j] + 1
            else:
                rows[i + 1][j + 1] = max(rows[i][j + 1], rows[i + 1][j])
    return rows[-1][-1] / len(expected)


called = ["lookup_order", "check_refund_policy", "issue_refund"]
expected = ["lookup_order", "verify_identity", "issue_refund"]

print(round(tool_correctness(called, expected, "exact"), 3))     # 0.0
print(round(tool_correctness(called, expected, "in_order"), 3))  # 0.667

The in_order score is longest-common-subsequence recall: what fraction of the required sequence survived, in the right order. Notice what it ignores. Junk calls don’t lower it, so an agent can score 1.0 here while making twice the calls it needed. When extra calls cost money or mutate state, track precision alongside it (matched required calls over total calls) and read the two together. Recall catches the missing step; precision catches the wandering.

DeepEval’s Tool Correctness metric exposes the same knobs through should_consider_ordering and should_exact_match.

Argument correctness

Calling the right tool with the wrong arguments is often worse than calling the wrong tool because the trace looks normal.

For simple cases, validate JSON schema and exact values. For semantic cases, store expected arguments and grade the deltas:

{
    "trace_id": "tr_2417",
    "input": "Reschedule order A-100 for next Friday.",
    "expected_tools": ["lookup_order", "reschedule_delivery"],
    "expected_arguments": {
        "reschedule_delivery": {
            "order_id": "A-100",
            "date": "2026-06-19"
        }
    }
}

A tool-name metric can’t catch 2026-06-17 where the policy requires 2026-06-19. The dataset has to store arguments too.

The score that goes with that dataset is parameter-match: the fraction of expected (tool, key, value) triples the agent got right.

def argument_correctness(called_args: dict, expected_args: dict) -> float:
    total = matched = 0
    for tool, params in expected_args.items():
        for key, want in params.items():
            total += 1
            if called_args.get(tool, {}).get(key) == want:
                matched += 1
    return matched / total if total else 1.0

Exact equality is right for IDs, enums, and dates. It’s wrong for free text and floats, where == flags a correct answer as wrong. Grade those fields on their own terms: a normalized string match, a date parse, a numeric tolerance. The metric stays the same; the per-field comparator changes.

Efficiency, loops, and dead ends

An agent that completes the task after five redundant tool calls still signals a planning problem and costs more to run.

Cheap signals you should start with:

Redundant-call rate: identical tool calls with identical arguments repeated more than twice.
Trace shape anomalies: sudden spikes in depth, tool-call count, token count, latency, or cost.
Path convergence: how close the run is to the shortest known valid path for the task.
Termination correctness: whether the agent stopped early, kept working after success, or declared success without the required state change.
Plan adherence: if the agent writes a plan before acting, check whether the trace followed it. A good plan ignored and a bad plan followed perfectly both fail, for opposite reasons, and the diff between plan and trace tells you which.

Run these before a judge whenever you can. A loop detector is a few lines over the trace. It doesn’t need a model.

Task completion and outcome grading

End-to-end, the question is “did the user get what they asked for?”

Two patterns work best:

Referenceless task-completion judging: extract the goal from the input and judge whether the trace plus final answer achieved it. This works online because production traffic rarely has golden outputs.
Environment-state grading: compare the final database rows, files, tickets, bookings, or records to an annotated goal state. This is more robust than transcript matching because agents can find valid paths you didn’t write down.

The second option is better when you can build it. The final state is the contract. The transcript is only evidence.

Two caveats keep this honest, both from the benchmark that popularized state grading. tau-bench can hand a passing score to an agent that does nothing on certain no-op tasks, because the starting state already satisfied the goal. And Anthropic reported an Opus 4.5 run that “failed” a tau2-bench booking task by finding a policy loophole that was actually the better outcome for the user. State grading beats transcript matching, but the goal state is still an annotation, and annotations have bugs. Audit the cases that pass too easily, not only the ones that fail.

The trace-to-eval flywheel

Mine production failures before brainstorming additional eval cases.

The trace-to-eval flywheel

The loop:

Capture the full trace.
Label what failed.
Group similar failures.
Keep one representative golden per cluster.
Version the dataset.
Run it in CI.
Keep scoring sampled production traces online.

The companion repository trace2evals implements the full loop for a faulty support agent. It captures OpenTelemetry GenAI spans, detects failures with deterministic rules, deduplicates cases into a versioned golden dataset, and reruns each golden in CI. The scripted default requires no API key, so make demo reproduces the process offline.

Mine failures with error analysis

Hamel Husain and Shreya Shankar teach an error-analysis workflow for exactly this step; Hamel’s field guide walks through it. The first two steps borrow their names from qualitative research, but the method is straightforward: read traces, take notes, name the patterns.

Open coding: read 30 to 50 real traces and write freeform notes on what went wrong.
Axial coding: cluster those notes into 5 or 6 named failure categories.
Label everything against the taxonomy.
Build metrics for the largest buckets.

Don’t start with labels like reasoning_issue or tool_problem. They’re too vague to test. Use labels like missing_identity_verification, date_argument_mismatch, retried_same_tool_after_429, or stopped_before_database_update. A label that specific tells you exactly what the regression test should assert.

Deduplicate before you promote

The trace-mining loop has a trap: adding every bad trace forever. That creates a dataset that is large, expensive, and narrow. It passes on near-duplicates from March while missing the new shape of the same bug in June.

Group first. Promote one representative golden per cluster. Store the related trace IDs in metadata so a reviewer can inspect the production evidence later.

If a failure cluster recurs after a fix, the regression case did not generalize. Re-cluster and broaden the golden instead of adding 15 point examples.

Version the dataset

Version datasets the way you version prompts and code. Whenever anything meaningful changes (model, prompt, tool schema, judge prompt, or app behavior), you want to run the same dataset version before and after.

The CI gate should pin:

dataset version
app version
prompt version
judge model
judge prompt
evaluator code version

If any of those moves, your before/after comparison gets muddy. A goldens-v3.json file in git is fine at small scale. Tool-native snapshots in Langfuse, Phoenix, Braintrust, or LangSmith help once the dataset becomes collaborative.

Gate CI

A failing metric must become a failing build, or the eval suite is just a dashboard nobody reads.

The test should rerun the current agent against the golden input. It shouldn’t merely replay the old failed trace:

@pytest.mark.parametrize("golden", GOLDENS, ids=[item["id"] for item in GOLDENS])
def test_agent_regression(golden: dict) -> None:
    answer, fresh_trace = run_agent_and_capture_trace(golden["input"])

    refired = set(flag_failures(fresh_trace)) & set(golden["failure_modes"])
    assert not refired, f"failure mode regressed: {sorted(refired)}"

    assert tool_correctness(
        called=[call["name"] for call in fresh_trace["tool_calls"]],
        expected=golden["expected_tools"],
        mode=golden.get("tool_match", "in_order"),
    ) >= golden.get("tool_threshold", 1.0)

This distinction is easy to get wrong. The dataset’s job is to catch the next version of the agent repeating an old failure, not to archive the failure itself.

Calibrate the judge before trusting it

LLM-as-judge helps. It’s also easy to fool yourself with.

G-Eval asks a judge to write explicit rubric steps before scoring. It then sums each rating level weighted by its token probability ( $\text{score} = \sum_i p(s_i)\,s_i$ ). This protocol tracked human ratings better than the older automatic metrics it replaced.

The probability-weighted step needs judge logprobs, which some hosted models do not expose. The broader result still supports an explicit rubric path over a bare score.

MT-Bench showed GPT-4 agreeing with human preferences about as often as humans agree with each other, helping make LLM judging mainstream. Later work exposed position, length, and self-preference biases. Judge scores can also shift when the prompt or model version changes.

JudgeBench built response pairs where one answer is objectively wrong across verifiable knowledge, reasoning, math, and code. GPT-4o scored 50.9% on that set, while the strongest judge tested reached roughly 64%. Confident but wrong answers remain a difficult case for model-based judges.

Treat the judge as a measurement instrument: calibrate it against human labels before it grades anything, and recheck it whenever the judge model or prompt moves.

Judge calibration loop

When a judge is required, make the verdict structured. Schema-Guided Reasoning (SGR) defines the judge’s reasoning path as a Pydantic schema. Structured Outputs or constrained decoding then requires fields such as evidence, passed_criteria, failed_criteria, failure_mode, and score.

Put evidence fields before the score. The judge then applies the same predefined rubric stages, in the same field order, on every run and across every compatible model. A reviewer can inspect named fields instead of parsing a paragraph. CI can diff a stable JSON object, while the calibration set exposes which rubric stage disagreed with the human label.

It can also change the cost curve. Treat a cheaper model as a candidate, not an automatic replacement. Run it over the same human-labeled calibration set. Compare its agreement, false-pass rate, and false-fail rate with the larger judge. Use it for routine cases only if it clears the thresholds your application set. Keep the larger judge for disagreements, high-risk cases, or calibration runs.

Default judge hygiene checklist:

Prefer binary pass/fail where possible. Five-point scales invite fake precision.
Hand-label 30 to 50 trajectories before writing the final rubric.
Measure judge-human agreement with Cohen’s kappa (agreement corrected for chance, so a judge that always says “pass” scores near zero) or plain TPR/TNR.
Decompose coarse criteria. “Did the agent verify identity before the refund tool call?” beats “Was the trajectory good?”
Emit the verdict through an SGR schema with evidence, failed criteria, failure mode, and score.
Use a judge from a different model family than the generator when possible.
Randomize pairwise order and average both directions.
Penalize unsupported length in the rubric. A longer answer is not a better one.
Pin the judge model, prompt, dataset, schema, and app version.
Recalibrate after model, prompt, tool, policy, or schema changes.

For high-stakes scores, use a small jury instead of one big judge. PoLL tested a panel of smaller judges drawn from disjoint model families and pooled their verdicts. Across six datasets, the panel tracked human judgments better than a single GPT-4 judge. It also avoided the single judge’s self-preference bias and cost over seven times less. Keep human review for decisions that affect money, access, safety, or compliance.

If a judge agrees with humans at 0.55 kappa on your task, don’t use it to block deploys. Use it to sort review queues. If it sits near 0.75 and the failure cost is moderate, a CI gate is much easier to defend.

Guardrails block inline, online evals observe afterward

People mix these up because both produce scores. The difference is placement: inline in the request path, before release, or after the response.

Guardrails versus online evals

Guardrails run inline. They are fast, deterministic, and user-visible. A guardrail can block a tool call, redact PII, reject prompt injection, or force a retry before the response leaves your system. A false positive is a production bug.

Offline evals run before release. They are reproducible. They gate prompts, models, tools, retrievers, and policies against a fixed dataset.

Online evals run after the response, usually on sampled traffic. They can use slower LLM judges because they are not in the latency path. Their job is to detect drift, find new failure clusters, and feed the next offline dataset.

Get the placement wrong and it hurts either way:

A judge in the request path adds latency and a new source of flakiness.
A guardrail relegated to async scoring lets policy violations reach users.

For high-volume systems, score a small sample with a stronger judge and a wider sample with cheaper classifiers. Alert on clusters and confidence bounds, not one noisy point estimate.

Tooling choices

No single tool owns the whole loop. Most serious stacks use two pieces: a trace store and a CI/eval runner.

Tool	Best fit	CI story	Self-hosting story	Trade-off
DeepEval	Pytest-native agent and LLM evals	Strong: `deepeval test run` fits CI	Core library is local/open source	Judge calls and cloud features can add cost
Inspect AI	Safety, frontier, sandboxed evaluations	CLI and Python API	Fully local/open source	Not a production trace platform
Phoenix	OTel/OpenInference tracing plus evals	Custom scripts	Strong self-host option	Managed alerting lives in the commercial layer
Langfuse	Trace store, datasets, prompt versions	Experiments and custom gates	Strong self-host option	Eval metrics are less batteries-included than DeepEval
LangSmith	LangChain/LangGraph tracing and evals	pytest, Vitest, GitHub workflows	Enterprise self-host	Closed source; watch per-seat and trace-volume pricing
Braintrust	Eval-driven product loop and PR review	Very strong managed PR regression flow	Enterprise/hybrid	Span volume, processed data, and score count can add up
Promptfoo	Prompt tests and red-team suites	GitHub, GitLab, Jenkins, CircleCI	Local/open source core	Great pre-release runner, not a trace hub

The trade-off notes describe where cost comes from, not what it is. Pricing pages move, and vendors count different things: traces, observations, spans, scores, users, retention, or processed data. Recheck live pricing before committing.

Decision shortcuts:

Need self-hosted tracing with OTel portability: start with Phoenix or Langfuse.
Need a code-first CI gate: start with DeepEval.
Already committed to LangGraph: LangSmith is convenient.
Want managed PR regression review: Braintrust is hard to beat.
Security and red-team cases dominate: Promptfoo is the focused tool.
Safety research or controlled benchmark work: Inspect AI is the better fit.

Tool choice is secondary. If production failures don’t become test cases, you’re mostly paying for trace storage.

A practical rollout checklist

Build the evidence pipeline before expanding the metric stack. Start by deciding where the examples will come from.

Collect historical runs first. If the agent already exists, pull traces, support tickets, bug reports, thumbs-down sessions, manual QA transcripts, and dogfooding notes before changing the implementation. If the agent does not exist yet, log every prototype and manual test run from day one.
Instrument the trace shape. Capture messages, tool calls, arguments, tool outputs, errors, token counts, latency, cost, user feedback, app version, prompt version, model version, tool schema version, and final environment state. Use OpenTelemetry GenAI conventions or OpenInference-style spans if you want portability. Use Langfuse, LangSmith, Phoenix, or Braintrust if you want a trace UI and dataset workflow immediately.
Turn real failures into seed cases. Read the traces before summarizing them with a model. For each useful failure, store the input, source trace ID, expected state, expected tool invariants, failure mode, severity, and reviewer note. Langfuse can link dataset items back to production traces; LangSmith can create datasets from traced runs. Keep the source link so the case remains auditable.
If there is no history, generate cold-start cases. Ask an LLM to draft tasks from product requirements, policies, tool schemas, state machines, and support macros. Cover happy paths and failures such as wrong permissions, missing identity checks, stale tool results, ambiguous dates, retries after rate limits, and contradictory tool output.
Do not trust synthetic cases until a human reviews them. Synthetic examples are useful for coverage, not truth. Mark them with source: synthetic and require a reviewer to approve the expected outcome. Run a known-good reference path when possible, and use different model families to generate the case and judge the result.
Build a small balanced dataset. Include successes, failures, refusals, boundary cases, long-turn cases, policy-sensitive cases, and valid alternate paths. Do not make the golden “the exact old transcript.” The golden should encode the required outcome, allowed invariants, and failure mode.
Add deterministic checks first. Required tool order where order is policy, required arguments, schema validation, final-state diffs, loop limits, token and latency ceilings, and task-specific invariants should run before any judge.
Add one SGR-shaped judge. Use it only for the part that needs interpretation. Calibrate it against human labels. If it cannot separate good and bad examples on the calibration set, fix the rubric before wiring it into CI.
Wire the loop. Run the small offline suite in CI, run the larger suite before release, score sampled production traffic online, and promote recurring online failure clusters back into the offline dataset.

Your first eval suite will be wrong in boring ways. Ship it anyway. A suite you run every day is easier to fix than a perfect design doc that never blocks a bad PR.

Evaluating AI Agents in Production: From Traces to Test Suites

Why agent evals are different

The adoption gap

Pick metrics by failure mode

Tool-call correctness

Argument correctness

Efficiency, loops, and dead ends

Task completion and outcome grading

The trace-to-eval flywheel

Mine failures with error analysis

Deduplicate before you promote

Version the dataset

Gate CI

Calibrate the judge before trusting it

Guardrails block inline, online evals observe afterward

Tooling choices

A practical rollout checklist

References