2026-01-11 · Updated 2026-07-15

Enterprise RAG Challenge 3 (ECR3): Winning AI Agent Architectures

Enterprise RAG Challenge 3 (ECR3) asked agents to complete business tasks against a simulated company API. The frozen prize leaderboard is unusually useful because many entrants published not only a score, but also their architecture, model mix, cost, and failure notes.

I reviewed those public descriptions to answer a narrower question: which design choices recurred in strong submissions, and which of them are useful outside this benchmark?

TL;DR: There was no single winning topology. Strong entries ranged from a simple tool-calling agent to specialist pipelines and plan-execute systems. The recurring ideas were more specific: learn from failed traces, validate risky steps near the point of execution, make context policy explicit, and hide API hazards such as pagination behind reliable wrappers. The prize winner’s production prompt was its 80th automatically generated version.

What is the Enterprise RAG Challenge?

The Enterprise RAG Challenge 3 is a large-scale, crowdsourced research project that tests how autonomous AI agents handle complex business tasks. Unlike static benchmarks, ECR3 runs on the Agentic Enterprise Simulation (AGES), a discrete-event simulation that exposes a realistic enterprise API.

What the benchmark tests

Through AGES, agents work inside a fake company that has:

Employee profiles with specific skills and departments
Projects with team assignments and customer relationships
Corporate wiki with business rules and permission hierarchies
Time tracking and financial operations

Each task spins up an isolated simulation. The company wiki is shared, but operational records vary by task, so an agent cannot solve the suite by memorizing one company state.

Read the scores as a snapshot

ECR3 now exposes both a frozen competition leaderboard and a public benchmark that continued to receive runs after the event. Those pages answer different questions. The figures below describe the prize leaderboard at the competition cutoff, not later best-performing sessions:

Metric	Competition snapshot
Prize submissions	38
Task set	103 business tasks
Highest prize score	0.718
Prize cutoff	December 9, 2025, 13:40 CET

The live benchmark page can show higher scores because it includes later runs. That makes the frozen leaderboard the right source for claims about what won the competition.

Types of tasks

The tasks span several skill areas:

Multi-hop reasoning, such as matching employee skills to project assignments.
Permission validation, such as blocking unauthorized salary changes or data access.
Ambiguous queries, including multilingual and paraphrased requests.
Strict output compliance, including mandatory entity links in responses.

What the submissions actually suggest

The public write-ups do not support a clean verdict such as “multi-agent beats single-agent.” The fourth-place prize entry was explicitly a simple single-agent design. They do support four narrower observations:

Decomposition was useful when it isolated a known failure boundary. Teams separated permission checks, step validation, code execution, or response formatting—not arbitrary “agent roles.”
Validation moved closer to irreversible actions. Several systems checked permissions before execution, reviewed individual steps, or guarded the final response.
Trace-driven iteration mattered. The prize winner converted failed runs into prompt revisions through an automated loop; other teams documented similarly concrete tool and prompt fixes.
Context policy was an architectural choice. Teams tried distillation, preloading, retrieval, and history compression. Their own reports disagree on whether compression helped, so there is no universal recipe.

Five informative approaches

These are not the top five in rank order. I selected them because their public descriptions expose five distinct ways to build the system: automated prompt revision, specialist stages, per-step validation, response guards, and plan-execute isolation. Where I interpret why a design helped, I label that interpretation rather than treating it as a leaderboard finding.

Team	Leaderboard context	Published score
VZS9FL	Prize, 1st	0.718
Lcnxuy	Prize, 8th	0.505
NLN7Dw	Prize, 2nd	0.621
J8Gvbi	Prize, 16th	0.437
key_concept_parallel	Ultimate, 3rd	0.670

1. Evolutionary prompt engineering (Team VZS9FL / @aostrikov)

The highest-scoring approach automated prompt engineering through a self-improvement loop.

Evolutionary Prompt Engineering Pipeline

Instead of hand-tuning the production prompt, the team built a three-agent loop that turned failed traces into candidate revisions.

Three-agent pipeline:

Agent	Role
Main Agent	Runs benchmark, logs all actions and failures
Analyzer Agent	Reviews failed tasks, formulates hypotheses about root causes
Versioner Agent	Generates new prompt version incorporating learnings

The result: The production prompt was the 80th auto-generated version. The team describes the loop as analyzing failed tasks, proposing causes, and deciding which suggestions to incorporate. The leaderboard establishes the final score and iteration count; it does not isolate how much of the gain came from automation rather than the models, tools, or accumulated benchmark feedback.

Stack: claude-opus-4.5 with Anthropic Python SDK and native Tool Use.

2. Multi-agent sequential pipeline (Team Lcnxuy / @andrey_aiweapps)

This submission built a sequential workflow in which specialist components owned security checks, context extraction, execution, and entity-link formatting.

Multi-Agent Sequential Pipeline

The documented components:

Security Gate Agent: Pre-execution check that validates permissions against wiki rules before the main loop runs.
Context Extraction Agent: Pulls the critical rules out of massive prompts and preloads user, project, and customer data.
Execution Agent: ReAct-style planning with 5 internal phases (Identity → Threat Detection → Info Gathering → Access Validation → Execution).
LinkGeneratorAgent: Embedded inside the response tool, parses context to include the required entity links.

The LinkGenerator agent is the most transferable part. Putting it inside the response tool makes a benchmark requirement—mandatory entity links—a property of the interface rather than one more instruction the execution model can forget.

Stack: atomic-agents and instructor frameworks with gpt-5.1-codex-max, gpt-4.1, and claude-sonnet-4.5.

3. Schema-guided reasoning with step validation (Team NLN7Dw / Ilia Ris)

This team paired SGR with fast inference and a validator on every proposed step. The design makes revision cheap: reject a flawed step before it becomes a tool call, then ask the main flow to rework it with the validator’s comments.

SGR with Step Validation

Key components:

Component	Function
StepValidator	Inspects each proposed step. If something is off, sends it back for rework with comments.
Context Management	Full plan from previous turn, plus compressed history for older turns
Dynamic Enrichment	Auto-pulls user profile, projects, customers; LLM filters to inject only task-relevant data
Auto-pagination Wrappers	All list endpoints return complete results automatically

The team reported running gpt-oss-120b on Cerebras at up to roughly 3,000 tokens per second. Fast inference reduced the latency penalty of validation, although the leaderboard does not provide an ablation that separates speed from the rest of the design.

Stack: gpt-oss-120b on Cerebras, with a customized SGR NextStep implementation.

4. Enricher and guard system (Team J8Gvbi / @mishka)

This submission added non-blocking hints and a tiered guard system to an SGR base. As API responses came back, enrichers inspected them and added operational guidance to later context.

Enricher & Guard System

The enricher system:

More than 20 enrichers inspected API responses and injected contextual hints:

RoleEnricher: "You are LEAD of this project, proceed with update."
PaginationHintEnricher: "next_offset=5 means MORE results! MUST paginate."

Three-mode guard system:

Mode	Behavior
Hard block	Impossible actions blocked permanently
Soft block	Risky actions blocked on first attempt, allowed on retry
Soft hint	Guidance without blocking

Hybrid RAG wiki: Three search streams—regex, semantic, and keyword—covered different query shapes against the company wiki.

Stack: qwen/qwen3-235b-a22b-2507 on the LangChain SGR framework.

5. Plan-execute REPL (Team key_concept_parallel)

This architecture put a hard wall between planning and execution and used a code-generating loop. It appeared on the broader Ultimate leaderboard rather than the frozen prize top five, but its public description is useful because it shows a different form of decomposition: isolation by execution phase rather than by business role.

Plan-Execute REPL Architecture

Different models handled different jobs: one planned, one wrote Python, and a separate decision model chose what to do after each step.

Multi-model setup:

Stage	Model
Planning	openai/gpt-5.1
Code Generation	deepseek/deepseek-v3.2
Post-Step Decision	openai/gpt-4.1
Final Response	openai/gpt-4.1

The step completion REPL:

Planner creates a high-level step.
Code-gen model works in a fresh model context and writes a Python script for it.
Script executes in a task-scoped REPL whose variables persist across steps.
Decision model looks at the result and picks: continue, abort, or replan.

The replan path is the reusable idea. When a step partially fails, the decision model can preserve completed work and rewrite only the remaining plan.

Patterns that recurred across submissions

The implementations differed, but several engineering concerns appeared repeatedly in the public descriptions.

Context management was explicit

No team could hand the model every rule, record, and prior step without making a policy choice. The interesting difference was where each system filtered information.

Context Management Strategies

Strategy	Approach	Best for
Rule Distillation	Pre-process wiki rules into compact instructions while preserving constraints	Lean prompts, fast startup
Aggressive Preloading	Load user/project/customer data before execution	Minimizing tool calls
Hybrid RAG	Regex + semantic + keyword search streams	Complex retrieval needs
History Compression	Keep recent turns full, compress older history	Long conversations

Trade-off: NLN7Dw compressed older turns, while f1Uixf reported that history compression hurt its experiments and kept the full conversation instead. Treat compression as a measured choice, not a default.

Guardrails were placed at different failure boundaries

Several teams placed checks before, during, or after the main loop. These mechanisms addressed different risks and should not be collapsed into one generic “critic agent.”

Guardrail Architecture

Guardrail Type	When	Example
Pre-Execution Gates	Before main loop starts	Security Gate Agent validates permissions against wiki rules
In-Loop Validators	During reasoning	StepValidator checks each proposed action, triggers rework if flawed
Post-Execution Guards	Before final submission	Three-Mode Guard System checks response outcomes against API evidence and policy

Smart tool wrappers

Several teams built abstraction layers around the raw API:

Auto-pagination: Wrappers loop through every page and return the complete dataset.
Fuzzy normalization: “Willingness to travel” gets translated to the will_travel API field.
Specialized reasoning tools: think, plan, and critic tools for controlled deliberation.

Failure modes and the structural fixes teams reported

The write-ups repeatedly mention failures at API and policy boundaries. Their most reusable fixes moved the requirement into code or into a dedicated validation step:

Failure Mode	Description	Architectural Fix
Permission Bypass	Executing restricted actions without verifying user permissions	Pre-execution Security Gate Agent; mandatory Identity → Permissions → Execution sequence
Missing Entity Links	Correct text answer but missing required reference links	Embedded LinkGeneratorAgent in the response tool
Pagination Exhaustion	Processing only the first page of list results	Auto-pagination wrappers for all list endpoints
Tool-Calling Loops	Repeated calls with minor variations	Turn limits; clearer tool schemas; model choice tested on the actual workflow
Context Overloading	Filling context with irrelevant wiki sections	Rule distillation; dynamic context filtering

A practical adoption order

ECR3 is one simulated company, not a general agent ablation study. Use it as a source of design hypotheses, then test those hypotheses against your own traces. A sensible adoption order is:

Make API correctness deterministic first. Auto-paginate list endpoints, normalize fuzzy fields, validate schemas, and generate required links inside the response tool.
Add checks at real risk boundaries. Verify identity and permission before mutation; validate a step before execution only when the extra model call catches failures worth its cost.
Write down a context policy. Decide what is preloaded, retrieved, compressed, or kept verbatim. Measure the policy by task slice rather than token count alone.
Turn failed traces into regression cases. Classify the failure, change one mechanism, and rerun the affected slice. Automate prompt revision only after that loop is trustworthy.
Decompose when ownership becomes clearer. A separate component is justified when it can own a constraint, use a different model or tool, or be tested independently—not simply because “multi-agent” sounds more capable.

The larger lesson is not that one architecture won. It is that reliable submissions made hidden operational requirements visible in tools, validators, and evaluation loops.