Skip to content

The harness is the car: engineering reliability around a fixed model

The single most important shift in 2025-2026 agent engineering is that the scaffolding around the model now drives production outcomes more than the model itself. Anthropic's engineering team states it bluntly on their own landing page: "Infrastructure configuration can swing agentic coding benchmarks by several percentage points—sometimes more than the leaderboard gap between top models." This is no longer a community hypothesis — it is the stated operating assumption of the labs shipping frontier models, and it has a name. The "harness" is the code, configuration, permissions, tools, hooks, sessions, sandboxes, and filesystem conventions that wrap an LLM and turn it from a turn-based chatbot into a long-running worker. Parts 1-3 of this series covered reasoning loops, memory, and tool ergonomics. Part 4 is about everything else — the load-bearing infrastructure that determines whether an agent ships a feature or silently declares victory at 3 AM.

This report organizes the primary evidence by the seven research areas commissioned, privileging named sources, exact URLs, and verbatim language over secondary paraphrase.

From guardrails to harness: the term takes hold

The word "harness" entered agentic-AI discourse gradually through 2025, migrating from its older LM-evaluation meaning (think EleutherAI's lm-evaluation-harness) into a broader term for everything around the model. It displaced "guardrails" — which in practice had narrowed to input/output filters and policy checks — because practitioners needed a word that also covered tools, memory, sandboxing, orchestration, and lifecycle hooks.

The inflection point was a six-month cascade of primary posts. On July 18, 2025, Yichao "Peak" Ji at Manus published "Context Engineering for AI Agents: Lessons from Building Manus," with the line that became the community's rallying cry: "It's an experimental science—and we've rebuilt our agent framework four times, each time after discovering a better way to shape context." Ji nicknamed this process "Stochastic Graduate Descent" and warned: "If model progress is the rising tide, we want Manus to be the boat, not the pillar stuck to the seabed." (A later webinar updated the count to five rebuilds, which is how the number circulates in secondary write-ups; the original blog says four.) One day earlier, on July 14, 2025, Geoffrey Huntley published the Ralph Wiggum post that reduced the harness to a single bash line. By December 2025, Vercel's engineering team had published "We Removed 80% of Our Agent's Tools" about their d0 text-to-SQL agent, reporting the specific numbers: tokens 145,463 → 67,483, steps 100 → 19, latency 724s → 141s, and success rate "100% success rate instead of 80%. Fewer steps, fewer tokens, faster responses. All by doing less."

By early 2026 the labs had formalized the vocabulary. OpenAI's February 11, 2026 post by Ryan Lopopolo, "Harness engineering: leveraging Codex in an agent-first world," coined "harness engineering" as a named discipline. LangChain's March 10, 2026 "The Anatomy of an Agent Harness" by Vivek Trivedy supplied the most-quoted one-liner: "If you're not the model, you're the harness." Trivedy also offered the formulation that best captures the whole shift: "Agent = Model + Harness. Harness engineering is how we build systems around models to turn them into work engines." Phil Schmid's January 5, 2026 post "The importance of Agent Harness in 2026" added the operating-system analogy: "The Model is the CPU… Context Window is the RAM… Agent Harness is the Operating System" — the clearest mental model for senior engineers coming from systems backgrounds.

The claim from the original brief that "the model is the engine, the harness is the car" could not be verified to a specific named author in primary sources; it circulates widely in secondary coverage but appears to be a paraphrase of the OpenHarness tagline "The model is the agent. The code is the harness."

Anatomy of a harness: eight load-bearing components

The most rigorous component map in a single primary source is Trivedy's LangChain post, which names five: system prompts; tools, skills, MCPs and their descriptions; bundled infrastructure (filesystem, sandbox, browser); orchestration logic (subagent spawning, handoffs, model routing); and hooks/middleware for deterministic execution. HKUDS's OpenHarness expands that to ten subsystems (engine, tools, skills, plugins, permissions, hooks, commands, mcp, memory, tasks, coordinator), each mapping cleanly onto a production failure mode. Synthesizing across primary sources, eight components carry real weight.

Permission systems enforce deterministic policy before any tool runs. Claude Code's docs specify five canonical modesdefault, acceptEdits, plan, auto, and bypassPermissions — plus a sixth dontAsk mode for locked-down agents where anything not in allowed_tools is denied outright. The evaluation order is codified: deny rules first (they block even in bypassPermissions), then the active mode, then hooks, then canUseTool callbacks. The auto mode routes each action through a Sonnet 4.6 classifier that blocks deploys, mass deletion, force-push, and curl | bash patterns. Codex takes a parallel approach with three approval policies (auto, read-only, full-access) and three OS-level sandbox modes (read-only, workspace-write, danger-full-access) enforced by Seatbelt on macOS, Landlock+seccomp on Linux, and restricted tokens on Windows. Both designs express the same insight: permission is infrastructure, not prompt.

Human-in-the-loop controls are where LangGraph's interrupt() API sits. The official docs describe it as: "Interrupt the graph with a resumable exception from within a node. The interrupt function enables human-in-the-loop workflows by pausing graph execution." A critical behavior for engineers to internalize: "When execution resumes… the runtime restarts the entire node from the beginning — it does not resume from the exact line where interrupt was called." Interrupts require a checkpointer; resuming takes a Command(resume=value) or Command(update={...}, resume=...) to edit state mid-run. As of v0.4 interrupts can resume out of order, which matters for parallel tool calls.

Context lifecycle management is where filesystem-as-memory becomes a first-class primitive. Manus's original post popularized the pattern of offloading tool outputs to files with reversible URL/path references. Anthropic's long-running-agent post codifies it: the claude-progress.txt file + git history is the bridge across context windows. Trivedy lists filesystems as the first harness primitive: "durable storage to interface with real data, offload information that doesn't fit in context, and persist work across sessions." Compaction is the fallback, not the strategy.

Hooks and lifecycle events are where deterministic policy meets the model's non-determinism. Claude Code's documented hook events form a comprehensive lifecycle: PreToolUse, PostToolUse, PostToolUseFailure, PermissionRequest, PermissionDenied, UserPromptSubmit, Stop, SubagentStart, SubagentStop, SessionStart, SessionEnd, PreCompact, and Notification. A PreToolUse hook that returns permissionDecision: "deny" blocks a tool even in bypassPermissions — hooks outrank modes. The SessionStart hook firing on compact is the canonical pattern for re-injecting project context after compaction trims history.

The Ralph Wiggum pattern is the minimalist extreme of hook-based orchestration. Geoffrey Huntley's original post defines it in one literal line: while :; do cat PROMPT.md | claude-code ; done. No orchestrator, no planner — the agent reads the plan file, does one task, commits, exits, and the bash loop restarts it with a fresh context. Huntley describes why it works: "Ralph is monolithic. Ralph works autonomously in a single repository as a single process that performs one task per loop." Anthropic shipped an official ralph-wiggum plugin in anthropics/claude-code that implements it via a Stop hook that intercepts the agent's attempt to exit and feeds the same prompt back. Ralph is not just a meme — Anthropic's own "Harness design for long-running application development" (Prithvi Rajasekaran, Anthropic Labs) cites it by name as evidence that "the broader developer community has converged on similar insights."

Tool orchestration, observability, and multi-agent coordination round out the set. The observability point is underrated: Codex's post describes "per-worktree ephemeral observability stack with LogQL/PromQL/TraceQL" wired into the agent runtime, and single Codex runs "work on a task for upwards of six hours." Multi-agent coordination — planner, generator, evaluator — is covered in depth below.

The long-running agent problem and Anthropic's three-stage evolution

Compaction alone fails for multi-hour tasks. The core failure modes, each named in primary Anthropic sources, are premature completion ("one-shotting" the app), feature amnesia across sessions, duplicated work, and context anxiety — a term Cognition coined in its "Rebuilding Devin for Claude Sonnet 4.5" post (September 2025). Cognition's exact description: "This 'context anxiety' can actually hurt performance: we found the model taking shortcuts or leaving tasks incomplete when it believed it was near the end of its window, even when it had plenty of room left." Their workaround was specific and surprising — enable the 1M-token beta but cap usage at 200k, which convinced the model it had runway. The anxiety was a training artifact, not a window-size constraint.

Anthropic's public response played out across three engineering posts that together describe a three-stage evolution.

Stage one (solo agent): a single Claude Agent SDK loop. Anthropic's "Effective harnesses for long-running agents" acknowledges this baseline fails: "even a frontier coding model like Opus 4.5 running on the Claude Agent SDK in a loop across multiple context windows will fall short of building a production-quality web app if it's only given a high-level prompt."

Stage two (initializer + coder): the claude-progress.txt pattern. "We looked to human engineers for inspiration" — an initializer agent runs once, writes a JSON feature list, creates init.sh, makes the first git commit, and seeds claude-progress.txt. A coding agent then runs in shifts against that artifact. Anthropic's footnote is telling: "We refer to these as separate agents in this context only because they have different initial user prompts. The system prompt, set of tools, and overall agent harness was otherwise identical." The stage-two agents are not different programs — they are the same harness with different opening prompts.

Stage three (planner / generator / evaluator): Prithvi Rajasekaran's "Harness design for long-running application development" introduces an explicit evaluator. The problem being solved is self-flattery: "When asked to evaluate work they've produced, agents tend to respond by confidently praising the work — even when, to a human observer, the quality is obviously mediocre." The evaluator navigates live pages via Playwright MCP, grades against four frontend criteria (design quality, originality, craft, functionality), and feeds detailed critique back to the generator. Iteration cycles run five to fifteen times, sometimes lasting four hours. This post also confirms Anthropic added explicit context resets (not just compaction) to address Sonnet 4.5's context anxiety: "Claude Sonnet 4.5 exhibited context anxiety strongly enough that compaction alone wasn't sufficient to enable strong long task performance."

Then Opus 4.5 shipped and the anxiety vanished. Anthropic's "Scaling Managed Agents: Decoupling the brain from the body" contains the most important paragraph for any engineer building on top of frontier models: "Harnesses encode assumptions that go stale as models improve… in prior work we found that Claude Sonnet 4.5 would wrap up tasks prematurely… We addressed this by adding context resets to the harness. But when we used the same harness on Claude Opus 4.5, we found that the behavior was gone." The exact harness scaffolding that was necessary three months earlier had become dead weight.

Managed Agents is Anthropic's response: a hosted service that virtualizes the three agent primitives — session (an append-only event log), harness (the loop that calls Claude and routes tool calls), and sandbox (the execution substrate). The explicit analogy is operating-system abstractions: "The abstractions outlasted the hardware. The read() command is agnostic as to whether it's accessing a disk pack from the 1970s or a modern SSD." Auth tokens never touch the agent — Git tokens are wired into the sandbox's git remote at init, and OAuth tokens for MCP tools live in a vault that a proxy queries per-session. "The harness is never made aware of any credentials." Session events are fetched by the brain via getEvents() with positional slices, letting the harness replay, rewind, or transform context arbitrarily before it reaches the model. This is the pets-vs-cattle pattern: sessions, harnesses, and sandboxes are disposable; interfaces are durable.

Community patterns filled the same need with much less infrastructure. BMAD-METHOD (Breakthrough Method for Agile AI-Driven Development, bmad-code-org/BMAD-METHOD) ships 21 specialized agent personas (PM, Architect, Developer, UX, Scrum Master, and a Test Architect module) with scale-adaptive planning tracks, a "Party Mode" for multi-persona sessions, and a codebase flattener that emits AI-consumable XML. SpecKit (github/spec-kit) codifies spec-driven development with slash commands /constitution, /specify, /clarify, /plan, /tasks, /analyze, /implement and a constitution.md file expressing non-negotiable principles — "All implementation MUST follow strict Test-Driven Development. No implementation code shall be written before: 1. Unit tests are written 2. Tests are validated and approved by the user 3. Tests are confirmed to FAIL (Red phase)." Both sit on top of existing harnesses rather than replacing them.

Reference implementations worth reading line-by-line

Claude Code is now the de facto reference harness. Six permission modes (the docs foreground five; dontAsk is a sixth for locked-down headless use). Thirteen-plus hook events. Settings hierarchy: enterprise managed → project local → project checked-in → user. Protected paths (.git/, .claude/) cannot be written to even with bypassPermissions as of v2.1.78. CLAUDE.md carries project-specific context; skills live in .claude/skills/ (protected since v2.1.81). The architecture is readable as code, not marketing.

OpenAI Codex expresses the same primitives differently. Config is ~/.codex/config.toml with sandbox_mode, approval_policy, [profiles.*], and [mcp_servers.*] blocks. Memory lives at ~/.codex/memories/ (inside writable roots by default). The key harness insight from Lopopolo's post is treating code as context: "So instead of treating AGENTS.md as the encyclopedia, we treat it as the table of contents." Roughly 100 lines of map, with a docs/ directory as the system of record; custom linters whose error messages inject remediation instructions; "golden principles" enforced by background Codex tasks that open auto-merged refactor PRs. The team shipped "roughly 1,500 pull requests… merged with a small team of just three engineers" — 3.5 PRs per engineer per day, first commit August 2025, about a million lines of code five months later. The often-cited phrase "knowledge not in the repo doesn't exist for the agent" does not appear verbatim in Lopopolo's post (flag as conceptual, not literal). The closest verbatim line is the "system of record" framing.

AGENTS.md has become the interoperability surface. Its own wording: "stewarded by the Agentic AI Foundation under the Linux Foundation" (formed around Dec 2025; platinum members include AWS, Google, Microsoft, OpenAI, Anthropic, Block, Bloomberg, Cloudflare). Precedence rule: "The closest AGENTS.md to the edited file wins; explicit user chat prompts override everything." The main OpenAI repo has 88 AGENTS.md files. Cursor, Windsurf, Codex, Copilot, Jules, goose, Aider, Zed, Warp, and VS Code all read it.

OpenHarness (HKUDS/OpenHarness, v0.1.0 April 2026, MIT) is the clearest pedagogical read. The README reports Claude Code at 512,664 LOC vs OpenHarness at 11,733 LOC — 44× lighter — and 1,884 files vs 163, implementing the same conceptual surface. Ten subsystem directories, 43 tools, 54 commands, and a literal agent-loop pseudocode block (query → stream → tool-call → loop). The tagline is honest: "The model is the agent. The code is the harness."

AutoHarness (aiming-lab/AutoHarness, UNC, April 2026) takes a governance-first angle. It ships three pipeline modes — Core (6 steps), Standard (8), Enhanced (14) — around a YAML "constitution" file, multi-agent profiles (fork / swarm / background), JSONL audit trail, and init presets for default/strict/soc2/hipaa/financial. Note: two unrelated projects share the AutoHarness name — the UNC governance framework and a Google DeepMind arXiv paper (2603.03329) on automatic code-harness synthesis via tree search that eliminated all illegal moves across 145 TextArena games.

LangChain's middleware stack on create_agent (LangChain 1.0) is how the same concepts land in Python. Trivedy reports: "we improved our coding agent Top 30 to Top 5 on Terminal Bench 2.0 by only changing the harness." Cursor's .cursor/rules/*.mdc (MDC = Markdown with frontmatter, with description, globs, alwaysApply metadata) and Windsurf's three-tier rules (Global, Workspace, System-enterprise) plus Memories system all implement the same pattern at different tiers.

Principles, anti-patterns, and the Bitter Lesson

The canonical design principle is Anthropic's two-sentence paragraph from Building Effective Agents (December 2024): "When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all." Every subsequent primary source reaffirms it.

The deeper principle, articulated most cleanly in Anthropic's Managed Agents post, is that every harness component encodes an assumption about what the model can't do on its own — and those assumptions rot. Context resets were load-bearing for Sonnet 4.5 and obsolete three months later on Opus 4.5. Rajasekaran's closing paragraph states the corollary directly: "when a new model lands, it is generally good practice to re-examine a harness, stripping away pieces that are no longer load-bearing to performance." This is the Bitter Lesson applied to scaffolding — any clever control flow you wrote will eventually be rendered obsolete by a better model, and your job is to make it cheap to delete.

Anti-patterns follow from this principle. Monolithic instruction files that rot — Lopopolo's Codex team encoded this directly as the "table of contents" pattern. Pet containers that developers bond with — Managed Agents solves this by making sandboxes ephemeral. Over-constraining agents — Vercel's d0 team removed 15 of 16 tools and got better results. "Smart" orchestration that the next model makes unnecessary — Cognition rebuilt Devin for Sonnet 4.5 with simpler parallelism patterns and saw 18% planning improvement. Phil Schmid names the emerging bottleneck explicitly: "We see a new bottleneck being context durability" — the ability of a model to follow instructions across hundreds of tool calls over time — and predicts that harnesses will become the primary instrument for detecting model drift and feeding it back into training.

The consolidation academic view is the Preprints position paper "Harness Engineering for Language Agents" (He et al., March 2026), which proposes a Control / Agency / Runtime (CAR) decomposition and a HarnessCard reporting artifact analogous to Model Cards, with required fields: base model, control artifacts, runtime policy, action substrate, feedback stack, governance layer, evaluation protocol. The authors argue "many reported agent gains may be partly harness-sensitive rather than purely model-driven" — which is the academic way of saying every published benchmark number is partially a harness score.

Quantitative evidence that the harness is the lever

The numbers are now overwhelming enough to displace model-centric reasoning for production systems.

Claim Primary source Number
Infrastructure swings agentic benchmarks Anthropic Engineering landing page Several percentage points, sometimes exceeding leaderboard gap between top models
Harness alone on fixed LLM Stanford/MIT/KRAFTON Meta-Harness (arXiv 2603.28052) 6× performance gap on same benchmark
Vercel d0 tool reduction vercel.com/blog (Dec 2025) Success 80%→100%; 40% fewer tokens; 40% fewer steps; 3.5× faster
LangChain Terminal Bench 2.0 LangChain blog (Mar 2026) Top 30 → Top 5 on same model, harness changes only
DeepMind AutoHarness on TextArena arXiv 2603.03329 Illegal moves eliminated across 145 games; Gemini-2.5-Flash + AutoHarness beats Gemini-2.5-Pro
ContextCov executable checks arXiv 2603.00822 46,000+ executable checks extracted from 723 repos, 99.997% syntax validity
Codex harness productivity OpenAI (Feb 2026) ~1,500 merged PRs, 3 engineers, 5 months, ~1M LOC
Claude Sonnet 4.5 SWE-bench Zvi analysis / anthropic.com/news Epoch scaffold 65% vs swebench.com scaffold 70.6% — 5.6 points on scaffold alone

The Meta-Harness paper (Stanford/MIT/KRAFTON) adds a finding that should change how evaluations are run: on Terminal Bench 2, discovered harnesses surpassed the best hand-engineered Claude Haiku 4.5 baselines, and their agentic proposer reads up to ~10M tokens of prior candidate source code, scores, and traces per iteration — effectively treating harness code itself as the object to optimize. HumanLayer's blog summarizes the implication in one sentence that belongs on a whiteboard: "most agent failures are configuration problems, not model limitations."

What changes when you internalize the harness

The practical consequences for senior ML engineers and engineering leaders are concrete and immediate. First, benchmark numbers are partially harness scores. Sonnet 4.5 publishes at 65% or 70.6% on SWE-bench Verified depending on scaffold — that 5.6-point delta is larger than most model releases. Treat every published score as a lower bound on the model's capability and an upper bound on your current scaffolding. Second, harnesses are liabilities as well as assets. Every component you add is a standing assumption about the model's current weaknesses; when the next model lands, audit which components are still load-bearing and delete the rest. Rajasekaran's "re-examine a harness, stripping away pieces that are no longer load-bearing" is now table stakes. Third, the filesystem, git, and a text file named progress.txt outperform most orchestration frameworks for sessions longer than a single context window — a pattern so consistent across Manus, Anthropic, Vercel, Huntley's Ralph, and OpenHarness that it should be the default, not the exception. Fourth, the harness is the observability surface — permissions, hooks, sessions, and event logs are where you get cost tracking, audit, drift detection, and replayability.

The deepest lesson is the one Anthropic formalized in Managed Agents: abstractions outlast implementations. Sessions, harnesses, and sandboxes should be as stable as read() and write() were for the Unix file abstraction — so that the particular harness you're running today can be thrown away the day a better model ships, without taking your product down with it. The model is going to keep changing. Build the car so you can swap the engine.