Skip to content

The Hands — Tool Ergonomics and the Agent-Computer Interface

Part 3 of the Engineering the Agentic Stack series

Part 1 covered reasoning loops, Part 2 covered memory. This post is about tools — how agents interact with the world, and why tool design matters more than most teams realize.

The tool landscape shifted dramatically in 2025–2026. MCP won the standards war — but production teams are discovering its security gaps and token overhead the hard way. The alternative gaining traction: agents that write and execute code instead of calling JSON-defined tools, achieving 98.7% token reductions and 20% higher task success rates.

This post covers all five tool modalities, the ACI design principles that govern them, and practical patterns I apply in the Market Analyst Agent.

TL;DR: Five tool modalities exist, each with a sweet spot. The practical 2026 stack: skills for expertise, CLI as default, MCP for external services, code execution for multi-step orchestration. ACI design principles from SWE-agent govern all of them.


Five Ways an Agent Can Use Tools

In Part 1, I showed how the cognitive engine decides how to think. Tool modalities decide how to act. There are now five distinct approaches, each with different trade-offs in token cost, flexibility, and security.

Tool Modalities Landscape

1. JSON Tool Calling — The Baseline

The original pattern: you define tool schemas as JSON, the LLM emits structured function calls, your code executes them. Simple, well-understood, and adequate for small toolsets.

# Traditional tool definition — each tool consumes ~550-1,400 tokens (Apideck benchmark)
tools = [
    {
        "name": "get_stock_price",
        "description": "Get the current stock price for a ticker symbol",
        "input_schema": {
            "type": "object",
            "properties": {
                "ticker": {"type": "string", "description": "Stock ticker (e.g., NVDA)"}
            },
            "required": ["ticker"]
        }
    }
]

With 5-10 tools, this works fine. The problem is scale: each tool definition costs 550-1,400 tokens. At 20 tools you're spending 15-25K tokens before the agent even starts reasoning.

2. MCP — The Universal Standard (With Caveats)

The Model Context Protocol won the governance war. Anthropic donated it to the Linux Foundation under the Agentic AI Foundation (AAIF) in December 2025, co-founded with OpenAI and Block, and backed by Google, Microsoft, and AWS as platinum members. OpenAI adopted MCP support in its Responses API. There are now 10,000+ active MCP servers and 97M+ monthly SDK downloads.

Where MCP wins: Cross-vendor SaaS integration (Figma, Notion, Salesforce), enterprise governance with audit trails, services without CLI equivalents, and environments requiring OAuth orchestration. For these use cases, MCP is the clear choice.

But the production story has sharp edges.

The security surface is alarming. The Vulnerable MCP Project tracks 50 vulnerabilities across MCP servers, 13 rated Critical, from 32 security researchers. The attack classes span prompt injection, input validation failures, authentication gaps, and network security holes. The first real-world malicious MCP server appeared in September 2025 — a package called "postmark-mcp" that BCC'd every outgoing email to an attacker's address, affecting an estimated 300–500 organizations before discovery.

Tool poisoning is perhaps the most insidious attack class. Invariant Labs demonstrated that poisoned MCP tools can exfiltrate data even when never invoked — the model merely reading the tool's metadata is sufficient to trigger the attack. MCPTox benchmarks testing 20 LLM agents against 45 real-world MCP servers found attack success rates as high as 72.8%.

Token overhead is the operational concern. One team running MCP servers for GitHub, Slack, and Sentry (~40 tools total) found 55,000 tokens of schema definitions injected before a user asks anything. Another reported burning 143,000 of 200,000 available tokens (72%) on tool definitions alone.

Token Overhead Comparison

The mitigations are real but confirm the fundamental issue. Anthropic's Tool Search Tool reduces schema overhead by 85%. Dynamic tool loading (3-5 relevant tools per task) achieves 96% reduction. But these are workarounds for a protocol that wasn't designed for token-constrained contexts.

3. Skills (SKILL.md) — Expertise, Not Execution

A key development in late 2025 was the standardization of agent skills as an open format (launched October 2025, published as an open standard in December 2025). The distinction is fundamental: tools provide capabilities (what agents can do), while skills provide expertise (what agents know about how to accomplish complex tasks).

The SKILL.md standard defines a skill as a markdown file with YAML frontmatter:

---
name: deploy
description: Deploy the application to production
argument-hint: "[environment]"
user-invocable: true
---
Deploy the application to the $0 environment (default: staging).
Steps:
1. Run the test suite
2. Build the production bundle
3. Deploy using the deploy script
4. Verify the deployment health check

Skills use progressive disclosure: only metadata (~100 tokens for the YAML frontmatter) loads at startup; full instructions load only when activated. Compare this to MCP, where ~40 tools across typical services consume ~55,000 tokens before any reasoning begins. Cross-platform adoption as of March 2026 includes Claude Code, OpenAI Codex CLI, Cursor, GitHub Copilot, Gemini CLI, Goose, Windsurf, and Roo Code — a trend Victor Dibia explored in his analysis of the shift from tools back to code-driven agent actions.

When skills win: Domain expertise transfer, complex multi-step procedures, recurring workflows like database migrations or payment integrations, and any task where the agent needs to know how rather than just what.

4. CLI/Bash — The Unix Renaissance

The Unix terminal has re-emerged as the most token-efficient agent interface. The math is stark: a CLI tool manifest costs approximately 100 tokens, while the equivalent MCP server schemas consume 50,000+ tokens — a 4-32x difference measured by Scalekit across 75 benchmark runs (32x for tasks like repo language/license lookup).

Models are native CLI speakers. Training data includes billions of terminal interactions from Stack Overflow, GitHub repos, and documentation. They natively understand git, docker, kubectl, gh, curl, and jq without any schema definitions.

Ugo Enyioha's guide "Writing CLI Tools That AI Agents Actually Want to Use" codified eight design rules:

  1. Structured output is mandatory — support --json
  2. Exit codes are control flow — use distinct codes for different error types
  3. Commands should be idempotent
  4. Self-documenting --help with realistic examples
  5. Design for composability--quiet for bare values, stdin support
  6. Provide --dry-run and --yes flags
  7. Support version introspection
  8. Handle auth via environment variables

The trade-offs are real. CLI lacks MCP's type safety, built-in OAuth orchestration, tool discovery, and audit trail. The emerging consensus: CLI as default for development and local operations, MCP for external service integration and enterprise governance.

5. Code Execution — The Big Shift

This is the most consequential architectural change in agent tooling. Instead of the LLM emitting structured JSON to invoke predefined functions one at a time, the agent writes a complete Python or bash script that calls multiple tools, processes results with loops and conditionals, and returns only final summaries to the model context.

Anthropic formalized this with Programmatic Tool Calling (PTC), now GA in the Claude API. The academic foundation comes from the CodeAct paper (Wang et al., ICML 2024), which tested across 17 LLMs and found code actions achieved up to 20% higher task success rates and 30% fewer steps than JSON alternatives.

Code Execution Flow

The production evidence is compelling:

  • Vercel rebuilt their d0 text-to-SQL agent by removing 80% of its tools (from 15 to 2: ExecuteCommand and ExecuteSQL). Task success jumped from 80% to 100%, execution time dropped 3.5x, and token usage fell 40%. Their key insight: "The best agents might be the ones with the fewest tools."

  • Cloudflare developed "Code Mode," letting agents write TypeScript to call their API rather than defining tool schemas — dramatically reducing context overhead. Their rationale: "LLMs have an enormous amount of real-world TypeScript in their training set, but only a small set of contrived examples of tool calls."

Here's the pattern from Anthropic's PTC documentation. Traditional tool calling for expense analysis requires 20+ separate inference passes — one per team member — with all intermediate data flowing through context. A workflow consuming ~150,000 tokens with direct tool calling was reimplemented for ~2,000 tokens — a 98.7% reduction. With code execution, the agent writes a single script:

# Agent generates this code, executes in sandbox
import json
members = get_team_members("engineering")
over_budget = []
for m in members:
    expenses = get_expenses(m["id"], "Q3")
    total = sum(e["amount"] for e in expenses)
    if total > 5000:
        custom = get_custom_budget(m["id"])
        limit = custom["limit"] if custom else 5000
        if total > limit:
            over_budget.append({"name": m["name"], "spent": total, "limit": limit})
# Only this final summary returns to the LLM context
print(json.dumps(over_budget))

The LLM sees only the final JSON summary — not the thousands of expense line items processed in the sandbox. This wins on every dimension: token efficiency, composability (native loops/conditionals), error handling (try/except vs. natural language reasoning about errors), and privacy (sensitive data stays in the sandbox).

When JSON tool calling still makes sense: Single atomic operations, when sandboxing infrastructure is unavailable, with smaller models that have weak code generation, or when auditability of every individual tool invocation is required.


The Comparison Table

Dimension JSON Tool Calling MCP Skills (SKILL.md) CLI/Bash Code Execution (PTC)
Best for Simple, single actions Cross-vendor SaaS Domain expertise Dev workflows, local ops Multi-step orchestration
Token overhead Medium (schemas per request) Very high (550-1,400/tool) Very low (~100 tokens) Near zero Low (2 meta-tools)
Task success rate Baseline Similar to baseline N/A (expertise layer) Higher for CLI-native tasks +20% vs baseline
Composability Low (sequential) Low (sequential) High (procedural knowledge) High (pipes, chaining) Very high (native code)
Security surface Moderate High (50+ CVEs) Low (prompt-based) High (shell access) High (needs sandboxing)
Setup complexity Low Medium (server deployment) Very low (markdown) Very low (existing CLIs) Medium (sandbox infra)
Latency per action 1 inference/call 1 inference + transport 0 (context injection) 1 inference/call 1 pass for N calls
Debugging Good (structured I/O) Moderate (transport layer) Good (readable markdown) Excellent (visible) Good (readable code)

Composability — how easily multiple operations chain into a larger workflow. Low means each tool call requires a separate LLM round-trip; high means the modality natively supports chaining results (Unix pipes, code variables, procedural steps).


The Agent-Computer Interface (ACI)

The term "Agent-Computer Interface" (ACI) was coined by John Yang, Carlos E. Jimenez, and colleagues at Princeton in their SWE-agent paper (NeurIPS 2024). The core insight: just as humans benefit from well-designed interfaces (HCI), LM agents are "a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces."

The ablation results proved this empirically: SWE-agent with its full ACI achieved 18.0% on SWE-bench Lite versus 7.3% with only a standard Linux shell — a 10.7 percentage point improvement solely from interface design. The linting guardrail alone accounted for 3 percentage points: 51.7% of agent edits had at least one error caught by the linter before it could propagate.

ACI Design Principles

Anthropic adopted ACI as a foundational concept in their "Building Effective Agents" guide, listing it as one of three core principles: "Carefully craft your agent-computer interface through thorough tool documentation and testing." Their practical guidance: "One rule of thumb is to think about how much effort goes into human-computer interfaces, and plan to invest just as much effort in creating good agent-computer interfaces."

The Four Principles in Practice

1. Actions should be simple and easy to understand. The most common mistake is wrapping API endpoints one-to-one. Instead of list_users, list_events, create_event, implement schedule_event that finds availability and schedules in one call. Instead of read_logs, implement search_logs that returns only relevant lines with context.

2. Actions should be compact and efficient. Consolidate important operations into as few actions as possible. In the Market Analyst Agent, I combine price fetching with basic metrics into a single get_stock_snapshot tool rather than requiring separate calls for price, volume, market cap, and PE ratio.

3. Environment feedback should be informative but concise. Avoid returning raw HTML or full API payloads. Resolve cryptic IDs to semantic names. Anthropic's testing showed that implementing a response_format enum letting agents request concise (~72 tokens) or detailed (~206 tokens) responses — a 3x token cost difference — significantly improved real-world performance.

4. Guardrails should mitigate error propagation. Automatic error detection helps agents recognize and correct mistakes quickly. In SWE-agent, a custom file editor with integrated linting automatically rejects syntax errors — catching 51.7% of agent edit errors before they could compound. I apply this same principle in the Market Analyst Agent by validating tool arguments with Pydantic schemas before execution:

from pydantic import BaseModel, Field, field_validator

class StockQuery(BaseModel):
    """Validated input for stock queries.

    Pydantic catches malformed tickers before the API call,
    preventing error propagation through the reasoning loop.
    """
    ticker: str = Field(description="Stock ticker symbol (e.g., NVDA)")
    period: str = Field(default="1mo", description="Time period: 1d, 5d, 1mo, 3mo, 1y")

    @field_validator("ticker")
    @classmethod
    def validate_ticker(cls, v: str) -> str:
        v = v.upper().strip()
        if not v.isalpha() or len(v) > 5:
            raise ValueError(f"Invalid ticker format: {v}")
        return v

    @field_validator("period")
    @classmethod
    def validate_period(cls, v: str) -> str:
        valid = {"1d", "5d", "1mo", "3mo", "6mo", "1y", "5y"}
        if v not in valid:
            raise ValueError(f"Invalid period: {v}. Must be one of {valid}")
        return v

Tool Design Patterns That Work

Anthropic's "Writing effective tools for agents" guide frames tools as "a new kind of software reflecting a contract between deterministic systems and non-deterministic agents." Here are the patterns that have crystallized from production experience.

Treat Tool Descriptions as Prompt Engineering

Descriptions should be 3-4+ sentences minimum, explaining when to use the tool, required versus optional parameters, output format, and edge cases. Namespacing with prefixes (asana_search, jira_search) has "non-trivial effects" on tool selection accuracy. In Anthropic's experiments, Claude-optimized tool descriptions outperformed expert human-written ones on evaluation benchmarks.

# Bad: vague, no context for when to use
tools = [{
    "name": "search",
    "description": "Search for items",
}]

# Good: specific, with input examples and edge cases
tools = [{
    "name": "stock_search_news",
    "description": (
        "Search for recent news articles about a specific stock or company. "
        "Use this tool when the user asks about recent events, earnings, "
        "announcements, or market-moving news for a specific ticker. "
        "Returns up to 10 articles sorted by relevance. "
        "For broad market news (not ticker-specific), use market_overview instead."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Search query. Examples: 'NVDA earnings Q3 2025', 'Tesla delivery numbers'"
            },
            "max_results": {
                "type": "integer",
                "description": "Max articles to return (1-10, default 5)",
                "default": 5
            }
        },
        "required": ["query"]
    }
}]

Internal testing showed input examples (input_examples field) substantially improved accuracy on complex parameter handling.

Return High-Signal, Machine-Readable Output

Avoid low-level identifiers (uuid, mime_type). Resolve cryptic IDs to semantic names. Structure the response so the agent can reason about it without parsing boilerplate:

# Bad: raw API response dumped to agent
def get_stock_price(ticker: str) -> dict:
    response = api.get(f"/v1/quotes/{ticker}")
    return response.json()  # 500+ tokens of nested JSON

# Good: high-signal summary the agent can immediately reason about
def get_stock_price(ticker: str) -> dict:
    data = api.get(f"/v1/quotes/{ticker}").json()
    return {
        "ticker": ticker,
        "price": data["regularMarketPrice"],
        "change_pct": round(data["regularMarketChangePercent"], 2),
        "volume": data["regularMarketVolume"],
        "market_cap_b": round(data["marketCap"] / 1e9, 1),
        "pe_ratio": data.get("trailingPE"),
        "summary": f"{ticker} at ${data['regularMarketPrice']:.2f} "
                   f"({'up' if data['regularMarketChangePercent'] > 0 else 'down'} "
                   f"{abs(data['regularMarketChangePercent']):.1f}%)"
    }

Design Error Handling for Resilience

The proven production pattern is four-layer fault tolerance:

  1. Retry with exponential backoff for transient errors
  2. Model fallback chains for provider outages
  3. Error classification routing — transient errors retry, LLM-recoverable errors return to the agent with context, human-required errors escalate
  4. Checkpoint recovery for crash survival

Teams report unrecoverable failures dropping from 23% to under 2% with this four-layer approach, as documented in Anthropic's tool resilience patterns.

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10),
)
def call_stock_api(ticker: str) -> dict:
    """Fetch stock data with automatic retry on transient failures.

    Layer 1: Exponential backoff handles rate limits and network blips.
    If all retries fail, the error propagates to the agent with
    enough context to decide whether to try a different approach.
    """
    response = httpx.get(
        f"https://api.example.com/v1/quotes/{ticker}",
        timeout=10.0,
    )
    response.raise_for_status()
    return response.json()

Applying These Patterns: The Market Analyst Agent

Let me show how these principles come together in the Market Analyst Agent from Part 1.

Tool Consolidation

The original design had 10+ tools: get_stock_price, get_company_metrics, get_market_cap, get_pe_ratio, get_volume, and so on. Each was a thin wrapper around one API endpoint. The agent had to reason about which combination to call for every request.

I consolidated these into 5 high-level tools following the ACI principle of compact, efficient actions:

Before (10+ tools) After (5 tools) Why
get_stock_price + get_company_metrics + get_pe_ratio get_stock_snapshot Single call returns everything needed for basic analysis
get_price_history + get_volume_history get_price_history Combined with configurable period and indicators
search_news + search_press_releases search_news Unified search with source filtering
search_competitors + get_sector_data search_competitors Returns competitors with relative metrics
get_financials + get_balance_sheet + get_cash_flow get_financials Unified with statement_type parameter

This reduced the tool schema overhead significantly while making the agent's tool selection more reliable because there are fewer ambiguous choices.

Structured Outputs for Tool Results

Every tool in the Market Analyst Agent returns a Pydantic-validated response. This applies the ACI guardrails principle at the tool boundary:

class StockSnapshot(BaseModel):
    """Structured tool response — the agent never sees raw API noise."""
    ticker: str
    price: float
    change_pct: float
    volume: int
    market_cap_b: float
    pe_ratio: float | None
    summary: str  # Human-readable one-liner for direct use in reports

class NewsResult(BaseModel):
    """Each news item is pre-processed for agent consumption."""
    headline: str
    source: str
    date: str
    relevance_score: float  # Pre-ranked so the agent doesn't waste tokens sorting
    key_points: list[str]  # Extracted by the tool, not the agent

The summary field is particularly important — it gives the agent a ready-to-use string that can go directly into a report without additional processing. The key_points in NewsResult are extracted server-side, saving the agent from spending inference tokens parsing article bodies.


Trade-offs and Considerations

Beyond the modality-specific caveats above, there are cross-cutting concerns that affect your choice:

  • Operational cost varies by dimension. Code execution saves tokens but adds sandbox cold-start latency. MCP saves development time for SaaS integrations but adds server deployment overhead. CLI is free to start but harder to govern at scale. Optimize for your bottleneck — token cost, latency, or operational complexity.

  • Team skills matter. Code execution assumes your agents (and the models behind them) can generate reliable Python/TypeScript. CLI assumes familiarity with Unix conventions. MCP requires understanding transport protocols and OAuth flows. Match the modality to your team's strengths.

  • Tool consolidation can go too far. If you merge everything into one mega-tool, the parameter space becomes too complex for the agent to navigate. The sweet spot is 5-10 well-scoped tools for most agent systems.

  • Skills are prompt-based, not enforced. A skill is instructions the agent should follow, not guardrails it must follow. For critical workflows, combine skills with deterministic validation.

  • Audit requirements shape the choice. If you need a complete log of every tool invocation and its result, MCP and JSON tool calling provide structured audit trails out of the box. Code execution produces a script and its output — useful, but harder to decompose into individual actions for compliance reporting.


Where Tool Ergonomics Is Heading

Three trends are worth watching:

Tool RAG for scaling. As toolsets grow to hundreds or thousands, naive tool selection accuracy degrades to 13.62%. The RAG-MCP paper showed that applying retrieval-augmented generation to tool selection — indexing tool descriptions in a vector database and retrieving only relevant tools per query — achieves 43.13% accuracy (a 3.2x improvement) while cutting prompt tokens by ~50%.

Agents creating their own tools. The LATM framework ("LLMs As Tool Makers") established a two-phase paradigm where a powerful LLM creates reusable Python functions and a lightweight LLM uses them. ToolMaker (ACL 2025) autonomously transforms GitHub repositories into LLM-compatible tools with an 80% success rate. The frontier is moving from tool use to tool creation to tool library management.

The A2A + MCP dual-protocol stack. Google's Agent2Agent Protocol (A2A) addresses what MCP cannot: agent-to-agent communication. MCP handles agent-to-tool integration; A2A handles agent-to-agent discovery, negotiation, and task delegation. The combined formula: build with your framework, equip with MCP, communicate with A2A.


Key Takeaways

  1. The tool landscape has five distinct modalities, each with a clear sweet spot. Don't default to MCP for everything — match the modality to the task.

  2. Code execution replacing JSON tool calling is the biggest shift in agent tooling. Anthropic's PTC, Vercel's tool reduction, and Cloudflare's Code Mode all converge on the same insight: let agents write code instead of calling schemas.

  3. Token overhead is the operational bottleneck. 55K+ tokens for ~40 MCP tools versus ~100 tokens for CLI manifests is not a marginal difference — it's the difference between an agent having 65% versus 95% of its context for reasoning.

  4. ACI design principles matter more than the protocol. SWE-agent proved that interface design alone accounts for 10.7 percentage points on benchmarks. Simple actions, concise feedback, and built-in guardrails apply regardless of whether you use MCP, CLI, or code execution.

  5. Consolidate tools aggressively. Five well-designed tools outperform twenty granular ones. Every tool is a choice you're forcing the model to make — reduce the decision space.

  6. Treat tool descriptions as prompt engineering. Input examples substantially improve accuracy. Claude-optimized descriptions outperform expert human-written ones. Invest the same effort in tool UX as you would in human UX.

  7. The practical 2026 stack: Skills + CLI + MCP + Code Execution, layered by use case. Skills for expertise, CLI for local ops, MCP for external services, code execution for multi-step orchestration.


What's Next

In Part 4, I'll cover safety layers and the Guardian Pattern — how to build guardrails that prevent your agent from taking destructive actions, hallucinating tool calls, or leaking sensitive data through tool responses.


References

Papers

Anthropic Engineering

Industry Case Studies

Security

CLI Design

Demo Project


The complete Market Analyst Agent code, including the tool designs described in this post, is available on GitHub. Star the repo and follow along as I build the full production stack.

Series: Engineering the Agentic Stack

  • Part 1: The Cognitive Engine — Choosing the right reasoning loop
  • Part 2: The Cortex — Architecting memory for AI agents
  • Part 3: The Hands (this post) — Tool ergonomics and the ACI
  • Part 4: Safety Layers — The Guardian Pattern (coming soon)
  • Part 5: Production Deployment