Skip to content

Substack Announcement: The Guardians — Why Agent Security Is Not LLM Safety

Subject Line Options

  1. Six Incidents. Zero Content Filter Alerts.
  2. LLM Guardrails Don't Guard Agents
  3. The Security Layer Below the Model

Post Body

Edge of Context covers practical AI engineering — what you're actually dealing with when you build systems, not what the benchmarks measure.


In 2024, we shipped guardrails. NeMo, Bedrock, Lakera, the whole content-filter category — wrapping model inputs and outputs and asking: is the model saying the right thing?

Then we gave the model a tool loop, a filesystem, a shell, and the authority to act. The threat model changed. The guardrails didn't notice. (The guardrails were, as far as anyone can tell, still watching the tokens.)

Six serious incidents in eighteen months — EchoLeak, the Amazon Q Developer compromise, the Azure MCP Server disclosure, Claude Code CVE-2025-59536, the axios 1.14.1 remote access trojan, and the Trivy Actions tag hijack — and not one of them tripped a content filter. Because none of them went through the model. The attacker talked to the tool, the hook, the npm registry, or the CI tag.

TL;DR: Content filters watch what the model says. Agent guardians watch what the system tries to do. Every major 2025–2026 incident exploited the loop, not the generation stream. The 2026 policy stack has six layers: permission ladders, pre-tool hooks, OS sandboxes, human-in-the-loop interrupts, audience-bound MCP tokens, and the OWASP ASI Top 10 as a threat map.

What Part 4 covers:

  • 93% of agent permission prompts get clicked through. That's not oversight — that's approval fatigue with a logging layer on top.
  • 8 of the 10 OWASP ASI categories are harness concerns. Your content filter covers the other two.
  • The lethal trifecta isn't a corner case. Private data access + untrusted content + outbound messaging is the default shape of every useful corporate agent.
  • Audience-bound MCP tokens (RFC 8707) are what actually closed CVE-2025-59536-class attacks. A token scoped to one server can't be replayed elsewhere — small cryptographic fact, large policy hole closed.

The full post walks through all six incidents, maps each to the OWASP ASI Top 10, and builds a four-layer guardian stack for the Market Analyst Agent from Part 1 — deny-list hook, input canary, output validator, and an interrupt gate on anything outbound. There's also an honest section on what that stack still doesn't cover, because every layer has a floor.

[Read Part 4: The Guardians]


Image Suggestions

Header image (1456 × 816 px): Export the llm-vs-agent-guardian.svg from the post — it shows the key architectural distinction (guardrail wrapping a model call vs. guardian wrapping the full loop) and reads clearly as a thumbnail.

Inline image 1: The lethal-trifecta.svg — three overlapping circles showing private data, untrusted content, and external communication. Works well after the TL;DR section to ground the concept visually.

Inline image 2: The owasp-asi-top-10.svg — table of the ten categories, useful just before the "what this stack doesn't cover" section. Shows readers where the remaining gaps are.