AI Agent Memory: Schema-Guided Typed State for Long-Running Systems

Agents rarely fail because they forgot everything. They fail because they remembered an old fact and treated it as current state.

Imagine a long-running assistant serving a user named Mira. In one episode, Mira says her passport deadline is July 15. Later, she corrects it to June 30. A vector hit can find the old note. It cannot know, by itself, which deadline is current. That is the difference between recall and memory.

TL;DR: Treat durable agent memory as typed application state. Extract memory candidates through a structured-output boundary, store records with tenant scope, validity windows, supersession, provenance, and schema versioning, then retrieve the smallest current slice on the read path. Vector search can stay in the system. It should not be the authority for mutable facts.

The context-window trap

The context window is a working set. It is not durable memory, and it is not a database.

Long-running agents need to remember preferences, task status, customer facts, tool decisions, compliance notes, and prior mistakes. The easy version is to append summaries or dump old notes into a vector store. That works until one of the remembered facts changes.

Now the agent has two passport deadlines, two preferred formats, or two project decisions. Semantic search may retrieve both. A summary may overwrite one. A long context may include the stale one next to the active one. None of those behaviors gives you a memory contract.

The contract has to answer concrete questions:

What is true now?
What was true on June 2?
Who said it?
Which tenant does it belong to?
Which older fact did this fact replace?
Can I delete or expire it?

That is where Schema-Guided Agent Memory (SGAM) starts.

What SGAM means

The name needs a short cleanup before the architecture makes sense.

Schema-Guided Dialogue (SGD) is the 2019 Google task-oriented dialogue dataset. Its schema describes service APIs, intents, and slots so a dialogue model can track state for services it has not seen before. Useful precedent, different layer.

Schema-Guided Memory (SGM) is the research term used by Mei et al. in According to Me: Long-Term Personalized Referential Memory QA. The paper compares free-text Descriptive Memory (DM) with fixed-schema key-value memory items. The source information is the same; the representation changes.

Schema-Guided Agent Memory (SGAM) is the engineering pattern this article is naming: use schemas to govern durable agent memory writes, updates, retrieval, and deletion. The schema is not just a nicer note format. It is the contract for state.

ATM-Bench makes the motivation concrete. It uses roughly four years of personal memory data from emails, images, and videos, then asks questions that need personal references, location awareness, multiple evidence items, and updates over time. Current memory systems perform poorly on the hard split, and SGM improves over DM because fields like time, source, location, entities, and tags are available to the retriever and answerer as structure instead of prose.

That is why the next comparison is SGAM versus SGR. Schema-Guided Reasoning is not a random detour. It is the write gate SGAM usually needs.

SGR is the write gate

Schema-Guided Reasoning (SGR) and Schema-Guided Agent Memory (SGAM) solve different parts of the same production problem.

I have used SGR as a recurring building block in this blog. The original vLLM and XGrammar article covered the mechanism. Later posts applied it to structured judges, planners and routers, and reproducible RAG rubrics. In all of those cases, SGR constrained the call so the response was valid enough for code to consume.

SGAM borrows that discipline but changes the lifetime. The output is no longer just a verdict, route, or plan for the current request. It becomes a candidate memory write that may affect future sessions and tool calls. That is why the similarities and differences matter.

SGR constrains one model call. It forces the model to produce a valid object for the current inference step, often with Pydantic, JSON Schema, provider-native structured outputs, or a guided decoding runtime such as XGrammar.

SGAM decides what happens after that object exists. Should it be stored? Does it supersede an older fact? Which tenant can see it? Is it current or historical? Which source episode backs it?

Dimension	SGR	SGAM
Scope	One inference step or tool call	Durable memory lifecycle
Format	Pydantic or JSON Schema for the call	Typed records, schemas, and relations
Enforcement	Structured output or decode-time constraints	Validation, database constraints, conflict rules
Lifetime	Usually discarded after the response	Persists between sessions and runs
Failure mode	Invalid or semantically weak extraction	Stale, polluted, unscoped, or unauditable state

The pattern is:

structured extraction on write -> typed memory at rest -> scoped retrieval on read

SGR makes the extraction reliable enough to enter memory. SGAM makes the memory safe enough to reuse later.

from datetime import datetime
from pydantic import BaseModel, Field


class MemoryDelta(BaseModel):
    tenant_id: str = Field(description="Isolation boundary, e.g. acme")
    subject: str = Field(description="Normalized entity ID, e.g. mira")
    attribute: str = Field(description="Property being updated")
    value: str = Field(description="New value")
    valid_from: datetime
    source_episode_id: str

That object is not the memory system. It is the boundary between messy conversation and the memory ledger.

The two flows

The workflow has two separate flows. Mixing them is how diagrams become spaghetti.

The ingestion flow is the write path:

Capture a raw episode from messages, tool results, or business events.
Extract typed candidates through structured output.
Validate the schema and reject malformed writes.
Reconcile conflicts, close stale facts, and keep provenance.
Commit the record to the SGAM store.

The request flow is the read path:

Start with the user question.
Decide whether the question needs current state or point-in-time state.
Filter by tenant, memory type, subject, attribute, and validity window.
Add vector or graph expansion only if the exact state lookup is not enough.
Assemble the smallest cited context for the model.

Schema-Guided Agent Memory architecture

Read that diagram from left to right in two lanes. The top lane writes memory. The bottom lane reads memory. The store is shared, but the responsibilities are not.

What belongs in a memory schema

A minimal SGAM record needs more than text.

tenant_id
memory_id
subject
attribute
value
memory_type
schema_version
valid_from
valid_to
supersedes_memory_id
source_episode_id
confidence
retention_policy

That shape makes ordinary operations explicit. A new passport deadline can close the previous deadline without deleting history. A query can ask for current state or point-in-time state. A response can cite the source episode. A migration can know which schema version produced a record. A deletion flow can know which indexes need cleanup.

This is where SGAM and RAG separate. RAG retrieves documents. SGAM maintains state.

SGAM is not anti-vector. Vectors are useful for fuzzy recall, clustering, and expansion. But the current value of mira.passport_deadline should come from a scoped memory record, not whichever chunk happened to rank first.

A stale-fact example

The smallest useful SGAM demo is a ledger for that Mira example with two episodes.

e1: Mira prefers concise answers. Her passport deadline is 2026-07-15.
e2: Mira corrected the deadline. It is now 2026-06-30.

A text-memory search can return e1 because it contains the right words. A typed store should return e2 for current state and keep e1 for a historical query.

The write-side behavior is small:

def close_previous_fact(db: sqlite3.Connection, fact: MemoryFact) -> int | None:
    row = db.execute(
        """
        select fact_id
        from memory_facts
        where tenant_id = ?
          and subject = ?
          and attribute = ?
          and valid_to is null
        order by valid_from desc
        limit 1
        """,
        (fact.tenant_id, fact.subject, fact.attribute),
    ).fetchone()
    if not row:
        return None

    db.execute(
        "update memory_facts set valid_to = ? where fact_id = ?",
        (fact.valid_from, row["fact_id"]),
    )
    return int(row["fact_id"])

The expected behavior is the whole point:

Naive text memory:
  returned episode: e1 -> passport deadline is 2026-07-15

SGAM current state:
  mira.passport_deadline = 2026-06-30
  valid_from=2026-06-03T10:00:00Z, source=e2

SGAM point-in-time state:
  on 2026-06-02, mira.passport_deadline = 2026-07-15

In production, pair this transaction with structured extraction on the write path. The database closes stale state. The model should not have to guess which old note still applies.

Tooling map

The agent memory market has not settled on one name for this pattern. The pieces show up as memory stores, context graphs, profiles, long-term stores, graph RAG, and stateful agents.

Tool or framework	Main storage layer	Temporal handling	Schema mechanism	Practical niche
Zep / Graphiti	Neo4j, FalkorDB, Neptune, legacy Kuzu support	High	Pydantic entity and edge types, temporal edges, provenance	Temporal graph memory
LangGraph / LangMem	LangGraph stores, Postgres-backed stores	Medium	JSON stores plus Pydantic profile or collection extraction	Agent apps already built on LangGraph
Mem0	Managed stack, Valkey / Redis / vector backends in OSS setups	Medium	Memory types, custom categories, extraction prompts	User, agent, and session memory as a service
Letta / MemGPT	Database-backed agent state and memory blocks	Low	Editable labeled memory blocks	Stateful agents with OS-style context management
Cognee	Graph plus vector and relational backends	Medium	Ontology-oriented extraction and validation	Enterprise knowledge graph memory
LlamaIndex property graph	Property graph stores plus vector stores	Low to medium	`SchemaLLMPathExtractor` with allowed entities and relations	Graph extraction over documents and traces

Graphiti is the clearest open-source reference if your memory is relational and temporal. It tracks facts as they change, keeps provenance to source episodes, and supports hybrid retrieval. LangGraph is a good application layer because it separates thread checkpoints from cross-thread stores. Mem0 is useful when you want managed memory operations instead of owning all of the storage plumbing. Letta is closer to editable context blocks than field-level SGAM, but the stateful-agent framing is relevant.

Do not start with the fanciest graph unless the domain needs it. For many teams, a relational table with JSON payloads, validity columns, tenant indexes, and a vector sidecar is enough for the first version.

Implementation playbook

The first implementation decision is not "graph or vector store?" It is what the product is allowed to remember.

A support agent might remember account tier, open cases, and durable contact preferences. It should not promote every frustrated aside into profile state. A coding agent might remember repo conventions and unresolved tasks. It should not keep a private note forever because the note happened to be retrieved once.

Start with the write path and treat memory as a small state mutation:

Name the memory type, subject, tenant scope, and retention class.
Extract candidate records with structured output.
Validate the payload with Pydantic or the schema layer your stack already uses.
Resolve conflicts before insert, including whether the new record supersedes an old one.
Keep a source pointer to the raw episode, tool result, file, ticket, or user confirmation that produced the record.
Write the schema version with the record, not only in application code.

For many teams, the first SGAM store can be a relational table with a JSON column and a few indexes. You do not need to start with a temporal knowledge graph. The graph becomes useful when relationships matter: customer-to-account, account-to-policy, task-to-artifact, user-to-preference, project-to-decision.

Hot path and background writes

Immediate extraction is worth it when the next turn depends on the new memory. If the user says "remember that I prefer short answers," the system should not need a nightly job before it behaves differently.

Most turns are not like that. The cheaper default is to write the raw episode quickly, attach tenant/session/tool metadata, and let a background worker extract candidate memories later. Recurrence-based consolidation is one version of that policy: buffer low-signal turns and promote a fact only when similar evidence repeats or the user explicitly confirms it. The trade-off is freshness lag. That is acceptable for "user often asks for CSV exports" and less acceptable for "customer changed the delivery address."

The read path should be less clever than the write path. First answer "which state am I allowed to use?" Then ask whether fuzzy retrieval can add useful context.

Filter by tenant, memory type, and validity window.
Retrieve exact structured state before semantic neighbors.
Use vector or graph expansion for supporting evidence, related entities, and examples, not as the authority for current facts.
Assemble the smallest cited context that can answer the question.

Treat schema migration as product work. When a memory record changes shape, the agent's behavior changes too: what it can recall, what it can cite, what it can delete, and which old facts still count as current. Keep migration scripts, backfills, dual-read windows, and deletion behavior in the same release plan as the product change.

Where SGAM earns its keep

SGAM is a good fit when memory is stateful and changes over time:

user preferences that can be updated or revoked;
customer or account facts with audit requirements;
task state for long-running assistants;
coding-agent project memory;
multi-agent shared state;
compliance notes where provenance matters;
temporal questions such as "what did we believe before the migration?"

It is overkill when the memory is short-lived, exploratory, or cheap to recompute. If the agent only needs a few turns of continuity, a checkpoint and trimmed message history are enough. If the task is static document QA, RAG may be enough. If your schema changes every day because the domain is still vague, SGAM will slow you down.

Evaluation checklist

Do not evaluate memory only by reading the final answer. A memory system can answer politely while it wrote the wrong fact, retrieved a stale one, or crossed a tenant boundary on the way there.

The shape is close to the evaluation split in my RAG evaluation article: measure the pipeline stage where the failure can happen, not only the generated text at the end. It also overlaps with the trace discipline from the agent evaluation article: a memory bug is often visible in the run history before it becomes visible in the answer.

For SGAM, the cleanest test is replay. Feed a fixed sequence of episodes into the memory writer, inspect the ledger after each meaningful turn, then ask current-state and point-in-time questions against the resulting store.

Layer	Failure you are looking for	Measures
Write extraction	The agent missed a fact, invented one, or produced invalid shape	Schema-valid write rate, extraction precision/recall, source episode coverage
Conflict handling	A stale fact stayed current or a valid old fact was overwritten	Supersession correctness, duplicate rate, stale-fact invalidation correctness
Isolation and policy	Memory leaked across users or survived past its policy window	Tenant isolation failures, deletion correctness, retention compliance
Read retrieval	The right record exists but the reader did not fetch it	Current-state accuracy, point-in-time accuracy, recall@k over memory records
Answer grounding	The answer used memory without support or cited the wrong source	Claim support against source episodes, citation accuracy, conflict-resolution correctness
Operations	The memory path is too slow, too stale, or too expensive	p95 write latency, freshness lag, read latency, cost per query

Benchmarks such as LoCoMo, LongMemEval, and ATM-Bench give useful external signals. They are not replacements for a domain test suite. Memory is workload-shaped: a coding assistant, a customer support bot, and a compliance copilot need different schemas, filters, retention rules, and failure tests.

Caveats

SGAM is my label for a pattern, not a standard. The industry already uses overlapping names for pieces of the same design space: LangGraph memory and LangMem talk about short-term and long-term stores, profiles, collections, hot-path writes, and background memory managers. Zep Graphiti calls the graph-shaped version a temporal Context Graph. Letta frames the system as stateful agents with persisted memory blocks. Mem0 calls itself a managed memory layer. Microsoft GraphRAG, LlamaIndex property graphs, and Cognee use knowledge-graph language for related retrieval and ontology problems.

Those names are not interchangeable. Some systems manage user profiles. Some manage episodes. Some build graph context over documents. Some give the agent tools to edit its own memory. SGAM is the stricter version of the claim: when durable memory represents current application state, it needs schema, validity, provenance, conflict handling, retention, and migration.

Typed memory can still be wrong. A schema makes bad writes easier to inspect; it does not make them trustworthy. You still need source trust, user confirmation for sensitive facts, conflict policy, deletion, and monitoring.

Schema migration is work. Once memory becomes state, you own versioning, backfills, old records, and deletion behavior. That is the price of getting reliable current state instead of a pile of plausible notes.

Key takeaways

Context is a working set. Durable agent memory needs a separate state contract.
SGR and SGAM are complementary: validate the write, then preserve typed memory lifecycle at rest.
Vector recall is useful, but current facts need scope, validity, supersession, and provenance.
Start with a simple relational or JSON-backed memory ledger before reaching for a graph.
Evaluate memory by write correctness, temporal retrieval, grounding, tenant isolation, and deletion behavior.

References

According to Me: Long-Term Personalized Referential Memory QA - Mei et al. paper introducing ATM-Bench and Schema-Guided Memory.
Towards Scalable Multi-Domain Conversational Agents: The Schema-Guided Dialogue Dataset - Rastogi et al. paper on the SGD dataset.
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory - Wu et al. benchmark for long-term memory abilities.
Graphiti: Build Temporal Context Graphs for AI Agents
Zep: A Temporal Knowledge Graph Architecture for Agent Memory
Mem0 Platform Overview
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
LangGraph Memory Overview
LangGraph Persistence
LangMem documentation
Letta: Introduction to Stateful Agents
Microsoft GraphRAG documentation
LlamaIndex: Using a Property Graph Index
Cognee Documentation
Pydantic model validation docs
Python sqlite3 documentation