AI Agent Memory: Schema-Guided Typed State for Long-Running Systems
Agents rarely fail because they forgot everything. They fail because they remembered an old fact and treated it as current state.
Imagine a long-running assistant serving a user named Mira. In one episode, Mira says her passport deadline is July 15. Later, she corrects it to June 30. A vector hit can find the old note. It cannot know, by itself, which deadline is current. That is the difference between recall and memory.
TL;DR: Treat durable agent memory as typed application state. Extract memory candidates through a structured-output boundary, store records with tenant scope, validity windows, supersession, provenance, and schema versioning, then retrieve the smallest current slice on the read path. Vector search can stay in the system. It should not be the authority for mutable facts.
The context-window trap
The context window is a working set. It is not durable memory, and it is not a database.
Long-running agents need to remember preferences, task status, customer facts, tool decisions, compliance notes, and prior mistakes. The easy version is to append summaries or dump old notes into a vector store. That works until one of the remembered facts changes.
Now the agent has two passport deadlines, two preferred formats, or two project decisions. Semantic search may retrieve both. A summary may overwrite one. A long context may include the stale one next to the active one. None of those behaviors gives you a memory contract.
The contract has to answer concrete questions:
- What is true now?
- What was true on June 2?
- Who said it?
- Which tenant does it belong to?
- Which older fact did this fact replace?
- Can I delete or expire it?
That is where Schema-Guided Agent Memory (SGAM) starts.
What SGAM means
The name needs a short cleanup before the architecture makes sense.
Schema-Guided Dialogue (SGD) is the 2019 Google task-oriented dialogue dataset. Its schema describes service APIs, intents, and slots so a dialogue model can track state for services it has not seen before. Useful precedent, different layer.
Schema-Guided Memory (SGM) is the research term used by Mei et al. in According to Me: Long-Term Personalized Referential Memory QA. The paper compares free-text Descriptive Memory (DM) with fixed-schema key-value memory items. The source information is the same; the representation changes.
Schema-Guided Agent Memory (SGAM) is the engineering pattern this article is naming: use schemas to govern durable agent memory writes, updates, retrieval, and deletion. The schema is not just a nicer note format. It is the contract for state.
ATM-Bench makes the motivation concrete. It uses roughly four years of personal memory data from emails, images, and videos, then asks questions that need personal references, location awareness, multiple evidence items, and updates over time. Current memory systems perform poorly on the hard split, and SGM improves over DM because fields like time, source, location, entities, and tags are available to the retriever and answerer as structure instead of prose.
That is why the next comparison is SGAM versus SGR. Schema-Guided Reasoning is not a random detour. It is the write gate SGAM usually needs.
SGR is the write gate
Schema-Guided Reasoning (SGR) and Schema-Guided Agent Memory (SGAM) solve different parts of the same production problem.
I have used SGR as a recurring building block in this blog. The original vLLM and XGrammar article covered the mechanism. Later posts applied it to structured judges, planners and routers, and reproducible RAG rubrics. In all of those cases, SGR constrained the call so the response was valid enough for code to consume.
SGAM borrows that discipline but changes the lifetime. The output is no longer just a verdict, route, or plan for the current request. It becomes a candidate memory write that may affect future sessions and tool calls. That is why the similarities and differences matter.
SGR constrains one model call. It forces the model to produce a valid object for the current inference step, often with Pydantic, JSON Schema, provider-native structured outputs, or a guided decoding runtime such as XGrammar.
SGAM decides what happens after that object exists. Should it be stored? Does it supersede an older fact? Which tenant can see it? Is it current or historical? Which source episode backs it?
| Dimension | SGR | SGAM |
|---|---|---|
| Scope | One inference step or tool call | Durable memory lifecycle |
| Format | Pydantic or JSON Schema for the call | Typed records, schemas, and relations |
| Enforcement | Structured output or decode-time constraints | Validation, database constraints, conflict rules |
| Lifetime | Usually discarded after the response | Persists between sessions and runs |
| Failure mode | Invalid or semantically weak extraction | Stale, polluted, unscoped, or unauditable state |
The pattern is:
structured extraction on write -> typed memory at rest -> scoped retrieval on read
SGR makes the extraction reliable enough to enter memory. SGAM makes the memory safe enough to reuse later.
from datetime import datetime
from pydantic import BaseModel, Field
class MemoryDelta(BaseModel):
tenant_id: str = Field(description="Isolation boundary, e.g. acme")
subject: str = Field(description="Normalized entity ID, e.g. mira")
attribute: str = Field(description="Property being updated")
value: str = Field(description="New value")
valid_from: datetime
source_episode_id: str
That object is not the memory system. It is the boundary between messy conversation and the memory ledger.
The two flows
The workflow has two separate flows. Mixing them is how diagrams become spaghetti.
The ingestion flow is the write path:
- Capture a raw episode from messages, tool results, or business events.
- Extract typed candidates through structured output.
- Validate the schema and reject malformed writes.
- Reconcile conflicts, close stale facts, and keep provenance.
- Commit the record to the SGAM store.
The request flow is the read path:
- Start with the user question.
- Decide whether the question needs current state or point-in-time state.
- Filter by tenant, memory type, subject, attribute, and validity window.
- Add vector or graph expansion only if the exact state lookup is not enough.
- Assemble the smallest cited context for the model.
Read that diagram from left to right in two lanes. The top lane writes memory. The bottom lane reads memory. The store is shared, but the responsibilities are not.
What belongs in a memory schema
A minimal SGAM record needs more than text.
tenant_id
memory_id
subject
attribute
value
memory_type
schema_version
valid_from
valid_to
supersedes_memory_id
source_episode_id
confidence
retention_policy
That shape makes ordinary operations explicit. A new passport deadline can close the previous deadline without deleting history. A query can ask for current state or point-in-time state. A response can cite the source episode. A migration can know which schema version produced a record. A deletion flow can know which indexes need cleanup.
This is where SGAM and RAG separate. RAG retrieves documents. SGAM maintains state.
SGAM is not anti-vector. Vectors are useful for fuzzy recall, clustering, and expansion. But the current value of mira.passport_deadline should come from a scoped memory record, not whichever chunk happened to rank first.
A stale-fact example
The smallest useful SGAM demo is a ledger for that Mira example with two episodes.
e1: Mira prefers concise answers. Her passport deadline is 2026-07-15.
e2: Mira corrected the deadline. It is now 2026-06-30.
A text-memory search can return e1 because it contains the right words. A typed store should return e2 for current state and keep e1 for a historical query.
The write-side behavior is small:
def close_previous_fact(db: sqlite3.Connection, fact: MemoryFact) -> int | None:
row = db.execute(
"""
select fact_id
from memory_facts
where tenant_id = ?
and subject = ?
and attribute = ?
and valid_to is null
order by valid_from desc
limit 1
""",
(fact.tenant_id, fact.subject, fact.attribute),
).fetchone()
if not row:
return None
db.execute(
"update memory_facts set valid_to = ? where fact_id = ?",
(fact.valid_from, row["fact_id"]),
)
return int(row["fact_id"])
The expected behavior is the whole point:
Naive text memory:
returned episode: e1 -> passport deadline is 2026-07-15
SGAM current state:
mira.passport_deadline = 2026-06-30
valid_from=2026-06-03T10:00:00Z, source=e2
SGAM point-in-time state:
on 2026-06-02, mira.passport_deadline = 2026-07-15
In production, pair this transaction with structured extraction on the write path. The database closes stale state. The model should not have to guess which old note still applies.
Tooling map
The agent memory market has not settled on one name for this pattern. The pieces show up as memory stores, context graphs, profiles, long-term stores, graph RAG, and stateful agents.
| Tool or framework | Main storage layer | Temporal handling | Schema mechanism | Practical niche |
|---|---|---|---|---|
| Zep / Graphiti | Neo4j, FalkorDB, Neptune, legacy Kuzu support | High | Pydantic entity and edge types, temporal edges, provenance | Temporal graph memory |
| LangGraph / LangMem | LangGraph stores, Postgres-backed stores | Medium | JSON stores plus Pydantic profile or collection extraction | Agent apps already built on LangGraph |
| Mem0 | Managed stack, Valkey / Redis / vector backends in OSS setups | Medium | Memory types, custom categories, extraction prompts | User, agent, and session memory as a service |
| Letta / MemGPT | Database-backed agent state and memory blocks | Low | Editable labeled memory blocks | Stateful agents with OS-style context management |
| Cognee | Graph plus vector and relational backends | Medium | Ontology-oriented extraction and validation | Enterprise knowledge graph memory |
| LlamaIndex property graph | Property graph stores plus vector stores | Low to medium | SchemaLLMPathExtractor with allowed entities and relations |
Graph extraction over documents and traces |
Graphiti is the clearest open-source reference if your memory is relational and temporal. It tracks facts as they change, keeps provenance to source episodes, and supports hybrid retrieval. LangGraph is a good application layer because it separates thread checkpoints from cross-thread stores. Mem0 is useful when you want managed memory operations instead of owning all of the storage plumbing. Letta is closer to editable context blocks than field-level SGAM, but the stateful-agent framing is relevant.
Do not start with the fanciest graph unless the domain needs it. For many teams, a relational table with JSON payloads, validity columns, tenant indexes, and a vector sidecar is enough for the first version.
Implementation playbook
The first implementation decision is not "graph or vector store?" It is what the product is allowed to remember.
A support agent might remember account tier, open cases, and durable contact preferences. It should not promote every frustrated aside into profile state. A coding agent might remember repo conventions and unresolved tasks. It should not keep a private note forever because the note happened to be retrieved once.
Start with the write path and treat memory as a small state mutation:
- Name the memory type, subject, tenant scope, and retention class.
- Extract candidate records with structured output.
- Validate the payload with Pydantic or the schema layer your stack already uses.
- Resolve conflicts before insert, including whether the new record supersedes an old one.
- Keep a source pointer to the raw episode, tool result, file, ticket, or user confirmation that produced the record.
- Write the schema version with the record, not only in application code.
For many teams, the first SGAM store can be a relational table with a JSON column and a few indexes. You do not need to start with a temporal knowledge graph. The graph becomes useful when relationships matter: customer-to-account, account-to-policy, task-to-artifact, user-to-preference, project-to-decision.
Hot path and background writes
Immediate extraction is worth it when the next turn depends on the new memory. If the user says "remember that I prefer short answers," the system should not need a nightly job before it behaves differently.
Most turns are not like that. The cheaper default is to write the raw episode quickly, attach tenant/session/tool metadata, and let a background worker extract candidate memories later. Recurrence-based consolidation is one version of that policy: buffer low-signal turns and promote a fact only when similar evidence repeats or the user explicitly confirms it. The trade-off is freshness lag. That is acceptable for "user often asks for CSV exports" and less acceptable for "customer changed the delivery address."
The read path should be less clever than the write path. First answer "which state am I allowed to use?" Then ask whether fuzzy retrieval can add useful context.
- Filter by tenant, memory type, and validity window.
- Retrieve exact structured state before semantic neighbors.
- Use vector or graph expansion for supporting evidence, related entities, and examples, not as the authority for current facts.
- Assemble the smallest cited context that can answer the question.
Treat schema migration as product work. When a memory record changes shape, the agent's behavior changes too: what it can recall, what it can cite, what it can delete, and which old facts still count as current. Keep migration scripts, backfills, dual-read windows, and deletion behavior in the same release plan as the product change.
Where SGAM earns its keep
SGAM is a good fit when memory is stateful and changes over time:
- user preferences that can be updated or revoked;
- customer or account facts with audit requirements;
- task state for long-running assistants;
- coding-agent project memory;
- multi-agent shared state;
- compliance notes where provenance matters;
- temporal questions such as "what did we believe before the migration?"
It is overkill when the memory is short-lived, exploratory, or cheap to recompute. If the agent only needs a few turns of continuity, a checkpoint and trimmed message history are enough. If the task is static document QA, RAG may be enough. If your schema changes every day because the domain is still vague, SGAM will slow you down.
Evaluation checklist
Do not evaluate memory only by reading the final answer. A memory system can answer politely while it wrote the wrong fact, retrieved a stale one, or crossed a tenant boundary on the way there.
The shape is close to the evaluation split in my RAG evaluation article: measure the pipeline stage where the failure can happen, not only the generated text at the end. It also overlaps with the trace discipline from the agent evaluation article: a memory bug is often visible in the run history before it becomes visible in the answer.
For SGAM, the cleanest test is replay. Feed a fixed sequence of episodes into the memory writer, inspect the ledger after each meaningful turn, then ask current-state and point-in-time questions against the resulting store.
| Layer | Failure you are looking for | Measures |
|---|---|---|
| Write extraction | The agent missed a fact, invented one, or produced invalid shape | Schema-valid write rate, extraction precision/recall, source episode coverage |
| Conflict handling | A stale fact stayed current or a valid old fact was overwritten | Supersession correctness, duplicate rate, stale-fact invalidation correctness |
| Isolation and policy | Memory leaked across users or survived past its policy window | Tenant isolation failures, deletion correctness, retention compliance |
| Read retrieval | The right record exists but the reader did not fetch it | Current-state accuracy, point-in-time accuracy, recall@k over memory records |
| Answer grounding | The answer used memory without support or cited the wrong source | Claim support against source episodes, citation accuracy, conflict-resolution correctness |
| Operations | The memory path is too slow, too stale, or too expensive | p95 write latency, freshness lag, read latency, cost per query |
Benchmarks such as LoCoMo, LongMemEval, and ATM-Bench give useful external signals. They are not replacements for a domain test suite. Memory is workload-shaped: a coding assistant, a customer support bot, and a compliance copilot need different schemas, filters, retention rules, and failure tests.
Caveats
SGAM is my label for a pattern, not a standard. The industry already uses overlapping names for pieces of the same design space: LangGraph memory and LangMem talk about short-term and long-term stores, profiles, collections, hot-path writes, and background memory managers. Zep Graphiti calls the graph-shaped version a temporal Context Graph. Letta frames the system as stateful agents with persisted memory blocks. Mem0 calls itself a managed memory layer. Microsoft GraphRAG, LlamaIndex property graphs, and Cognee use knowledge-graph language for related retrieval and ontology problems.
Those names are not interchangeable. Some systems manage user profiles. Some manage episodes. Some build graph context over documents. Some give the agent tools to edit its own memory. SGAM is the stricter version of the claim: when durable memory represents current application state, it needs schema, validity, provenance, conflict handling, retention, and migration.
Typed memory can still be wrong. A schema makes bad writes easier to inspect; it does not make them trustworthy. You still need source trust, user confirmation for sensitive facts, conflict policy, deletion, and monitoring.
Schema migration is work. Once memory becomes state, you own versioning, backfills, old records, and deletion behavior. That is the price of getting reliable current state instead of a pile of plausible notes.
Key takeaways
- Context is a working set. Durable agent memory needs a separate state contract.
- SGR and SGAM are complementary: validate the write, then preserve typed memory lifecycle at rest.
- Vector recall is useful, but current facts need scope, validity, supersession, and provenance.
- Start with a simple relational or JSON-backed memory ledger before reaching for a graph.
- Evaluate memory by write correctness, temporal retrieval, grounding, tenant isolation, and deletion behavior.
References
- According to Me: Long-Term Personalized Referential Memory QA - Mei et al. paper introducing ATM-Bench and Schema-Guided Memory.
- Towards Scalable Multi-Domain Conversational Agents: The Schema-Guided Dialogue Dataset - Rastogi et al. paper on the SGD dataset.
- LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory - Wu et al. benchmark for long-term memory abilities.
- Graphiti: Build Temporal Context Graphs for AI Agents
- Zep: A Temporal Knowledge Graph Architecture for Agent Memory
- Mem0 Platform Overview
- Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
- LangGraph Memory Overview
- LangGraph Persistence
- LangMem documentation
- Letta: Introduction to Stateful Agents
- Microsoft GraphRAG documentation
- LlamaIndex: Using a Property Graph Index
- Cognee Documentation
- Pydantic model validation docs
- Python sqlite3 documentation