The Runtime β running agents for hours, not seconds
Part 5 of the Engineering the Agentic Stack series
The previous four posts covered the inside of the agent: reasoning, memory, tools, security. This one is about the outside.
Production agents in 2026 stopped being request handlers and started being background workers. A run can last a few hours. The worker hosting it might not. The interesting engineering moved out of the agent into the runtime around it. The model picks the next move. The runtime decides what happens to that move when the worker dies mid-call.
TL;DR: Long-running agents need a runtime outside the model: session logs, harness logic, sandbox isolation, checkpoints, traces, policy checks, secrets, and cost limits. Queue + worker + checkpoint DB is the default when you own the stack; hosted harnesses fit when their vendor constraints are acceptable. Pick by run length first, then recovery semantics, replay needs, sandbox isolation, and operational ownership.
What changes when sessions outlast processes
A year ago, "deploying an agent" meant wrapping a chat endpoint in a container and pointing a load balancer at it. The work was indistinguishable from any other Python web service. That stopped being true the moment agents started running for hours instead of seconds.
The OpenAI Codex team put a number on the new baseline in their harness engineering write-up:
"We regularly see single Codex runs work on a single task for upwards of six hours (often while the humans are sleeping)."
Anthropic's engineering team put the structural problem just as clearly in Effective harnesses for long-running agents:
"The core challenge of long-running agents is that they must work in discrete sessions, and each new session begins with no memory of what came before."
Take that sentence seriously and the whole runtime stack changes shape. Three consequences follow from "discrete sessions with no shared memory," and each one forces a specific piece of infrastructure into the design.
If sessions are discrete, the session has to live outside the process. In-memory state is gone the moment the worker exits, so the session has to be written to a durable store the next worker can read. If a session can resume after a crash, the durable record has to contain everything that happened so far, not just the final answer. That lets the next worker pick up at the right point instead of replaying the run from scratch. If the model fills its context window before the work is done, something has to checkpoint the progress, tear the current session down, and start a fresh one that loads the checkpoint instead of replaying the full history. In Anthropic's phrasing, the harness becomes "cattle": disposable, restartable, identical instances. The state lives somewhere else.
The model decides what to do next. The runtime decides whether the move is allowed, where it executes, how it is recorded, and how the run resumes after a crash. The rest of this post is about that runtime.
The five primitives every production agent needs
Anthropic's Scaling Managed Agents write-up gave the industry a working vocabulary, and most of the field has converged on it. Five components do most of the runtime work. The harness drives the agent forward step by step; the session records what it did; the sandbox is where commands execute; the checkpoint is what the next worker reads on resume; the trace is what you read days later when you need to understand what went wrong. Each component is swappable as long as the responsibility stays intact.
Session. An append-only log of everything that happened: model calls, tool calls, results, errors, approvals. Recovery is wake(sessionId) β getSession(id) β resume from last event. In LangGraph this is a thread_id plus a Postgres checkpointer (see LangGraph persistence). The OpenAI Agents SDK ships ten built-in session backends including SQLiteSession, RedisSession, SQLAlchemySession, MongoDBSession, and EncryptedSession (see the Sessions docs).
Harness. The orchestration loop. It calls the model, parses tool calls, executes them, writes results back into the session, and applies retry rules. Anthropic puts it bluntly:
"every component in a harness encodes an assumption about what the model can't do on its own."
OpenAI's Codex team calls the discipline harness engineering: writing software still takes discipline, but more of it now goes into the scaffolding than into the code itself. LangGraph's CompiledStateGraph, Deep Agents' create_deep_agent, and Claude Code itself are all harnesses in this sense.
Sandbox. The isolated execution environment where commands actually run. The OpenAI Agents SDK sandbox concepts page draws the line cleanly:
"The outer runtime still owns approvals, tracing, handoffs, and resume bookkeeping. The sandbox session owns commands, file changes, and environment isolation."
Sandboxes differ in how long they live and what they remember between runs. The simplest shape is fresh ephemeral: spin one up for a single task, destroy it when the task ends, and pay the cold-start cost on every run. Persistent paused sandboxes stay alive between runs in a paused state; the filesystem and a memory snapshot stick around, so the next resume is sub-second instead of a full boot. Snapshot or fork goes a step further: each task branches a copy-on-write image from a parent that already has dependencies installed and caches warmed, so the expensive setup work happens once and N tasks share the base image. Per-worktree sandboxes give each task its own workspace and its own observability stack: separate logs, metrics, and traces. That lets you debug one agent's run without it bleeding into another's. Concrete cold-start and persistence numbers per provider are in the table further down this section.
Checkpoint. Resumable state. LangGraph's PostgresSaver writes a StateSnapshot at every super-step boundary, with per-task writes to checkpoint_writes so successful node outputs are not recomputed when a sibling fails. The snapshot is a JSON-serializable dict (v, ts, id, channel_values, channel_versions, versions_seen, pending_sends) documented on the langgraph-checkpoint-postgres PyPI page and the LangGraph checkpoints reference.
Trace. The replay and debug surface. Every model call, tool call, and sub-agent step becomes a span with timing, inputs, outputs, token counts, and cost. When a six-hour run fails, the trace is what you read to figure out what went wrong. The terminal output from the run is long gone by then. OpenTelemetry's GenAI semantic conventions standardize the attribute names (which model, which provider, how many tokens, which conversation, which workflow), so the same trace renders cleanly in Tempo, Jaeger, Honeycomb, or LangSmith without re-instrumenting.
Policy and secrets are separate runtime boundaries
Two boundaries cut across all five primitives and are easier to think about as separate concerns. They're the runtime version of the security argument from Part 4.
Policy engine
A permission check runs before every tool call and decides whether it goes through. Two patterns are common in production. Deep Agents lets each subagent declare which file paths it can read or write, and the middleware blocks anything outside that declaration. Anthropic Managed Agents routes every tool call through an MCP proxy, so the proxy enforces permissions instead of the agent code. When a sensitive call needs human approval, LangGraph's interrupt() and Deep Agents' approval hook pause the graph until a person says yes.
Secret broker
The model should not see long-lived secrets, and the sandbox usually should not either. The Managed Agents pattern is the one to copy:
"For Git, we use each repository's access token to clone the repo during sandbox initialization and wire it into the local git remote. Git
pushandpullwork from inside the sandbox without the agent ever handling the token itself. For custom tools, we support MCP and store OAuth tokens in a secure vault. Claude calls MCP tools via a dedicated proxy; this proxy takes in a token associated with the session. β¦ The harness is never made aware of any credentials."
In the market-analyst-agent reference stack, the MCP sidecar reads OAuth tokens from a Docker secret (in production, HashiCorp Vault) and exposes only the tool surface to the LangGraph worker. The worker never sees the token. git push works. cat ~/.ssh/id_rsa does not.
A practical sanity check on your own stack: write down every component you run and which of the five primitives it implements. Postgres might cover session and checkpoint. The worker container is the harness. A hosted-sandbox service such as Daytona, Modal, or E2B is the sandbox. Tempo or LangSmith is the trace. If you find that two primitives live in the same process, one crash takes both down. If two share a single credential, one leak takes both down. Both are easy to introduce by accident when you're moving fast: a worker that also writes traces, a sidecar token that also unlocks the checkpoint DB.
Failure modes the runtime has to handle
The runtime exists for the things the model cannot do on its own: manage retries, remember what it did three hours ago, isolate one run's filesystem from another's, stop itself before spending the budget. By now we've named seven pieces of it: harness, session log, sandbox, checkpointer, tracer, policy engine, secret broker. Once a single agent run crosses about 30 minutes of wall time, a recognizable set of failures starts showing up. Most of them have nothing to do with reasoning quality; they are problems with state, retries, sandboxes, and budgets.
The failures fall into four loose groups:
- Quality-of-output failures: the agent declares victory before the work is actually done, forgets what it did across a context-window reset, or trusts its own self-evaluation and ships broken output.
- Cost-control failures: the agent gets stuck in a retry loop, or spends through a token or tool-call budget without producing anything useful.
- State and crash failures: workspaces drift because one run touches files another run owns, tool calls fire more than once because retries replay them, or work is lost when a worker dies between events.
- Context-window failures: the model summarises and quits early because it thinks it is running out of room, even when the window still has headroom.
The table below maps each failure to the mitigation pattern, the runtime hook where the mitigation lives, and whether the mitigation is still needed on the current model generation. Some are no longer needed. The older your harness is, the more obsolete mitigations it likely contains.

| Failure mode | Mitigation | Still needed? | Runtime hook |
|---|---|---|---|
| Premature completion: agent declares victory early | Generator/evaluator split: a fresh-context evaluator reads files (not chat) and votes "done" or "not done." Default-FAIL on every acceptance check. | Yes; Anthropic's own cwc-long-running-agents quick-start still ships an evaluator subagent. |
Sub-agent without Write/Edit tools and its own context window |
| Feature amnesia across context windows | Initializer agent writes claude-progress.txt, feature-list.json, init.sh. Coding agent reads them on every cold boot. |
Yes; compaction alone does not close the gap. | Boot hook before the first model call of each session |
| Duplicated work after session reset | Append-only event log plus a structured handoff file. Each new session starts with pwd β read PROGRESS.md β review tests. |
Yes | LangGraph PostgresSaver checkpoint plus progress.md artifact |
| Context anxiety: model summarises and quits early | (a) Enable the 1M-token beta but cap effective usage at 200k (Cognition's Sonnet 4.5 fix). (b) Tear the session down and rebuild from a handoff. | No on Opus 4.5; Anthropic reports "the behavior was gone" and the resets "had become dead weight." Yes on Sonnet 4.5 and GPT-5/Codex. | Outer driver caps session length, starts the next, resumes from checkpoint |
| Self-evaluation optimism: model marks its work passing | Separate evaluator plus Playwright/MCP grounding in the real DOM, not screenshots. Anthropic's harness design frontend rubric penalizes "AI-style" defaults. | Yes | Evaluator runs in a separate sandbox session with no write tools |
| Stuck loops and retry storms | Iteration cap per turn, exponential backoff, circuit breaker on tool error rate. Hard budget on tool calls. | Yes | Decorator on the tool-execution node; RetryPolicy on Temporal Activities (see Temporal OpenAI Agents SDK contrib) |
| Workspace drift: agent edits unrelated files | Git commits as checkpoints, file-permission middleware, per-session workspace mount. Deep Agents middleware lets you declare path read/write. | Yes | LangGraph file-permission middleware or Daytona/Runloop per-task fork |
| Runaway token or tool cost | Per-run token budget, per-tool budget, kill switch tied to a Prometheus counter. | Yes; a 24-hour Opus session with bad budgeting can spend a week's API budget in an afternoon (Addy Osmani on long-running agents). | Cost-attribution span attributes plus Alertmanager rule |
| Non-idempotent tool calls | Idempotency key per tool call. In durable workflows, retries can fire the same tool call more than once, so a deduplication key blocks the duplicate. | Yes | Temporal Activity with start_to_close_timeout and idempotency key |
| Lost work after process or sandbox crash | Durable session log outside the process; checkpoint after every super-step. wake(sessionId) β getSession(id) β resume. |
Yes | PostgresSaver at every super-step, or wrap as a Temporal Workflow |
Two ideas show up across every row. Anthropic, on harness staleness in Harness design for long-running application development:
"Every component in a harness encodes an assumption about what the model can't do on its own, and those assumptions are worth stress testing, both because they may be incorrect, and because they can quickly go stale as models improve."
Vercel, on the related problem of too many tools encoding too many assumptions, in We removed 80% of our agent's tools:
"We deleted most of it and stripped the agent down to a single tool: execute arbitrary bash commands. We call this a file system agent."
Vercel's reported result on a representative query: success rate went from 80% to 100%, and the worst case dropped from 724 s / 100 steps / 145,463 tokens (failed) to 141 s / 19 steps / 67,483 tokens (succeeded). The lesson is not "delete your tools." It's that every primitive in your runtime, including the tool surface, has a half-life. Re-test the assumption when the model changes.
Cognition saw the same moving target on session length with Sonnet 4.5. In Rebuilding Devin for Claude Sonnet 4.5 they describe a model that proactively writes SUMMARY.md / CHANGELOG.md as it senses context exhaustion but underestimates how many tokens it has left. Their fix is to enable the 1M-token beta and cap usage at 200k so the model still believes it has headroom. That mitigation will, eventually, become dead weight too.
OpenAI's harness team has the one-line version of the discipline: "Humans steer. Agents execute." When something fails, the question is not "try harder." It is "what capability is missing, and how do we make it both legible and enforceable for the agent?"
The healthy run lifecycle
A well-behaved run is boring. It is a chain of small recoverable steps, each one writing its result to durable storage before the next one starts, rather than one giant request that has to succeed end-to-end. Splitting a run into steps that way is what lets it survive crashes: when something fails, only the step that was in flight is lost, and the next worker resumes from the last completed step instead of starting over.
- Boot from either a fresh session or a resumed one. On resume, mount the workspace from its last known state, read any progress files the previous attempt left behind (
PROGRESS.md,feature-list.json), and load the last checkpoint from the database. This is where the harness hands the agent everything the previous worker had in memory before it died. - Plan before any tool calls fire. Write down what "done" looks like, how much the run is allowed to spend, which tools the agent can invoke, and what should stop the run early. These plan values become runtime checks; without them, execution has nothing to push back against.
- Execute one tool call at a time. The policy layer decides whether to allow the call. The harness runs it, captures the result, and writes one event into the session log. One step, one event. A crash between events is recoverable because the log, not the worker's memory, is the source of truth.
- Checkpoint at super-step boundaries, or after every event in a simpler harness. Persist the graph state, the workspace diff, and references to any artifacts produced. This checkpoint is what step 1 reads on the next resume. If the checkpoint is missing or stale, recovery degrades into replaying the full session log from scratch, which is much slower.
- Evaluate against the artifacts when the agent thinks it is done: tests, a fresh-context reviewer, schema validation, browser checks. If the check passes, the run exits successfully. If it fails, the run resumes from the last clean checkpoint with the failure message added to context and tries again.
There is no step in that list that requires the agent to remember anything between runs. The state lives in the session and the checkpoint, and the agent reads it back on each resume.
Side-effecting tools need idempotency. Any tool with side effects needs an idempotency key derived from the session ID and tool-call ID, stored before the side effect fires. send_email(session_id, tool_call_id, message_hash). create_pr(session_id, tool_call_id, branch_name). charge_customer(session_id, tool_call_id, invoice_id). At-least-once execution is the default in queues and workflow engines. If repeating a tool call can cause real harm, the tool is not ready for agents.
Evaluation has to run outside the producing context. The same context that produced the answer cannot reliably judge whether the answer is correct. A fresh evaluator reads files and artifacts, runs tests, lints, browser checks, or schema validation, and returns pass, fail, or needs_human. For code agents, this is another model session with read-only tools. For data and report agents, it is a deterministic validator plus a reviewer model.
Eleven deployment shapes and what decides between them
Once the five primitives are named, the question is which deployment shape runs them. By "shape" I mean an arrangement of those primitives: where the harness lives, where state persists, and what kind of sandbox runs the work. It is not just a single vendor pick. The chart below shows where each shape is comfortable on the run-length axis. The text after it walks through what decides between them.
If you only read one of the eleven, read shape 2: queue + worker + checkpoint DB. It is the default I recommend for most teams, the shape used by the reference repo, and the skeleton most other shapes vary on: queue β worker β durable state, with the sandbox source, harness owner, or state engine swapped out. Reading shape 2 first makes the rest faster to scan.
The chart compares shapes by run length. The matrix below compares them by ownership: where each of the five primitives physically lives. Green cells are where the shape provides the primitive; grey is where you wire it in yourself.
1. SDK inside an app server (synchronous, request scoped)
The original shape. The agent SDK runs inside a request handler. Good for sub-30-second tasks, demos, and internal tools. Bad for anything an HTTP client might disconnect from. Cloud Run's HTTP timeout maxes out at 60 minutes, and any web-tier panic kills the run. The SDK is the harness, the web process also acts as the sandbox, and state usually lives in process memory unless you explicitly push it elsewhere. Do not use this for multi-hour work.
2. Queue + worker + checkpoint DB
The default I recommend for most teams, and the production shape used in market-analyst-agent: a Python worker with a PostgreSQL checkpointer, Redis Streams (or RabbitMQ) for the inbound queue, and an MCP sidecar for tools. Good for 10-minute to multi-hour runs with idempotent steps. The local runner can bypass the queue for synchronous development, but the queue is part of the production shape once you need async submission and backpressure.
The app accepts a request, creates a session row, pushes a job, and returns a run ID. The worker pulls the job, runs the harness, writes checkpoints, streams status, and stores artifacts as it goes. Postgres survives, workers are cattle, and queue depth gives you backpressure. Spot/Preemptible compute works as long as the checkpointer finishes writing to disk before it reports success.
In this shape, the worker is the harness. The worker container plus a per-thread workspace mount is the sandbox. Postgres owns session and checkpoint state. Traces go through OpenTelemetry to whatever observability stack you run.
3. Durable workflow engine (Temporal-style)
Agent orchestration code runs inside a Temporal Workflow; model calls and tool calls run as Activities. Workflow state lives in an event-history log backed by Cassandra, MySQL, or Postgres, so state replays cleanly across deploys. The public-preview Temporal Γ OpenAI Agents SDK integration ships an OpenAIAgentsPlugin and an activity_as_tool helper, and the agentic sandboxes write-up describes forking a running agent onto a different sandbox provider mid-conversation. Idle workflows consume zero compute. The caveats are real: streaming and voice agents are unsupported in the current integration, and LocalShellTool and ComputerTool are disabled because they don't fit a distributed model.
Use this shape when the run has real waiting points: human approvals, external callbacks, long sleeps, retries with business rules, deploy windows. A human approval becomes a durable sleep that consumes no compute, not a polling loop that does.
The Workflow code is the harness. The sandbox usually lives outside Temporal and is called from Activities. Session and checkpoint state collapse into Temporal's event-history log, while trace visibility comes from Temporal UI plus OpenTelemetry spans on each Activity.
4. Sandbox provider per session
A newer shape. Every agent run gets its own microVM or container from a sandbox-as-a-service provider. The harness lives somewhere durable; the sandbox is the disposable execution environment.
| Provider | Isolation | Max session | Concurrency | Persistence | Cold start |
|---|---|---|---|---|---|
| E2B | Firecracker microVM | 1 h Hobby / 24 h Pro | 20 / 100 (up to 1,100 add-on) | Pause/resume, ~4 s/GiB pause, ~1 s resume (public beta) | ~150 ms p50 |
| Vercel Sandbox | Firecracker microVM | 45 min Hobby / 5 h Pro/Ent | 10 / 2,000 | Disposable | n/a |
| Daytona | Docker (optional Kata) | configurable auto-stop/archive | tier-based | Stop β Archive β Delete; fork supported | ~90 ms (some configs 27 ms) |
| Modal Sandboxes | gVisor | typical 1β15 min lifecycle | high | Volumes for persistence; memory snapshot in preview | "about one second" per Modal docs |
| Runloop Devboxes | microVM (custom hypervisor) | suspend/resume; snapshot+branch | "more than 30,000 concurrent instances" per the AWS Marketplace listing | Snapshot + branch from disk state | sub-1 s |
Sources: the E2B vs Daytona comparison, Daytona's own sandboxes documentation and fork/snapshot changelog, Modal's sandboxes guide and cold-start guide, the Runloop AWS Marketplace listing, and Vercel Sandbox pricing. Daytona's docs add a useful note on fork semantics: "the new sandbox is fully independent β¦ Daytona tracks the parent-child relationship in a fork tree." Every forked sandbox keeps a recorded link back to the base it branched from, so you can query the lineage of any derived sandbox. OpenAI's Codex harness uses the per-worktree variant: "Codex works on a fully isolated version of that app, including its logs and metrics, which get torn down once that task is complete."
Reach for this shape when the agent runs untrusted code, browser automation, tests, or package installs. The trade-off is cost and provider coupling, both higher than running shared workers.
The provider owns the sandbox and nothing else. Harness, session, checkpoint, and trace stay on your side, usually wired as the queue + worker shape from #2.
5. Anthropic Managed Agents (hosted harness)
Launched in public beta on April 8, 2026, behind the managed-agents-2026-04-01 beta header. Claude bills Managed Agents at standard token rates plus $0.08 per session-hour. Billing is millisecond-granular and applies only while the session status is "running." Idle time is free; session runtime "replaces the Code Execution container-hour billing model when using Claude Managed Agents." You get a hosted session, harness, sandbox, and vault-backed MCP proxy. The brain/hands split is what wake(sessionId) is buying: the harness can be reinitialised on a new worker without losing state. Watch the cost shape carefully; a runaway retry loop on a session-hour bill adds up faster than a per-token bill.
Read the caveats. The Batch API discount does not apply ("Sessions are stateful and interactive. There is no batch mode."). Managed Agents is not available through AWS Bedrock or Google Vertex AI. Multi-agent coordination and self-evaluation remain in research preview. Lock-in is high: you trade harness freedom for not running the loop yourself.
Anthropic hosts all five primitives: session, harness, sandbox, checkpoint, and trace. You hand over the runtime and get the outputs.
6. LangChain Deep Agents Deploy (managed open harness)
deepagents deploy packages a deepagents.toml into a LangSmith Deployment with durable execution, memory, multi-tenancy, human-in-the-loop, observability, sandboxed code execution, and scheduled runs. Cloud, hybrid, and self-hosted deployment modes are supported. Sandbox providers (LangSmith Sandboxes, Daytona, Modal, Runloop, or custom) are switchable via a single config value. State lives in a virtual filesystem with pluggable backends; memory is scoped to user, assistant, or both. Lock-in is lower than Managed Agents: the harness is MIT-licensed, instructions use the open AGENTS.md standard, and agents are exposed via MCP, A2A, and Agent Protocol. See LangChain's runtime-behind-production-deep-agents write-up.
All five primitives are hosted by default, but each is config-swappable. The sandbox sits behind one config value. Session and checkpoint live on a virtual filesystem with pluggable backends. Trace goes to LangSmith.
7. Google Cloud Run service or job
Cloud Run has two different runtime modes, and which one fits depends on how the agent is invoked. Services are HTTP-bound and scale to zero between requests; the harness runs as a request handler that returns when the run is done. Jobs run to completion without an HTTP entrypoint; the harness runs as a one-shot worker that exits when the task finishes. Both can host the harness, but neither holds state across runs. Sessions and checkpoints have to live in Postgres, Spanner, or a similar external store.
The hard limits are very different between the two. Cloud Run service request timeout: default 300 s, max 3,600 s (60 min). WebSockets get the same timeout. Cloud Run jobs: default 10 min per task, max 168 h (7 days); for tasks using GPUs, max 1 hour. Services scale to zero unless you enable always-on CPU; jobs do not have HTTP and do not autoscale.
Use a service for synchronous runs up to 60 minutes. Use a job for longer one-shot or async work. Cloud Run Jobs can keep a task alive for days, but they do not give you durable replay across deploys, version changes, or worker replacement. Above 7 days, do not use Cloud Run.
Cloud Run hosts the harness. Session, checkpoint, sandbox, and trace are external services you wire in, typically Postgres or Spanner for session/checkpoint, the container itself as the sandbox, and Cloud Logging plus OpenTelemetry for trace.
8. AWS Lambda (why it is the wrong tool)
Lambda's maximum function timeout is 900 s (15 minutes), hard. API Gateway adds a separate 29 s cap on top of that. An agent harness whose minimum useful run is "minutes to hours" cannot survive on Lambda without an external state store and a re-invocation strategy that essentially rebuilds the queue + worker shape from scratch. Use Lambda for individual tool calls, such as file fetches or S3 uploads, invoked by a longer-running orchestrator. Do not put the orchestrator there.
At most, Lambda holds one tool call inside its 15-minute cap. The harness, session, checkpoint, sandbox, and trace all have to live somewhere else.
9. AWS ECS / Fargate task per run
No documented hard cap on task runtime (unlike Lambda). Per Fargate throttling quotas, launches are rate-limited: a burst of 100, refilling at 20 per second, with on-demand and spot tracked on separate 20/sec budgets. ECS service quotas: services with AWS Cloud Map service discovery cap at 1,000 tasks per service; the theoretical ceiling is 5,000 EC2 instances per cluster. Fargate requires awsvpc mode, so every task gets its own network interface and a private IP, which is the right shape when you need VPC-internal data access. Fargate Spot is available; budget for interruption. Durability is on you: there is no Temporal-style replay built in.
Fargate hosts the harness and gives each run its own sandbox: one task per run. Session, checkpoint, and trace go to external services. RDS or DynamoDB plus CloudWatch/X-Ray are the usual picks.
10. Kubernetes Job or namespace per session
Good when you already operate Kubernetes and want sandbox-per-session with cluster-wide controls. Bad when you need sub-second startup, because pulling the container image and initializing the pod takes too long on a cold start. The pattern is one Job per agent run, with activeDeadlineSeconds, a PersistentVolumeClaim for the workspace, and a sidecar for the MCP server. Crash recovery is yours to build. Adopting Kubernetes just to host agents is expensive in configuration overhead and operational burden. Only worth it if you already run K8s for other reasons.
K8s hosts the harness and the sandbox, usually as one Job per run and sometimes with a dedicated namespace for stronger isolation. Session and checkpoint live in an external DB or on a PersistentVolumeClaim. Trace flows into whatever in-cluster observability stack you already run.
11. Local Docker Compose (dev only)
The reference for the next section. The point of this shape is that it mirrors the production topology one-for-one (same primitives, same network shape) while running on a single box. Read the "not production-safe" list at the end of the next section before you ship anything that looks like this.
Compose mirrors shape #2 on a single host. Postgres holds session and checkpoint. The worker container is the harness. The workspace mount is the shared sandbox. The optional OpenTelemetry stack is the trace.
Reference stack: Docker Compose
The reference topology, used in slavadubrov/market-analyst-agent, is a LangGraph worker, a Postgres checkpointer, Qdrant for retrieval, an MCP sidecar, a Redis queue for async production-like runs, and an optional Prometheus / Grafana / Loki / Tempo / OTel observability stack. In local compose, Redis is optional only because the synchronous runner can call the worker directly. docker compose up brings the whole topology up locally.
The one piece worth showing inline is the canonical LangGraph wiring. It is the smallest concrete example of the checkpoint primitive:
from langgraph.checkpoint.postgres import PostgresSaver
DB_URI = "postgresql://agent:${POSTGRES_PASSWORD}@postgres:5432/agent"
with PostgresSaver.from_conn_string(DB_URI) as checkpointer:
checkpointer.setup() # creates tables on first run
graph = builder.compile(checkpointer=checkpointer)
Observability that survives the run
Short request handlers are easy to debug: when something fails, you read the response and the live log. Long-running agents do not have that luxury. By the time a six-hour run fails, the interesting event happened five hours ago, the live terminal output is gone, and the worker that produced it has been replaced. Nobody is going to reconstruct the run from memory. So you debug from durable artifacts that were written while the run was still alive.
Production stacks tend to cover four kinds of artifact, in two groups. Two of them you read after the run is over, for postmortems and replay: a queryable event log of every step, and OpenTelemetry traces of where time and tokens went. Two of them you read during the run, to watch it as it happens: a live tail of what the agent is producing in the workspace, and a per-worktree observability stack that the agent itself can query while it is still running.
Structured event log (read after the run)
Every model call, tool call, result, error, and approval written to durable storage, keyed by session ID and timestamp. Once the run ends, you query it like a normal database table. Addy Osmani sets the bar plainly in Long-running Agents: "If you can't reconstruct what the agent did in the last 24 hours from durable storage, what you have is a long-running shell script that happens to call an LLM, not a long-running agent."
OpenTelemetry GenAI traces (read after the run)
The same kind of step-by-step data, but emitted as spans using the standard attributes from the gen_ai.* semantic conventions: model name, provider, input and output token counts, conversation ID, workflow name (Development status as of v1.36.0). Provider-specific fields live in subnamespaces (anthropic.*, openai.*) keyed off gen_ai.provider.name. The reason to use the standard is portability: the same trace renders cleanly in Tempo, Jaeger, Honeycomb, or LangSmith without re-instrumenting the code each time you switch backends.
Tool-call timeline plus workspace diffs (read during the run)
The fastest way to know what an agent is doing right now is to tail what it is producing in the workspace, not to grep through a session log. Anthropic's Harness Primitives for Long-Running Claude Agents quick-start ships two hooks for this: watch -n 5 'git log --oneline -8' shows the latest commits the agent has made, and watch -n 5 'find screenshots -name "*.png" | tail -5' shows the latest screenshots it has taken. Two terminal panes refreshing every five seconds is enough to tell whether a run is making real progress or spinning.
Ephemeral stack per worktree (read by the agent itself, during the run)
Per OpenAI's harness post: "Logs, metrics, and traces are exposed to Codex via a local observability stack that's ephemeral for any given worktree." Each agent worktree gets its own short-lived Loki + Prometheus + Tempo, scoped to that run alone. The agent queries it while it works. That is what lets a prompt like "no span in these four user journeys exceeds two seconds" become something the agent can verify directly, rather than something it has to guess at.
(The fresh-context evaluator from the failure-modes table reads these artifacts to decide "done." It belongs to evaluation, not observability; see Β§ healthy run lifecycle. It depends on every surface above.)
A minimal self-hosted observability stack
For something like market-analyst-agent:
- OpenTelemetry Collector with the GenAI processor and an attribute filter on
gen_ai.*. - Tempo (or Jaeger) for traces, keyed by
gen_ai.conversation.id/thread_id. - Loki for structured event-log entries.
- Prometheus for
gen_ai.client.token.usage,gen_ai.client.operation.duration,gen_ai.server.time_to_first_token(see the GenAI metrics conventions). - Grafana dashboards keyed on
gen_ai.agent.nameandgen_ai.request.model.
Hosted alternatives (pick one, not three):
- LangSmith: native LangGraph integration; also the deployment target for Deep Agents Deploy.
- Braintrust: strongest fit if eval-first regression suites are the priority.
- Arize Phoenix: OSS, OTLP-native, pairs with OpenInference instrumentation.
- OpenAI's tracing dashboard: automatic when you use the OpenAI Agents SDK or its Temporal integration.
- Anthropic's Claude tracing: for sessions running inside Managed Agents.
Instrument the LangGraph node
# In the LangGraph node, around the model call:
span.set_attribute("gen_ai.operation.name", "chat")
span.set_attribute("gen_ai.provider.name", "anthropic")
span.set_attribute("gen_ai.request.model", "claude-sonnet-4-5")
span.set_attribute("gen_ai.response.model", "claude-sonnet-4-5")
span.set_attribute("gen_ai.conversation.id", thread_id)
span.set_attribute("gen_ai.agent.name", "market-analyst")
span.set_attribute("gen_ai.workflow.name", "research_then_write")
span.set_attribute("gen_ai.usage.input_tokens", usage.input_tokens)
span.set_attribute("gen_ai.usage.output_tokens", usage.output_tokens)
Attribute names taken verbatim from the OpenTelemetry GenAI semantic conventions registry.
Three queries you actually need
# Loki: token usage per agent over 1h
sum by (gen_ai_agent_name) (
rate({service_name="market-analyst-agent"} | json | unwrap gen_ai_usage_output_tokens [1h])
)
# PromQL: p95 model latency per model
histogram_quantile(0.95,
sum by (le, gen_ai_request_model) (
rate(gen_ai_client_operation_duration_bucket[5m])
)
)
# TraceQL: long-running tool calls
{ span.gen_ai.operation.name = "execute_tool" && duration > 30s }
The debug bundle pattern
When a run fails, the worker should drop /workspaces/${THREAD_ID}/_debug/ containing the artifacts you would ask for in a postmortem:
session.jsonl: full event-log dump from the PostgresSaver (checkpointer.list({"thread_id": ...})).last_state.json:StateSnapshot.valuesfrom the last successful super-step.trace.json: OTLP-exported spans for the run.tool_calls.csv:(ts, tool, input_hash, latency_ms, status, error).workspace.tar.zst: the workspace directory plusgit diffagainst the initializer commit.screenshots/*.png: what the agent saw.PROGRESS.md,feature-list.json, and any other agent-authored progress files.env.txt: image tags, model version, harness commit SHA.
This bundle is what a human (or another agent) needs to figure out what happened. Without it, every failed run is a guess. "The agent got stuck" is vague. A useful failure report looks more like this: session s_123 spent 71 percent of tokens in a three-command retry loop after npm install failed.
Picking the right shape: a decision guide
Most of the comparison above collapses into a handful of decisions.
Start with run length
Use run length as the first filter:
- Under 30 seconds, idempotent: request-lifecycle SDK in an app server.
- 30 s to 60 min, no crash recovery needed: queue + worker + checkpoint DB.
- 60 min to 24 h: same queue + worker, or a Cloud Run Job for one-shot work. Use a durable workflow engine if you also need versioning and replay.
- More than 24 h, must survive deploys: durable workflow engine (Temporal-style). Cloud Run Jobs can hold long work up to their task limit, but they do not provide replay semantics.
- Multi-day RL training loops: K8s Job + volume + Temporal.
Once run length is decided, the rest is platform choice.
Platform fit by use case
The matrix is dense on purpose: eleven shapes across many workload types in one view. Two patterns shape almost every reading of it.
Deep Agents Deploy is the only column with green on every row. That means it is the only shape in the lineup that fits every workload type the matrix tracks: short runs, multi-hour runs, code agents, research agents, scheduled jobs. That breadth is the strongest argument for picking it. The trade-off is maturity. The harness shipped recently, and there is less production track record around it than around older alternatives like a queue + worker + Postgres stack that teams have been running for years. If "fits every workload type without re-platforming" is the constraint that matters most to you, accept the lower maturity and pick Deep Agents Deploy. Otherwise prefer the shape you already operate.
Anthropic Managed Agents either fits your workload entirely or not at all. The product has three hard constraints: it is hosted-only, Claude-only, and sub-24-hour per session. If your workload satisfies all three, for example an internal coding agent that runs in 2-6 hour bursts and you would rather not operate a harness yourself, Managed Agents is a strong fit and removes a large chunk of platform work from your team. If one constraint fails because you need a non-Claude model, self-hosted compliance, or 48-hour runs, Managed Agents does not fit. No amount of configuration changes that.
The pricing is worth modeling before you commit, not after. The session-hour line is $0.08/hour on top of standard token costs. If a single session ran continuously, that is about $58/month per session. At 100 sessions running continuously, it is about $5,800/month before tokens. Multiply $0.08 by your expected concurrent-session hours, add it to your token bill, and compare against what a queue + worker stack on your own infra would cost. Do this before you commit, because migrating off Managed Agents later is a re-platforming exercise, not a config change.
Hosted harness vs owned harness
The distinction here is operational, not who wrote the harness code. Hosted means the vendor runs the harness loop in their infrastructure and you call an API. Owned means you run the loop on your own infrastructure, even if the harness code itself came from a vendor.
LangChain shows up on both sides of this line, which trips people up. They ship LangGraph, an MIT-licensed library you self-host (owned), and Deep Agents Deploy, a managed product that runs a Deep Agents harness on LangSmith Deployment in its default cloud mode (hosted). Same company, two different operational models. You pick the model, not the vendor. (Deep Agents Deploy also has a self-hosted mode for teams that want the harness ergonomics without the cloud component; that mode lands in the owned bucket.)
Choose a hosted harness when you do not have platform capacity and the vendor constraints fit. Managed Agents means Claude-only. Deep Agents Deploy in cloud mode means LangSmith in production. In return, the vendor owns caching and compaction.
An owned harness (LangGraph, Deep Agents Deploy in self-hosted mode, or a custom harness on top of an SDK) is the right call when you have platform engineers, when you need to iterate on harness shape faster than a vendor will release updates, when compliance pushes data residency under your control, or when multi-model routing across providers is non-negotiable. You pay for it in pages and operational surface area.
Most teams should start hosted, measure what they need to change, and migrate to owned only when the hosted constraints start to hurt.
Hosted sandbox vs Docker / Fargate sandbox
Pick a hosted sandbox when sandbox creation time matters (sub-200 ms), you need session pause/resume with memory state, or you need fork/branch semantics. Pick Docker or Fargate when you already pay for that compute, you need VPC-internal access to sensitive data sources, or you have hard data-residency constraints.
State stores: Git, DB, and object storage side by side
Long-running agents usually have three state stores running at the same time, not one, and each holds a different kind of state. Git stores the workspace state: the code, documents, and progress files the agent is changing. Each commit gives the harness a stable recovery point and gives the next session a compact history to inspect. The checkpoint DB holds the harness's graph state: what was decided, which nodes ran, which results came back, and what should run next. That is what lets the next worker resume mid-run. The artifact store holds the big final outputs, such as PDFs, parquet files, and screenshots. Those do not belong in git or the checkpoint DB.
When to use git as state
Use git when the workload is code-shaped (multi-file edits, refactors, app generation) or document-shaped enough that file history matters. The pattern is simple: create a run branch, make an initializer commit, then commit at meaningful boundaries: after setup, after each feature, after tests pass, after the final cleanup. Store the latest workspace commit SHA next to the checkpoint row. On resume, the next worker checks out the branch, reads git log --oneline -8, inspects git status and the latest diff, then reads PROGRESS.md or whatever handoff file the previous session wrote.
That makes git a recovery surface for the artifact under edit, not a replacement for the checkpoint DB. Git can answer two questions: what changed, and which version passed tests. It cannot tell the harness which graph node should run next, which tool call is waiting for approval, or which retry already used its idempotency key. Anthropic's harness uses initializer commits plus per-feature commits as the source of truth for workspace recovery; the model reads git log --oneline -8 to recover state. Skip git when the work product is a single conversational answer. The overhead does not pay off.
When to use DB checkpointing
Use PostgresSaver-style checkpointing when the agent has a graph structure with multiple nodes whose intermediate state matters (planner β researcher β writer β verifier). The reference repo uses this for exactly that reason. Do not put terabyte-scale workspace artifacts in the checkpoint; those go to object storage.
When to use an artifact store (S3 / GCS)
Reach for object storage in three situations. First, the output is bigger than a few hundred KB. Checkpoint DBs are not built to hold large blobs, and trying to keep them there makes both the DB and the checkpoint format painful. Second, downstream consumers such as BI tools, customers, or other services need URL-addressable artifacts they can fetch on their own without going through the agent. Third, the retention window for the run state and the retention window for the deliverable diverge. You might drop the session log after 30 days but keep the final report for years. Key the layout by (thread_id, checkpoint_id, artifact_name) so you can always reconstruct which run produced which artifact.
When to add human approval gates
Add gates when the tool call is destructive and irreversible (DB writes, money movement, sending external comms), when the tool call exits the agent's blast radius (production deploys, customer-facing publishes), or when regulators require review. LangGraph's interrupt() and Deep Agents' approval middleware both have built-in support for these gates. Part 4 covered why these gates are a permission concern, not a prompt concern.
A practical production checklist
Before a long-running agent ships, answer these questions in concrete infrastructure terms.
- Which store owns session events and checkpoints?
- What happens if the worker dies halfway through a tool call?
- Can one run corrupt another run's workspace?
- Which actions require approval?
- Can the model or sandbox read raw credentials?
- Which tool calls can safely retry?
- Where is the per-run cost cap enforced?
- What fresh-context check decides "done"?
- Where do final outputs live after the sandbox is gone?
- Can we explain a failed run tomorrow without re-running it?
If the answer to any of these is "the prompt tells the agent to be careful," the system is not deployed yet. It's still a demo.
Key takeaways
- Long-running agents need a runtime, not just a bigger HTTP timeout. The runtime keeps the session, workspace, tool results, checkpoints, traces, budgets, approvals, and credentials outside the model's memory.
- The core components are session, harness, sandbox, checkpoint, and trace. Policy checks and secret brokering cut across all of them.
- The first deployment question is run length. A 20-second helper can live in an app server. A six-hour coding or research run needs a queue, worker, durable state, and a sandbox strategy.
- Queue + worker + checkpoint DB is the practical default when you want to own the runtime. Hosted harnesses such as Anthropic Managed Agents or Deep Agents Deploy are better when their vendor constraints fit and you do not want to operate the loop yourself.
- Side-effecting tools need idempotency keys. In queues and workflow engines, retries are normal. Without deduplication, a retry can send the same email, create the same PR, or charge the same customer twice.
- Observability has to survive the worker. For a failed six-hour run, the useful artifacts are the event log, last checkpoint, trace, tool-call timeline, workspace diff, screenshots, and environment metadata.
- Runtime assumptions age as models change. Re-test context resets, evaluator patterns, tool surfaces, and budget rules when you switch models or harnesses.
- Secrets should stay outside the harness and outside the model context. The agent gets tool capabilities, not raw credentials.
References
Engineering write-ups
- OpenAI, Harness engineering: leveraging Codex in an agent-first world.
- Anthropic Engineering, Effective harnesses for long-running agents.
- Anthropic Engineering, Harness design for long-running application development.
- Anthropic Engineering, Scaling Managed Agents: Decoupling the brain from the hands, April 8, 2026.
- Cognition AI, Rebuilding Devin for Claude Sonnet 4.5: Lessons and Challenges.
- Vercel, We removed 80% of our agent's tools.
- Addy Osmani, Long-running Agents.
LangGraph and Deep Agents
- LangGraph docs, Persistence.
- LangGraph reference, Checkpoints.
langgraph-checkpoint-postgreson PyPI.- LangChain docs, Deep Agents overview.
- LangChain blog, The runtime behind production Deep Agents.
OpenAI Agents SDK
- OpenAI Agents SDK, Sessions.
- OpenAI Agents SDK, Sandbox concepts.
Temporal
- Temporal blog, Introducing Temporal and agentic sandboxes: the OpenAI Agents SDK.
- Temporal blog, Production-ready agents with the OpenAI Agents SDK + Temporal.
- Temporal Γ OpenAI Agents SDK contrib README (
temporalio/sdk-python).
Anthropic platform
- Anthropic, Claude platform pricing: Managed Agents session-hour rates.
anthropics/cwc-long-running-agents: Code with Claude 2026 take-home with evaluator subagent and progress-file patterns.
Sandbox providers
- ZenML, E2B vs Daytona: sandbox comparison for platform engineers.
- Robert Mill, E2B vs Fly Machines (Medium, 2025): head-to-head cold-start benchmark (80 ms same-region, 410 ms p50 cross-region).
- Daytona docs, Sandboxes.
- Daytona changelog, Sandbox fork and snapshot endpoints.
- Modal docs, Sandboxes.
- Modal docs, Cold start guide.
- Runloop on AWS Marketplace.
- Vercel Sandbox pricing and limits.
Cloud platform timeouts and quotas
- Google Cloud, Configure request timeout for services.
- Google Cloud, Using WebSockets.
- Google Cloud, Set task timeout for jobs.
- AWS, Configure Lambda function timeout.
- AWS, Lambda quotas.
- AWS, Fargate throttling quotas.
- AWS, ECS service quotas and API throttling limits.
Observability
- OpenTelemetry, Semantic conventions for generative AI systems.
- OpenTelemetry, Gen AI attributes registry.
- OpenTelemetry, Semantic conventions for GenAI agent and framework spans.
- OpenTelemetry, Semantic conventions for generative AI metrics.
Series
- Part 1: The Cognitive Engine: reasoning loops.
- Part 2: The Cortex: memory tiers.
- Part 3: The Hands: tool ergonomics and the ACI.
- Part 4: The Guardians: policy, sandboxes, permissions, agent threat models.
- Part 5: The Runtime (this post): sessions, sandboxes, checkpoints, deployment shapes.
The Market Analyst Agent code (LangGraph worker, Postgres checkpointer, Qdrant memory, MCP sidecar, and the Docker Compose topology described above) is on GitHub.