Skip to content

State Stores

This guide covers the SessionStateStore interface and the v7 contracts every implementation must satisfy. For end-user custom-state patterns (defining schemas, reading/writing state inside tools), see State Management.

Overview

The state store is the single durable surface for an agent's session data. v7's stateless-suspension redesign elevated this to a load- bearing role: every HITL pause writes its full continuation context into SessionState.suspensionContext so the next process to wake up on the session can pick up where the previous one left off, even across restarts and machine moves.

In v7, every in-tree state store implements the full SessionStateStore interface atomically. Third-party stores can delegate to a default non-atomic fallback (with a warning log) until they upgrade.

What changed in v7

If you maintain a custom SessionStateStore implementation or operate sessions at the SQL level, the v7 changes that matter:

  1. New required method: saveStateAndPromoteStaging. Atomic write-and-promote that replaces the v6 two-call dance.
  2. Forward-only schema migrations added to all in-tree stores — Postgres V5, D1 V8, DO SQLite V4. They add a suspension_context column and indexes on pendingClientToolCalls for efficient expiration queries. See Storage migrations.
  3. compareAndSetStatus return shape. Old: Promise<boolean>. New: discriminated { ok: true; newVersion } | { ok: false; currentStatus; currentVersion }. The single most-commonly-tripped v7 breaking change.
  4. New SessionState fieldssuspendedAwaitingChildren, suspendedStepId, tracingContext, expiresAt. Custom stores must persist all of them, even if the columns store JSON.
  5. expiredSessionCleanup — operator-driven helper for reaping sessions whose expiresAt is in the past.

If you are upgrading from v6, read the v6 to v7 migration guide end-to-end before deploying. The rest of this page describes the v7 model.

saveStateAndPromoteStaging

SessionStateStore.saveStateAndPromoteStaging(sessionId, state, opts) atomically:

  1. Persists the full SessionState (messages, custom state, suspensionContext, all the v7 fields).
  2. Promotes any staged Immer patches into the canonical state.
  3. Bumps the session version.

In v7 this is the canonical write path used by the run loop after every step. The previous v6 flow — saveStaging followed by a separate commitStaging call — has a small window where a crash leaves staging written but unpromoted. The atomic primitive closes that window.

Implementing in a custom store

If you maintain a third-party SessionStateStore, you MUST implement saveStateAndPromoteStaging atomically — run both writes inside a single transaction (Postgres) or compare-and-swap (Redis/DO).

Earlier versions exported a defaultSaveStateAndPromoteStaging() helper from @helix-agents/core that performed the legacy two-call flow (non-atomic). That helper was removed in P3.R3-BC-FALLBACK: a sequential appendMessages → saveState → promoteStaging opens a small window where a crash between calls leaves staging written but unpromoted, which is exactly the corruption the atomic primitive was added to prevent. All five in-tree stores (memory, redis, postgres, D1, DO) implement the atomic version; custom stores must do the same.

compareAndSetStatus returns an object

The status-CAS API changed in v7 to surface what the store saw, not just whether the swap succeeded:

ts
// v6
const ok = await store.compareAndSetStatus(sessionId, ['active'], 'paused');
if (ok) { ... }

// v7
const result = await store.compareAndSetStatus(
  sessionId,
  ['active'],
  'paused',
);
if (result.ok) {
  console.log('promoted to version', result.newVersion);
} else {
  console.log(
    'lost CAS — store is at',
    result.currentStatus,
    'version',
    result.currentVersion,
  );
}

Every call site in your codebase must update. The lossy boolean form is gone.

New SessionState fields

v7 adds four fields to SessionState. Custom stores that serialize state must round-trip all of them.

FieldTypePurpose
suspensionContextSuspensionContext | undefinedContinuation context for HITL pauses. Read on resume to restore loop.
suspendedAwaitingChildrenSuspendedChildWait[] | undefinedPer-child waits for cascading sub-agent suspensions.
suspendedStepIdstring | undefinedThe step ID a suspended_step_partial outcome is anchored to.
tracingContextTracingContext | undefinedPersisted Langfuse / OTel trace IDs so resume continues the same trace.
expiresAtnumber | undefinedEpoch ms TTL — read by expiredSessionCleanup.

In Postgres, all four serialize into the state JSONB column (no schema migration needed beyond V5's suspension_context column for indexing). In D1 / DO SQLite they live in the state TEXT column.

Storage migrations

Every in-tree store ships a forward migration in v7. Apply migrations before rolling new code; new code reading old data is fine, but old code reading new data is undefined behavior.

PackageMigrationNotes
@helix-agents/store-postgresV5Adds suspension_context JSONB, GIN index.
@helix-agents/store-cloudflare (D1)V8Adds suspension_context TEXT, JSON-path index.
@helix-agents/store-cloudflare (DO)V4Adds suspension_context TEXT to the DO SQLite.
@helix-agents/store-redis(none)RedisJSON path-set; version bump only.
@helix-agents/store-memory(none)In-memory; no migration needed.

Verify the active migration version with:

sql
SELECT version FROM __agents_migrations
ORDER BY version DESC
LIMIT 1;

Postgres should show 5 or higher; D1 should show 8 or higher; the DO SQLite tier should show 4 or higher.

Forward-only

Rolling back from v7 to v6 after applying these migrations is unsafe by default. Sessions paused under v7 carry suspension context that v6 does not know how to read; resuming them under v6 silently loses the context. See the migration guide's rollback semantics for the recovery procedure.

Operator-driven session cleanup

v7 adds expiredSessionCleanup to @helix-agents/agent-server — a helper for reaping sessions whose expiresAt is in the past. The helper:

  1. Pages through stateStore.listSessions() (configurable page size, default 200).
  2. Loads each session and checks expiresAt.
  3. For each expired non-terminal session: enumerates owned workspace snapshots via the matching WorkspaceProvider's snapshot.list / snapshot.delete capability and deletes them (closes the R2 cost- amplification gap).
  4. CAS's the session status to 'failed' with reason 'session_expired'. Per-session failures (load errors, snapshot errors, CAS conflicts) are logged but do not abort the loop.

The framework does not run this automatically. Wire it into a scheduled job (cron, Cloudflare Alarm, k8s CronJob).

ts
import { expiredSessionCleanup } from '@helix-agents/agent-server';

// Cloudflare Alarm handler
export default {
  async scheduled(_event, env, _ctx) {
    const summary = await expiredSessionCleanup({
      stateStore,
      workspaceProviders, // Map<providerId, WorkspaceProvider>
      logger: consoleLogger,
    });
    console.log('cleanup summary', summary);
  },
};

The returned summary ({ detected, marked, alreadyTerminal, snapshotsDeleted, errors }) gives operators a per-run observability handle.

See also

Released under the MIT License.