v6 to v7 Migration Guide — Stateless Suspension Redesign
This guide covers the v7 release of @helix-agents/*, the largest single-version change since the project began. v7 reshapes how the runtime suspends and resumes for human-in-the-loop (HITL) interactions — client-executed tools and approval-gated tools — by removing every in-memory bridge between an HTTP request and the agent loop and making the state store the single source of truth across requests.
If you have been running v6 in production with chat-style HITL agents, read this guide end-to-end before upgrading. Some changes are forward-compatible-only at the storage layer (rolling back to v6 after applying migrations is unsafe; see Rollback Semantics).
Table of contents
- Overview / motivation
- Breaking changes per package
- Operational guidance
- Code migration examples
- Rollback semantics
- Validation checklist
Overview / motivation
What changed and why
In v6, when a Helix agent reached a HITL boundary (a client-executed tool, or an approval-gated tool) the runtime parked the in-flight JavaScript loop in memory and waited for the consumer to land a submitToolResult() call on the same process. In a Cloudflare Durable Object, that meant a hibernation guard kept the DO awake. In the JS runtime, that meant a setTimeout-driven promise dangling on the heap. In Temporal it meant a workflow blocked on a condition, which was the only place this design composed cleanly.
Three problems compounded on top of that model:
- Determinism timeouts at the editor seam. The flagship example was the
editContentclient tool: long edits would deterministically blow past the v6 client-tool deadline because the suspended waiter billed wall-clock time end-to-end. - Stream resumption was a gut-feel system. AI SDK v6 introduced first-class
streamIdHITL primitives. Plumbing them through to v6 Helix required ad-hoc transports (HelixChatTransport) that short-circuited the SDK's own stream-close-and-reopen lifecycle. - Cost ratchet on long pauses. Anything that paused longer than a few seconds amplified DO/Temporal wall-time billing. Internal data showed long-pause sessions could amortize ~80% of their wall-time waiting on the human.
v7 in one sentence
The agent loop is allowed to die at every HTTP request boundary; the state store is the only durable thing across requests; resumption is driven by
executor.resume()reading durable suspension context, not by signaling an in-memory waiter.
This unlocks:
- Deterministic suspension semantics — the client-tool deadline now measures wall-clock from suspension write to submission write, not from in-memory dispatch to in-memory resolve.
- Native AI SDK v6 stream-close-and-reopen integration via
prepareHelixChatRequestanduseResumeClientTools. - Approval-gated tools as a first-class primitive (
requireApprovalondefineTool) rather than something every consumer had to build by hand. - Cost reduction on long pauses, particularly on Cloudflare DO where the v6 hibernation guard kept the object loaded.
Scope
v7.0 ships v7 stateless HITL support across the four currently-supported runtimes:
runtime-js(full stateless)runtime-cloudflare(Durable Object path) (full stateless)runtime-temporal(full stateless)runtime-cloudflare(Workflows path) (full stateless)runtime-dbos(consumer-equivalent; see DBOS divergence note)
The first four runtimes use durable-state-only suspension at every HITL boundary — the workflow exits, durable state captures the suspension, and a subsequent executor.resume() re-enters from durable state. No runtime blocks on in-memory or in-step waits during HITL.
DBOS divergence (per second-round review #5/#8 finding P0.5)
The DBOS runtime is NOT durable-state-only at the suspension boundary. DBOS implements HITL by blocking inline on DBOS.recv() inside the workflow — the workflow itself stays alive across the suspension, and the durable state captures only the input/output of the recv call rather than the full workflow exit/resume cycle.
This is consumer-equivalent for the public surface area: executor.execute() / submitToolResult() / executor.resume() behave identically from the user's perspective. The lifecycle hooks (onAgentSuspended / onAgentResumed) fire at the same logical points, and the persisted SessionState shape matches the other runtimes.
It's NOT identical for cost & operational characteristics:
- DBOS workflows accumulate uptime for the duration of a HITL pause (because the workflow process stays alive). Long-running pauses on DBOS cost more compute than the same pause on JS/Temporal/CF.
- DBOS hook firing order around
Promise.all-driven sub-agent dispatch differs from the other runtimes (hooks fire BEFORE the Promise.all resolves, vs after on the others). Tracing pipelines that assume the post-Promise.all timing observe DBOS as an outlier. - Lifecycle hooks for the
awaiting_childrendiscriminator have a different code path on DBOS (thephase1ClientToolIds.length > 0guard previously caused them to skip entirely; fixed in v7.0-final).
The DBOS hook-firing-order parity table in docs/internals/concepts.md records these differences precisely.
Breaking changes per package
Each package follows independent semver. The branch omnara/stateless-suspension-redesign is the v7 release train; consult each package's CHANGELOG.md for the specific v7 release version (e.g. core@0.29.x, runtime-temporal@2.2.x, ai-sdk@17.0.0, runtime-js@8.0.0, runtime-cloudflare@5.1.0, runtime-dbos@3.0.0). The list below highlights only consumer-observable changes — internal refactors are documented in the package CHANGELOG.
@helix-agents/core
New types:
RunOutcome<TOutput>— runtime-internal discriminated union returned by everyrunLoopimplementation. Five variants:completed,failed,suspended_client_tool,suspended_awaiting_children,suspended_step_partial. Marked@internalbut visible in stack traces and structured-logger payloads, so worth knowing.StepOutcome— per-step counterpart ofRunOutcome, also@internal.SuspendedChildWait— describes a sub-agent the parent is waiting on; one entry per pending child insuspended_awaiting_children.
New requireApproval flag on defineTool:
defineTool({
name: 'sendEmail',
parameters: z.object({ ... }),
requireApproval: true, // or (input, ctx) => boolean
execute: async (input) => { ... }
});requireApproval is mutually exclusive with execute: 'client' and finishWith: true. The function form fails-closed (an exception inside the evaluator is treated as requireApproval = true, matching the Mastra precedent).
New SubmitToolResult union:
The v6 single shape ({ toolCallId, result | error }) is now one variant of a discriminated union; the second variant is the approval-response shape:
type SubmitToolResult =
| { kind: 'approval-response'; toolCallId: string; approved: boolean; reason?: string }
| { kind: 'client-tool-result'; toolCallId: string; result: unknown }
| { kind: 'client-tool-result'; toolCallId: string; error: string };Note on kind: v6 callers that did not pass kind get rejected at SubmitToolResultSchema.parse() time. The schema accepts both variants; routing happens off toolCallId, not kind.
compareAndSetStatus return shape:
Old: Promise<boolean>. New: Promise<{ ok: true; newVersion: number } | { ok: false; currentStatus: SessionStatus; currentVersion: number }>.
This is the single most-commonly-tripped breaking change in v7. Update every call site:
// v6
const ok = await store.compareAndSetStatus(sessionId, ['active'], 'paused');
if (ok) { ... }
// v7
const result = await store.compareAndSetStatus(sessionId, ['active'], 'paused');
if (result.ok) {
console.log('promoted to version', result.newVersion);
} else {
console.log('lost CAS — store is at', result.currentStatus, 'version', result.currentVersion);
}saveStateAndPromoteStaging is now a required interface method on SessionStateStore, and it MUST be atomic. All in-tree stores ship atomic implementations. A previously-exported defaultSaveStateAndPromoteStaging() helper (non-atomic, sequential appendMessages → saveState → promoteStaging) was removed in P3.R3-BC-FALLBACK — the crash-between-calls window it created is exactly the corruption the atomic primitive was added to prevent. Custom stores: implement using a transaction (Postgres) or compare-and-swap (Redis/DO).
New SessionState.failureReason field. When a session enters status: 'failed', this discriminator distinguishes a child marked failed because its parent suspended ('parent_suspended') from a genuine child execution failure. The γ-cascade in applyResultsAndReload uses this to decide re-spawn vs. observe — children with failureReason === 'parent_suspended' are re-spawned (parent's resume gives them another chance), while genuinely failed children are drained to the parent as failure results.
Custom store authors: persist failureReason round-trip. Stores that silently drop the field will break the γ-cascade — the parent's resume will treat suspended children as genuinely failed and skip re-spawn, producing dangling sub-agent state.
@helix-agents/runtime-js
- The legacy
JSAgentExecutor.runLoopis gone (~1725 LOC removed). The new loop is built aroundrunStepIterator()from core and uses the newRunOutcomediscriminated union. handle.result()now resolves toAgentResult.statusof'suspended_client_tool' | 'suspended_awaiting_children' | 'suspended_step_partial'for HITL agents that did not complete in-process. Existingswitchstatements that only handled'completed' | 'failed' | 'interrupted'will fall through silently for HITL agents.- The client-tool wait moved from in-memory promise +
setTimeoutto durable state (SessionState.suspensionContext). Submission resumes the loop viaexecutor.resume(), not by waking an in-process promise. JsClientToolResolveris no longer exported from the public surface. Internal CF DO use only.
@helix-agents/runtime-cloudflare (DO path)
- The hibernation guard is gone (~365 LOC removed). DOs are now free to evict during HITL waits; the alarm scheduler retains only heartbeat and interrupt-poll subscribers (the client-tool-deadline alarm subscriber is gone — deadlines are enforced in
findExpiredPendingat request time). DOAlarmTimerStrategyis removed; the runtime no longer needs a timer strategy for client-tool waits.persistentAgentsare now supported on the DO path (commitfb3180f6b). Each persistent child runs in its own DO instance addressed by stable sessionId ({parent}-agent-{name}); the parent's six auto-injectedcompanion__*tools dispatch via thesubAgentNamespaceDO stub against the existing sub-agent endpoints. The earlier v7.0 fail-fast guard has been lifted.
@helix-agents/runtime-cloudflare (Workflows path)
Full v7 stateless HITL support. The workflow body returns early from runAgentWorkflow when the run hits a HITL boundary (client tool, approval gate, or sub-agent suspension cascade). The exit point writes durable suspension state via the commitSuspendedStep activity and returns AgentWorkflowResult with status: 'suspended_client_tool', 'suspended_awaiting_children', or 'suspended_step_partial'. The workflow instance terminates — there is no step.waitForEvent keeping it alive across the HITL pause.
executor.resume(sessionId) starts a new workflow instance with mode: 'resume'. The new instance begins with applyResultsAndReload, which drains any queued submissions (pendingClientToolCalls whose results have landed) into the conversation and resumes the agent loop from the durable state snapshot. Sub-agent suspension cascades up — a child workflow that suspends propagates suspended_awaiting_children to its parent's AgentWorkflowResult, which terminates the parent workflow and writes parent suspension state for later resume.
agent.workspaces is still rejected at run-start on CFW Workflows. That gate remains a v7.1 deferral. Workspaces require the JS or CF DO runtimes.
@helix-agents/runtime-temporal
Full v7 HITL support. The TemporalAgentExecutor API surface is unchanged — execute(), resume(), submitToolResult(), interruptAgent(), abortAgent() — but several behaviors changed:
handle.result()for HITL agents now resolves with'suspended_client_tool' | 'suspended_awaiting_children' | 'suspended_step_partial'. Consumers must handle the new statuses; there is no backwards-compat shim. Exhaustiveswitchstatements that only handled'completed' | 'failed' | 'interrupted'will fall through silently for HITL agents.executor.resume(sessionId)starts a NEW workflow instance with workflow ID${prefix}__${agentType}__${sessionId}__resume-${N}(single-dash suffix;WorkflowIdReusePolicy.ALLOW_DUPLICATE). The prior workflow has already exited at the HITL boundary; resume'smode='resume'branch calls theapplyResultsAndReloadactivity, which drains submitted client-tool results into messages, firesonMessage+afterToolhooks, synthesizes timeouts for expired deadlines, and drains completed sub-agent children.submitToolResultaccepts theSubmitToolResultdiscriminated union withkind: 'client-tool-result'orkind: 'approval-response'(same as runtime-js). The submission is a durable write only — no Temporal signal is sent (the workflow has already exited at the HITL boundary).- Sub-agents run as Temporal child workflows. On parent suspension, in-flight children are marked
failed:'parent_suspended'(mitigation #3) and re-spawned via the__resume-Nworkflow ID convention on parent's resume (γ-cascade). Completed children are drained on parent resume viarecordSubSessionResult.
Deleted exports — clean break, no aliases. Code that imported these must remove the imports:
runAgentWorkflow(function) → use the newagentWorkflowfrom@helix-agents/runtime-temporal/workflow. The post-A.2 rewrite collapsedrunAgentWorkflow+ theWorkflowActivities/WorkflowDepsinjection shim into a single self-containedagentWorkflowfunction.TemporalClientToolResolver→ no replacement. Client-tool suspension is owned by durable state writes; there is nothing for consumers to wire.executeClientToolInWorkflow→ no replacement.AgentWorkflowActivities,AgentWorkflowOptions→ no replacement. Activities are now provided directly byGenericActivitiesfrom@helix-agents/runtime-temporal; the workflow body proxies them internally without consumer wiring.registerToolResultHandler,getSubmittedToolResult,clearSubmittedToolResult→ no replacement. Submit is durable-only post-A.2; the workflow has already exited at the HITL boundary by the timesubmitToolResultis called, so there's no in-workflow signal handler to register. Consumers drive the next loop iteration viaexecutor.resume().- Five Temporal signal definitions are removed:
submitToolResultSignal,childSuspendedSignal,childWokeSignal,runResumedSignal, plus theRUN_RESUMED_SIGNAL_NAMEconstant.INTERRUPT_SIGNAL_NAMEis retained for cross-process interrupt backward compat.
Worker setup. Register agentWorkflow directly, or wrap it in a thin delegate so the worker bundles a workflow under the conventional name your TemporalAgentExecutor.workflowName expects:
// workflows.ts (bundled by the worker)
import { agentWorkflow as v7AgentWorkflow } from '@helix-agents/runtime-temporal/workflow';
import type {
AgentWorkflowInput,
AgentWorkflowResult,
} from '@helix-agents/runtime-temporal';
// Re-export under whatever name your TemporalAgentExecutor.workflowName
// expects. Most deployments use the convention 'agentWorkflow' directly,
// in which case you can re-export by name without aliasing.
export async function agentWorkflow(
input: AgentWorkflowInput
): Promise<AgentWorkflowResult> {
return v7AgentWorkflow(input);
}The imported agentWorkflow sets up its own proxyActivities, the INTERRUPT_SIGNAL_NAME handler, and child-workflow dispatch internally. The wrapper exists only to register under a stable name.
AgentRegistry.replace() (Cloudflare + Temporal)
Both runtime-cloudflare and runtime-temporal add a public AgentRegistry.replace(config) method that returns boolean (true if a prior agent under the same name was replaced; false if it was a fresh registration). Use this to install a per-call hook variant during tests without unregistering and re-registering:
// Test-clone-with-hooks pattern
const cloned = { ...AgentDef, hooks: { afterTool: spy } };
registry.replace(cloned);
// ... run test ...
registry.replace(AgentDef); // restoreThrows if the agent was originally registered as a factory (registerFactory()); use unregister() + registerFactory() for that case. Verified at packages/runtime-cloudflare/src/registry.ts:188-200 and packages/runtime-temporal/src/registry.ts:231-243.
@helix-agents/runtime-dbos
Full v7 HITL support mirroring runtime-js + runtime-temporal. The F7 resume() contract fix lands in packages/runtime-dbos/src/lifecycle/ resume.ts — a single multi-status CAS replaces the previous two-step-and-sometimes-skip pattern. Resumes now move through paused_awaiting_client | paused_awaiting_children | suspended_step_partial | interrupted to running atomically.
Hooks (onAgentSuspended / onAgentResumed / beforeTool / afterTool / onMessage / onStateChange) fire from runtime-dbos workflows identically to other runtimes — wired through the lifecycle pipeline at workflows/shared.ts + workflows/standard-workflow.ts.
Workspaces are silently unsupported in v1. Agents declaring workspaces will silently lose those tools without a fail-fast guard. Tracked as future work — see docs/dev/future-work.md. Workaround: use runtime-js or CF DO until that gap closes.
@helix-agents/agent-server
Five new chat handler routes layered on top of the existing 7 AgentExecutor routes:
POST /chat— start or continue a chat turn. Always streams.GET /chat/{id}/stream— reattach to an in-progress stream after a refresh.POST /chat/{id}/submit-tool-result— submit a client-tool result or approval response.POST /chat/{id}/interrupt— durable interrupt; replaces the v6 in-memoryinterrupt(handle).POST /chat/{id}/abort— abort the current run.
Wire them via AgentServer.toExpressMiddleware('/chat') (Express) or the equivalent generic adapter.
INTERRUPT_NOT_LOCAL 503 is removed. v6 returned a 503 from agent-server when interrupt was issued against a session whose handle was on a different process. v7 writes a durable interrupt request to the state store; the running loop picks it up at its next checkpoint.
@helix-agents/ai-sdk
HelixChatTransportis deleted. It existed solely to short- circuit AI SDK v5's transport in service of v6's in-memory HITL bridge. v7 uses the SDK's native stream-close-and-reopen lifecycle.- New helper:
prepareHelixChatRequest({ api, resumeFromSequence, existingMessageId })— pass toDefaultChatTransportto drive resume. - New helper:
prepareHelixReconnectRequest({ api })— pass toDefaultChatTransport.prepareReconnectToStreamRequestso the AI SDK's built-inreconnectToStreamafter page-refresh hitsGET /chat/{id}/streamwith the rightX-Resume-From-Sequence/X-Existing-Message-Idheaders. Without this helper, page-refresh-during-stream silently hangs tool calls inpendingforever because reconnectToStream defaults to a 404 HTML page rather than the SSE endpoint. This is the single most-frequently-missed migration step. - New React hook:
useResumeClientTools({ chat, toolHandlers })— interceptstool_startchunks for client-executed tools, runs the consumer-supplied handler, and submits the result back to the server. Replaces the manualsubmitToolResultplumbing each consumer wrote in v6. - New helpers:
extractResumeIntent(server-side, parses the client's resume cookie/header into{ resumeFromSequence, existingMessageId }) andfindExpiredPending(server-side, inspectspendingClientToolCallsfor entries past their deadline, used by the chat handler to emit synthetictool_errorchunks rather than letting clients spin forever). - New orchestrator:
handleChatStream({ executor, stateStore, streamManager, agent }, params)— the canonical chat handler. Pass it toAgentServer({ chatHandler }).
Storage adapters
All four state stores apply a forward migration that adds a suspension_context JSONB (or TEXT for D1) column to the session state row, plus indexes on pendingClientToolCalls for efficient expiration queries.
| Package | Migration |
|---|---|
@helix-agents/store-postgres | V7 |
@helix-agents/store-redis | (no schema, but consumes new RedisJSON paths — version bump only) |
@helix-agents/store-cloudflare (D1) | V9 |
@helix-agents/store-cloudflare (DO SQLite) | V5 |
@helix-agents/store-memory | (in-memory; no migration needed) |
All four implement atomic saveStateAndPromoteStaging. The default non-atomic fallback in core is reserved for third-party stores that have not yet upgraded.
@helix-agents/tracing-langfuse
- Trace ID seed changed from
runIdtosessionId. This is a deliberate behavior change: with v7's stateless-suspension model, a single conversational session can span many runs (each resume after a HITL boundary creates a new run). Seeding fromrunIdwould produce one trace per run, fragmenting the chat conversation across many traces in the Langfuse UI. Seeding fromsessionIdkeeps the entire conversation in one trace. - New hook handlers:
onAgentResumedandonAgentSuspendedproduce matchingeventspans inside the session-scoped trace, so you can visually see where a run paused and where it resumed.
The legacy core/tracing/tracing-hooks.ts adapter is HITL- incompatible in v7: it relies on an in-memory tracingStateMap that the stateless-suspension redesign cannot populate across process restarts. v7 fail-fasts when requireApproval or client-tool agents are run with the legacy adapter. Migrate to @helix-agents/tracing- langfuse (or implement the v7 Logger-style interface in your own adapter) before upgrading.
Operational guidance
Pre-deploy checklist
- Drain agent traffic to a < 60s window. v7's storage migrations are forward-compatible-only. New code reading old data is fine; old code reading new data is undefined behavior.
- Apply storage migrations before deploying new code. Each store ships a CLI migration runner; verify with:sqlPostgres should show V7; D1 should show V9; the DO SQLite tier should show V5.
SELECT version FROM __agents_migrations; - Verify no Temporal / CFW Workflows agents declare
agent.workspaces. Workspaces remain unsupported on those runtimes in v7.0 (still a v7.1 follow-up). HITL primitives (requireApproval, client-executed tools) and persistent sub-agents (CF DO) are all shipped in v7.0 — onlyagent.workspacesis gated outside the JS and CF DO runtimes.
Post-deploy monitoring
client_tool_timeoutcount should drop to 0 (or near it). Pre-v7, long-running client tools exceeded the in-memory deadline routinely. Post-v7, the deadline measures durable wall-clock from suspension write to submission write, so genuine timeouts should only occur for clients that actually never submit.- Wall-time billed on long-pause sessions should drop ~80%. The CF DO hibernation guard was the dominant cost driver. With it gone, a 5-minute HITL pause now bills approximately the cost of two HTTP requests instead of 5 minutes of DO uptime.
- Watch the new structured logger events:
client_tool.suspended,client_tool.submitted,client_tool.timeout,agent.resumed,agent.suspended. These replace the v6 in-memory bookkeeping and are now your only visibility into HITL state.
Recovery from stuck sessions
If a session ends up stuck in pendingClientToolCalls with no submission landing (e.g., a client crashed mid-flow), the operator runbook is:
- Inspect
SessionState.pendingClientToolCallsto see which tool call IDs are pending. - Use
findExpiredPending(state)from@helix-agents/ai-sdkto identify expired entries. - Force-fail by calling
executor.submitToolResult({ kind: 'client-tool-result', toolCallId, error: 'operator_force_failed' }). This emits a synthetic tool error and resumes the run.
See the v6 client-executed-tools guide ("Operating in production" section) for additional context — most of it carries forward unchanged.
Temporal cutover
The v7 cutover on Temporal is a clean break. There is no worker versioning, no v6/v7 coexistence, no v6 drain path:
- Deploy v7 workers. They register
agentWorkflowand the v7GenericActivitiessurface only — the v6runAgentWorkflowbody, the v6 activity injection shim, andTemporalClientToolResolverare gone. - Terminate any v6 workflows still running. Their durable state in
store-postgres/store-redisis unaffected — affected sessions can be re-executed under v7 against the samesessionId. - If you skip step 2, v6 workflows fail at the next activity call once they hit the v7 worker pool, because the v6 activity names are no longer registered.
There is no rollback path on the Temporal side either: the v6 code is deleted from the package. Pin to @helix-agents/runtime-temporal@6.x in your dependency manager if you genuinely need to revert; do not attempt to patch v6 behavior back into v7.
CFW Workflows v7 stateless cutover
CFW Workflows now exits the workflow instance on HITL boundaries (client tools, approval gates, sub-agent suspensions). v6 deployments that have in-flight workflows blocking on step.waitForEvent are not upgrade-compatible — those workflows will not resume under v7.
Cutover procedure:
- Deploy v7 workers alongside v6 workers (Cloudflare Workflows version routing).
- Drain v6 traffic: route new agent runs to v7; allow existing v6 workflows to complete or terminate.
- After all v6 workflows have terminated (typical: minutes to hours), decommission v6 workers.
Forced cut-over: terminate any in-flight v6 workflow instances via wrangler workflows instance terminate. Their durable session state is preserved; the next consumer call against the session triggers a fresh v7 workflow instance.
Billable wall-time reduction: v6 billed for the entire HITL wait duration (workflow instance running). v7 billing is bound to the active LLM + tool work plus a few seconds for resume bootstrap. Multi-minute approvals see ~80% wall-time reduction.
Code migration examples
Server-side route handler
Before (v6):
import express from 'express';
import { createFrontendHandler } from '@helix-agents/ai-sdk';
import { JSAgentExecutor } from '@helix-agents/sdk';
const executor = new JSAgentExecutor({ /* ... */ });
const handler = createFrontendHandler({ executor, agent });
const app = express();
app.post('/api/chat', handler);After (v7):
import express from 'express';
import { AgentServer } from '@helix-agents/agent-server';
import { handleChatStream } from '@helix-agents/ai-sdk';
import { JSAgentExecutor } from '@helix-agents/sdk';
const executor = new JSAgentExecutor({ stateStore, streamManager, /* ... */ });
const server = new AgentServer({
executor,
// No auth shown — for production, configure `authenticate` or pass
// `allowUnauthenticated: true` (v7 fail-fasts in the constructor if
// neither is set).
allowUnauthenticated: true,
chatHandler: (params) =>
handleChatStream({ executor, stateStore, streamManager, agent }, params),
});
const app = express();
// One call wires every route under /chat:
// POST /chat
// GET /chat/:id/stream
// POST /chat/:id/submit-tool-result
// POST /chat/:id/interrupt
// POST /chat/:id/abort
app.use('/api', server.toExpressMiddleware('/chat'));Client-side useChat configuration
Before (v6):
import { useChat } from 'ai/react';
import { HelixChatTransport } from '@helix-agents/ai-sdk';
const transport = new HelixChatTransport({
api: '/api/chat',
sessionId,
resumeFromSequence,
});
const { messages, sendMessage } = useChat({ transport });After (v7):
import { useChat } from 'ai/react';
import { DefaultChatTransport } from 'ai';
import {
prepareHelixChatRequest,
useResumeClientTools,
} from '@helix-agents/ai-sdk/react';
const transport = new DefaultChatTransport({
api: '/api/chat',
prepareSendMessagesRequest: prepareHelixChatRequest({
api: '/api/chat',
resumeFromSequence: snapshot?.streamSequence,
existingMessageId: snapshot?.existingMessageId,
}),
});
const chat = useChat({ transport });
useResumeClientTools({
chat,
toolHandlers: {
editContent: async (input, { toolCallId }) =>
runEditContentClientSide(input, toolCallId),
},
});Tool definition with built-in approval
Before (v6) — roll your own:
defineTool({
name: 'sendEmail',
parameters: z.object({
to: z.string(),
subject: z.string(),
body: z.string(),
}),
execute: async (input, ctx) => {
// v6 had no built-in approval; consumers wired up their own state
// machine, often by emitting a custom chunk and then waiting on a
// user-supplied "confirmation" tool.
await ctx.emit({ type: 'awaiting_approval', input });
const approved = await waitForApprovalSomehow(ctx);
if (!approved) throw new Error('rejected');
await actuallySendEmail(input);
return { ok: true };
},
});After (v7) — first-class flag:
defineTool({
name: 'sendEmail',
parameters: z.object({
to: z.string(),
subject: z.string(),
body: z.string(),
}),
// Static form — every call requires approval:
requireApproval: true,
// Or function form — only require approval over a threshold:
// requireApproval: (input) => input.body.length > 1000,
execute: async (input) => {
// Only runs after the approval response submits with approved=true.
await actuallySendEmail(input);
return { ok: true };
},
});When the LLM calls a requireApproval tool, the runtime:
- Emits a
tool_approval_requeststream chunk with the tool name and parsed input. - Suspends the run with status
suspended_client_tool(the same primitive carries both client-tool and approval flows; routing happens off thekindfield of the submission). - On submission with
{ kind: 'approval-response', approved: true }, resumes and runsexecute(). - On submission with
{ kind: 'approval-response', approved: false, reason }, emits atool_errorchunk ('Tool call X was not approved by the user') and skipsexecute().
Consuming handle.result()
Before (v6):
const result = await handle.result();
switch (result.status) {
case 'completed': /* output ready */ break;
case 'failed': /* error */ break;
case 'interrupted': /* resumable */ break;
}After (v7):
const result = await handle.result();
switch (result.status) {
case 'completed': /* output ready */ break;
case 'failed': /* error */ break;
case 'interrupted': /* resumable */ break;
case 'suspended_client_tool': /* client must submit results */ break;
case 'suspended_awaiting_children': /* sub-agents pending */ break;
case 'suspended_step_partial': /* partial-step suspend */ break;
}The three new suspended_* statuses carry a result.suspended payload with the routing info (toolCallIds, children, stepId) that the chat handler needs to drive resume. Most consumers will never read this directly — handleChatStream does it on their behalf — but it is part of the public type surface and exhaustive switches need to handle it.
Submit-tool-result for client-executed tools
Before (v6):
await executor.submitToolResult({
sessionId,
toolCallId,
result: { url: 'https://...' },
});After (v7):
// client-tool-result variant
await executor.submitToolResult({
kind: 'client-tool-result',
sessionId: rootSessionId, // still required — routes to owning sub-agent
toolCallId,
result: { url: 'https://...' },
});
// approval-response variant
await executor.submitToolResult({
kind: 'approval-response',
sessionId: rootSessionId,
toolCallId,
approvalId,
approved: true,
reason: 'optional reason string',
});sessionId is still required in v7 — it's the root sessionId, used by the executor to route the submission to the owning sub-agent via SessionState.clientToolCallOwnership. Per the routing invariant, submissions always go against the root sessionId, even when the pending tool call was emitted by a sub-agent.
The change from v6 is structural: SubmitToolResult is now a discriminated union (kind: 'client-tool-result' | 'approval-response') where sessionId lives inside the submission object alongside toolCallId. v6 callers that omit kind are rejected at schema-parse time; the schema lives at packages/core/src/types/client-tool-submit.ts and is the source of truth for the field set.
Rollback semantics
Rolling back from v7 to v6 is unsafe by default. The Postgres V7, D1 V9, and DO V5 migrations add a suspension_context column that v6 does not know about. v6 readers will simply ignore the column, but any session that paused under v7 and is then resumed under v6 will silently lose its suspension context — the client tool will appear to never resume, and the consumer will see a hung session.
If a rollback is genuinely necessary:
- Drain HITL agent traffic to zero. Confirm
pending_client_tool_callsis empty for every active session. Note: this is a top-level TEXT column added in V3 (NOT a JSONB field on astatecolumn — earlier versions of this guide had the wrong path; corrected per round-3 review #6 finding P1.M1):sqlSELECT count(*) FROM __agents_states WHERE pending_client_tool_calls IS NOT NULL AND pending_client_tool_calls != '{}' AND pending_client_tool_calls != ''; - Optionally drop the
suspension_contextcolumn to fully revert. This is destructive — any v7-era state in the column is lost, and sessions paused under v7 cannot be revived even by re-upgrading:sql(D1 / DO SQLite have equivalents.)-- Postgres ALTER TABLE __agents_states DROP COLUMN suspension_context; - For sessions stuck in
pending_client_tool_callsstate during rollback, clean up by force-failing each one:sqlThen notify affected clients out-of-band.UPDATE __agents_states SET pending_client_tool_calls = '{}', status = 'failed' WHERE pending_client_tool_calls IS NOT NULL AND pending_client_tool_calls != '{}' AND pending_client_tool_calls != '';
If you anticipate any chance of rollback, leave the column in place during the rollback (option 1 only). v6 ignores it, and re-upgrading to v7 lets the session pick back up where it left off.
Validation checklist
Run this checklist after deploying v7 to a non-production environment.
Pre-flight (storage)
- [ ] Postgres migration V7 applied:
SELECT version FROM __agents_migrations ORDER BY version DESC LIMIT 1returns7or higher. - [ ] D1 migration V9 applied (Cloudflare deployments): same query against the D1 binding returns
9or higher. - [ ] DO SQLite migration V5 applied (Cloudflare DO deployments): same query against the DO storage returns
5or higher.
Pre-flight (versions)
- [ ] Every
@helix-agents/*package upgraded to its v7 release version per itsCHANGELOG.md. Each package follows independent semver — there is no single^7.0.0constraint to grep for. Consult per-package changelogs inpackages/*/CHANGELOG.mdfor the exact v7 versions.
Pre-flight (code)
- [ ] No imports of the deleted
HelixChatTransport:grep -r "HelixChatTransport" src/returns no hits. - [ ] No imports of
JsClientToolResolverfrom app code:grep -r "JsClientToolResolver" src/returns no hits (the symbol is internal-only in v7). - [ ] No agents declare
agent.workspaceson the Temporal or CFW Workflows paths (if applicable). HITL primitives and persistent sub-agents are now supported on every HITL-capable runtime; onlyagent.workspacesoutside JS / CF DO remains a v7.1 deferral. - [ ] No imports of the deleted Temporal v6 surface:
grep -r "runAgentWorkflow\|TemporalClientToolResolver\|executeClientToolInWorkflow\|AgentWorkflowActivities\|AgentWorkflowOptions\|registerToolResultHandler" src/returns no hits.
Functional (HITL)
- [ ] Client tool round-trip: start a chat that calls a client- executed tool, verify the run suspends with status
suspended_client_tool, submit the result, verify the run resumes and produces output. Refresh mid-suspension and verify the stream reattaches. - [ ] Approval flow: call a
requireApproval: truetool, verify atool_approval_requestchunk emits, submit{ kind: 'approval-response', approved: true }, verify execute runs. Repeat withapproved: falseand verify atool_errorchunk emits without running execute. - [ ] Stream resumption: start a long stream (10+ seconds of tokens), refresh the page mid-stream, verify the client reattaches and continues receiving tokens.
- [ ] Long-pause cost: start a chat, pause at a HITL boundary for five minutes, then submit. Inspect billing/wall-time metrics and confirm < 1 minute of runtime cost was billed (compared to ~5 minutes under v6 on the CF DO path).
Functional (Temporal)
- [ ] HITL on Temporal: repeat the client-tool round-trip and approval-flow tests above against
TemporalAgentExecutor. Verifyhandle.result()resolves with the v7'suspended_*'statuses andexecutor.resume()starts a NEW workflow instance with a__resume-Nworkflow ID suffix that drains submitted results via theapplyResultsAndReloadactivity. - [ ] Worker registers
agentWorkflow(or a thin delegate) and no v6 activity names remain in the worker bundle.
Functional (non-HITL)
- [ ] CFW Workflows HITL (smoke test): client-tool round-trip and approval-flow tests against a CFW Workflows-deployed agent; verify the workflow instance exits on suspension and that
executor.resume()starts a fresh instance. - [ ] Sub-agents (ephemeral and persistent) work on JS, CF DO, and Temporal. CF DO persistent sub-agents dispatch via the
subAgentNamespaceDO stub (commitfb3180f6b); ephemeral sub-agents remain unchanged.
Observability
- [ ]
client_tool_timeoutcounter: trending toward 0 over 24h. - [ ] Langfuse traces: each chat session shows up as a single trace, not one trace per run.
- [ ]
client_tool.suspendedandclient_tool.submittedevents visible in your structured log aggregator.
Getting help
- Per-runtime client-tool guide:
docs/guide/client-executed-tools.md - Approval-gated tools guide:
docs/guide/approval-gates.md - Sub-agent docs (ephemeral + persistent):
docs/guide/sub-agents.md - State store contracts (
saveStateAndPromoteStaging,compareAndSetStatus, migrations):docs/guide/state-stores.md - Issues: file at the project tracker with the
v7-migrationlabel.
The v7 release was a large rewrite. If something behaves differently from this document, file an issue — the document is the canonical contract for what v7 should do.