Interrupt and Resume
Helix Agents supports interrupting and resuming agent execution. This enables user-controlled pauses, crash recovery, time-travel debugging, and human-in-the-loop (HITL) workflows.
Overview
When an agent is running, you can:
- Interrupt — soft stop that saves state for later resumption.
- Resume — continue execution from where it stopped.
This is different from abort, which is a hard stop that fails the agent permanently.
v7 changes at a glance
Three things changed in v7 that touch this guide directly:
- Durable interrupt protocol. v6's
interrupt(handle)walked an in-memoryactiveHandlesmap and returned 503INTERRUPT_NOT_LOCALwhen the handle wasn't on the same process. v7 writes a durable interrupt request to the state store; the running loop picks it up at its next checkpoint regardless of which process owns the handle. The 503 response is gone. - Three new suspended statuses.
RunOutcome.kind(and thereforehandle.result().status) gains'suspended_client_tool','suspended_awaiting_children', and'suspended_step_partial'. Existingswitchstatements that only handle'completed' | 'failed' | 'interrupted'will fall through silently for HITL agents. - Resume is explicit for HITL. When a run suspends at a HITL boundary,
submitToolResultno longer wakes an in-memory waiter. The chat handler (or your code) must callexecutor.resume({ sessionId })after the submission lands.
If you are upgrading from v6, read the v6 to v7 migration guide end-to-end before deploying. The rest of this page describes the v7 model.
execute() vs resume() Semantics
Understanding the distinction between execute() and resume() is critical for correct agent lifecycle management.
execute() - Fresh Start
execute() always starts a fresh execution. Using the same sessionId continues the conversation within that session:
// New session with generated ID
const handle = await executor.execute(agent, 'Hello');
// New session with specific ID
const handle = await executor.execute(agent, 'Hello', { sessionId: 'my-session' });
// Continue conversation in existing session
const handle2 = await executor.execute(agent, 'Follow up', { sessionId: 'my-session' });When you call execute():
- Stream is reset - All previous stream chunks are cleared
- New run begins - A new run starts within the session
- Session state continues - Messages and custom state are preserved within the session
This means each execute() call starts a fresh run. If you call execute() with the same sessionId, the new run continues the conversation with access to previous messages. Clients streaming from a previous run will receive a stream_resync event notifying them of the discontinuity.
resume() - Continue Execution
resume() continues from where the agent stopped:
// Continue from last checkpoint
const newHandle = await handle.resume();
// Continue with additional context
const newHandle = await handle.resume({
mode: 'with_message',
message: 'Please continue',
});When you call resume():
- Stream is preserved - Existing stream chunks remain intact
- State is loaded - Agent state is restored from the checkpoint
- Execution continues - The agent resumes from the checkpoint step
When to Use Each
| Scenario | Method | Reason |
|---|---|---|
| Starting a new conversation | execute() | Fresh start needed |
| User sends new message to completed agent | execute() | New run, new stream |
| Crash recovery | resume() | Continue where left off |
| User interrupted, wants to continue | resume() | Preserve progress |
| Time-travel to earlier state | resume({ mode: 'from_checkpoint' }) | Branch from history |
| Tool confirmation received | resume({ mode: 'with_confirmation' }) | Continue paused run |
| Retry after failure | retry() | Restore checkpoint, re-attempt |
Stream Behavior Difference
The key technical difference is stream handling:
// execute() resets the stream
// - Clients see: stream_resync (if reconnecting) → new chunks from step 0
// - Old chunks are deleted
// resume() preserves the stream
// - Clients see: existing chunks → new chunks from checkpoint step
// - Stream continuity maintainedThis distinction matters for frontend integration. If you're building a chat UI:
- New conversation: Use
execute()- the UI should clear and show fresh messages - Reconnecting after disconnect: Use
resume()- the UI should show existing messages and continue
Retrying Failed Agents
Failed agents cannot be resumed with resume(). Use the retry() method for failure recovery.
Why retry() is Separate from resume()
| Aspect | resume() | retry() |
|---|---|---|
| Purpose | Continue interrupted/paused execution | Recover from failure |
| Valid statuses | interrupted, paused | failed |
| Stream behavior | Preserves chunks | Resets to checkpoint |
| State handling | Continues from current | Restores from checkpoint |
retry() Method
const result = await handle.result();
if (result.status === 'failed') {
// Retry from last checkpoint — re-provide the triggering message
const retryHandle = await handle.retry({
message: 'Research quantum computing',
});
// Or retry from a specific checkpoint
const retryHandle = await handle.retry({
message: 'Research quantum computing',
checkpointId: 'cpv1-...',
});
}Message Requirement
The original user message that triggered the failure is part of the checkpoint state. You typically need to provide a message option to specify what to retry.
Without an explicit message, the retry will attempt to extract the last user message from the session history, but this may not work as expected in all cases. It's safer to always provide the message explicitly.
If no checkpoint exists (the run failed before its first checkpoint), retry() restarts fresh from the triggering message instead of throwing.
RetryOptions
| Option | Type | Default | Description |
|---|---|---|---|
checkpointId | string | Latest | Which checkpoint to restore from. If provided but unresolvable, retry() throws instead of falling back to a fresh restart. |
message | string | (extracted) | Message to retry with - recommended to always provide |
abortSignal | AbortSignal | - | Abort signal for cancellation |
Concurrency Safety
execute() Safety Check
execute() validates that the session is not already running:
import { AgentAlreadyRunningError } from '@helix-agents/core';
try {
await executor.execute(agent, message, { sessionId });
} catch (error) {
if (error instanceof AgentAlreadyRunningError) {
console.log(`Session ${error.sessionId} is already running`);
// Wait for existing execution or use different session
}
}This prevents state corruption from concurrent executions.
Atomic Status Transitions
All methods use appropriate concurrency control:
| Method | Protection | Mechanism |
|---|---|---|
execute() | Status check + StaleStateError | Rejects if running |
resume() | CAS | Atomic interrupted/paused → active |
retry() | CAS | Atomic failed → active |
When concurrent calls race past status checks, optimistic locking via version numbers ensures only one succeeds. The losing call receives a StaleStateError.
Basic Usage
Interrupting an Agent
Use handle.interrupt() to pause execution:
const handle = await executor.execute(agent, 'Research quantum computing');
// Later, interrupt the agent
await handle.interrupt('user_requested');
// Agent status is now 'interrupted'
const state = await handle.getState();
console.log(state.status); // 'interrupted'The agent will stop at the next safe point (between steps). Any in-progress step is rolled back to the last checkpoint.
Durable interrupt protocol (v7)
Interrupts are now durable across processes. v6 looked the handle up in an in-memory activeHandles map and returned HTTP 503 INTERRUPT_NOT_LOCAL if the handle lived on a different process. v7 writes a durable interrupt request to the state store; whichever process owns the run loop picks it up at the next checkpoint and stops gracefully.
This means POST /chat/{id}/interrupt (or POST /interrupt on the executor route) succeeds from any process, regardless of where the run started. No more 503s, no sticky-routing requirements for the interrupt path.
Deadline semantics. Both interruptAgent and abortAgent have a configurable deadline (default ~10 seconds). If the running loop does not acknowledge the interrupt within the deadline, the agent-server HTTP routes return 504 Gateway Timeout. The interrupt request remains durable — the loop will still process it once it reaches a safe point — but the HTTP caller is unblocked. Tune the deadline via the AgentServerConfig.interruptDeadlineMs option (or the equivalent parameter on the chat handler).
Resuming an Agent
Use handle.resume() to continue execution:
// Check if we can resume
const { canResume, reason } = await handle.canResume();
if (canResume) {
// Resume execution
const newHandle = await handle.resume();
// Stream events from the resumed execution
for await (const chunk of (await newHandle.stream()) ?? []) {
console.log(chunk.type);
}
// Get final result
const result = await newHandle.result();
console.log(result.output);
}Resume Modes
The resume() method supports four modes:
continue (default)
Resume from where the agent stopped:
const newHandle = await handle.resume(); // Same as { mode: 'continue' }
const newHandle = await handle.resume({ mode: 'continue' });with_message
Resume and add a new user message to the conversation:
const newHandle = await handle.resume({
mode: 'with_message',
message: 'Actually, focus on quantum entanglement specifically',
});The message field accepts a string or an array of UserInputMessage objects for multi-message input:
const newHandle = await handle.resume({
mode: 'with_message',
message: [
{
role: 'user',
content: 'Here is additional context from the system',
metadata: { source: 'system' },
},
{ role: 'user', content: 'Focus on quantum entanglement specifically' },
],
});This is useful when the user wants to redirect the agent's focus.
with_confirmation
Resume a paused tool call with confirmation data:
// Agent paused waiting for confirmation on a tool
const newHandle = await handle.resume({
mode: 'with_confirmation',
data: { approved: true, notes: 'Proceed with the action' },
});This mode is used when a tool requires human-in-the-loop approval.
from_checkpoint
Resume from a specific historical checkpoint (time-travel):
// Get available checkpoints
const state = await handle.getState();
const checkpoints = await stateStore.listCheckpoints(handle.sessionId);
// Resume from an earlier checkpoint
const newHandle = await handle.resume({
mode: 'from_checkpoint',
checkpointId: checkpoints.items[0].id,
});Resuming after HITL suspension (v7)
When a run reaches a client-tool boundary or an approval gate, the run loop suspends durably and handle.result() resolves with one of the suspended_* statuses. To continue:
const result = await handle.result();
switch (result.status) {
case 'suspended_client_tool':
// The client tool / approval gate is now pending submission.
// Submit results via executor.submitToolResult({ kind, ... })
// (or via chat.addToolOutput / chat.addToolApprovalResponse on the
// AI SDK side), then resume:
const resumed = await executor.resume({ sessionId: handle.sessionId });
await resumed.result();
break;
case 'suspended_awaiting_children':
// Parent is waiting for sub-agent(s) to finish their own HITL
// cycles. Once each child resumes and completes, the framework
// cascades the resume up to the parent automatically.
break;
case 'suspended_step_partial':
// Mid-step suspend — some tools in the step ran, at least one is
// pending submission. Submit + resume same as suspended_client_tool.
break;
case 'completed':
/* handle output */ break;
case 'failed':
/* handle error */ break;
case 'interrupted':
/* user-requested pause */ break;
}For chat-driven UIs, handleChatStream (the canonical chat handler in @helix-agents/ai-sdk) drives this lifecycle for you — submission + resume happens automatically on the server side, and AI SDK v6's stream-close-and-reopen lifecycle reattaches the client.
Crash Recovery
If your process crashes while an agent is running, the state is preserved in the state store. On restart, you can resume:
// After restart, reconnect to the session
const handle = await executor.getHandle(agent, savedSessionId);
if (handle) {
const { canResume, reason } = await handle.canResume();
if (canResume) {
// Resume from last checkpoint
const resumed = await handle.resume();
const result = await resumed.result();
}
}For this to work, use a persistent state store like Redis:
import { RedisStateStore, RedisStreamManager } from '@helix-agents/store-redis';
const stateStore = new RedisStateStore({ host: 'localhost' });
const streamManager = new RedisStreamManager({ host: 'localhost' });
const executor = new JSAgentExecutor(stateStore, streamManager, llmAdapter);What Happens During Resume
When you call resume() on a crashed or interrupted agent, the framework performs cleanup before continuing:
- Load checkpoint - Get the last successfully completed step
- Clean up orphaned staging - Remove any uncommitted staged changes
- Truncate messages - Remove messages added after the checkpoint (prevents duplicates)
- Clean up stream chunks - Remove chunks after the checkpoint step
- Emit
stream_resync- Notify connected clients of the discontinuity - Continue execution - Resume from the checkpoint step
graph LR
subgraph Before ["Before Crash"]
direction LR
S1["Step 1"] --> S2["Step 2"] --> S3["Step 3"] --> S4["Step 4 ✗"]
end
S3 -. "Checkpoint" .-> S3
subgraph After ["After Resume"]
direction LR
R1["Step 1"] --> R2["Step 2"] --> R3["Step 3"] --> R4["Step 4"] --> R5["Step 5"]
end
R3 -. "Resume from<br/>checkpoint" .-> R3The truncation is critical for preventing duplicate messages. If step 4 partially completed (wrote messages but crashed before checkpoint), those messages are orphaned. truncateMessages() removes them so they're not duplicated when the step re-executes.
Message Count Protection
The framework only truncates messages when a checkpoint exists AND has messageCount > 0. This prevents data loss in edge cases like:
- First-step crash (no checkpoint yet)
- Old checkpoints from before
messageCountfield was added
Resumable Statuses
An agent can be resumed when its status is:
| Status | Resumable | Description |
|---|---|---|
running | Yes | Crash recovery — process died but state says running. |
interrupted | Yes | User interrupted execution. |
paused | Yes | Waiting for tool confirmation (legacy v6 path). |
suspended_client_tool (v7) | Yes | One or more client-tool / approval calls are pending submission. |
suspended_awaiting_children (v7) | Yes | One or more sub-agents are awaiting submissions; parent waits for cascade. |
suspended_step_partial (v7) | Yes | A step finished some tools but at least one is suspended; mid-step retry. |
completed | No* | Agent finished successfully. |
failed | No* | Agent failed with error. |
*resume() does not apply to terminal states — but a completed run is not a dead end. You can continue it as a new turn with execute({ sessionId }) (see Continuing a completed agent below), or time-travel into either terminal state with resume({ mode: 'from_checkpoint' }).
The three new suspended_* statuses carry a result.suspended payload with the routing info (toolCallIds, children, stepId) the chat handler uses to drive resume. Most consumers do not read the payload directly — handleChatStream (from @helix-agents/ai-sdk) does it on their behalf — but it is part of the public type surface and exhaustive switches need to handle the variants.
Continuing a completed agent
resume() is for suspended runs (interrupted, paused, or a suspended_* HITL boundary). A completed run is continued differently: call execute({ sessionId }) against the completed session to start a new turn — it preserves the session's memory and produces a fresh result, with a fresh per-turn maxSteps budget.
// First turn runs to completion
const h1 = await executor.execute(agent, 'Draft a release note', { sessionId: 'rel-1' });
await h1.result(); // status: 'completed'
// Continue the SAME session as a new turn — memory is retained.
const h2 = await executor.execute(agent, 'Tighten the wording', { sessionId: 'rel-1' });
const r2 = await h2.result(); // fresh typed result, full prior history in contextThis now works for structured-output agents too. Previously a completed structured-output agent could not be continued on a real LLM provider: its transcript ended with a dangling __finish__ tool_use that had no matching tool_result, which Anthropic/OpenAI reject. The finish tool call is now paired with a synthetic { acknowledged: true } tool result, so the history is valid for continuation. See Finishing Agents — Continuing a completed structured-output agent.
The cross-runtime contract that execute(sessionId) is valid from completed is documented in the agent lifecycle table — see also the Agent Lifecycle Methods table in Core Concepts. The same continuation mechanism powers persistent companions: re-consulting a completed critic continues it on its preserved session. See the critic-loop recipe.
Stream Events
The framework emits stream chunks for interrupt/resume events:
run_interrupted
Emitted when an agent is interrupted:
{
type: 'run_interrupted',
runId: 'run-123',
checkpointId: 'cpv1-run-123-s5-t1234567890-abc123',
reason: 'user_requested'
}run_resumed
Emitted when an agent resumes:
{
type: 'run_resumed',
runId: 'run-123',
fromCheckpointId: 'cpv1-run-123-s5-t1234567890-abc123',
fromStepCount: 5,
mode: 'continue'
}run_paused
Emitted when an agent pauses for confirmation:
{
type: 'run_paused',
runId: 'run-123',
reason: 'tool_confirmation_required',
pendingToolName: 'delete_file',
pendingToolCallId: 'call-456'
}checkpoint_created
Emitted when a checkpoint is saved:
{
type: 'checkpoint_created',
runId: 'run-123',
checkpointId: 'cpv1-run-123-s5-t1234567890-abc123',
stepCount: 5
}Error Handling
AgentAlreadyRunningError
Thrown when trying to resume an agent that's already running:
import { AgentAlreadyRunningError } from '@helix-agents/core';
try {
await handle.resume();
} catch (error) {
if (error instanceof AgentAlreadyRunningError) {
console.log(`Agent session ${error.sessionId} is already running`);
}
}AgentNotResumableError
Thrown when trying to resume an agent in a terminal state:
import { AgentNotResumableError } from '@helix-agents/core';
try {
await handle.resume();
} catch (error) {
if (error instanceof AgentNotResumableError) {
console.log(`Cannot resume: ${error.currentStatus}`);
}
}Runtime Considerations
JavaScript Runtime
The JS runtime handles interrupts in-process. When you call interrupt():
- The current step is marked for rollback
- Staged changes are discarded
- Status is set to 'interrupted'
- The run loop exits cleanly
Temporal Runtime
Temporal uses workflow signals for interrupts:
interrupt()sends a signal to the workflow- The workflow handles the signal at the next safe point
- State is persisted via Temporal's durability guarantees
- Resume creates a new workflow execution that continues from the checkpoint
Cloudflare Runtime
Cloudflare Workflows handle interrupts via instance coordination:
interrupt()updates the state and marks for stop- The workflow step completes and checks the interrupt flag
- Resume creates a new workflow instance with the same run ID
Interrupt Behavior During Sub-Agent Execution
When an agent spawns sub-agents (child workflows), interrupt signals propagate through the entire agent hierarchy. This enables responsive cancellation even during complex multi-agent operations.
How Interrupt Propagation Works
When you interrupt a parent agent that has running sub-agents:
- Parent receives interrupt - The interrupt flag is set on the parent agent
- Children are signaled - The parent propagates the interrupt to all running children
- Children stop gracefully - Each child detects the interrupt and stops at its next safe point
- Parent completes - Once all children have stopped, the parent returns with status
interrupted
Target Response Time: < 1 second from interrupt request to stopped execution, even with deeply nested sub-agents.
Per-Runtime Implementation
| Runtime | Mechanism | Latency |
|---|---|---|
| JS | AbortSignal propagation via batchController | Immediate (< 100ms) |
| Temporal | INTERRUPT_SIGNAL_NAME handler + durable interrupt flag | < 1 second |
| Cloudflare DO | Durable interrupt flag observed at next checkpoint | < 1 step boundary |
| Cloudflare Workflows | Durable interrupt flag observed at next step.do | < 1 step boundary |
| DBOS | Durable interrupt flag observed at next workflow checkpoint | < 1 step boundary |
JavaScript Runtime
The JS runtime uses AbortSignal for cooperative cancellation:
// Parent's AbortController is linked to child agents
// When parent.abort() is called, children receive abort signal immediately
// In custom tools, check the signal:
const myTool = defineTool({
execute: async (input, context) => {
for (const item of items) {
if (context.abortSignal.aborted) {
return { partial: true, processed: results };
}
// ... process item
}
},
});Temporal Runtime
Temporal uses workflow signals with a Trigger primitive for instant response:
// Platform adapter pattern for sub-second interrupt response:
import { Trigger, getExternalWorkflowHandle, setHandler } from '@temporalio/workflow';
const interruptTrigger = new Trigger<string>();
setHandler(interruptSignal, (reason) => {
interruptReason = reason;
interruptTrigger.resolve(reason); // Wake up immediately
});
// The workflow races child completion against interrupt trigger
// If interrupt wins, all running children are signaled to stopCloudflare Runtime
Cloudflare uses an event-based approach for immediate interrupt response:
When interrupt() is called:
- Interrupt flag is set via
stateStore.setInterruptFlag()(persisted) - Interrupt event is sent via
instance.sendEvent()for immediate wake-up - The workflow races completion events against interrupt events using
Promise.race - Whichever event arrives first wins the race
The workflow checks for interrupts at two points:
- Before each step: In the main execution loop
- Before spawning sub-agents: A pre-spawn check prevents spawning if already interrupted
This event-based approach provides immediate interrupt response (< 100ms) without polling overhead.
Best Practices for Sub-Agent Interrupts
- Keep sub-agent operations bounded - Long-running sub-agents delay interrupt response
- Use interruptible tools - Check
context.abortSignal.abortedin loops - Configure poll intervals appropriately - Shorter intervals = faster response, more overhead
- Design for partial results - Sub-agents should return meaningful partial output when interrupted
Best Practices
1. Always Check canResume()
Before resuming, verify the agent can be resumed:
const { canResume, reason } = await handle.canResume();
if (!canResume) {
console.log(`Cannot resume: ${reason}`);
return;
}2. Handle Resume Errors
Wrap resume calls in try-catch:
try {
const resumed = await handle.resume();
await resumed.result();
} catch (error) {
if (error instanceof AgentAlreadyRunningError) {
// Another process resumed first
} else if (error instanceof AgentNotResumableError) {
// Agent is in terminal state
}
}3. Use Persistent Storage for Production
For crash recovery to work, use Redis or another persistent store:
const stateStore = new RedisStateStore({
host: process.env.REDIS_HOST,
ttl: 86400 * 7, // 7 days retention
});4. Store Session IDs for Later Resumption
Save session IDs so you can reconnect after crashes:
const handle = await executor.execute(agent, input);
// Persist the session ID
await database.save({ sessionId: handle.sessionId, userId });
// Later, after restart
const savedSession = await database.get(userId);
const handle = await executor.getHandle(agent, savedSession.sessionId);Next Steps
- Client-executed tools — HITL primitive that triggers
suspended_client_tool - Approval gates — first-class
requireApprovalflag - v6 to v7 migration guide — full v7 changes
- Checkpoints - Understand the checkpoint system
- Distributed Coordination - Multi-process coordination
- Streaming - Handle interrupt/resume stream events