Interrupt and Resume

Helix Agents supports interrupting and resuming agent execution. This enables user-controlled pauses, crash recovery, and time-travel debugging.

Overview

When an agent is running, you can:

Interrupt - Soft stop that saves state for later resumption
Resume - Continue execution from where it stopped

This is different from abort, which is a hard stop that fails the agent permanently.

execute() vs resume() Semantics

Understanding the distinction between execute() and resume() is critical for correct agent lifecycle management.

execute() - Fresh Start

execute() always starts a fresh execution. Using the same sessionId continues the conversation within that session:

typescript

// New session with generated ID
const handle = await executor.execute(agent, 'Hello');

// New session with specific ID
const handle = await executor.execute(agent, 'Hello', { sessionId: 'my-session' });

// Continue conversation in existing session
const handle2 = await executor.execute(agent, 'Follow up', { sessionId: 'my-session' });

When you call execute():

Stream is reset - All previous stream chunks are cleared
New run begins - A new run starts within the session
Session state continues - Messages and custom state are preserved within the session

This means each execute() call starts a fresh run. If you call execute() with the same sessionId, the new run continues the conversation with access to previous messages. Clients streaming from a previous run will receive a stream_resync event notifying them of the discontinuity.

resume() - Continue Execution

resume() continues from where the agent stopped:

typescript

// Continue from last checkpoint
const newHandle = await handle.resume();

// Continue with additional context
const newHandle = await handle.resume({
  mode: 'with_message',
  message: 'Please continue',
});

When you call resume():

Stream is preserved - Existing stream chunks remain intact
State is loaded - Agent state is restored from the checkpoint
Execution continues - The agent resumes from the checkpoint step

When to Use Each

Scenario	Method	Reason
Starting a new conversation	`execute()`	Fresh start needed
User sends new message to completed agent	`execute()`	New run, new stream
Crash recovery	`resume()`	Continue where left off
User interrupted, wants to continue	`resume()`	Preserve progress
Time-travel to earlier state	`resume({ mode: 'from_checkpoint' })`	Branch from history
Tool confirmation received	`resume({ mode: 'with_confirmation' })`	Continue paused run
Retry after failure	`retry()`	Restore checkpoint, re-attempt

Stream Behavior Difference

The key technical difference is stream handling:

typescript

// execute() resets the stream
// - Clients see: stream_resync (if reconnecting) → new chunks from step 0
// - Old chunks are deleted

// resume() preserves the stream
// - Clients see: existing chunks → new chunks from checkpoint step
// - Stream continuity maintained

This distinction matters for frontend integration. If you're building a chat UI:

New conversation: Use execute() - the UI should clear and show fresh messages
Reconnecting after disconnect: Use resume() - the UI should show existing messages and continue

Retrying Failed Agents

Failed agents cannot be resumed with resume(). Use the retry() method for failure recovery.

Why retry() is Separate from resume()

Aspect	resume()	retry()
Purpose	Continue interrupted/paused execution	Recover from failure
Valid statuses	interrupted, paused	failed
Stream behavior	Preserves chunks	Resets to checkpoint
State handling	Continues from current	Restores from checkpoint

retry() Method

typescript

const result = await handle.result();
if (result.status === 'failed') {
  // Retry from last checkpoint (default) - typically needs message
  const retryHandle = await handle.retry({
    message: 'Research quantum computing', // Re-provide the triggering message
  });

  // Or from specific checkpoint
  const retryHandle = await handle.retry({
    message: 'Research quantum computing',
    checkpointId: 'cpv1-...',
  });

  // Or start completely fresh
  const retryHandle = await handle.retry({
    mode: 'from_start',
    message: 'Research quantum computing',
  });
}

Message Requirement

When using from_checkpoint mode (default), the original user message that triggered the failure is part of the checkpoint state. You typically need to provide a message option to specify what to retry.

Without an explicit message, the retry will attempt to extract the last user message from the session history, but this may not work as expected in all cases. It's safer to always provide the message explicitly.

RetryOptions

Option	Type	Default	Description
`mode`	`'from_checkpoint' \| 'from_start'`	`'from_checkpoint'`	How to retry
`checkpointId`	`string`	Latest	Which checkpoint to restore from
`message`	`string`	(extracted)	Message to retry with - recommended to always provide
`abortSignal`	`AbortSignal`	-	Abort signal for cancellation

Concurrency Safety

execute() Safety Check

execute() validates that the session is not already running:

typescript

import { AgentAlreadyRunningError } from '@helix-agents/core';

try {
  await executor.execute(agent, message, { sessionId });
} catch (error) {
  if (error instanceof AgentAlreadyRunningError) {
    console.log(`Session ${error.sessionId} is already running`);
    // Wait for existing execution or use different session
  }
}

This prevents state corruption from concurrent executions.

Atomic Status Transitions

All methods use appropriate concurrency control:

Method	Protection	Mechanism
`execute()`	Status check + StaleStateError	Rejects if running
`resume()`	CAS	Atomic interrupted/paused → active
`retry()`	CAS	Atomic failed → active

When concurrent calls race past status checks, optimistic locking via version numbers ensures only one succeeds. The losing call receives a StaleStateError.

Basic Usage

Interrupting an Agent

Use handle.interrupt() to pause execution:

typescript

const handle = await executor.execute(agent, 'Research quantum computing');

// Later, interrupt the agent
await handle.interrupt('user_requested');

// Agent status is now 'interrupted'
const state = await handle.getState();
console.log(state.status); // 'interrupted'

The agent will stop at the next safe point (between steps). Any in-progress step is rolled back to the last checkpoint.

Resuming an Agent

Use handle.resume() to continue execution:

typescript

// Check if we can resume
const { canResume, reason } = await handle.canResume();

if (canResume) {
  // Resume execution
  const newHandle = await handle.resume();

  // Stream events from the resumed execution
  for await (const chunk of (await newHandle.stream()) ?? []) {
    console.log(chunk.type);
  }

  // Get final result
  const result = await newHandle.result();
  console.log(result.output);
}

Resume Modes

The resume() method supports four modes:

continue (default)

Resume from where the agent stopped:

typescript

const newHandle = await handle.resume(); // Same as { mode: 'continue' }
const newHandle = await handle.resume({ mode: 'continue' });

with_message

Resume and add a new user message to the conversation:

typescript

const newHandle = await handle.resume({
  mode: 'with_message',
  message: 'Actually, focus on quantum entanglement specifically',
});

This is useful when the user wants to redirect the agent's focus.

with_confirmation

Resume a paused tool call with confirmation data:

typescript

// Agent paused waiting for confirmation on a tool
const newHandle = await handle.resume({
  mode: 'with_confirmation',
  data: { approved: true, notes: 'Proceed with the action' },
});

This mode is used when a tool requires human-in-the-loop approval.

from_checkpoint

Resume from a specific historical checkpoint (time-travel):

typescript

// Get available checkpoints
const state = await handle.getState();
const checkpoints = await stateStore.listCheckpoints(handle.sessionId);

// Resume from an earlier checkpoint
const newHandle = await handle.resume({
  mode: 'from_checkpoint',
  checkpointId: checkpoints.items[0].id,
});

Crash Recovery

If your process crashes while an agent is running, the state is preserved in the state store. On restart, you can resume:

typescript

// After restart, reconnect to the session
const handle = await executor.getHandle(agent, savedSessionId);

if (handle) {
  const { canResume, reason } = await handle.canResume();

  if (canResume) {
    // Resume from last checkpoint
    const resumed = await handle.resume();
    const result = await resumed.result();
  }
}

For this to work, use a persistent state store like Redis:

typescript

import { RedisStateStore, RedisStreamManager } from '@helix-agents/store-redis';

const stateStore = new RedisStateStore({ host: 'localhost' });
const streamManager = new RedisStreamManager({ host: 'localhost' });
const executor = new JSAgentExecutor(stateStore, streamManager, llmAdapter);

What Happens During Resume

When you call resume() on a crashed or interrupted agent, the framework performs cleanup before continuing:

Load checkpoint - Get the last successfully completed step
Clean up orphaned staging - Remove any uncommitted staged changes
Truncate messages - Remove messages added after the checkpoint (prevents duplicates)
Clean up stream chunks - Remove chunks after the checkpoint step
Emit stream_resync - Notify connected clients of the discontinuity
Continue execution - Resume from the checkpoint step

mermaid

graph LR
    subgraph Before ["Before Crash"]
        direction LR
        S1["Step 1"] --> S2["Step 2"] --> S3["Step 3"] --> S4["Step 4 ✗"]
    end

    S3 -. "Checkpoint" .-> S3

    subgraph After ["After Resume"]
        direction LR
        R1["Step 1"] --> R2["Step 2"] --> R3["Step 3"] --> R4["Step 4"] --> R5["Step 5"]
    end

    R3 -. "Resume from<br/>checkpoint" .-> R3

The truncation is critical for preventing duplicate messages. If step 4 partially completed (wrote messages but crashed before checkpoint), those messages are orphaned. truncateMessages() removes them so they're not duplicated when the step re-executes.

Message Count Protection

The framework only truncates messages when a checkpoint exists AND has messageCount > 0. This prevents data loss in edge cases like:

First-step crash (no checkpoint yet)
Old checkpoints from before messageCount field was added

Resumable Statuses

An agent can be resumed when its status is:

Status	Resumable	Description
`running`	Yes	Crash recovery - process died but state says running
`interrupted`	Yes	User interrupted execution
`paused`	Yes	Waiting for tool confirmation
`completed`	No*	Agent finished successfully
`failed`	No*	Agent failed with error

*Terminal states can only be resumed with mode: 'from_checkpoint' to time-travel to a previous state.

Stream Events

The framework emits stream chunks for interrupt/resume events:

run_interrupted

Emitted when an agent is interrupted:

typescript

{
  type: 'run_interrupted',
  runId: 'run-123',
  checkpointId: 'cpv1-run-123-s5-t1234567890-abc123',
  reason: 'user_requested'
}

run_resumed

Emitted when an agent resumes:

typescript

{
  type: 'run_resumed',
  runId: 'run-123',
  fromCheckpointId: 'cpv1-run-123-s5-t1234567890-abc123',
  fromStepCount: 5,
  mode: 'continue'
}

run_paused

Emitted when an agent pauses for confirmation:

typescript

{
  type: 'run_paused',
  runId: 'run-123',
  reason: 'tool_confirmation_required',
  pendingToolName: 'delete_file',
  pendingToolCallId: 'call-456'
}

checkpoint_created

Emitted when a checkpoint is saved:

typescript

{
  type: 'checkpoint_created',
  runId: 'run-123',
  checkpointId: 'cpv1-run-123-s5-t1234567890-abc123',
  stepCount: 5
}

Error Handling

AgentAlreadyRunningError

Thrown when trying to resume an agent that's already running:

typescript

import { AgentAlreadyRunningError } from '@helix-agents/core';

try {
  await handle.resume();
} catch (error) {
  if (error instanceof AgentAlreadyRunningError) {
    console.log(`Agent session ${error.sessionId} is already running`);
  }
}

AgentNotResumableError

Thrown when trying to resume an agent in a terminal state:

typescript

import { AgentNotResumableError } from '@helix-agents/core';

try {
  await handle.resume();
} catch (error) {
  if (error instanceof AgentNotResumableError) {
    console.log(`Cannot resume: ${error.currentStatus}`);
  }
}

Runtime Considerations

JavaScript Runtime

The JS runtime handles interrupts in-process. When you call interrupt():

The current step is marked for rollback
Staged changes are discarded
Status is set to 'interrupted'
The run loop exits cleanly

Temporal Runtime

Temporal uses workflow signals for interrupts:

interrupt() sends a signal to the workflow
The workflow handles the signal at the next safe point
State is persisted via Temporal's durability guarantees
Resume creates a new workflow execution that continues from the checkpoint

Cloudflare Runtime

Cloudflare Workflows handle interrupts via instance coordination:

interrupt() updates the state and marks for stop
The workflow step completes and checks the interrupt flag
Resume creates a new workflow instance with the same run ID

Interrupt Behavior During Sub-Agent Execution

When an agent spawns sub-agents (child workflows), interrupt signals propagate through the entire agent hierarchy. This enables responsive cancellation even during complex multi-agent operations.

How Interrupt Propagation Works

When you interrupt a parent agent that has running sub-agents:

Parent receives interrupt - The interrupt flag is set on the parent agent
Children are signaled - The parent propagates the interrupt to all running children
Children stop gracefully - Each child detects the interrupt and stops at its next safe point
Parent completes - Once all children have stopped, the parent returns with status interrupted

Target Response Time: < 1 second from interrupt request to stopped execution, even with deeply nested sub-agents.

Per-Runtime Implementation

Runtime	Mechanism	Latency
JS	AbortSignal propagation via `batchController`	Immediate (< 100ms)
Temporal	Signal handlers + `Trigger` primitive for instant wake	< 1 second
Cloudflare	Event-based: Promise.race with interrupt events	Immediate (< 100ms)

JavaScript Runtime

The JS runtime uses AbortSignal for cooperative cancellation:

typescript

// Parent's AbortController is linked to child agents
// When parent.abort() is called, children receive abort signal immediately

// In custom tools, check the signal:
const myTool = defineTool({
  execute: async (input, context) => {
    for (const item of items) {
      if (context.abortSignal.aborted) {
        return { partial: true, processed: results };
      }
      // ... process item
    }
  },
});

Temporal Runtime

Temporal uses workflow signals with a Trigger primitive for instant response:

typescript

// Platform adapter pattern for sub-second interrupt response:
import { Trigger, getExternalWorkflowHandle, setHandler } from '@temporalio/workflow';

const interruptTrigger = new Trigger<string>();

setHandler(interruptSignal, (reason) => {
  interruptReason = reason;
  interruptTrigger.resolve(reason); // Wake up immediately
});

// The workflow races child completion against interrupt trigger
// If interrupt wins, all running children are signaled to stop

Cloudflare Runtime

Cloudflare uses an event-based approach for immediate interrupt response:

When interrupt() is called:

Interrupt flag is set via stateStore.setInterruptFlag() (persisted)
Interrupt event is sent via instance.sendEvent() for immediate wake-up
The workflow races completion events against interrupt events using Promise.race
Whichever event arrives first wins the race

The workflow checks for interrupts at two points:

Before each step: In the main execution loop
Before spawning sub-agents: A pre-spawn check prevents spawning if already interrupted

This event-based approach provides immediate interrupt response (< 100ms) without polling overhead.

Best Practices for Sub-Agent Interrupts

Keep sub-agent operations bounded - Long-running sub-agents delay interrupt response
Use interruptible tools - Check context.abortSignal.aborted in loops
Configure poll intervals appropriately - Shorter intervals = faster response, more overhead
Design for partial results - Sub-agents should return meaningful partial output when interrupted

Best Practices

1. Always Check canResume()

Before resuming, verify the agent can be resumed:

typescript

const { canResume, reason } = await handle.canResume();
if (!canResume) {
  console.log(`Cannot resume: ${reason}`);
  return;
}

2. Handle Resume Errors

Wrap resume calls in try-catch:

typescript

try {
  const resumed = await handle.resume();
  await resumed.result();
} catch (error) {
  if (error instanceof AgentAlreadyRunningError) {
    // Another process resumed first
  } else if (error instanceof AgentNotResumableError) {
    // Agent is in terminal state
  }
}

3. Use Persistent Storage for Production

For crash recovery to work, use Redis or another persistent store:

typescript

const stateStore = new RedisStateStore({
  host: process.env.REDIS_HOST,
  ttl: 86400 * 7, // 7 days retention
});

4. Store Session IDs for Later Resumption

Save session IDs so you can reconnect after crashes:

typescript

const handle = await executor.execute(agent, input);

// Persist the session ID
await database.save({ sessionId: handle.sessionId, userId });

// Later, after restart
const savedSession = await database.get(userId);
const handle = await executor.getHandle(agent, savedSession.sessionId);

Next Steps

Checkpoints - Understand the checkpoint system
Distributed Coordination - Multi-process coordination
Streaming - Handle interrupt/resume stream events

Interrupt and Resume ​

Overview ​

execute() vs resume() Semantics ​

execute() - Fresh Start ​

resume() - Continue Execution ​

When to Use Each ​

Stream Behavior Difference ​

Retrying Failed Agents ​

Why retry() is Separate from resume() ​

retry() Method ​

RetryOptions ​

Concurrency Safety ​

execute() Safety Check ​

Atomic Status Transitions ​

Basic Usage ​

Interrupting an Agent ​

Resuming an Agent ​

Resume Modes ​

continue (default) ​

with_message ​

with_confirmation ​

from_checkpoint ​

Crash Recovery ​

What Happens During Resume ​

Resumable Statuses ​

Stream Events ​

run_interrupted ​

run_resumed ​

run_paused ​

checkpoint_created ​

Error Handling ​

AgentAlreadyRunningError ​

AgentNotResumableError ​

Runtime Considerations ​

JavaScript Runtime ​

Temporal Runtime ​

Cloudflare Runtime ​

Interrupt Behavior During Sub-Agent Execution ​

How Interrupt Propagation Works ​

Per-Runtime Implementation ​

JavaScript Runtime ​

Temporal Runtime ​

Cloudflare Runtime ​

Best Practices for Sub-Agent Interrupts ​

Best Practices ​

1. Always Check canResume() ​

2. Handle Resume Errors ​

3. Use Persistent Storage for Production ​

4. Store Session IDs for Later Resumption ​

Next Steps ​

Interrupt and Resume

Overview

execute() vs resume() Semantics

execute() - Fresh Start

resume() - Continue Execution

When to Use Each

Stream Behavior Difference

Retrying Failed Agents

Why retry() is Separate from resume()

retry() Method

RetryOptions

Concurrency Safety

execute() Safety Check

Atomic Status Transitions

Basic Usage

Interrupting an Agent

Resuming an Agent

Resume Modes

continue (default)

with_message

with_confirmation

from_checkpoint

Crash Recovery

What Happens During Resume

Resumable Statuses

Stream Events

run_interrupted

run_resumed

run_paused

checkpoint_created

Error Handling

AgentAlreadyRunningError

AgentNotResumableError

Runtime Considerations

JavaScript Runtime

Temporal Runtime

Cloudflare Runtime

Interrupt Behavior During Sub-Agent Execution

How Interrupt Propagation Works

Per-Runtime Implementation

JavaScript Runtime

Temporal Runtime

Cloudflare Runtime

Best Practices for Sub-Agent Interrupts

Best Practices

1. Always Check canResume()

2. Handle Resume Errors

3. Use Persistent Storage for Production

4. Store Session IDs for Later Resumption

Next Steps