Skip to content

Interrupt and Resume

Helix Agents supports interrupting and resuming agent execution. This enables user-controlled pauses, crash recovery, and time-travel debugging.

Overview

When an agent is running, you can:

  • Interrupt - Soft stop that saves state for later resumption
  • Resume - Continue execution from where it stopped

This is different from abort, which is a hard stop that fails the agent permanently.

execute() vs resume() Semantics

Understanding the distinction between execute() and resume() is critical for correct agent lifecycle management.

execute() - Fresh Start

execute() always starts a fresh execution. Using the same sessionId continues the conversation within that session:

typescript
// New session with generated ID
const handle = await executor.execute(agent, 'Hello');

// New session with specific ID
const handle = await executor.execute(agent, 'Hello', { sessionId: 'my-session' });

// Continue conversation in existing session
const handle2 = await executor.execute(agent, 'Follow up', { sessionId: 'my-session' });

When you call execute():

  1. Stream is reset - All previous stream chunks are cleared
  2. New run begins - A new run starts within the session
  3. Session state continues - Messages and custom state are preserved within the session

This means each execute() call starts a fresh run. If you call execute() with the same sessionId, the new run continues the conversation with access to previous messages. Clients streaming from a previous run will receive a stream_resync event notifying them of the discontinuity.

resume() - Continue Execution

resume() continues from where the agent stopped:

typescript
// Continue from last checkpoint
const newHandle = await handle.resume();

// Continue with additional context
const newHandle = await handle.resume({
  mode: 'with_message',
  message: 'Please continue',
});

When you call resume():

  1. Stream is preserved - Existing stream chunks remain intact
  2. State is loaded - Agent state is restored from the checkpoint
  3. Execution continues - The agent resumes from the checkpoint step

When to Use Each

ScenarioMethodReason
Starting a new conversationexecute()Fresh start needed
User sends new message to completed agentexecute()New run, new stream
Crash recoveryresume()Continue where left off
User interrupted, wants to continueresume()Preserve progress
Time-travel to earlier stateresume({ mode: 'from_checkpoint' })Branch from history
Tool confirmation receivedresume({ mode: 'with_confirmation' })Continue paused run
Retry after failureretry()Restore checkpoint, re-attempt

Stream Behavior Difference

The key technical difference is stream handling:

typescript
// execute() resets the stream
// - Clients see: stream_resync (if reconnecting) → new chunks from step 0
// - Old chunks are deleted

// resume() preserves the stream
// - Clients see: existing chunks → new chunks from checkpoint step
// - Stream continuity maintained

This distinction matters for frontend integration. If you're building a chat UI:

  • New conversation: Use execute() - the UI should clear and show fresh messages
  • Reconnecting after disconnect: Use resume() - the UI should show existing messages and continue

Retrying Failed Agents

Failed agents cannot be resumed with resume(). Use the retry() method for failure recovery.

Why retry() is Separate from resume()

Aspectresume()retry()
PurposeContinue interrupted/paused executionRecover from failure
Valid statusesinterrupted, pausedfailed
Stream behaviorPreserves chunksResets to checkpoint
State handlingContinues from currentRestores from checkpoint

retry() Method

typescript
const result = await handle.result();
if (result.status === 'failed') {
  // Retry from last checkpoint (default) - typically needs message
  const retryHandle = await handle.retry({
    message: 'Research quantum computing', // Re-provide the triggering message
  });

  // Or from specific checkpoint
  const retryHandle = await handle.retry({
    message: 'Research quantum computing',
    checkpointId: 'cpv1-...',
  });

  // Or start completely fresh
  const retryHandle = await handle.retry({
    mode: 'from_start',
    message: 'Research quantum computing',
  });
}

Message Requirement

When using from_checkpoint mode (default), the original user message that triggered the failure is part of the checkpoint state. You typically need to provide a message option to specify what to retry.

Without an explicit message, the retry will attempt to extract the last user message from the session history, but this may not work as expected in all cases. It's safer to always provide the message explicitly.

RetryOptions

OptionTypeDefaultDescription
mode'from_checkpoint' | 'from_start''from_checkpoint'How to retry
checkpointIdstringLatestWhich checkpoint to restore from
messagestring(extracted)Message to retry with - recommended to always provide
abortSignalAbortSignal-Abort signal for cancellation

Concurrency Safety

execute() Safety Check

execute() validates that the session is not already running:

typescript
import { AgentAlreadyRunningError } from '@helix-agents/core';

try {
  await executor.execute(agent, message, { sessionId });
} catch (error) {
  if (error instanceof AgentAlreadyRunningError) {
    console.log(`Session ${error.sessionId} is already running`);
    // Wait for existing execution or use different session
  }
}

This prevents state corruption from concurrent executions.

Atomic Status Transitions

All methods use appropriate concurrency control:

MethodProtectionMechanism
execute()Status check + StaleStateErrorRejects if running
resume()CASAtomic interrupted/paused → active
retry()CASAtomic failed → active

When concurrent calls race past status checks, optimistic locking via version numbers ensures only one succeeds. The losing call receives a StaleStateError.

Basic Usage

Interrupting an Agent

Use handle.interrupt() to pause execution:

typescript
const handle = await executor.execute(agent, 'Research quantum computing');

// Later, interrupt the agent
await handle.interrupt('user_requested');

// Agent status is now 'interrupted'
const state = await handle.getState();
console.log(state.status); // 'interrupted'

The agent will stop at the next safe point (between steps). Any in-progress step is rolled back to the last checkpoint.

Resuming an Agent

Use handle.resume() to continue execution:

typescript
// Check if we can resume
const { canResume, reason } = await handle.canResume();

if (canResume) {
  // Resume execution
  const newHandle = await handle.resume();

  // Stream events from the resumed execution
  for await (const chunk of (await newHandle.stream()) ?? []) {
    console.log(chunk.type);
  }

  // Get final result
  const result = await newHandle.result();
  console.log(result.output);
}

Resume Modes

The resume() method supports four modes:

continue (default)

Resume from where the agent stopped:

typescript
const newHandle = await handle.resume(); // Same as { mode: 'continue' }
const newHandle = await handle.resume({ mode: 'continue' });

with_message

Resume and add a new user message to the conversation:

typescript
const newHandle = await handle.resume({
  mode: 'with_message',
  message: 'Actually, focus on quantum entanglement specifically',
});

This is useful when the user wants to redirect the agent's focus.

with_confirmation

Resume a paused tool call with confirmation data:

typescript
// Agent paused waiting for confirmation on a tool
const newHandle = await handle.resume({
  mode: 'with_confirmation',
  data: { approved: true, notes: 'Proceed with the action' },
});

This mode is used when a tool requires human-in-the-loop approval.

from_checkpoint

Resume from a specific historical checkpoint (time-travel):

typescript
// Get available checkpoints
const state = await handle.getState();
const checkpoints = await stateStore.listCheckpoints(handle.sessionId);

// Resume from an earlier checkpoint
const newHandle = await handle.resume({
  mode: 'from_checkpoint',
  checkpointId: checkpoints.items[0].id,
});

Crash Recovery

If your process crashes while an agent is running, the state is preserved in the state store. On restart, you can resume:

typescript
// After restart, reconnect to the session
const handle = await executor.getHandle(agent, savedSessionId);

if (handle) {
  const { canResume, reason } = await handle.canResume();

  if (canResume) {
    // Resume from last checkpoint
    const resumed = await handle.resume();
    const result = await resumed.result();
  }
}

For this to work, use a persistent state store like Redis:

typescript
import { RedisStateStore, RedisStreamManager } from '@helix-agents/store-redis';

const stateStore = new RedisStateStore({ host: 'localhost' });
const streamManager = new RedisStreamManager({ host: 'localhost' });
const executor = new JSAgentExecutor(stateStore, streamManager, llmAdapter);

What Happens During Resume

When you call resume() on a crashed or interrupted agent, the framework performs cleanup before continuing:

  1. Load checkpoint - Get the last successfully completed step
  2. Clean up orphaned staging - Remove any uncommitted staged changes
  3. Truncate messages - Remove messages added after the checkpoint (prevents duplicates)
  4. Clean up stream chunks - Remove chunks after the checkpoint step
  5. Emit stream_resync - Notify connected clients of the discontinuity
  6. Continue execution - Resume from the checkpoint step
mermaid
graph LR
    subgraph Before ["Before Crash"]
        direction LR
        S1["Step 1"] --> S2["Step 2"] --> S3["Step 3"] --> S4["Step 4 ✗"]
    end

    S3 -. "Checkpoint" .-> S3

    subgraph After ["After Resume"]
        direction LR
        R1["Step 1"] --> R2["Step 2"] --> R3["Step 3"] --> R4["Step 4"] --> R5["Step 5"]
    end

    R3 -. "Resume from<br/>checkpoint" .-> R3

The truncation is critical for preventing duplicate messages. If step 4 partially completed (wrote messages but crashed before checkpoint), those messages are orphaned. truncateMessages() removes them so they're not duplicated when the step re-executes.

Message Count Protection

The framework only truncates messages when a checkpoint exists AND has messageCount > 0. This prevents data loss in edge cases like:

  • First-step crash (no checkpoint yet)
  • Old checkpoints from before messageCount field was added

Resumable Statuses

An agent can be resumed when its status is:

StatusResumableDescription
runningYesCrash recovery - process died but state says running
interruptedYesUser interrupted execution
pausedYesWaiting for tool confirmation
completedNo*Agent finished successfully
failedNo*Agent failed with error

*Terminal states can only be resumed with mode: 'from_checkpoint' to time-travel to a previous state.

Stream Events

The framework emits stream chunks for interrupt/resume events:

run_interrupted

Emitted when an agent is interrupted:

typescript
{
  type: 'run_interrupted',
  runId: 'run-123',
  checkpointId: 'cpv1-run-123-s5-t1234567890-abc123',
  reason: 'user_requested'
}

run_resumed

Emitted when an agent resumes:

typescript
{
  type: 'run_resumed',
  runId: 'run-123',
  fromCheckpointId: 'cpv1-run-123-s5-t1234567890-abc123',
  fromStepCount: 5,
  mode: 'continue'
}

run_paused

Emitted when an agent pauses for confirmation:

typescript
{
  type: 'run_paused',
  runId: 'run-123',
  reason: 'tool_confirmation_required',
  pendingToolName: 'delete_file',
  pendingToolCallId: 'call-456'
}

checkpoint_created

Emitted when a checkpoint is saved:

typescript
{
  type: 'checkpoint_created',
  runId: 'run-123',
  checkpointId: 'cpv1-run-123-s5-t1234567890-abc123',
  stepCount: 5
}

Error Handling

AgentAlreadyRunningError

Thrown when trying to resume an agent that's already running:

typescript
import { AgentAlreadyRunningError } from '@helix-agents/core';

try {
  await handle.resume();
} catch (error) {
  if (error instanceof AgentAlreadyRunningError) {
    console.log(`Agent session ${error.sessionId} is already running`);
  }
}

AgentNotResumableError

Thrown when trying to resume an agent in a terminal state:

typescript
import { AgentNotResumableError } from '@helix-agents/core';

try {
  await handle.resume();
} catch (error) {
  if (error instanceof AgentNotResumableError) {
    console.log(`Cannot resume: ${error.currentStatus}`);
  }
}

Runtime Considerations

JavaScript Runtime

The JS runtime handles interrupts in-process. When you call interrupt():

  1. The current step is marked for rollback
  2. Staged changes are discarded
  3. Status is set to 'interrupted'
  4. The run loop exits cleanly

Temporal Runtime

Temporal uses workflow signals for interrupts:

  1. interrupt() sends a signal to the workflow
  2. The workflow handles the signal at the next safe point
  3. State is persisted via Temporal's durability guarantees
  4. Resume creates a new workflow execution that continues from the checkpoint

Cloudflare Runtime

Cloudflare Workflows handle interrupts via instance coordination:

  1. interrupt() updates the state and marks for stop
  2. The workflow step completes and checks the interrupt flag
  3. Resume creates a new workflow instance with the same run ID

Interrupt Behavior During Sub-Agent Execution

When an agent spawns sub-agents (child workflows), interrupt signals propagate through the entire agent hierarchy. This enables responsive cancellation even during complex multi-agent operations.

How Interrupt Propagation Works

When you interrupt a parent agent that has running sub-agents:

  1. Parent receives interrupt - The interrupt flag is set on the parent agent
  2. Children are signaled - The parent propagates the interrupt to all running children
  3. Children stop gracefully - Each child detects the interrupt and stops at its next safe point
  4. Parent completes - Once all children have stopped, the parent returns with status interrupted

Target Response Time: < 1 second from interrupt request to stopped execution, even with deeply nested sub-agents.

Per-Runtime Implementation

RuntimeMechanismLatency
JSAbortSignal propagation via batchControllerImmediate (< 100ms)
TemporalSignal handlers + Trigger primitive for instant wake< 1 second
CloudflareEvent-based: Promise.race with interrupt eventsImmediate (< 100ms)

JavaScript Runtime

The JS runtime uses AbortSignal for cooperative cancellation:

typescript
// Parent's AbortController is linked to child agents
// When parent.abort() is called, children receive abort signal immediately

// In custom tools, check the signal:
const myTool = defineTool({
  execute: async (input, context) => {
    for (const item of items) {
      if (context.abortSignal.aborted) {
        return { partial: true, processed: results };
      }
      // ... process item
    }
  },
});

Temporal Runtime

Temporal uses workflow signals with a Trigger primitive for instant response:

typescript
// Platform adapter pattern for sub-second interrupt response:
import { Trigger, getExternalWorkflowHandle, setHandler } from '@temporalio/workflow';

const interruptTrigger = new Trigger<string>();

setHandler(interruptSignal, (reason) => {
  interruptReason = reason;
  interruptTrigger.resolve(reason); // Wake up immediately
});

// The workflow races child completion against interrupt trigger
// If interrupt wins, all running children are signaled to stop

Cloudflare Runtime

Cloudflare uses an event-based approach for immediate interrupt response:

When interrupt() is called:

  1. Interrupt flag is set via stateStore.setInterruptFlag() (persisted)
  2. Interrupt event is sent via instance.sendEvent() for immediate wake-up
  3. The workflow races completion events against interrupt events using Promise.race
  4. Whichever event arrives first wins the race

The workflow checks for interrupts at two points:

  • Before each step: In the main execution loop
  • Before spawning sub-agents: A pre-spawn check prevents spawning if already interrupted

This event-based approach provides immediate interrupt response (< 100ms) without polling overhead.

Best Practices for Sub-Agent Interrupts

  1. Keep sub-agent operations bounded - Long-running sub-agents delay interrupt response
  2. Use interruptible tools - Check context.abortSignal.aborted in loops
  3. Configure poll intervals appropriately - Shorter intervals = faster response, more overhead
  4. Design for partial results - Sub-agents should return meaningful partial output when interrupted

Best Practices

1. Always Check canResume()

Before resuming, verify the agent can be resumed:

typescript
const { canResume, reason } = await handle.canResume();
if (!canResume) {
  console.log(`Cannot resume: ${reason}`);
  return;
}

2. Handle Resume Errors

Wrap resume calls in try-catch:

typescript
try {
  const resumed = await handle.resume();
  await resumed.result();
} catch (error) {
  if (error instanceof AgentAlreadyRunningError) {
    // Another process resumed first
  } else if (error instanceof AgentNotResumableError) {
    // Agent is in terminal state
  }
}

3. Use Persistent Storage for Production

For crash recovery to work, use Redis or another persistent store:

typescript
const stateStore = new RedisStateStore({
  host: process.env.REDIS_HOST,
  ttl: 86400 * 7, // 7 days retention
});

4. Store Session IDs for Later Resumption

Save session IDs so you can reconnect after crashes:

typescript
const handle = await executor.execute(agent, input);

// Persist the session ID
await database.save({ sessionId: handle.sessionId, userId });

// Later, after restart
const savedSession = await database.get(userId);
const handle = await executor.getHandle(agent, savedSession.sessionId);

Next Steps

Released under the MIT License.