Skip to content

Sub-Agent Execution

This document explains how Helix Agents handles sub-agent (child agent) execution and the patterns involved.

Overview

Sub-agents enable:

  1. Task Delegation - Parent agents delegate specialized tasks
  2. Composition - Build complex agents from simpler ones
  3. Isolation - Sub-agents have their own state
  4. Reusability - Define once, use from multiple parents

Creating Sub-Agent Tools

Using createSubAgentTool

typescript
import { createSubAgentTool, defineAgent } from '@helix-agents/core';
import { z } from 'zod';

// First define the sub-agent with an outputSchema
const SummarizerAgent = defineAgent({
  name: 'summarizer',
  outputSchema: z.object({
    summary: z.string(),
    keyPoints: z.array(z.string()),
  }),
  // ... other config
});

// Then create a tool that invokes it
const summarizeTool = createSubAgentTool(
  SummarizerAgent, // Full agent config (must have outputSchema)
  z.object({
    texts: z.array(z.string()),
    maxLength: z.number().optional(),
  }),
  { description: 'Summarize a list of texts' } // Optional
);

Tool Name Convention

Sub-agent tools use the subagent__ prefix internally:

typescript
// When LLM calls the tool, it uses 'summarize'
// Internally stored as 'subagent__summarizer'

const SUBAGENT_TOOL_PREFIX = 'subagent__';

Detecting Sub-Agent Tools

typescript
import { isSubAgentTool, SUBAGENT_TOOL_PREFIX } from '@helix-agents/core';

// Check if a tool call is for a sub-agent
if (isSubAgentTool(toolName)) {
  // Extract agent type
  const agentType = toolName.slice(SUBAGENT_TOOL_PREFIX.length);
}

Execution Flow

1. Parent Agent Makes Tool Call

Parent Agent

    ├── LLM decides to use 'summarize' tool


{ type: 'tool_calls', toolCalls: [
  { id: 's1', name: 'summarize', arguments: { texts: [...] } }
]}

2. Tool Call is Recognized as Sub-Agent

typescript
// In planStepProcessing
const subAgentCalls = toolCalls
  .filter((tc) => isSubAgentTool(tc.name))
  .map((tc) => ({
    id: tc.id,
    agentType: tc.name.slice(SUBAGENT_TOOL_PREFIX.length),
    input: tc.arguments,
  }));

3. Sub-Agent is Executed

mermaid
graph TB
    Parent["Parent Agent (paused)"]

    Parent --> SubCreate["Sub-Agent Created"]

    subgraph SubConfig [" "]
        direction LR
        C1["sessionId: unique ID"]
        C2["streamId: linked to parent"]
        C3["parentSessionId: parent's sessionId"]
        C4["input: from tool arguments"]
    end

    SubCreate --> SubConfig

    SubConfig --> ExecLoop["Sub-Agent Execution Loop"]

    subgraph ExecSteps [" "]
        direction TB
        E1["Initialize state"]
        E2["Call LLM"]
        E3["Execute tools"]
        E4["Complete with output"]
        E1 --> E2 --> E3 --> E4
    end

    ExecLoop --> ExecSteps

4. Result Returned to Parent

Sub-Agent Complete

    ├── Output returned to parent


Parent Agent (resumed)

    └── Receives tool result with sub-agent output

State Isolation

Parent State

typescript
interface ParentState {
  notes: Note[];
  searchCount: number;
}

Sub-Agent State

typescript
interface SummarizerState {
  texts: string[];
  processedCount: number;
}

Sub-agents cannot directly modify parent state. Communication happens through:

  • Input - Data passed when invoking sub-agent
  • Output - Structured result returned on completion

Stream Event Flow

subagent_start

Emitted when sub-agent begins. Schema (canonical reference: ./stream-protocol.md):

typescript
{
  type: 'subagent_start',
  subAgentType: 'summarizer',                // Sub-agent type
  subSessionId: 'session-child-abc123',      // Sub-agent's session ID; correlate sub-agent chunks via this
  callId: 'call_xyz789',                     // Tool call ID that spawned this sub-agent
  agentId: 'session-parent-xyz789',          // BaseChunk: parent's session ID
  agentType: 'parent-agent-type',
  step: 3,
  timestamp: 1702329600000
}

Sub-Agent Events

Sub-agent emits its own events (text_delta, tool_start, etc.) with its own agentId (which is its sessionId):

typescript
{
  type: 'text_delta',
  delta: 'Summarizing...',
  agentId: 'session-child-abc123',  // Sub-agent's sessionId
  agentType: 'summarizer',
  timestamp: 1702329600100
}

subagent_end

Emitted when sub-agent completes. Schema (canonical reference: ./stream-protocol.md):

typescript
{
  type: 'subagent_end',
  subAgentType: 'summarizer',
  subSessionId: 'session-child-abc123',
  callId: 'call_xyz789',
  result: { summary: 'The texts discuss...' }, // Failures surface via the result payload itself or via emitted error chunks; there is no `success` field on this chunk
  agentId: 'session-parent-xyz789',
  agentType: 'parent-agent-type',
  step: 3,
  timestamp: 1702329601000
}

Runtime Implementations

JS Runtime

Sub-agents execute recursively in the same process:

typescript
// In JSAgentExecutor
for (const subAgentCall of plan.pendingSubAgentCalls) {
  const subAgent = registry.get(subAgentCall.agentType);

  // Execute sub-agent (recursive call)
  const handle = await this.execute(
    subAgent,
    {
      message: JSON.stringify(subAgentCall.input),
      state: subAgentCall.input,
    },
    {
      parentSessionId: state.sessionId, // Parent's session ID (primary key)
    }
  );

  const result = await handle.result();

  // Add result to parent's messages
  state.messages.push(
    createSubAgentResultMessage({
      toolCallId: subAgentCall.id,
      agentType: subAgentCall.agentType,
      result: result.output,
      success: result.status === 'completed',
    })
  );
}

Temporal Runtime

Sub-agents run as child workflows:

typescript
// In workflow
for (const subAgentCall of plan.pendingSubAgentCalls) {
  const subSessionId = generateSubSessionId();
  const childResult = await executeChild(agentWorkflow, {
    workflowId: subSessionId,
    args: [
      {
        agentType: subAgentCall.agentType,
        sessionId: subSessionId,
        streamId: parentStreamId, // Share stream
        message: JSON.stringify(subAgentCall.input),
        parentSessionId: input.sessionId,
      },
    ],
    parentClosePolicy: 'ABANDON',
  });

  // Record result
  await activities.recordSubAgentResult({
    parentSessionId: input.sessionId,
    subAgentCall,
    result: childResult,
  });
}

Cloudflare Runtime

Sub-agents spawn as separate workflow instances:

typescript
// In workflow step
for (const subAgentCall of plan.pendingSubAgentCalls) {
  const instance = await workflowBinding.create({
    id: generateSubSessionId(),
    params: {
      agentType: subAgentCall.agentType,
      message: JSON.stringify(subAgentCall.input),
      parentSessionId: input.sessionId, // Parent's session ID (primary key)
    },
  });

  // Wait for completion
  const result = await instance.status();
}

Message Recording

Assistant Message

Tool calls including sub-agent calls are recorded:

typescript
{
  role: 'assistant',
  content: 'I will summarize these texts.',
  toolCalls: [
    { id: 's1', name: 'subagent__summarizer', arguments: { texts: [...] } }
  ]
}

Tool Result Message

Sub-agent result is recorded as a tool result:

typescript
{
  role: 'tool',
  toolCallId: 's1',
  toolName: 'subagent__summarizer',
  content: '{"summary":"The texts discuss..."}'
}

Parallel Sub-Agents

Multiple sub-agents can run in parallel:

typescript
// LLM requests multiple sub-agents
{
  type: 'tool_calls',
  toolCalls: [],
  subAgentCalls: [
    { id: 's1', agentType: 'summarizer', input: { texts: batch1 } },
    { id: 's2', agentType: 'summarizer', input: { texts: batch2 } },
    { id: 's3', agentType: 'analyzer', input: { data: {...} } },
  ]
}

The runtime executes them in parallel:

typescript
// JS Runtime
const results = await Promise.all(
  subAgentCalls.map(call => executeSubAgent(call))
);

// Temporal Runtime
await Promise.all(
  subAgentCalls.map(call => executeChild(agentWorkflow, { ... }))
);

Interrupt Propagation (v7 durable model)

When a parent agent is interrupted while sub-agents are running, v7 propagates the interrupt through the entire hierarchy via the durable interrupt protocol, not in-memory promise races.

Durable Interrupt Flag

Every interrupt request — local or cross-process — is written to the state store via stateStore.setInterruptFlag(sessionId, reason). The runLoop polls stateStore.checkInterruptFlag(sessionId) (atomic check-and-clear) at the top of each step iteration. This is the canonical mechanism for interrupt detection on JS, CF DO, CFW Workflows, and Temporal.

typescript
// In any process — possibly different from the one that owns the in-memory handle:
await stateStore.setInterruptFlag(sessionId, 'user clicked Stop');

// runLoop, on its next iteration:
const flag = await stateStore.checkInterruptFlag(sessionId); // atomic read+clear
if (flag) {
  // Persist 'interrupted' run status; return RunOutcome.kind = 'interrupted'.
}

Why Not Promise.race?

In v7's stateless suspension model the parent runLoop EXITS at every HITL boundary; there is no long-lived in-memory Promise.all to race against. A sub-agent's completion in another session context fires a separate runLoop invocation against the parent's session via executor.resume(), which observes the durable child status and drains the result via applyResultsAndReload (Temporal/CFW) or the equivalent JS/DO bootstrap.

This means cross-process interrupts (the case the v6 Promise.race machinery existed to mask) are now first-class — any process can write the flag and the next runLoop tick observes it.

Cascade to In-Flight Children

When a parent suspends with running children, the in-flight children are marked failed:'parent_suspended' (see γ-cascade discriminator below). On parent's resume, the cascade either re-spawns the child (genuine parent-suspension drop) or drains the existing failure result (genuine child failure).

Interrupt-Specific Handling

Per-runtime entry points for an interrupt-during-children scenario:

RuntimeDetectionChild propagation
runtime-jspollDurableInterruptFlag at every step boundaryLinked AbortSignal for in-process children; durable flag write for sub-process / sub-DO children
runtime-cloudflare (DO)Per-step checkInterruptFlagDurable flag write to each child's session
runtime-cloudflare (Workflow)Per-step checkInterruptFlagDurable flag write; new resume workflow observes the flag
runtime-temporalPer-step checkInterruptFlag; original signal handler still wires to Trigger for in-flight wake-up of long pollsDurable flag write per child sessionId; child workflows observe on next iteration boundary

Latency is bounded by the per-step poll cadence (typically sub-second for tool-heavy agents, but can be longer for agents in long LLM calls). The durable model trades worst-case latency for cross-process correctness — important since v6's INTERRUPT_NOT_LOCAL 503 is now removed.

Sub-Agent Suspension Cascade (γ-cascade)

This is the central cross-runtime invariant for what happens when a parent suspends while children are running. Required reading for any sub-project touching parent/child suspension propagation.

The Problem

When a parent agent reaches a HITL boundary (e.g., a client-tool suspension or an approval-gate match) WHILE sub-agents are still executing, the runtime must persist enough information to:

  1. Cleanly exit the parent run.
  2. Cleanly exit (or abandon) in-flight children.
  3. On parent's resume, decide which children to re-spawn (the parent-suspended drop wasn't a real failure) and which to drain as failure results (genuine execution failures).

The Discriminator: SessionState.failureReason

When a child is forcibly marked 'failed' due to parent suspension, the runtime sets SessionState.failureReason = 'parent_suspended'. This field is persisted by every state store and carries the γ-cascade signal across processes. A child failed for any other reason (thrown error, timeout, stop-when condition) has failureReason unset (or set to a different string).

The Cascade Logic

On parent's resume, applyResultsAndReload (Temporal/CFW) or the equivalent JS/DO bootstrap iterates each SubSessionRef for the parent and decides:

Child statusfailureReasonCascade action
'completed'n/aDrain result to parent as tool_result message.
'failed''parent_suspended'Re-spawn — the parent's resume gives the child another chance. New child workflow / instance / loop runs from scratch.
'failed'other / unsetDrain the failure as a tool_error to parent. The LLM decides how to handle it.
'running'n/aContinue waiting (re-suspend if needed).
'terminated'n/aDrain a "terminated" tool result; do NOT re-spawn.

Cross-Store Invariant

ALL five in-tree state stores (memory, redis, postgres, D1, DO) MUST persist SessionState.failureReason correctly. The exhaustive-fixes audit found three separate P0 bugs where stores dropped this field on persist — the cascade silently devolved into "always re-spawn" or "always drain". Tests in packages/e2e/src/__tests__/gamma-cascade-parity.integ.test.ts exercise the matrix.

Companion Tools and the Cascade

Persistent sub-agents managed via companion tools follow the same cascade. If a parent suspends while a companion__waitForResult blocking child is mid-flight, the child is marked 'paused_awaiting_client' (or 'failed' with 'parent_suspended' if the persistence model requires a terminal status), and the parent's resume drains via applyResultsAndReload. See Persistent Sub-Agent Execution below.

Lifecycle Hook Guarantees

Sub-agents fire their own lifecycle hooks (onAgentStart, onAgentComplete, onAgentFail) independently from the parent. This is critical for tracing integrations (e.g., Langfuse) that need to emit root spans when a sub-agent completes.

The Problem

Sub-agents share the parent's stream. Without special handling, the sub-agent's onAgentComplete hook would either not fire (if stream finalization is skipped) or would close the parent's stream prematurely.

The Solution: skipStreamClose

All runtimes call endAgentStream for sub-agents with skipStreamClose: true. This fires onAgentComplete without closing the parent's stream:

typescript
// Root agent: fire hooks and close stream
await endAgentStream({ sessionId });

// Sub-agent: fire hooks but keep parent's stream open
await endAgentStream({ sessionId: subAgentSessionId, skipStreamClose: true });

This applies to both completion paths:

  • Normal completion (__finish__ tool) — sub-agent hooks fire with skipStreamClose: true
  • finishWith completion — sub-agent hooks fire with both finishWithOutput and skipStreamClose: true

Per-Runtime Behavior

RuntimeRoot AgentSub-Agent
JSendAgentStream({ sessionId })endAgentStream({ sessionId, skipStreamClose: true })
TemporalendAgentStream({ sessionId })endAgentStream({ sessionId, skipStreamClose: true })
CloudflareendAgentStream({ sessionId })endAgentStream({ sessionId, skipStreamClose: true })

All three runtimes follow the same pattern, ensuring hook behavior is consistent regardless of where the agent runs.

Error Handling

Sub-Agent Failure

If a sub-agent fails, the result indicates failure:

typescript
{
  type: 'subagent_end',
  subAgentId: 'run-child-123',
  agentType: 'summarizer',
  success: false,
  error: 'Max steps exceeded',
  parentSessionId: 'run-parent-456',
  timestamp: 1702329601000
}

Parent Handling

Parent receives error as tool result:

typescript
{
  role: 'tool',
  toolCallId: 's1',
  toolName: 'subagent__summarizer',
  content: '{"error":"Max steps exceeded"}'
}

The LLM can then decide how to handle the failure.

State Reference Tracking

Parents track sub-agent references via the state store's addSubSessionRefs / updateSubSessionRef / getSubSessionRefs methods. Refs are persisted separately from the parent's custom state.

The canonical v7 SubSessionRef shape (from packages/core/src/types/session.ts):

typescript
interface SubSessionRef {
  /** Sub-agent's session ID */
  subSessionId: string;

  /** Agent type (e.g. 'researcher') */
  agentType: string;

  /** Tool call ID that spawned this sub-agent */
  parentToolCallId: string;

  /**
   * Workflow lifecycle status:
   * - 'running'
   * - 'completed'
   * - 'failed'
   * - 'interrupted'
   * - 'terminated'
   * - 'paused_awaiting_client' (sub-agent waiting on a client-tool result)
   */
  status:
    | 'running'
    | 'completed'
    | 'failed'
    | 'interrupted'
    | 'terminated'
    | 'paused_awaiting_client';

  startedAt: number;
  completedAt?: number;

  /**
   * Remote-specific metadata (null/undefined for in-process sub-agents).
   *
   * Contains stream and sequence info for reconnecting to a remote agent
   * after parent interruption or crash. URL/transport config is NOT stored
   * here — it's reconstructed from the tool definition at resume time.
   */
  remote?: {
    streamId: string;
    lastSequence: number;
  };

  /**
   * Sub-agent lifecycle mode:
   * - 'ephemeral': One-shot execution, session discarded after completion (default)
   * - 'persistent': Long-lived session, can be resumed and reused across parent runs
   */
  mode: 'ephemeral' | 'persistent';

  /**
   * Stable name for persistent sub-agents (used to generate deterministic
   * session IDs). Required when `mode === 'persistent'`; absent for
   * ephemeral sub-agents. Use `assertPersistentSubSessionRef()` to narrow
   * the type when `name` must be present.
   */
  name?: string;
}

This allows:

  • Querying sub-agent status
  • Retrieving sub-agent results (load each subSessionId for terminal output / error / failureReason)
  • Cleanup on parent completion
  • Crash-recovery reconnect to remote sub-agent streams via remote.streamId + remote.lastSequence
  • γ-cascade re-spawn decisions on parent resume (combined with each child's SessionState.failureReason)

Best Practices

1. Clear Input/Output Contracts

typescript
// Define clear schemas for sub-agent communication
const SummarizerInputSchema = z.object({
  texts: z.array(z.string()),
  maxLength: z.number().optional().default(500),
});

const SummarizerOutputSchema = z.object({
  summary: z.string(),
  keyPoints: z.array(z.string()),
});

2. Meaningful Agent Types

typescript
// Good: descriptive type names
agentType: 'code-reviewer';
agentType: 'data-analyzer';
agentType: 'email-composer';

// Bad: generic names
agentType: 'helper';
agentType: 'agent1';

3. Limit Nesting Depth

Avoid deep nesting of sub-agents:

Parent
  └── Sub-Agent
        └── Sub-Sub-Agent  // Avoid this level
              └── ...      // Definitely avoid this

4. Handle Failures Gracefully

typescript
// In parent's system prompt
'If a sub-agent fails, try to complete the task yourself or report the failure.';

Testing

typescript
import { MockLLMAdapter } from '@helix-agents/core';

describe('SubAgent', () => {
  it('executes sub-agent and returns result', async () => {
    const mock = new MockLLMAdapter([
      // Parent calls sub-agent
      {
        type: 'tool_calls',
        toolCalls: [],
        subAgentCalls: [
          {
            id: 's1',
            agentType: 'summarizer',
            input: { texts: ['text1'] },
          },
        ],
      },
      // Parent finishes with sub-agent result
      {
        type: 'structured_output',
        output: { result: 'Used summary: ...' },
      },
    ]);

    // Register both agents
    registry.register(ParentAgent);
    registry.register(SummarizerAgent);

    const result = await executor.execute(ParentAgent, 'Summarize texts');
    expect(result.status).toBe('completed');
  });
});

Remote Sub-Agent Execution

Remote sub-agents (createRemoteSubAgentTool()) follow a different execution path from local sub-agents. Instead of spawning an in-process or child workflow execution, remote sub-agents delegate to agents reachable via a RemoteAgentTransport — typically HttpRemoteAgentTransport for cross-service calls, or DOStubTransport for sibling Durable Object routing.

DO Runtime Transparent Rewriting

In the Cloudflare DO runtime, createSubAgentTool() tools are transparently rewritten to createRemoteSubAgentTool() at execution time when subAgentNamespace is configured. This means the three-way routing below still applies — by the time the executor sees the tools, local sub-agent tools have already been converted to remote sub-agent tools backed by DOStubTransport. See Sub-Agents in the DO Runtime.

Three-Way Tool Call Routing

When the LLM returns tool calls, all three runtimes partition them into three groups:

  1. Regular tools — Executed via the standard tool execution path
  2. Local sub-agent calls — Routed to child workflow / in-process execution
  3. Remote sub-agent calls — Routed to a dedicated executeRemoteSubAgentCall activity/step

Detection uses isRemoteSubAgentTool() which checks the _isRemoteSubAgent marker on the tool.

Execution Flow

  1. Generate deterministic session ID{parentSessionId}-remote-{toolCallId} ensures idempotent restarts
  2. Register SubSessionRef — Tracks the remote session with remote: { streamId, lastSequence } metadata
  3. Call transport.start() — Sends POST /start to the remote agent server
  4. Consume transport.stream() — Reads SSE events, proxies chunks to the parent stream
  5. Handle completion — Updates SubSessionRef status, returns output as tool result

Crash Recovery (Temporal and Cloudflare)

If the runtime crashes mid-execution:

  1. Check transport.getStatus() — Determine if the remote agent is still running, completed, or failed
  2. If completed — Return the output directly without re-executing
  3. If still running — Reconnect to transport.stream() with fromSequence to avoid duplicate events
  4. If failed — Return the error

Resume Reconnection (JS Runtime)

The JS runtime's reconcileRemoteSubAgents() handles resume after interrupts:

  1. Loads SubSessionRefs with remote metadata from the state store
  2. For each running remote sub-agent, checks transport.getStatus()
  3. Reconnects to streams or records completions/failures
  4. Updates tool result messages in the conversation history

See Also

Persistent Sub-Agent Execution

Persistent sub-agents use a different execution model from ephemeral sub-agents. Instead of being invoked as tool calls that run to completion and return results, persistent children are long-lived agents managed through companion tools.

Companion Tool Architecture

When buildEffectiveTools() processes an agent with persistentAgents, it dynamically generates companion tools:

typescript
// In buildEffectiveTools (packages/core/src/orchestration/state-operations.ts)
if (config.persistentAgents && config.persistentAgents.length > 0) {
  // Generate companion tools based on configured persistent agents
  tools.push(createSpawnAgentTool(config.persistentAgents));
  tools.push(createSendMessageTool());
  tools.push(createListChildrenTool());
  tools.push(createGetChildStatusTool());
  tools.push(createTerminateChildTool());

  // waitForResult only available if at least one blocking agent exists
  if (config.persistentAgents.some((pa) => pa.mode === 'blocking')) {
    tools.push(createWaitForResultTool());
  }
}

Companion Tool Call Routing

Companion tool calls are handled separately from regular tool calls in all runtimes. The three-way tool routing becomes four-way:

  1. Regular tools -- Standard tool execution
  2. Local sub-agent calls -- Ephemeral child agent execution
  3. Remote sub-agent calls -- HTTP-based delegation
  4. Companion tool calls -- Persistent child management

Detection uses isCompanionTool() which checks the _isCompanionTool marker:

typescript
if (isCompanionTool(tool)) {
  // Route to the shared core dispatcher (each runtime supplies its own deps).
  return executeCompanionToolDispatch(input, deps);
}

Execution Flow: Blocking Spawn (v7)

mermaid
graph TB
    Parent["Parent Agent"]
    Parent --> Spawn["companion__spawnAgent called"]
    Spawn --> CreateRef["Create SubSessionRef (mode: 'persistent')"]
    CreateRef --> CreateChild["Initialize child agent session"]
    CreateChild --> RunChild["Execute child agent loop"]
    RunChild --> ChildComplete["Child calls __finish__"]
    ChildComplete --> UpdateRef["Update SubSessionRef (status: 'completed')"]
    UpdateRef --> ReturnResult["Return result to parent"]
    ReturnResult --> ParentContinues["Parent continues execution"]

v7 stateless suspension note: If the parent calls companion__waitForResult and the blocking child has not finished yet, the parent's runLoop EXITS with RunOutcome.kind = 'suspended_awaiting_children' (AgentResult.status = 'suspended_awaiting_children'). The parent's suspendedAwaitingChildren map is persisted; the runtime returns.

Sub-agent completion in another session context fires the parent's resume via the same __resume-N workflow-id convention used for client-tool resumes (single-dash on Temporal/CFW). On resume, applyResultsAndReload (Temporal/CFW) or the equivalent JS/DO bootstrap drains the result via recordSubSessionResult and the parent continues from the durable checkpoint.

If the parent suspends while ANY children are still running (e.g. via a separate client-tool boundary), the γ-cascade applies — children are marked failed:'parent_suspended' and re-spawned on resume. See ./concepts.md §Client-Executed Tools for per-runtime resume mechanics.

Execution Flow: Non-Blocking Spawn

mermaid
graph TB
    Parent["Parent Agent"]
    Parent --> Spawn["companion__spawnAgent called"]
    Spawn --> CreateRef["Create SubSessionRef (mode: 'persistent')"]
    CreateRef --> StartChild["Start child agent (fire-and-forget)"]
    StartChild --> ReturnImmediate["Return immediately to parent"]
    ReturnImmediate --> ParentContinues["Parent continues execution"]
    StartChild --> ChildRuns["Child runs concurrently"]
    ChildRuns --> ChildComplete["Child completes later"]
    ChildComplete --> UpdateRef["Update SubSessionRef"]

SubSessionRef with Persistent Mode

The SubSessionRef interface (full schema in State Reference Tracking above) carries mode and name fields specifically for persistent children:

  • mode: 'persistent' — Long-lived session, resumable across parent runs.
  • name: string — Required when mode === 'persistent'; used to generate deterministic session IDs ({parentSessionId}-agent-{name}).

For ephemeral children: mode: 'ephemeral' and name is omitted; session ID is derived from the spawning tool call ID instead.

Continuing a completed child

A persistent companion declares an outputSchema, so it always completes via the auto-injected __finish__ tool. Re-consulting a completed child — companion__sendMessage, or companion__spawnAgent re-using its name — is now continued on its preserved session (memory retained, fresh typed output) rather than thrown (sendMessage) or delete-and-respawned (spawnAgent). The continuation reopens the finished session (completed → active), heals any dangling __finish__, appends the consult, and runs a new turn with a fresh per-turn maxSteps budget. failed / terminated children still re-spawn fresh (the old session is cleaned up). The continuation primitive diverges per runtime (see each subsection below and the cross-runtime parity table in ./concepts.md §Persistent-companion continuation + the __finish__ heal), but the heal and the per-turn reset are uniform.

What a new (6th) runtime must implement

A runtime that reimplements the step loop / companion dispatch does not inherit companion continuation for free. To reach parity it must wire all four:

  1. Eager __finish__ heal at its own step site — append the synthetic tool_result via the shared synthesizeFinishToolResult(toolCallId) (message-builder.ts:233) immediately after the terminal __finish__ assistant message. (Only runtime-js inherits this from core runStepIteration; Temporal/DBOS/CFW replicate it — see ./step-processing.md §The __finish__ history invariant.)
  2. Legacy reopen heal — before appending the consult on a continuation, scan the preserved history with findUnpairedFinishCallId(messages) (message-builder.ts:251) and synthesize the missing result if a dangling __finish__ is present (covers sessions completed under a pre-heal release).
  3. The continuation primitive — CAS completed → active, heal-before-append, append the consult exactly once, reset stepCount → 0 so the continued turn gets a fresh per-turn maxSteps budget. customState (memory), workspaceRef, and subSessionRefs are preserved.
  4. Replay-idempotency if the runtime re-executes its dispatch on recovery — use deterministic continuation ids and a durable marker short-circuit. DBOS is the reference pattern: a deterministic ${childSessionId}-continue-${toolCallId} restart workflow id (so DBOS.startWorkflow dedupes) plus a metadata.dbosWorkflowId marker check that re-enters the idempotent continuation branch instead of re-sending / re-launching on workflow-body replay (applies to both spawn-continue and sendMessage-continue).

Per-Runtime Implementation

runtime-js

Companion tool calls route through the shared core executeCompanionToolDispatch (the JS-specific executeCompanionToolCall handler was retired during the convergence). The JS run loop supplies a waitForChildTerminal park primitive, so a blocking spawn runs the child's loop in-process and parks inline until the child is terminal, returning the result inline — it does NOT suspend the parent with 'suspended_awaiting_children' (that stateless-suspension model is used by the Temporal / Cloudflare-Workflows runtimes, whose orchestration can't park inline). Non-blocking spawns start the child detached (fire-and-forget); the parent observes completion via the deliver-on-next-turn notifier, getSubSessionRefs, or companion__waitForResult.

Continuing a completed child uses the in-process continuePersistentChild dependency: it reopens the preserved session, resets stepCount → 0, and runs a new turn in the same run loop. A failed/aborted continued turn surfaces as failed (it routes through runSubAgentLoop, which throws on a non-completed outcome).

runtime-temporal

Companion tool calls are executed via an executeCompanionToolCall activity. Blocking spawns use wf.startChild to run a child workflow. In v7, the parent does NOT block awaiting children in-workflow — if the parent reaches a HITL boundary (or its own deadline), it exits with 'suspended_awaiting_children'. After a blocking child completes, markPersistentChildStatus updates the SubSessionRef; the parent's __resume-N workflow drains the result via applyResultsAndReload.

The Temporal runtime does NOT store companion tool results as separate ToolResultMessages in the state store. Results flow through the workflow execution context (or through the resume drain on suspension).

Continuing a completed child runs the store-side reopen (CAS completed → active, append the consult, reset the parent ref) inside the executeCompanionToolCall activity — so it is checkpointed and replay-idempotent — then starts a fresh child workflow via wf.startChild with a unique __continue__<stepCount> workflow id (never colliding with the completed child's existing workflow). The continued turn resets stepCount → 0 via the shared continued-turn init path.

runtime-cloudflare (Workflow path)

Companion tool calls run inside runAgentWorkflow steps. Blocking spawns start child workflow instances; non-blocking spawns fire-and-forget. Same v7 stateless model as Temporal — parent exits with 'suspended_awaiting_children' if it reaches a HITL boundary; resume drains via applyResultsAndReload.

Continuing a completed child runs the child as a fresh workflow instance with a unique-but-deterministic id agent__<type>__<childSession>__continue__<stepCount>__<toolCallId> (a Cloudflare Workflows instance id is write-once globally, so the completed child's base id can never be recreated). Unlike Temporal/DBOS, the CFW companion step deliberately does not CAS the child completed → active and does not append the consult there: the child stays completed, the consult is carried as the new instance's newMessages, and the instance's workflow body takes the !isResumable continuation branch (the same path as root multi-turn continuation) — which resets stepCount, appends the consult exactly once, and re-runs over the full preserved history.

runtime-cloudflare (DO path)

Companion tool calls execute inline in the DO's run-loop iteration. The DO can be evicted during the wait — on the next request that touches the parent session, a fresh DO observes the durable suspension state and resumes. D1StateStore was updated with a migration to add mode and name columns to the __agents_sub_session_refs table.

Continuing a completed child POSTs /start to the child session, which drives the JS executor's existing-session continuation path (reopen + heal + per-turn stepCount → 0); a non-blocking spawn-continue also re-resets the parent SubSessionRef. So the CF-DO path inherits the JS continuation semantics (including the legacy reopen heal on JSAgentExecutor's existing-session continuation).

runtime-dbos

DBOS supports persistent sub-agents. Companion tool calls are dispatched by the runExecuteCompanionTool step. listChildren / getChildStatus / terminateChild route through the shared core dispatcher (executeCompanionToolDispatch); spawnAgent, sendMessage, and waitForResult are handled by DBOS-local logic so they can use DBOS primitives — a child is started via startPersistentWorkflow, sendMessage appends to the child's durable inbox via DBOS.send, and waitForResult polls with the durable DBOS.sleep (survives crashes). Blocking spawns poll until the child reaches a terminal status. A completion notifier (a @DBOS.step) delivers a completed non-blocking child's outcome into the parent's next turn, deduplicated by the durable completionDelivered flag on the SubSessionRef.

Continuing a completed child cannot deliver via DBOS.send(..., 'inbox') — a completed DBOS child has no live recv loop (its workflow exited via finalizeLoop). Instead startPersistentContinuation starts a fresh persistent restart workflow on the preserved session with a deterministic id (${childSessionId}-continue-${toolCall.id}), so a workflow-body replay computes the same id and DBOS.startWorkflow dedupes — a replay never double-starts the continuation. The store-side reopen (CAS completed → active + the legacy defensive __finish__ heal) is CAS-gated single-winner (first execution heals; a replay loses the CAS and skips it), and the consult is carried as the restart workflow's initialMessage, appended exactly once by the workflow body's checkpointed append step. Because handleSpawnAgent / handleSendMessage run in the workflow body (not a @DBOS.step) and re-run on recovery, a continuation-replay is additionally detected from the durable metadata.dbosWorkflowId marker matching the deterministic continue id — short-circuiting before the live-child paths so neither spawn-continue nor sendMessage-continue re-sends, re-interrupts, or re-launches on replay. The continued turn starts at stepNumber = 0 (per-turn budget).

Known limitations: the blocking-spawn path currently blocks-until-idle and can mis-report a failed child as completed (tracked in docs/dev/follow-ups.md as FU-DBOS-BLOCKING-SPAWN-SEMANTICS). Workspaces on persistent children are not supported on DBOS — a persistent child that declares a workspace fails fast at spawn (C8 workspace fail-fast; see the Sub-Agents guide); declare workspaces only on JS / Cloudflare DO persistent children.

State Store Requirements

The persistent sub-agent feature requires state stores to support the mode, name, and completionDelivered fields on SubSessionRef (completionDelivered is the durable dedup flag that prevents a child's completion from being delivered to the parent twice):

StoreSupportNotes
InMemoryStateStoreYesFields stored in-memory
RedisStateStoreYesFields serialized as hash fields (completionDelivered as '1'/'0')
PostgresStateStoreYesSchema includes mode + name; V11 migration adds completion_delivered BOOLEAN
D1StateStoreYesMigration adds mode/name; D1 V14 adds completion_delivered INTEGER
DOStateStoreYesSQLite schema includes fields; DO V8 adds completion_delivered

All stores read an absent/null completion_delivered as "undelivered", so a session created before the migration delivers a completed child's outcome exactly once on its first post-upgrade parent turn. The cross-store contract test (packages/core/src/testing/sub-session-operations.ts) pins this round-trip on every store.

See Also

Released under the MIT License.