Sub-Agent Execution
This document explains how Helix Agents handles sub-agent (child agent) execution and the patterns involved.
Overview
Sub-agents enable:
- Task Delegation - Parent agents delegate specialized tasks
- Composition - Build complex agents from simpler ones
- Isolation - Sub-agents have their own state
- Reusability - Define once, use from multiple parents
Creating Sub-Agent Tools
Using createSubAgentTool
import { createSubAgentTool, defineAgent } from '@helix-agents/core';
import { z } from 'zod';
// First define the sub-agent with an outputSchema
const SummarizerAgent = defineAgent({
name: 'summarizer',
outputSchema: z.object({
summary: z.string(),
keyPoints: z.array(z.string()),
}),
// ... other config
});
// Then create a tool that invokes it
const summarizeTool = createSubAgentTool(
SummarizerAgent, // Full agent config (must have outputSchema)
z.object({
texts: z.array(z.string()),
maxLength: z.number().optional(),
}),
{ description: 'Summarize a list of texts' } // Optional
);Tool Name Convention
Sub-agent tools use the subagent__ prefix internally:
// When LLM calls the tool, it uses 'summarize'
// Internally stored as 'subagent__summarizer'
const SUBAGENT_TOOL_PREFIX = 'subagent__';Detecting Sub-Agent Tools
import { isSubAgentTool, SUBAGENT_TOOL_PREFIX } from '@helix-agents/core';
// Check if a tool call is for a sub-agent
if (isSubAgentTool(toolName)) {
// Extract agent type
const agentType = toolName.slice(SUBAGENT_TOOL_PREFIX.length);
}Execution Flow
1. Parent Agent Makes Tool Call
Parent Agent
│
├── LLM decides to use 'summarize' tool
│
▼
{ type: 'tool_calls', toolCalls: [
{ id: 's1', name: 'summarize', arguments: { texts: [...] } }
]}2. Tool Call is Recognized as Sub-Agent
// In planStepProcessing
const subAgentCalls = toolCalls
.filter((tc) => isSubAgentTool(tc.name))
.map((tc) => ({
id: tc.id,
agentType: tc.name.slice(SUBAGENT_TOOL_PREFIX.length),
input: tc.arguments,
}));3. Sub-Agent is Executed
graph TB
Parent["Parent Agent (paused)"]
Parent --> SubCreate["Sub-Agent Created"]
subgraph SubConfig [" "]
direction LR
C1["sessionId: unique ID"]
C2["streamId: linked to parent"]
C3["parentSessionId: parent's sessionId"]
C4["input: from tool arguments"]
end
SubCreate --> SubConfig
SubConfig --> ExecLoop["Sub-Agent Execution Loop"]
subgraph ExecSteps [" "]
direction TB
E1["Initialize state"]
E2["Call LLM"]
E3["Execute tools"]
E4["Complete with output"]
E1 --> E2 --> E3 --> E4
end
ExecLoop --> ExecSteps4. Result Returned to Parent
Sub-Agent Complete
│
├── Output returned to parent
│
▼
Parent Agent (resumed)
│
└── Receives tool result with sub-agent outputState Isolation
Parent State
interface ParentState {
notes: Note[];
searchCount: number;
}Sub-Agent State
interface SummarizerState {
texts: string[];
processedCount: number;
}Sub-agents cannot directly modify parent state. Communication happens through:
- Input - Data passed when invoking sub-agent
- Output - Structured result returned on completion
Stream Event Flow
subagent_start
Emitted when sub-agent begins. Schema (canonical reference: ./stream-protocol.md):
{
type: 'subagent_start',
subAgentType: 'summarizer', // Sub-agent type
subSessionId: 'session-child-abc123', // Sub-agent's session ID; correlate sub-agent chunks via this
callId: 'call_xyz789', // Tool call ID that spawned this sub-agent
agentId: 'session-parent-xyz789', // BaseChunk: parent's session ID
agentType: 'parent-agent-type',
step: 3,
timestamp: 1702329600000
}Sub-Agent Events
Sub-agent emits its own events (text_delta, tool_start, etc.) with its own agentId (which is its sessionId):
{
type: 'text_delta',
delta: 'Summarizing...',
agentId: 'session-child-abc123', // Sub-agent's sessionId
agentType: 'summarizer',
timestamp: 1702329600100
}subagent_end
Emitted when sub-agent completes. Schema (canonical reference: ./stream-protocol.md):
{
type: 'subagent_end',
subAgentType: 'summarizer',
subSessionId: 'session-child-abc123',
callId: 'call_xyz789',
result: { summary: 'The texts discuss...' }, // Failures surface via the result payload itself or via emitted error chunks; there is no `success` field on this chunk
agentId: 'session-parent-xyz789',
agentType: 'parent-agent-type',
step: 3,
timestamp: 1702329601000
}Runtime Implementations
JS Runtime
Sub-agents execute recursively in the same process:
// In JSAgentExecutor
for (const subAgentCall of plan.pendingSubAgentCalls) {
const subAgent = registry.get(subAgentCall.agentType);
// Execute sub-agent (recursive call)
const handle = await this.execute(
subAgent,
{
message: JSON.stringify(subAgentCall.input),
state: subAgentCall.input,
},
{
parentSessionId: state.sessionId, // Parent's session ID (primary key)
}
);
const result = await handle.result();
// Add result to parent's messages
state.messages.push(
createSubAgentResultMessage({
toolCallId: subAgentCall.id,
agentType: subAgentCall.agentType,
result: result.output,
success: result.status === 'completed',
})
);
}Temporal Runtime
Sub-agents run as child workflows:
// In workflow
for (const subAgentCall of plan.pendingSubAgentCalls) {
const subSessionId = generateSubSessionId();
const childResult = await executeChild(agentWorkflow, {
workflowId: subSessionId,
args: [
{
agentType: subAgentCall.agentType,
sessionId: subSessionId,
streamId: parentStreamId, // Share stream
message: JSON.stringify(subAgentCall.input),
parentSessionId: input.sessionId,
},
],
parentClosePolicy: 'ABANDON',
});
// Record result
await activities.recordSubAgentResult({
parentSessionId: input.sessionId,
subAgentCall,
result: childResult,
});
}Cloudflare Runtime
Sub-agents spawn as separate workflow instances:
// In workflow step
for (const subAgentCall of plan.pendingSubAgentCalls) {
const instance = await workflowBinding.create({
id: generateSubSessionId(),
params: {
agentType: subAgentCall.agentType,
message: JSON.stringify(subAgentCall.input),
parentSessionId: input.sessionId, // Parent's session ID (primary key)
},
});
// Wait for completion
const result = await instance.status();
}Message Recording
Assistant Message
Tool calls including sub-agent calls are recorded:
{
role: 'assistant',
content: 'I will summarize these texts.',
toolCalls: [
{ id: 's1', name: 'subagent__summarizer', arguments: { texts: [...] } }
]
}Tool Result Message
Sub-agent result is recorded as a tool result:
{
role: 'tool',
toolCallId: 's1',
toolName: 'subagent__summarizer',
content: '{"summary":"The texts discuss..."}'
}Parallel Sub-Agents
Multiple sub-agents can run in parallel:
// LLM requests multiple sub-agents
{
type: 'tool_calls',
toolCalls: [],
subAgentCalls: [
{ id: 's1', agentType: 'summarizer', input: { texts: batch1 } },
{ id: 's2', agentType: 'summarizer', input: { texts: batch2 } },
{ id: 's3', agentType: 'analyzer', input: { data: {...} } },
]
}The runtime executes them in parallel:
// JS Runtime
const results = await Promise.all(
subAgentCalls.map(call => executeSubAgent(call))
);
// Temporal Runtime
await Promise.all(
subAgentCalls.map(call => executeChild(agentWorkflow, { ... }))
);Interrupt Propagation (v7 durable model)
When a parent agent is interrupted while sub-agents are running, v7 propagates the interrupt through the entire hierarchy via the durable interrupt protocol, not in-memory promise races.
Durable Interrupt Flag
Every interrupt request — local or cross-process — is written to the state store via stateStore.setInterruptFlag(sessionId, reason). The runLoop polls stateStore.checkInterruptFlag(sessionId) (atomic check-and-clear) at the top of each step iteration. This is the canonical mechanism for interrupt detection on JS, CF DO, CFW Workflows, and Temporal.
// In any process — possibly different from the one that owns the in-memory handle:
await stateStore.setInterruptFlag(sessionId, 'user clicked Stop');
// runLoop, on its next iteration:
const flag = await stateStore.checkInterruptFlag(sessionId); // atomic read+clear
if (flag) {
// Persist 'interrupted' run status; return RunOutcome.kind = 'interrupted'.
}Why Not Promise.race?
In v7's stateless suspension model the parent runLoop EXITS at every HITL boundary; there is no long-lived in-memory Promise.all to race against. A sub-agent's completion in another session context fires a separate runLoop invocation against the parent's session via executor.resume(), which observes the durable child status and drains the result via applyResultsAndReload (Temporal/CFW) or the equivalent JS/DO bootstrap.
This means cross-process interrupts (the case the v6 Promise.race machinery existed to mask) are now first-class — any process can write the flag and the next runLoop tick observes it.
Cascade to In-Flight Children
When a parent suspends with running children, the in-flight children are marked failed:'parent_suspended' (see γ-cascade discriminator below). On parent's resume, the cascade either re-spawns the child (genuine parent-suspension drop) or drains the existing failure result (genuine child failure).
Interrupt-Specific Handling
Per-runtime entry points for an interrupt-during-children scenario:
| Runtime | Detection | Child propagation |
|---|---|---|
| runtime-js | pollDurableInterruptFlag at every step boundary | Linked AbortSignal for in-process children; durable flag write for sub-process / sub-DO children |
| runtime-cloudflare (DO) | Per-step checkInterruptFlag | Durable flag write to each child's session |
| runtime-cloudflare (Workflow) | Per-step checkInterruptFlag | Durable flag write; new resume workflow observes the flag |
| runtime-temporal | Per-step checkInterruptFlag; original signal handler still wires to Trigger for in-flight wake-up of long polls | Durable flag write per child sessionId; child workflows observe on next iteration boundary |
Latency is bounded by the per-step poll cadence (typically sub-second for tool-heavy agents, but can be longer for agents in long LLM calls). The durable model trades worst-case latency for cross-process correctness — important since v6's INTERRUPT_NOT_LOCAL 503 is now removed.
Sub-Agent Suspension Cascade (γ-cascade)
This is the central cross-runtime invariant for what happens when a parent suspends while children are running. Required reading for any sub-project touching parent/child suspension propagation.
The Problem
When a parent agent reaches a HITL boundary (e.g., a client-tool suspension or an approval-gate match) WHILE sub-agents are still executing, the runtime must persist enough information to:
- Cleanly exit the parent run.
- Cleanly exit (or abandon) in-flight children.
- On parent's resume, decide which children to re-spawn (the parent-suspended drop wasn't a real failure) and which to drain as failure results (genuine execution failures).
The Discriminator: SessionState.failureReason
When a child is forcibly marked 'failed' due to parent suspension, the runtime sets SessionState.failureReason = 'parent_suspended'. This field is persisted by every state store and carries the γ-cascade signal across processes. A child failed for any other reason (thrown error, timeout, stop-when condition) has failureReason unset (or set to a different string).
The Cascade Logic
On parent's resume, applyResultsAndReload (Temporal/CFW) or the equivalent JS/DO bootstrap iterates each SubSessionRef for the parent and decides:
| Child status | failureReason | Cascade action |
|---|---|---|
'completed' | n/a | Drain result to parent as tool_result message. |
'failed' | 'parent_suspended' | Re-spawn — the parent's resume gives the child another chance. New child workflow / instance / loop runs from scratch. |
'failed' | other / unset | Drain the failure as a tool_error to parent. The LLM decides how to handle it. |
'running' | n/a | Continue waiting (re-suspend if needed). |
'terminated' | n/a | Drain a "terminated" tool result; do NOT re-spawn. |
Cross-Store Invariant
ALL five in-tree state stores (memory, redis, postgres, D1, DO) MUST persist SessionState.failureReason correctly. The exhaustive-fixes audit found three separate P0 bugs where stores dropped this field on persist — the cascade silently devolved into "always re-spawn" or "always drain". Tests in packages/e2e/src/__tests__/gamma-cascade-parity.integ.test.ts exercise the matrix.
Companion Tools and the Cascade
Persistent sub-agents managed via companion tools follow the same cascade. If a parent suspends while a companion__waitForResult blocking child is mid-flight, the child is marked 'paused_awaiting_client' (or 'failed' with 'parent_suspended' if the persistence model requires a terminal status), and the parent's resume drains via applyResultsAndReload. See Persistent Sub-Agent Execution below.
Lifecycle Hook Guarantees
Sub-agents fire their own lifecycle hooks (onAgentStart, onAgentComplete, onAgentFail) independently from the parent. This is critical for tracing integrations (e.g., Langfuse) that need to emit root spans when a sub-agent completes.
The Problem
Sub-agents share the parent's stream. Without special handling, the sub-agent's onAgentComplete hook would either not fire (if stream finalization is skipped) or would close the parent's stream prematurely.
The Solution: skipStreamClose
All runtimes call endAgentStream for sub-agents with skipStreamClose: true. This fires onAgentComplete without closing the parent's stream:
// Root agent: fire hooks and close stream
await endAgentStream({ sessionId });
// Sub-agent: fire hooks but keep parent's stream open
await endAgentStream({ sessionId: subAgentSessionId, skipStreamClose: true });This applies to both completion paths:
- Normal completion (
__finish__tool) — sub-agent hooks fire withskipStreamClose: true finishWithcompletion — sub-agent hooks fire with bothfinishWithOutputandskipStreamClose: true
Per-Runtime Behavior
| Runtime | Root Agent | Sub-Agent |
|---|---|---|
| JS | endAgentStream({ sessionId }) | endAgentStream({ sessionId, skipStreamClose: true }) |
| Temporal | endAgentStream({ sessionId }) | endAgentStream({ sessionId, skipStreamClose: true }) |
| Cloudflare | endAgentStream({ sessionId }) | endAgentStream({ sessionId, skipStreamClose: true }) |
All three runtimes follow the same pattern, ensuring hook behavior is consistent regardless of where the agent runs.
Error Handling
Sub-Agent Failure
If a sub-agent fails, the result indicates failure:
{
type: 'subagent_end',
subAgentId: 'run-child-123',
agentType: 'summarizer',
success: false,
error: 'Max steps exceeded',
parentSessionId: 'run-parent-456',
timestamp: 1702329601000
}Parent Handling
Parent receives error as tool result:
{
role: 'tool',
toolCallId: 's1',
toolName: 'subagent__summarizer',
content: '{"error":"Max steps exceeded"}'
}The LLM can then decide how to handle the failure.
State Reference Tracking
Parents track sub-agent references via the state store's addSubSessionRefs / updateSubSessionRef / getSubSessionRefs methods. Refs are persisted separately from the parent's custom state.
The canonical v7 SubSessionRef shape (from packages/core/src/types/session.ts):
interface SubSessionRef {
/** Sub-agent's session ID */
subSessionId: string;
/** Agent type (e.g. 'researcher') */
agentType: string;
/** Tool call ID that spawned this sub-agent */
parentToolCallId: string;
/**
* Workflow lifecycle status:
* - 'running'
* - 'completed'
* - 'failed'
* - 'interrupted'
* - 'terminated'
* - 'paused_awaiting_client' (sub-agent waiting on a client-tool result)
*/
status:
| 'running'
| 'completed'
| 'failed'
| 'interrupted'
| 'terminated'
| 'paused_awaiting_client';
startedAt: number;
completedAt?: number;
/**
* Remote-specific metadata (null/undefined for in-process sub-agents).
*
* Contains stream and sequence info for reconnecting to a remote agent
* after parent interruption or crash. URL/transport config is NOT stored
* here — it's reconstructed from the tool definition at resume time.
*/
remote?: {
streamId: string;
lastSequence: number;
};
/**
* Sub-agent lifecycle mode:
* - 'ephemeral': One-shot execution, session discarded after completion (default)
* - 'persistent': Long-lived session, can be resumed and reused across parent runs
*/
mode: 'ephemeral' | 'persistent';
/**
* Stable name for persistent sub-agents (used to generate deterministic
* session IDs). Required when `mode === 'persistent'`; absent for
* ephemeral sub-agents. Use `assertPersistentSubSessionRef()` to narrow
* the type when `name` must be present.
*/
name?: string;
}This allows:
- Querying sub-agent status
- Retrieving sub-agent results (load each
subSessionIdfor terminaloutput/error/failureReason) - Cleanup on parent completion
- Crash-recovery reconnect to remote sub-agent streams via
remote.streamId+remote.lastSequence - γ-cascade re-spawn decisions on parent resume (combined with each child's
SessionState.failureReason)
Best Practices
1. Clear Input/Output Contracts
// Define clear schemas for sub-agent communication
const SummarizerInputSchema = z.object({
texts: z.array(z.string()),
maxLength: z.number().optional().default(500),
});
const SummarizerOutputSchema = z.object({
summary: z.string(),
keyPoints: z.array(z.string()),
});2. Meaningful Agent Types
// Good: descriptive type names
agentType: 'code-reviewer';
agentType: 'data-analyzer';
agentType: 'email-composer';
// Bad: generic names
agentType: 'helper';
agentType: 'agent1';3. Limit Nesting Depth
Avoid deep nesting of sub-agents:
Parent
└── Sub-Agent
└── Sub-Sub-Agent // Avoid this level
└── ... // Definitely avoid this4. Handle Failures Gracefully
// In parent's system prompt
'If a sub-agent fails, try to complete the task yourself or report the failure.';Testing
import { MockLLMAdapter } from '@helix-agents/core';
describe('SubAgent', () => {
it('executes sub-agent and returns result', async () => {
const mock = new MockLLMAdapter([
// Parent calls sub-agent
{
type: 'tool_calls',
toolCalls: [],
subAgentCalls: [
{
id: 's1',
agentType: 'summarizer',
input: { texts: ['text1'] },
},
],
},
// Parent finishes with sub-agent result
{
type: 'structured_output',
output: { result: 'Used summary: ...' },
},
]);
// Register both agents
registry.register(ParentAgent);
registry.register(SummarizerAgent);
const result = await executor.execute(ParentAgent, 'Summarize texts');
expect(result.status).toBe('completed');
});
});Remote Sub-Agent Execution
Remote sub-agents (createRemoteSubAgentTool()) follow a different execution path from local sub-agents. Instead of spawning an in-process or child workflow execution, remote sub-agents delegate to agents reachable via a RemoteAgentTransport — typically HttpRemoteAgentTransport for cross-service calls, or DOStubTransport for sibling Durable Object routing.
DO Runtime Transparent Rewriting
In the Cloudflare DO runtime, createSubAgentTool() tools are transparently rewritten to createRemoteSubAgentTool() at execution time when subAgentNamespace is configured. This means the three-way routing below still applies — by the time the executor sees the tools, local sub-agent tools have already been converted to remote sub-agent tools backed by DOStubTransport. See Sub-Agents in the DO Runtime.
Three-Way Tool Call Routing
When the LLM returns tool calls, all three runtimes partition them into three groups:
- Regular tools — Executed via the standard tool execution path
- Local sub-agent calls — Routed to child workflow / in-process execution
- Remote sub-agent calls — Routed to a dedicated
executeRemoteSubAgentCallactivity/step
Detection uses isRemoteSubAgentTool() which checks the _isRemoteSubAgent marker on the tool.
Execution Flow
- Generate deterministic session ID —
{parentSessionId}-remote-{toolCallId}ensures idempotent restarts - Register SubSessionRef — Tracks the remote session with
remote: { streamId, lastSequence }metadata - Call
transport.start()— SendsPOST /startto the remote agent server - Consume
transport.stream()— Reads SSE events, proxies chunks to the parent stream - Handle completion — Updates SubSessionRef status, returns output as tool result
Crash Recovery (Temporal and Cloudflare)
If the runtime crashes mid-execution:
- Check
transport.getStatus()— Determine if the remote agent is still running, completed, or failed - If completed — Return the output directly without re-executing
- If still running — Reconnect to
transport.stream()withfromSequenceto avoid duplicate events - If failed — Return the error
Resume Reconnection (JS Runtime)
The JS runtime's reconcileRemoteSubAgents() handles resume after interrupts:
- Loads SubSessionRefs with
remotemetadata from the state store - For each running remote sub-agent, checks
transport.getStatus() - Reconnects to streams or records completions/failures
- Updates tool result messages in the conversation history
See Also
- Remote Agents Guide — Full guide with setup and patterns
- API Reference — AgentServer and transport API
Persistent Sub-Agent Execution
Persistent sub-agents use a different execution model from ephemeral sub-agents. Instead of being invoked as tool calls that run to completion and return results, persistent children are long-lived agents managed through companion tools.
Companion Tool Architecture
When buildEffectiveTools() processes an agent with persistentAgents, it dynamically generates companion tools:
// In buildEffectiveTools (packages/core/src/orchestration/state-operations.ts)
if (config.persistentAgents && config.persistentAgents.length > 0) {
// Generate companion tools based on configured persistent agents
tools.push(createSpawnAgentTool(config.persistentAgents));
tools.push(createSendMessageTool());
tools.push(createListChildrenTool());
tools.push(createGetChildStatusTool());
tools.push(createTerminateChildTool());
// waitForResult only available if at least one blocking agent exists
if (config.persistentAgents.some((pa) => pa.mode === 'blocking')) {
tools.push(createWaitForResultTool());
}
}Companion Tool Call Routing
Companion tool calls are handled separately from regular tool calls in all runtimes. The three-way tool routing becomes four-way:
- Regular tools -- Standard tool execution
- Local sub-agent calls -- Ephemeral child agent execution
- Remote sub-agent calls -- HTTP-based delegation
- Companion tool calls -- Persistent child management
Detection uses isCompanionTool() which checks the _isCompanionTool marker:
if (isCompanionTool(tool)) {
// Route to the shared core dispatcher (each runtime supplies its own deps).
return executeCompanionToolDispatch(input, deps);
}Execution Flow: Blocking Spawn (v7)
graph TB
Parent["Parent Agent"]
Parent --> Spawn["companion__spawnAgent called"]
Spawn --> CreateRef["Create SubSessionRef (mode: 'persistent')"]
CreateRef --> CreateChild["Initialize child agent session"]
CreateChild --> RunChild["Execute child agent loop"]
RunChild --> ChildComplete["Child calls __finish__"]
ChildComplete --> UpdateRef["Update SubSessionRef (status: 'completed')"]
UpdateRef --> ReturnResult["Return result to parent"]
ReturnResult --> ParentContinues["Parent continues execution"]v7 stateless suspension note: If the parent calls
companion__waitForResultand the blocking child has not finished yet, the parent's runLoop EXITS withRunOutcome.kind = 'suspended_awaiting_children'(AgentResult.status='suspended_awaiting_children'). The parent'ssuspendedAwaitingChildrenmap is persisted; the runtime returns.Sub-agent completion in another session context fires the parent's resume via the same
__resume-Nworkflow-id convention used for client-tool resumes (single-dash on Temporal/CFW). On resume,applyResultsAndReload(Temporal/CFW) or the equivalent JS/DO bootstrap drains the result viarecordSubSessionResultand the parent continues from the durable checkpoint.If the parent suspends while ANY children are still running (e.g. via a separate client-tool boundary), the γ-cascade applies — children are marked
failed:'parent_suspended'and re-spawned on resume. See./concepts.md§Client-Executed Tools for per-runtime resume mechanics.
Execution Flow: Non-Blocking Spawn
graph TB
Parent["Parent Agent"]
Parent --> Spawn["companion__spawnAgent called"]
Spawn --> CreateRef["Create SubSessionRef (mode: 'persistent')"]
CreateRef --> StartChild["Start child agent (fire-and-forget)"]
StartChild --> ReturnImmediate["Return immediately to parent"]
ReturnImmediate --> ParentContinues["Parent continues execution"]
StartChild --> ChildRuns["Child runs concurrently"]
ChildRuns --> ChildComplete["Child completes later"]
ChildComplete --> UpdateRef["Update SubSessionRef"]SubSessionRef with Persistent Mode
The SubSessionRef interface (full schema in State Reference Tracking above) carries mode and name fields specifically for persistent children:
mode: 'persistent'— Long-lived session, resumable across parent runs.name: string— Required whenmode === 'persistent'; used to generate deterministic session IDs ({parentSessionId}-agent-{name}).
For ephemeral children: mode: 'ephemeral' and name is omitted; session ID is derived from the spawning tool call ID instead.
Continuing a completed child
A persistent companion declares an outputSchema, so it always completes via the auto-injected __finish__ tool. Re-consulting a completed child — companion__sendMessage, or companion__spawnAgent re-using its name — is now continued on its preserved session (memory retained, fresh typed output) rather than thrown (sendMessage) or delete-and-respawned (spawnAgent). The continuation reopens the finished session (completed → active), heals any dangling __finish__, appends the consult, and runs a new turn with a fresh per-turn maxSteps budget. failed / terminated children still re-spawn fresh (the old session is cleaned up). The continuation primitive diverges per runtime (see each subsection below and the cross-runtime parity table in ./concepts.md §Persistent-companion continuation + the __finish__ heal), but the heal and the per-turn reset are uniform.
What a new (6th) runtime must implement
A runtime that reimplements the step loop / companion dispatch does not inherit companion continuation for free. To reach parity it must wire all four:
- Eager
__finish__heal at its own step site — append the synthetictool_resultvia the sharedsynthesizeFinishToolResult(toolCallId)(message-builder.ts:233) immediately after the terminal__finish__assistant message. (Onlyruntime-jsinherits this from corerunStepIteration; Temporal/DBOS/CFW replicate it — see./step-processing.md§The__finish__history invariant.) - Legacy reopen heal — before appending the consult on a continuation, scan the preserved history with
findUnpairedFinishCallId(messages)(message-builder.ts:251) and synthesize the missing result if a dangling__finish__is present (covers sessions completed under a pre-heal release). - The continuation primitive — CAS
completed → active, heal-before-append, append the consult exactly once, resetstepCount → 0so the continued turn gets a fresh per-turnmaxStepsbudget.customState(memory),workspaceRef, andsubSessionRefsare preserved. - Replay-idempotency if the runtime re-executes its dispatch on recovery — use deterministic continuation ids and a durable marker short-circuit. DBOS is the reference pattern: a deterministic
${childSessionId}-continue-${toolCallId}restart workflow id (soDBOS.startWorkflowdedupes) plus ametadata.dbosWorkflowIdmarker check that re-enters the idempotent continuation branch instead of re-sending / re-launching on workflow-body replay (applies to both spawn-continue and sendMessage-continue).
Per-Runtime Implementation
runtime-js
Companion tool calls route through the shared core executeCompanionToolDispatch (the JS-specific executeCompanionToolCall handler was retired during the convergence). The JS run loop supplies a waitForChildTerminal park primitive, so a blocking spawn runs the child's loop in-process and parks inline until the child is terminal, returning the result inline — it does NOT suspend the parent with 'suspended_awaiting_children' (that stateless-suspension model is used by the Temporal / Cloudflare-Workflows runtimes, whose orchestration can't park inline). Non-blocking spawns start the child detached (fire-and-forget); the parent observes completion via the deliver-on-next-turn notifier, getSubSessionRefs, or companion__waitForResult.
Continuing a completed child uses the in-process continuePersistentChild dependency: it reopens the preserved session, resets stepCount → 0, and runs a new turn in the same run loop. A failed/aborted continued turn surfaces as failed (it routes through runSubAgentLoop, which throws on a non-completed outcome).
runtime-temporal
Companion tool calls are executed via an executeCompanionToolCall activity. Blocking spawns use wf.startChild to run a child workflow. In v7, the parent does NOT block awaiting children in-workflow — if the parent reaches a HITL boundary (or its own deadline), it exits with 'suspended_awaiting_children'. After a blocking child completes, markPersistentChildStatus updates the SubSessionRef; the parent's __resume-N workflow drains the result via applyResultsAndReload.
The Temporal runtime does NOT store companion tool results as separate ToolResultMessages in the state store. Results flow through the workflow execution context (or through the resume drain on suspension).
Continuing a completed child runs the store-side reopen (CAS completed → active, append the consult, reset the parent ref) inside the executeCompanionToolCall activity — so it is checkpointed and replay-idempotent — then starts a fresh child workflow via wf.startChild with a unique __continue__<stepCount> workflow id (never colliding with the completed child's existing workflow). The continued turn resets stepCount → 0 via the shared continued-turn init path.
runtime-cloudflare (Workflow path)
Companion tool calls run inside runAgentWorkflow steps. Blocking spawns start child workflow instances; non-blocking spawns fire-and-forget. Same v7 stateless model as Temporal — parent exits with 'suspended_awaiting_children' if it reaches a HITL boundary; resume drains via applyResultsAndReload.
Continuing a completed child runs the child as a fresh workflow instance with a unique-but-deterministic id agent__<type>__<childSession>__continue__<stepCount>__<toolCallId> (a Cloudflare Workflows instance id is write-once globally, so the completed child's base id can never be recreated). Unlike Temporal/DBOS, the CFW companion step deliberately does not CAS the child completed → active and does not append the consult there: the child stays completed, the consult is carried as the new instance's newMessages, and the instance's workflow body takes the !isResumable continuation branch (the same path as root multi-turn continuation) — which resets stepCount, appends the consult exactly once, and re-runs over the full preserved history.
runtime-cloudflare (DO path)
Companion tool calls execute inline in the DO's run-loop iteration. The DO can be evicted during the wait — on the next request that touches the parent session, a fresh DO observes the durable suspension state and resumes. D1StateStore was updated with a migration to add mode and name columns to the __agents_sub_session_refs table.
Continuing a completed child POSTs /start to the child session, which drives the JS executor's existing-session continuation path (reopen + heal + per-turn stepCount → 0); a non-blocking spawn-continue also re-resets the parent SubSessionRef. So the CF-DO path inherits the JS continuation semantics (including the legacy reopen heal on JSAgentExecutor's existing-session continuation).
runtime-dbos
DBOS supports persistent sub-agents. Companion tool calls are dispatched by the runExecuteCompanionTool step. listChildren / getChildStatus / terminateChild route through the shared core dispatcher (executeCompanionToolDispatch); spawnAgent, sendMessage, and waitForResult are handled by DBOS-local logic so they can use DBOS primitives — a child is started via startPersistentWorkflow, sendMessage appends to the child's durable inbox via DBOS.send, and waitForResult polls with the durable DBOS.sleep (survives crashes). Blocking spawns poll until the child reaches a terminal status. A completion notifier (a @DBOS.step) delivers a completed non-blocking child's outcome into the parent's next turn, deduplicated by the durable completionDelivered flag on the SubSessionRef.
Continuing a completed child cannot deliver via DBOS.send(..., 'inbox') — a completed DBOS child has no live recv loop (its workflow exited via finalizeLoop). Instead startPersistentContinuation starts a fresh persistent restart workflow on the preserved session with a deterministic id (${childSessionId}-continue-${toolCall.id}), so a workflow-body replay computes the same id and DBOS.startWorkflow dedupes — a replay never double-starts the continuation. The store-side reopen (CAS completed → active + the legacy defensive __finish__ heal) is CAS-gated single-winner (first execution heals; a replay loses the CAS and skips it), and the consult is carried as the restart workflow's initialMessage, appended exactly once by the workflow body's checkpointed append step. Because handleSpawnAgent / handleSendMessage run in the workflow body (not a @DBOS.step) and re-run on recovery, a continuation-replay is additionally detected from the durable metadata.dbosWorkflowId marker matching the deterministic continue id — short-circuiting before the live-child paths so neither spawn-continue nor sendMessage-continue re-sends, re-interrupts, or re-launches on replay. The continued turn starts at stepNumber = 0 (per-turn budget).
Known limitations: the blocking-spawn path currently blocks-until-idle and can mis-report a failed child as completed (tracked in docs/dev/follow-ups.md as FU-DBOS-BLOCKING-SPAWN-SEMANTICS). Workspaces on persistent children are not supported on DBOS — a persistent child that declares a workspace fails fast at spawn (C8 workspace fail-fast; see the Sub-Agents guide); declare workspaces only on JS / Cloudflare DO persistent children.
State Store Requirements
The persistent sub-agent feature requires state stores to support the mode, name, and completionDelivered fields on SubSessionRef (completionDelivered is the durable dedup flag that prevents a child's completion from being delivered to the parent twice):
| Store | Support | Notes |
|---|---|---|
| InMemoryStateStore | Yes | Fields stored in-memory |
| RedisStateStore | Yes | Fields serialized as hash fields (completionDelivered as '1'/'0') |
| PostgresStateStore | Yes | Schema includes mode + name; V11 migration adds completion_delivered BOOLEAN |
| D1StateStore | Yes | Migration adds mode/name; D1 V14 adds completion_delivered INTEGER |
| DOStateStore | Yes | SQLite schema includes fields; DO V8 adds completion_delivered |
All stores read an absent/null completion_delivered as "undelivered", so a session created before the migration delivers a completed child's outcome exactly once on its first post-upgrade parent turn. The cross-store contract test (packages/core/src/testing/sub-session-operations.ts) pins this round-trip on every store.
See Also
- Sub-Agents Guide
- Remote Agents Guide
- Durable Objects Sub-Agents — Transparent sub-agent routing via
DOStubTransport - Step Processing
- Stream Protocol