DBOS Runtime

The DBOS runtime (@helix-agents/runtime-dbos) executes agents as durable DBOS Transact workflows backed by Postgres. Workflow state — which steps have completed, their cached results, in-flight recv messages — lives in ordinary Postgres tables managed by DBOS, so you get crash recovery, automatic step retries, and durable interrupt/abort without standing up a separate workflow server.

When to Use

Good fit:

You already run Postgres and don't want to operate separate Temporal infrastructure
Production workloads requiring durable execution and crash recovery
Long-running agents that must survive process restarts
Chat assistants or stateful sessions that benefit from a hibernating, always-addressable workflow (persistent mode)
Teams that want to leverage existing Postgres tooling for observability and querying

Not ideal for:

Quick development iteration where the JS runtime's zero-infra story is simpler
Agents that need workspaces — DBOS does not support them (see Known Limitations)
Deployments with no Postgres and no appetite to add one

Prerequisites

You need a Postgres database. DBOS uses it as its system database (workflow status, step cache, in-flight messages) and @helix-agents/store-postgres shares the same instance for session state.

bash

# Local development with Docker
docker run -d --name dbos-postgres \
  -e POSTGRES_PASSWORD=dbos \
  -p 5432:5432 \
  postgres:16

Token-by-token streaming requires Redis (RedisStreamManager) — Postgres write latency is too high for per-token chunk fan-out.

Installation

bash

npm install @helix-agents/runtime-dbos @helix-agents/store-postgres @helix-agents/store-redis @helix-agents/llm-vercel @dbos-inc/dbos-sdk

Architecture

mermaid

graph TB
    subgraph App ["Your Application Process"]
        Executor["<b>DBOSAgentExecutor</b><br/>execute / resume / retry / submitToolResult<br/>Binds step deps · registers workflows"]
        Workflow["<b>DBOS Workflow Body</b> (deterministic)<br/>planStepProcessing() · buildMessagesForLLM()<br/>shouldStopExecution()"]
        Steps["<b>@DBOS.step()s</b><br/>loadState · callLLM · executeTool<br/>checkpoint · client-tool · approval-gate"]
        Executor --> Workflow --> Steps
    end

    Steps --> PG["Postgres<br/>(session state + DBOS system tables)"]
    Steps --> LLM["LLM API"]
    Steps --> Redis["RedisStreamManager<br/>(token streaming)"]

The workflow body must be deterministic so DBOS can replay it on crash recovery. All I/O (LLM calls, Postgres reads/writes, tool execution, streaming) is wrapped in @DBOS.step() calls whose results are checkpointed in DBOS system tables. See Determinism Rules below and the architecture deep-dive.

Setup Guide

The DBOS runtime registers its workflow bodies via module-level decorator evaluation. The critical ordering rule is: register workflows and bind your LLM adapter before DBOS.launch(), because DBOS.launch() asynchronously recovers any pending workflows, and each recovered workflow body reads the live LanguageModel registry at entry.

typescript

import { DBOS } from '@dbos-inc/dbos-sdk';
import { defineAgent } from '@helix-agents/core';
import { DBOSAgentExecutor, registerDBOSAgentWorkflows } from '@helix-agents/runtime-dbos';
import { PostgresStateStore } from '@helix-agents/store-postgres';
import { RedisStreamManager } from '@helix-agents/store-redis';
import { VercelAIAdapter } from '@helix-agents/llm-vercel';
import { openai } from '@ai-sdk/openai';

const agent = defineAgent({
  name: 'assistant',
  systemPrompt: 'You are a helpful assistant.',
  tools: [],
  llmConfig: { model: openai('gpt-4o') },
  maxSteps: 10,
});

// 1. Register workflow bodies (evaluates @DBOS.workflow() decorators). Idempotent.
registerDBOSAgentWorkflows();

// 2. Launch DBOS (begins async recovery of pending workflows).
await DBOS.launch();

// 3. Construct the executor. This binds every @DBOS.step's static deps
//    (state store, stream manager, LLM adapter, hooks, etc.) and calls
//    registerDBOSAgentWorkflows() again internally (idempotent).
const executor = new DBOSAgentExecutor({
  stateStore: new PostgresStateStore({ connectionString: process.env.DATABASE_URL! }),
  streamManager: new RedisStreamManager({ url: process.env.REDIS_URL! }),
  llmAdapter: new VercelAIAdapter({ model: openai('gpt-4o') }),
});

// 4. Execute. sessionId is REQUIRED on the DBOS runtime.
const handle = await executor.execute(agent, { message: 'Hello' }, { sessionId: 'session-1' });
const result = await handle.result();

Constructor config differs from { registry }

Unlike Temporal's AgentRegistry-based wiring, DBOSAgentExecutor takes { stateStore, streamManager, llmAdapter } directly. There is no registry argument — agents are registered lazily (per execute() / resume() / retry() call) via the internal registerAgent helper. The { registry } snippet in some older overview tables is not the actual constructor shape.

Registering agents for recovery

DBOSAgentExecutor.execute() / resume() / retry() register the agent internally before starting a workflow. You only need to call registerAgent explicitly when a process may host a recovered workflow on DBOS.launch() without first going through one of those calls — for example, a process whose only job is delivering executor.submitToolResult(...):

typescript

import { registerAgent } from '@helix-agents/runtime-dbos';

// Before DBOS.launch(), in a submit-only / recovery-only process:
registerAgent(agent);

registerAgent is idempotent. If the same agent.name is later re-registered with a different live LanguageModel, it logs a warning and overwrites (last writer wins) to surface config drift.

`bindCallLLMStep`

bindCallLLMStep(llmAdapter, streamManager) wires the LLM adapter into the DBOS CallLLMStep static slot. Application code should not call this directly — the DBOSAgentExecutor constructor calls it for you from your config. It is exported only because e2e test infrastructure needs to swap the adapter between runs.

Execution Model

The DBOS runtime is architecturally closest to the Temporal runtime — both derive workflow IDs from sessionId, wrap I/O in durable step functions, and checkpoint intermediate results. The key architectural difference is the suspension primitive: where the v7 stateless-suspension runtimes (JS, Temporal, Cloudflare) use a durable-state suspensionContext and exit the workflow at every HITL boundary, DBOS uses its native DBOS.recv / DBOS.send primitives over Postgres-backed workflow replay. The workflow suspends in-place on await DBOS.recv(toolCallId) and wakes when a submission is delivered via DBOS.send. This is functionally equivalent for callers but architecturally separate from the unified suspensionContext model.

Standard vs Persistent Mode

DBOS supports two operational modes. The mode is fixed at the first execute() for a session and cannot change (mixing throws DBOSModeMismatchError).

	Standard mode (default)	Persistent mode
Workflow lifetime	One short-lived workflow per `execute()` / `resume()` / `retry()`, runs to completion then exits	One long-lived `DBOS.recv()`-loop workflow per session, forever
Workflow ID	`agent__{name}__{sessionId}__run__{runId}` (resume AND retry both use the `__resume__N` form)	`agent__{name}__{sessionId}` (no run suffix)
Multi-turn	Each turn starts a fresh workflow that loads accumulated state	Subsequent turns route into the existing workflow via `DBOS.send(workflowId, msg, 'inbox')`
Hibernation	N/A (workflow exits between turns)	Blocks on `DBOS.recv` — zero compute, instant wake via Postgres `LISTEN/NOTIFY`
Best for	Request-response, batch, long analyses	Chat assistants, agents holding resources/locks across turns

Opt into persistent mode per-call ({ mode: 'persistent' }), per-agent (defaultMode: 'persistent'), or rely on the framework default ('standard'). Other runtimes ignore mode / defaultMode — they are no-ops there. See the persistent-mode deep-dive for the full lifecycle, idle-TTL semantics, and interrupt-vs-abort behavior.

Crash Recovery & Determinism

On process restart, DBOS replays the workflow body from the last completed step. For replay to be correct, the workflow body must produce the same sequence of step calls in the same order every time:

Inside the workflow body: use only pure functions (planStepProcessing, buildMessagesForLLM, shouldStopExecution from @helix-agents/core), DBOS.now(), DBOS.randomUUID(), and await @DBOS.step() calls. Avoid Date.now(), Math.random(), crypto.randomUUID(), fetch(), and process.env reads.
Inside a @DBOS.step(): non-determinism is fine. LLM calls, Postgres/Redis I/O, tool execution, Date.now(), and crypto.randomUUID() are all allowed because the step result is checkpointed and replayed from cache.

Two independent checkpoint layers coexist: the DBOS step cache (replay optimization, in DBOS system tables) and Helix checkpoints (user-facing time-travel / branching / retry, in store-postgres). They do not conflict.

Multi-Turn Conversations

In standard mode, multi-turn works exactly as on the JS / Temporal runtimes: call execute() again with the same sessionId after the previous run completes. Each call starts a new workflow that loads accumulated state.

typescript

await (
  await executor.execute(agent, { message: 'My name is Alice.' }, { sessionId: 's1' })
).result();

// Continue the conversation — same sessionId, new run, accumulated state.
const h2 = await executor.execute(agent, { message: 'What is my name?' }, { sessionId: 's1' });

In persistent mode, subsequent execute() calls route the message into the already-running workflow rather than starting a new one.

Human-in-the-Loop (HITL)

DBOS supports the full HITL surface on its DBOS-native suspension primitive.

Client-Executed Tools

A tool declared with execute: 'client' suspends the run while the client computes the result. The DBOS flow:

The LLM emits a client-tool call; the workflow writes pendingClientToolCalls[toolCallId] to session state and emits a tool_start chunk.
The workflow body calls await DBOS.recv(toolCallId, deadlineSec) and suspends. The process can be restarted here — the recv re-arms on replay.
The client posts to POST /submit-tool-result; the server calls executor.submitToolResult(...), which routes a DBOS.send(workflowId, payload, toolCallId) to the owner workflow.
The recv returns; the workflow clears pending state, appends a tool-result message, and the loop continues.

Submission is exactly-once: concurrent or retried submits on the same toolCallId are serialized by an OCC-stamped submittedAt marker (first writer wins; losers return already_completed), and the DBOS.send idempotency key is scoped by workflow ID. A durable completedClientToolCalls marker makes already_completed survive executor restart and persistent-workflow exit. Deadline timeouts append a synthetic tool_error, emit a tool_end chunk, and fire the timeout hooks — matching the cross-runtime contract.

See the client-tools deep-dive for sub-agent ownership routing and the full race-condition analysis.

Approval-Gated Tools

A tool with requireApproval: true (or a function-form predicate) suspends on a tool_approval_request chunk and shares the same pendingClientToolCalls + DBOS.recv primitive as client tools. Resume by calling executor.submitToolResult({ kind: 'approval-response', toolCallId, approved, reason? }):

Approve (approved: true) runs the tool's execute() with the original input.
Deny (approved: false) records a synthetic tool_error ('Tool call was not approved by the user') and skips execute().

Checkpointed requireApproval predicate: function-form predicates are evaluated inside a @DBOS.step (ApprovalGateStep.evaluateApprovalGatePredicateStep), so the boolean result is checkpointed in workflow history. This makes the suspend-vs-run decision replay-deterministic even for non-pure predicates (a predicate that reads the clock or a DB row is safe). Fail-closed semantics live inside the step — an exception is treated as requireApproval = true — so the checkpointed value is always a boolean (parity with Temporal's activity-wrapped predicate evaluation; GL #111 Batch C).

Driving the loop after `submitToolResult`

As on all v7 runtimes, submitToolResult is a durable write only — it does not auto-resume. After submitting, either use the framework's chat plumbing (handleChatStream + useChat / useResumeClientTools) or call executor.resume(agent, sessionId) explicitly. (DBOS's recv-driven path can also auto-continue the in-place workflow; see Hooks for the resulting runId === previousRunId self-loop.)

See the Client-Executed Tools and Approval Gates guides for cross-runtime concepts.

Hooks

Per-call hooks fire on the DBOS runtime with full parity. agent.hooks (the AgentConfig field), plus ExecuteOptions.hooks and ExecuteOptions.hookManager, fire on every execute() / resume() / retry(). The executor computes the merged HookManager (constructor → agent.hooks → options.hooks order) before DBOS.startWorkflow and registers it in a process-local HookManagerRegistry indexed by DBOS.workflowID. The hook and tool steps resolve the manager per-invocation, falling back to the constructor-bound static on cross-worker recovery (a documented split-brain risk mirroring Temporal/CF replaceAgent semantics).

Canonical firing order matches all five runtimes (JS, Temporal, Cloudflare Workflows, Cloudflare DO, and DBOS — the first four use the v7 stateless-suspension model; DBOS uses its own DBOS-native DBOS.recv/DBOS.send primitives but converges on the same observable hook order):

beforeTool → execute → onStateChange → onMessage → afterTool

beforeTool + onStateChange fire inside the @DBOS.step tool body (so they are checkpointed by the step boundary); onMessage + afterTool are deferred to the workflow body so they land after the phase-1 Promise.all settles and after each tool-result message is appended.

onMessage, onStateChange, onAgentSuspended, and onAgentResumed all fire — including for both client-tool and approval-gate suspend/resume cycles.

`onAgentResumed.previousRunId`

onAgentResumed fires with { runId, previousRunId, sessionId, resumedFromCheckpointId }. DBOS has two resume branches:

Branch 2 (fresh resume workflow): runId !== previousRunId — a new run is allocated and a fresh workflow starts (same as JS / Temporal / CF).
Branch 1 (recv-driven in-place continuation): runId === previousRunId — the client-tool submit path detects a still-PENDING workflow blocked on DBOS.recv, returns a handle wrapping the same workflow, and the body wakes in-place. No new workflow, no new run.

In both branches previousRunId is always populated with the suspended run's runId (Branch 1 sets previousRunId: runId, which equals the suspended runId because resume continues in-place). Hook consumers should use previousRunId as the linkage signal for span stitching — do not rely on runId !== previousRunId to detect a resume, because that test fails on DBOS Branch 1.

See the Hooks (Observability) guide for the full hook reference.

State

stateSchema.default() values are seeded into customState at session start so tools that append to a key (e.g. customState.items via an Immer /items/- push) see the default-initialized container rather than undefined. Seeding is best-effort and validated; dropped keys log a warning.

Parallel-tool in-step state visibility

Each DBOS tool runs as an independent durable step that loads state.customState fresh from Postgres at the top of its execution. A parallel sibling tool that calls ctx.getState() to read a key written by another sibling within the same LLM step sees the pre-sibling value, not the cumulative state. This is by design — DBOS steps must be idempotent for replay determinism.

Tools that need to read sibling writes within a single step should either be sequenced through a single parent tool that orchestrates the work, or run on the JS / Cloudflare DO runtimes (which apply in-memory state merging between parallel tool calls before persisting).

Persistent Sub-Agent Handling

DBOS supports persistent sub-agents: a parent that declares persistentAgents gets the same companion__* tools as every other runtime, with identical LLM-facing semantics (terminate-truth, exactly-once completion delivery, deterministic child ids).

Re-spawning a completed persistent child continues it on its preserved session (memory retained) rather than recreating it — see Re-consulting a persistent companion (the critic loop).

Dispatch. companion__listChildren / getChildStatus / terminateChild route through the shared core dispatcher. spawnAgent, sendMessage, and waitForResult are handled by DBOS-local logic so they can use durable primitives: a child runs as its own durable workflow (startPersistentWorkflow), sendMessage appends to the child's durable inbox via DBOS.send, and waitForResult polls with durable DBOS.sleep (the wait survives a crash).
Completion delivery. A completion notifier (a @DBOS.step) injects a finished non-blocking child's outcome into the parent's next turn, deduplicated by the durable completionDelivered flag on the SubSessionRef.
Blocking spawn (caveat). Non-blocking spawn is fully supported. The blocking spawn path currently blocks until the workflow is idle and can mis-report a failed child as completed (tracked as FU-DBOS-BLOCKING-SPAWN-SEMANTICS). Prefer non-blocking spawn + companion__waitForResult on DBOS until that is resolved.
No workspaces on persistent children (C8). A persistent child that declares a workspace fails fast at spawn — workspaces are unsupported on DBOS (see below). Use inheritWorkspace-free children with no workspace, or run workspace-bearing children on the JS / Cloudflare DO runtimes.

Known Limitations

Workspaces are unsupported. Unlike Temporal and CF Workflows (which fail-fast at run-start), DBOS silently passes workspaces: undefined. describeCapabilities() returns an empty workspaceProviderKinds list, so AgentServer surfaces a RUNTIME_NO_WORKSPACE_SUPPORT 404 on the HTTP workspace route. Agents needing workspaces should run on the JS or Cloudflare DO runtimes.
The history: input prefix is not honored. executor.execute(agent, { message, history: [...] }, { sessionId }) silently drops the history: prefix on DBOS — the message log starts at the new user message. Tracked as the open follow-up FU-DBOS-HISTORY-PREFIX-NEVER-HONORED (low priority; parity gap with JS / CF / Temporal).
Doubly-nested HITL is unsupported. A top-level agent's client-executed or approval-gated tool suspends and resumes correctly. The doubly-nested case — an ephemeral child sub-agent itself calling an execute: 'client' tool and expected to suspend the parent — is not supported; the parent returns 'completed' rather than suspending. Hoist the client tool to the top-level agent, or use a persistent sub-agent instead. Tracked in GitLab #73 (documented-as-unsupported).
Submit-side outputSchema validation is skipped. The DBOS agent registry stores SerializedAgent records, which strip Zod schemas, so submitted client-tool results are accepted without server-side schema validation. Client-side validators should be the primary source of correctness.
In-step parallel-sibling state visibility — see State above.
Memory auto-injection/extraction is not supported. The memory: config field is accepted at the type level but is never invoked on DBOS. Agents requiring memory recall or storage should run on the JS or Cloudflare runtimes.

Prompt caching is supported. LLMConfig.cache strategies (anthropicCache, openaiCache, xaiCache, or a custom CacheStrategy) are applied on DBOS at parity with the JS, Temporal, and Cloudflare runtimes. Strategies are not JSON-serializable, so the executor registers the live strategy in a process-local registry (alongside the live model) and the workflow body applies it before each LLM call; a strategy that throws is logged and the step continues un-annotated.

Best Practices

Use store-postgres for state — it shares the same Postgres instance DBOS uses for its system database, keeping infrastructure to one service.
Use RedisStreamManager for streaming — Postgres write latency is too high for per-token chunk fan-out.
Register before launch — call registerDBOSAgentWorkflows() and construct the executor (or call registerAgent) before DBOS.launch() so recovered workflows find their live LanguageModel.
Keep workflow bodies deterministic — push all I/O into @DBOS.step() calls. See Determinism.
Choose persistent mode deliberately — only when subsequent messages need to arrive into a running workflow context (in-memory state across turns); otherwise standard mode is simpler.

DBOS Runtime ​

When to Use ​

Prerequisites ​

Installation ​

Architecture ​

Setup Guide ​

Registering agents for recovery ​

bindCallLLMStep ​

Execution Model ​

Standard vs Persistent Mode ​

Crash Recovery & Determinism ​

Multi-Turn Conversations ​

Human-in-the-Loop (HITL) ​

Client-Executed Tools ​

Approval-Gated Tools ​

Driving the loop after submitToolResult ​

Hooks ​

onAgentResumed.previousRunId ​

State ​

Parallel-tool in-step state visibility ​

Persistent Sub-Agent Handling ​

Known Limitations ​

Best Practices ​

See Also ​

DBOS Runtime

When to Use

Prerequisites

Installation

Architecture

Setup Guide

Registering agents for recovery

`bindCallLLMStep`

Execution Model

Standard vs Persistent Mode

Crash Recovery & Determinism

Multi-Turn Conversations

Human-in-the-Loop (HITL)

Client-Executed Tools

Approval-Gated Tools

Driving the loop after `submitToolResult`

Hooks

`onAgentResumed.previousRunId`

State

Parallel-tool in-step state visibility

Persistent Sub-Agent Handling

Known Limitations

Best Practices

See Also