DBOS Runtime
The DBOS runtime (@helix-agents/runtime-dbos) executes agents as durable DBOS Transact workflows backed by Postgres. Workflow state — which steps have completed, their cached results, in-flight recv messages — lives in ordinary Postgres tables managed by DBOS, so you get crash recovery, automatic step retries, and durable interrupt/abort without standing up a separate workflow server.
When to Use
Good fit:
- You already run Postgres and don't want to operate separate Temporal infrastructure
- Production workloads requiring durable execution and crash recovery
- Long-running agents that must survive process restarts
- Chat assistants or stateful sessions that benefit from a hibernating, always-addressable workflow (persistent mode)
- Teams that want to leverage existing Postgres tooling for observability and querying
Not ideal for:
- Quick development iteration where the JS runtime's zero-infra story is simpler
- Agents that need workspaces — DBOS does not support them (see Known Limitations)
- Deployments with no Postgres and no appetite to add one
Prerequisites
You need a Postgres database. DBOS uses it as its system database (workflow status, step cache, in-flight messages) and @helix-agents/store-postgres shares the same instance for session state.
# Local development with Docker
docker run -d --name dbos-postgres \
-e POSTGRES_PASSWORD=dbos \
-p 5432:5432 \
postgres:16Token-by-token streaming requires Redis (RedisStreamManager) — Postgres write latency is too high for per-token chunk fan-out.
Installation
npm install @helix-agents/runtime-dbos @helix-agents/store-postgres @helix-agents/store-redis @helix-agents/llm-vercel @dbos-inc/dbos-sdkArchitecture
graph TB
subgraph App ["Your Application Process"]
Executor["<b>DBOSAgentExecutor</b><br/>execute / resume / retry / submitToolResult<br/>Binds step deps · registers workflows"]
Workflow["<b>DBOS Workflow Body</b> (deterministic)<br/>planStepProcessing() · buildMessagesForLLM()<br/>shouldStopExecution()"]
Steps["<b>@DBOS.step()s</b><br/>loadState · callLLM · executeTool<br/>checkpoint · client-tool · approval-gate"]
Executor --> Workflow --> Steps
end
Steps --> PG["Postgres<br/>(session state + DBOS system tables)"]
Steps --> LLM["LLM API"]
Steps --> Redis["RedisStreamManager<br/>(token streaming)"]The workflow body must be deterministic so DBOS can replay it on crash recovery. All I/O (LLM calls, Postgres reads/writes, tool execution, streaming) is wrapped in @DBOS.step() calls whose results are checkpointed in DBOS system tables. See Determinism Rules below and the architecture deep-dive.
Setup Guide
The DBOS runtime registers its workflow bodies via module-level decorator evaluation. The critical ordering rule is: register workflows and bind your LLM adapter before DBOS.launch(), because DBOS.launch() asynchronously recovers any pending workflows, and each recovered workflow body reads the live LanguageModel registry at entry.
import { DBOS } from '@dbos-inc/dbos-sdk';
import { defineAgent } from '@helix-agents/core';
import { DBOSAgentExecutor, registerDBOSAgentWorkflows } from '@helix-agents/runtime-dbos';
import { PostgresStateStore } from '@helix-agents/store-postgres';
import { RedisStreamManager } from '@helix-agents/store-redis';
import { VercelAIAdapter } from '@helix-agents/llm-vercel';
import { openai } from '@ai-sdk/openai';
const agent = defineAgent({
name: 'assistant',
systemPrompt: 'You are a helpful assistant.',
tools: [],
llmConfig: { model: openai('gpt-4o') },
maxSteps: 10,
});
// 1. Register workflow bodies (evaluates @DBOS.workflow() decorators). Idempotent.
registerDBOSAgentWorkflows();
// 2. Launch DBOS (begins async recovery of pending workflows).
await DBOS.launch();
// 3. Construct the executor. This binds every @DBOS.step's static deps
// (state store, stream manager, LLM adapter, hooks, etc.) and calls
// registerDBOSAgentWorkflows() again internally (idempotent).
const executor = new DBOSAgentExecutor({
stateStore: new PostgresStateStore({ connectionString: process.env.DATABASE_URL! }),
streamManager: new RedisStreamManager({ url: process.env.REDIS_URL! }),
llmAdapter: new VercelAIAdapter({ model: openai('gpt-4o') }),
});
// 4. Execute. sessionId is REQUIRED on the DBOS runtime.
const handle = await executor.execute(agent, { message: 'Hello' }, { sessionId: 'session-1' });
const result = await handle.result();Constructor config differs from { registry }
Unlike Temporal's AgentRegistry-based wiring, DBOSAgentExecutor takes { stateStore, streamManager, llmAdapter } directly. There is no registry argument — agents are registered lazily (per execute() / resume() / retry() call) via the internal registerAgent helper. The { registry } snippet in some older overview tables is not the actual constructor shape.
Registering agents for recovery
DBOSAgentExecutor.execute() / resume() / retry() register the agent internally before starting a workflow. You only need to call registerAgent explicitly when a process may host a recovered workflow on DBOS.launch() without first going through one of those calls — for example, a process whose only job is delivering executor.submitToolResult(...):
import { registerAgent } from '@helix-agents/runtime-dbos';
// Before DBOS.launch(), in a submit-only / recovery-only process:
registerAgent(agent);registerAgent is idempotent. If the same agent.name is later re-registered with a different live LanguageModel, it logs a warning and overwrites (last writer wins) to surface config drift.
bindCallLLMStep
bindCallLLMStep(llmAdapter, streamManager) wires the LLM adapter into the DBOS CallLLMStep static slot. Application code should not call this directly — the DBOSAgentExecutor constructor calls it for you from your config. It is exported only because e2e test infrastructure needs to swap the adapter between runs.
Execution Model
The DBOS runtime is architecturally closest to the Temporal runtime — both derive workflow IDs from sessionId, wrap I/O in durable step functions, and checkpoint intermediate results. The key architectural difference is the suspension primitive: where the v7 stateless-suspension runtimes (JS, Temporal, Cloudflare) use a durable-state suspensionContext and exit the workflow at every HITL boundary, DBOS uses its native DBOS.recv / DBOS.send primitives over Postgres-backed workflow replay. The workflow suspends in-place on await DBOS.recv(toolCallId) and wakes when a submission is delivered via DBOS.send. This is functionally equivalent for callers but architecturally separate from the unified suspensionContext model.
Standard vs Persistent Mode
DBOS supports two operational modes. The mode is fixed at the first execute() for a session and cannot change (mixing throws DBOSModeMismatchError).
| Standard mode (default) | Persistent mode | |
|---|---|---|
| Workflow lifetime | One short-lived workflow per execute() / resume() / retry(), runs to completion then exits | One long-lived DBOS.recv()-loop workflow per session, forever |
| Workflow ID | agent__{name}__{sessionId}__run__{runId} (resume AND retry both use the __resume__N form) | agent__{name}__{sessionId} (no run suffix) |
| Multi-turn | Each turn starts a fresh workflow that loads accumulated state | Subsequent turns route into the existing workflow via DBOS.send(workflowId, msg, 'inbox') |
| Hibernation | N/A (workflow exits between turns) | Blocks on DBOS.recv — zero compute, instant wake via Postgres LISTEN/NOTIFY |
| Best for | Request-response, batch, long analyses | Chat assistants, agents holding resources/locks across turns |
Opt into persistent mode per-call ({ mode: 'persistent' }), per-agent (defaultMode: 'persistent'), or rely on the framework default ('standard'). Other runtimes ignore mode / defaultMode — they are no-ops there. See the persistent-mode deep-dive for the full lifecycle, idle-TTL semantics, and interrupt-vs-abort behavior.
Crash Recovery & Determinism
On process restart, DBOS replays the workflow body from the last completed step. For replay to be correct, the workflow body must produce the same sequence of step calls in the same order every time:
- Inside the workflow body: use only pure functions (
planStepProcessing,buildMessagesForLLM,shouldStopExecutionfrom@helix-agents/core),DBOS.now(),DBOS.randomUUID(), andawait @DBOS.step()calls. AvoidDate.now(),Math.random(),crypto.randomUUID(),fetch(), andprocess.envreads. - Inside a
@DBOS.step(): non-determinism is fine. LLM calls, Postgres/Redis I/O, tool execution,Date.now(), andcrypto.randomUUID()are all allowed because the step result is checkpointed and replayed from cache.
Two independent checkpoint layers coexist: the DBOS step cache (replay optimization, in DBOS system tables) and Helix checkpoints (user-facing time-travel / branching / retry, in store-postgres). They do not conflict.
Multi-Turn Conversations
In standard mode, multi-turn works exactly as on the JS / Temporal runtimes: call execute() again with the same sessionId after the previous run completes. Each call starts a new workflow that loads accumulated state.
await (
await executor.execute(agent, { message: 'My name is Alice.' }, { sessionId: 's1' })
).result();
// Continue the conversation — same sessionId, new run, accumulated state.
const h2 = await executor.execute(agent, { message: 'What is my name?' }, { sessionId: 's1' });In persistent mode, subsequent execute() calls route the message into the already-running workflow rather than starting a new one.
Human-in-the-Loop (HITL)
DBOS supports the full HITL surface on its DBOS-native suspension primitive.
Client-Executed Tools
A tool declared with execute: 'client' suspends the run while the client computes the result. The DBOS flow:
- The LLM emits a client-tool call; the workflow writes
pendingClientToolCalls[toolCallId]to session state and emits atool_startchunk. - The workflow body calls
await DBOS.recv(toolCallId, deadlineSec)and suspends. The process can be restarted here — therecvre-arms on replay. - The client posts to
POST /submit-tool-result; the server callsexecutor.submitToolResult(...), which routes aDBOS.send(workflowId, payload, toolCallId)to the owner workflow. - The
recvreturns; the workflow clears pending state, appends a tool-result message, and the loop continues.
Submission is exactly-once: concurrent or retried submits on the same toolCallId are serialized by an OCC-stamped submittedAt marker (first writer wins; losers return already_completed), and the DBOS.send idempotency key is scoped by workflow ID. A durable completedClientToolCalls marker makes already_completed survive executor restart and persistent-workflow exit. Deadline timeouts append a synthetic tool_error, emit a tool_end chunk, and fire the timeout hooks — matching the cross-runtime contract.
See the client-tools deep-dive for sub-agent ownership routing and the full race-condition analysis.
Approval-Gated Tools
A tool with requireApproval: true (or a function-form predicate) suspends on a tool_approval_request chunk and shares the same pendingClientToolCalls + DBOS.recv primitive as client tools. Resume by calling executor.submitToolResult({ kind: 'approval-response', toolCallId, approved, reason? }):
- Approve (
approved: true) runs the tool'sexecute()with the original input. - Deny (
approved: false) records a synthetictool_error('Tool call was not approved by the user') and skipsexecute().
Checkpointed requireApproval predicate: function-form predicates are evaluated inside a @DBOS.step (ApprovalGateStep.evaluateApprovalGatePredicateStep), so the boolean result is checkpointed in workflow history. This makes the suspend-vs-run decision replay-deterministic even for non-pure predicates (a predicate that reads the clock or a DB row is safe). Fail-closed semantics live inside the step — an exception is treated as requireApproval = true — so the checkpointed value is always a boolean (parity with Temporal's activity-wrapped predicate evaluation; GL #111 Batch C).
Driving the loop after submitToolResult
As on all v7 runtimes, submitToolResult is a durable write only — it does not auto-resume. After submitting, either use the framework's chat plumbing (handleChatStream + useChat / useResumeClientTools) or call executor.resume(agent, sessionId) explicitly. (DBOS's recv-driven path can also auto-continue the in-place workflow; see Hooks for the resulting runId === previousRunId self-loop.)
See the Client-Executed Tools and Approval Gates guides for cross-runtime concepts.
Hooks
Per-call hooks fire on the DBOS runtime with full parity. agent.hooks (the AgentConfig field), plus ExecuteOptions.hooks and ExecuteOptions.hookManager, fire on every execute() / resume() / retry(). The executor computes the merged HookManager (constructor → agent.hooks → options.hooks order) before DBOS.startWorkflow and registers it in a process-local HookManagerRegistry indexed by DBOS.workflowID. The hook and tool steps resolve the manager per-invocation, falling back to the constructor-bound static on cross-worker recovery (a documented split-brain risk mirroring Temporal/CF replaceAgent semantics).
Canonical firing order matches all five runtimes (JS, Temporal, Cloudflare Workflows, Cloudflare DO, and DBOS — the first four use the v7 stateless-suspension model; DBOS uses its own DBOS-native DBOS.recv/DBOS.send primitives but converges on the same observable hook order):
beforeTool → execute → onStateChange → onMessage → afterToolbeforeTool + onStateChange fire inside the @DBOS.step tool body (so they are checkpointed by the step boundary); onMessage + afterTool are deferred to the workflow body so they land after the phase-1 Promise.all settles and after each tool-result message is appended.
onMessage, onStateChange, onAgentSuspended, and onAgentResumed all fire — including for both client-tool and approval-gate suspend/resume cycles.
onAgentResumed.previousRunId
onAgentResumed fires with { runId, previousRunId, sessionId, resumedFromCheckpointId }. DBOS has two resume branches:
- Branch 2 (fresh resume workflow):
runId !== previousRunId— a new run is allocated and a fresh workflow starts (same as JS / Temporal / CF). - Branch 1 (recv-driven in-place continuation):
runId === previousRunId— the client-tool submit path detects a still-PENDING workflow blocked onDBOS.recv, returns a handle wrapping the same workflow, and the body wakes in-place. No new workflow, no new run.
In both branches previousRunId is always populated with the suspended run's runId (Branch 1 sets previousRunId: runId, which equals the suspended runId because resume continues in-place). Hook consumers should use previousRunId as the linkage signal for span stitching — do not rely on runId !== previousRunId to detect a resume, because that test fails on DBOS Branch 1.
See the Hooks (Observability) guide for the full hook reference.
State
stateSchema.default() values are seeded into customState at session start so tools that append to a key (e.g. customState.items via an Immer /items/- push) see the default-initialized container rather than undefined. Seeding is best-effort and validated; dropped keys log a warning.
Parallel-tool in-step state visibility
Each DBOS tool runs as an independent durable step that loads state.customState fresh from Postgres at the top of its execution. A parallel sibling tool that calls ctx.getState() to read a key written by another sibling within the same LLM step sees the pre-sibling value, not the cumulative state. This is by design — DBOS steps must be idempotent for replay determinism.
Tools that need to read sibling writes within a single step should either be sequenced through a single parent tool that orchestrates the work, or run on the JS / Cloudflare DO runtimes (which apply in-memory state merging between parallel tool calls before persisting).
Known Limitations
- Workspaces are unsupported. Unlike Temporal and CF Workflows (which fail-fast at run-start), DBOS silently passes
workspaces: undefined.describeCapabilities()returns an emptyworkspaceProviderKindslist, soAgentServersurfaces aRUNTIME_NO_WORKSPACE_SUPPORT404 on the HTTP workspace route. Agents needing workspaces should run on the JS or Cloudflare DO runtimes. - The
history:input prefix is not honored.executor.execute(agent, { message, history: [...] }, { sessionId })silently drops thehistory:prefix on DBOS — the message log starts at the new user message. Tracked as the open follow-upFU-DBOS-HISTORY-PREFIX-NEVER-HONORED(low priority; parity gap with JS / CF / Temporal). - Doubly-nested HITL is unsupported. A top-level agent's client-executed or approval-gated tool suspends and resumes correctly. The doubly-nested case — an ephemeral child sub-agent itself calling an
execute: 'client'tool and expected to suspend the parent — is not supported; the parent returns'completed'rather than suspending. Hoist the client tool to the top-level agent, or use a persistent sub-agent instead. Tracked in GitLab #73 (documented-as-unsupported). - Submit-side
outputSchemavalidation is skipped. The DBOS agent registry storesSerializedAgentrecords, which strip Zod schemas, so submitted client-tool results are accepted without server-side schema validation. Client-side validators should be the primary source of correctness. - In-step parallel-sibling state visibility — see State above.
LLMConfig.cachestrategies are not applied. Cache strategies are a no-op on DBOS — they are not serializable across the durable step boundary. SettingcacheonLLMConfighas no effect; prompts are sent to the LLM without cache annotations.- Memory auto-injection/extraction is not supported. The
memory:config field is accepted at the type level but is never invoked on DBOS. Agents requiring memory recall or storage should run on the JS or Cloudflare runtimes.
Best Practices
- Use
store-postgresfor state — it shares the same Postgres instance DBOS uses for its system database, keeping infrastructure to one service. - Use
RedisStreamManagerfor streaming — Postgres write latency is too high for per-token chunk fan-out. - Register before launch — call
registerDBOSAgentWorkflows()and construct the executor (or callregisterAgent) beforeDBOS.launch()so recovered workflows find their liveLanguageModel. - Keep workflow bodies deterministic — push all I/O into
@DBOS.step()calls. See Determinism. - Choose persistent mode deliberately — only when subsequent messages need to arrive into a running workflow context (in-memory state across turns); otherwise standard mode is simpler.
See Also
These deep-dive docs live alongside the package source:
- Architecture — why DBOS, the two execution modes, determinism boundary, runtime comparison
- Client-Executed Tools — DBOS-specific client-tool flow, ownership routing, durable
already_completed - Persistent Mode — recv-loop lifecycle, hibernation, idle TTL, interrupt vs abort
- Race Conditions — exhaustive race analysis with test pointers
- Determinism Rules — the full determinism contract and pitfall catalog
- Error Reference —
DBOSModeMismatchError,DBOSWorkflowNotFoundError,DBOSPersistentSessionTerminatedError - Migrating from Other Runtimes — constructor-swap checklist
- Versioning — DBOS workflow versioning policy
Doc-site cross-links:
@helix-agents/runtime-dbosAPI Reference- Runtime Overview — comparison tables across all runtimes
- Temporal Runtime — the closest architectural peer
- Storage: Postgres — the recommended state store