Workspace Runbook
Operator-facing runbook for the most likely production incidents on the workspace stack. Each section: symptoms, what to check, mitigations, references.
Looking for upgrade procedures? See Upgrading & Migration. Looking for the integration-side error model? See Workspaces overview — Errors integrators should know about.
1. Workspace tool keeps failing with WorkspaceFailedError
Symptoms. The agent's stream surfaces WorkspaceFailedError errors on every workspace tool call. The session never makes forward progress on workspace ops.
What to check.
registry.describe()— does the entry'sstateshow'failed'? If so,lastErrorcarries the reason.- The capability-invariant assertion (round-2 cluster C). Common causes:
capabilities.snapshot: truedeclared on a sandbox withoutbackupR2Bindingconfigured.- A custom provider whose
open()returns a workspace missing one of the declared modules.
- Provider configuration:
- Cloudflare Sandbox: is the
SANDBOXDO binding correctly wired? Ismax_instancesreached? - Cloudflare Filestore: is
ctx.storage.sqlavailable? Did the migration tonew_sqlite_classesrun? - Local Bash: is the host POSIX? Was
tmpdirRootset to a path the process can write? - DBOS runtime + agent declared a workspace: the run is rejected at run-start.
DBOSAgentExecutor.execute()/resume()/retry()throw synchronously viaassertRuntimeSupportsWorkspaces(the same guard Temporal and Cloudflare Workflows use) before any DBOS workflow starts, so the symptom is an immediate "DBOS runtime does not support workspaces" error naming the agent — not a lateWorkspaceFailedErrorand not a silently-dropped tool set. Full provider support on DBOS is tracked as future work indocs/dev/future-work.md. Workaround: switch toruntime-jsor the Cloudflare DO runtime.
- Cloudflare Sandbox: is the
Mitigations.
- Fix the configuration mismatch and redeploy.
- If transient (provider outage, network blip), set
transientRetryAttemptshigher on the registry deps (default 3, ~10s total backoff) and ensure the provider is throwing withtransient: truefor the root cause. - For permanent provider failures, call
registry.reset()from operator code (NOT exposed to the LLM) once the underlying cause is fixed.
References. Errors integrators should know about, Transient vs permanent errors, Operator-driven recovery.
2. User reports their files vanished
Symptoms. A previously-saved file is no longer reachable; read_file throws or ls returns an empty listing where files used to live.
What to check.
- Lifecycle. Did the workspace
close()between sessions? Most providers tear down state on close —destroyOnClose: trueon the sandbox, tmpdir removal on local-bash. The in-memory provider drops everything on process restart. - Branch-from-checkpoint behavior. Round-4 cluster A8 fixed a silent corruption: branched sessions previously shared workspace refs with the source session. POST-fix, branched sessions start with FRESH workspaces — the source's files are not visible. This is the intended (and safer) behavior. See Workspace refs are scoped to the source session.
- Provider durability per provider.
- In-memory: state is lost on process restart.
- Local-bash: tmpdirs survive process restart but a
close()removes them; tmpfs clears on host reboot. - Cloudflare Filestore: data persists in DO SQLite for the DO's lifetime; SQLite is durable across hibernation but tied to the DO instance.
- Cloudflare Sandbox: the container persists across hibernation via the Sandbox DO's storage;
destroyOnClose: truepermanently removes it.
- Was the session branched? If so, the branch starts FRESH — surface that to the user.
Mitigations.
- For accidental closure: restore from a checkpoint within the SAME session (no branch) — that re-attaches the same refs. See Checkpoints + workspaces.
- For branch-fresh-workspace surprise: use
Snapshotter.snapshot()+restore()to seed the branch from the source. See Pitfall 7 in the upgrading guide. - For data lost to provider lifecycle: this is expected behavior. Consider switching to a more durable provider (filestore for CF; local-bash for POSIX dev) if persistence matters.
References. Lifecycle, Workspace refs are scoped to the source session, per-provider durability sections.
3. DO is OOMing
Symptoms. Cloudflare DO crashes with out-of-memory; logs show large memory consumption growing during agent execution.
What to check.
writeFilesize guard. Clusters D round-2, A round-3, and A round-4 added size guards to filestore and sandboxwriteFileto bound the worst-case allocation. Ensure yourcapabilities.fs.maxFileSizeMbis reasonable for your workload — the default is ~10 MB; agents writing 100 MB blobs need either a higher cap (and operator awareness) or a chunked write strategy.- Large
grepoperations.grepreads the file into memory before scanning. ThemaxGrepFileSizeMbknob (default 10 MB) skips files larger than the limit;skippedPathsis returned so the LLM knows. If the LLM keeps raising the cap to scan 500 MB log files, push back at the agent design layer. - Eviction storms. A flood of
WorkspaceEvictedErrorevents causeswithEvictionRetryto repeatedly re-resolve workspaces, churning open state. Check theworkspace tool: eviction retry exhaustedlog line for repeat occurrences. maxGlobalConcurrentOpens. When many sessions share a single provider instance, simultaneous opens can fire concurrently against the same upstream binding. Each open allocates state; an unbounded count can exhaust the heap or the binding'smax_instancesbefore any open completes. Set the provider-levelmaxGlobalConcurrentOpensto bound this. (With one workspace per agent, the per-sessionmaxConcurrentOpensis effectively a single-open guard.)
Mitigations.
- Lower
capabilities.fs.maxFileSizeMbuntil the LLM can no longer write the offending blob. - Cap
maxGrepFileSizeMbat the capability layer if the LLM is grepping huge files. - Set
workspaceMaxConcurrentOpenson the executor to bound concurrent opens (also matches the Sandbox DOmax_instances). - If eviction storms are the cause, treat as incident #7 below.
References. Tunable knobs, FileSystem module, grep skipped-paths envelope.
4. Sandbox container won't terminate
Symptoms. destroyOnClose: true is set; sessions complete; sandbox containers persist (visible in Cloudflare dashboard or via DO RPC) and accrue cost.
What to check.
closeAlltimeout. Registryclose()enforcescloseTimeoutMs(default 30000 ms). If the sandbox'sdestroy()exceeds this, the close is logged astimeoutand the framework moves on — but the underlying container may continue.sleepAfter. WhendestroyOnClose: false(default), the container suspends aftersleepAfter— by default ~10 minutes. Containers don't terminate; they hibernate. Cost on hibernated containers is much lower than running ones; check whether you actually need destroy semantics.- Sandbox SDK behavior.
@cloudflare/sandboxis pinned at an exact version because its API has been moving. Check the SDK's release notes for knowndestroy()issues.
Mitigations.
- Set
destroyOnClose: truefor one-shot agent runs. - Lower
sleepAfterto reduce idle cost (trade-off: higher cold-start latency on resume). - Tighten
closeTimeoutMsif the framework's wait is masking a real problem (better: investigate whydestroy()exceeds 30s).
References. Cloudflare Sandbox — Cost notes, Lifecycle.
5. Tmpdirs are filling /tmp on a host
Symptoms. Local-bash deployment shows /tmp/helix-ws-* directories accumulating; disk-full alerts fire on the host.
What to check.
- Close failures. Round-3 cluster C added tmpdir-cause logging on local-bash close failures. Check error logs for
tmpdir close failed— the common cause is a long-running subprocess holding a file open in the tmpdir, blockingrm -rf. closeTimeoutMs. If close exceeds the deadline, the framework moves on but the tmpdir survives. Default 30000 ms; tighter is OK on local-bash.- Crash-leaked tmpdirs. Process crash leaves tmpdirs orphaned — they live until the OS clears tmpfs (boot, manual clean).
Mitigations.
- Add a periodic cron / systemd-timer job:
find /tmp/helix-ws-* -maxdepth 0 -mtime +1 -exec rm -rf {} +(adjust-mtimeper your session lifetime). - Investigate the root cause of close failures via the tmpdir-cause logs and fix the subprocess hang.
- Consider
tmpdirRootpointing at a tmpfs that auto-clears on reboot if your sessions never need to outlive a reboot.
References. Local Bash — Lifecycle.
6. Workspace refs in state store don't match container state
Symptoms. provider.resolve(ref) throws WorkspaceFailedError with a schemaVersion message OR resolves to a sandbox/namespace that's empty.
What to check.
- Schema-version mismatch. Round-4 cluster D introduced explicit
schemaVersionon every ref. Providers support N±1 — beyond that range, resolution throws with a clear message. See Pitfall 8 in the upgrading guide. - Rollback hazard. If you rolled back across multiple schema versions, persisted refs may carry a version the OLD code doesn't know.
- Provider-side state divergence. Did the underlying container/namespace get manually deleted?
registry.describe()shows'failed';lastErrormay have provider-specific detail.
Mitigations.
- Roll forward to a version that understands the persisted ref schemaVersion.
- For sessions stuck in
'failed': identify them viaregistry.describe()filtered tostate: 'failed'+lastErrormatchingschemaVersion, then either re-create or roll forward. - For container-state divergence (the underlying sandbox or namespace was deleted out-of-band): no automatic recovery. The session is unrecoverable; surface to the user and start a fresh session.
References. Pitfall 8 — Schema version drift, Rollback procedure.
7. Eviction storms
Symptoms. Logs are full of workspace tool: eviction retry exhausted at error level; agents make slow forward progress; metrics show a high incEviction rate without recovery via incEvictionRetry.
What to check.
- Provider stability. Repeat eviction-retry-exhausted is a strong signal that the underlying provider isn't recoverable in this moment. Common causes:
- Cloudflare Sandbox: container quota exhaustion on the binding (
max_instancesreached, retry burst on top). - Cloudflare Filestore: R2 bucket reachability (if you use R2 spill).
- Local Bash: tmpfs full, host OOM, parent process churning.
- Cloudflare Sandbox: container quota exhaustion on the binding (
- Transient vs permanent classification. Round-4 cluster C added explicit
transient: trueopt-in onWorkspaceFailedError. If the provider is misclassifying a permanent error as transient, the registry retries with backoff (default 3 retries, ~10s total) before surfacing — wasted cycles. Check provider code. - Restart-storm dynamics. During a CF deployment rollout, every DO's first agent operation triggers
provider.resolve()for every workspace ref it had. WithoutmaxConcurrentOpens, this is N parallelgetSandboxRPCs per DO, multiplied by the DOs being recycled. See Restart behavior.
Mitigations.
- Set
workspaceMaxConcurrentOpensto bound concurrent opens per executor (matches the Sandbox DOmax_instances). This is per-session — for process-wide bounds across all sessions sharing a provider, ALSO setmaxGlobalConcurrentOpenson the provider options (round-5 B2). The two are layered: registry-level for fairness across workspaces in one session, provider-level for back-pressure to the upstream binding. - Lower
transientRetryAttempts(default 3) for paths that should fail fast. - Set
resetAfterMson the registry to auto-recover from'failed'state once the cooldown elapses (round-5 B4). Recommended:5 * 60 * 1000(5 min) so a transient outage doesn't permanently brick sessions until an operator manually resets them. - Wire
WorkspaceMetricsto surface eviction rates to your monitoring system; alert onincEviction> threshold so you catch storms early.
References. Eviction recovery semantics, Transient vs permanent errors, Restart behavior — Sandbox.
8. Cross-session data sharing suspected
Symptoms. One session's data is visible in another session's workspace. Audit logs show writes from one session appearing in reads from another.
What to check.
- Sandbox
idsharing. Round-4 cluster A6 added the explicit-opt-in pattern: settingconfig.idto a non-default value requiresshareAcrossSessions: true. Pre-fix, this was silent — multiple sessions opening the same workspace name with the sameidattached to the SAME container. Post-fix, the misconfig throws atopen().- If you're on pre-fix code, this is the most likely cause. Upgrade and respond to the assertion.
- If you're on post-fix code with
shareAcrossSessions: trueset intentionally, the sharing is by design — surface to the user and re-evaluate whether that's what you want.
- Filestore namespace sharing. Same pattern applies to
CloudflareFileStoreWorkspaceConfig.namespace. An explicit shared namespace shares data; a default namespace (derived fromsessionId) does not. - Sub-agent inheritance. Round-4 cluster B reviewed sub-agent workspace inheritance: by DEFAULT, sub-agents are workspace-isolated. If
inheritWorkspace: trueis set, the child sees the parent's registry — but this is opt-in and visible at the call site.- Check whether
createSubAgentTool(..., { inheritWorkspace: true })was set unintentionally.
- Check whether
Mitigations.
- Audit ALL workspace configs for non-default
id(sandbox) andnamespace(filestore). Either remove them or pair with the explicit-opt-in flag. - Audit
inheritWorkspace: trueusage; restrict to scoped admin tools. - For confirmed sharing-where-isolation-was-expected: identify the affected window, follow your incident-response plan.
References. Cross-session sharing — Sandbox, Workspaces in sub-agents, Prompt-injection threat surface.
9. Audit log saturation
Symptoms. Log sink ingestion volume spikes; operator alerting fires on log-line rate or byte budget; per-session log volume reaches MB/sec sustained. Closer inspection shows thousands of identical workspace tool: shell metacharacter rejected (or sibling glob/brace expansion rejected, command not in allowlist) lines from a single session.
What to check.
- Tight LLM-loop misuse. A confused or adversarial LLM can iterate the same rejected command (
workspace_run({ command: 'ls; ' + i })) thousands of times per second. Each rejection used to fire one structured warn — at default verbosity that's ~1MB/sec per session, multiplied by N sessions. - Round-5 B5 mitigation. As of round-5 B5, security warns in
tool-injection.ts,CloudflareSandboxShell, andSubprocessShellare wrapped withRateLimitedLogger.warnRateLimited. The first occurrence of each distinct rejection ALWAYS emits at full fidelity; subsequent identical events within a 60s window are deduped; a rollup line (withsuppressedCount) emits every 5s so volume stays visible. NEVER suppresses entirely. - Distinct-event keying. The dedup key incorporates workspace name + first-token (+ matched metachar where applicable) so different commands rejected from different workspaces emit independently. If you observe rate-limited rollup lines with identical content, the LLM is genuinely repeating the same exact rejection — the alerting amplitude is now O(rollups/sec) rather than O(rejections/sec).
Mitigations.
- Already mitigated by default in round-5+ — verify your version. Earlier versions emitted one warn per rejection.
- For higher dedup density, tighten
securityWindowMson the registry'sRateLimitedLogger(constructor option) — but check that you still see novel attack signals first. - Investigate the upstream LLM behavior: tight-loop rejections often indicate prompt mis-engineering or adversarial inputs. The audit log shows the first rejection plus rollup count; that's enough to identify the offending session.
- Filter your operator alerting on
[rate-limited]in the message text to surface ROLLUP volume rather than individual events.
References. RateLimitedLogger in the workspace utils export.
10. Snapshot R2 cost amplification
Symptoms. R2 storage usage on the backupR2Binding bucket grows unbounded over time. Cloudflare bill shows R2 storage costs disproportionate to active session count. R2 list shows tens of thousands of backups/<id>/... keys with creation timestamps reaching back weeks/months.
What to check.
- LLM snapshot frequency. A confused or adversarial LLM can call
workspace_snapshot10×/sec. Each call writes a multi-MB squashfs archive to R2. Without pruning the archives accumulate forever. - Round-7 mitigation. As of round-7 the framework provides
Snapshotter.list()andSnapshotter.delete()and corresponding auto-injected tools. Verify your version supports them. TheCloudflareSandboxSnapshotterimplements both; agents and operators can self-prune. - R2 lifecycle policy. The SDK (
@cloudflare/sandbox) recommends configuring an R2 lifecycle rule on thebackups/prefix as a backstop — the framework's pruning is reactive, the lifecycle policy is the time-bound floor. Together they bound storage on every dimension.
Mitigations.
- Agent-side: prompt the LLM to call
workspace_list_snapshotsperiodically andworkspace_delete_snapshoton snapshots older than the retention window. The system prompt fragment auto-injected forcapabilities.snapshotreferences both tools. - Operator-side: a Cron Trigger that opens each long-running session's workspace, calls
list_snapshots({ allowCrossSession: true }), filters by retention rule, and callsdelete_snapshoton each match. Cross-session opt-ins audit-log atwarn. - Cap the LLM's
list_snapshotsresponse: configurecapabilities.snapshot.maxListResults(default 100). The auto-injected tool clamps the LLM-supplied limit to this ceiling so a misbehaving call cannot dump 10k refs into the LLM's context window. - Audit-log alerting: alert on snapshot-rate > N/min per session (
workspace.snapshotusage entries fromrecordUsage). The cost-amplification pattern surfaces as a sustained high rate; pair with an automatic interrupt for adversarial sessions. - Backstop with R2 lifecycle: the SDK's
meta.jsoncarries attlfield. Configure a max-age R2 lifecycle rule on thebackups/prefix matching your retention floor.
References. Snapshotter — Pruning section, Cloudflare Sandbox — Snapshot semantics.
11. workspace_script fails or behaves unexpectedly
Symptoms. The agent's workspace_script calls fail, or the dynamic-worker / dual-tier sandbox session never opens. Common surface errors: WorkspaceFailedError: capability 'script' declared but no Worker Loader binding configured; workspace_script returns exitCode: 1 with an abort/timeout error; the LLM's script can't reach an external service; or the LLM expects variables/files to persist across workspace_script calls and they don't.
What to check.
- Worker Loader binding. The
scriptcapability runs LLM JavaScript in a Cloudflare Worker-Loader isolate. The binding MUST be wired: for the standalonecloudflare-dynamic-workerprovider, pass{ loader: env.LOADER }to the constructor; for the sandbox dual-tier path, pass the provider'sloaderoption. Ascriptcapability declared without a loader fails fast atopen()(the sandbox provider throws theWorkspaceFailedErrorabove — unlikesnapshot, which defers to first use). Confirm[[worker_loaders]]is in yourwrangler.tomlandenv.LOADERresolves. - Worker Loaders enablement. Worker Loaders / Dynamic Workers are a gated/beta Cloudflare feature. If the binding is declared but the account isn't enrolled, the binding is unavailable at runtime. Same caveat tier as the sandbox's Containers/Docker requirement.
- Network is OFF by default. The isolate ships egress-denied (
network: 'off'→globalOutbound: null). If the LLM's script doesfetch(...)to an external host, it fails by design. Opt in withcapabilities: { script: { network: 'allow' } }(honored on both the dynamic-worker and sandbox providers) — there is no allowlist in v1, it is all-or-nothing. - Timeout.
maxDurationMs(capability config, or the provider default) caps each run; the LLM-suppliedtimeoutMson a single call overrides it. A runaway script surfaces asexitCode: 1with an abort/timeout error — not a hang. - Statelessness. Every
workspace_scriptcall loads a FRESH isolate (JS-only,isStateful: false). No variables, files, or/tmpscratch carry across calls. An LLM expecting persistence wants the sandboxcodeinterpreter (codeStateful: true) or a durable filesystem instead. - Config location.
network/maxDurationMsbelong oncapabilities.script(preferred — takes precedence on both providers);compatibilityDateis a provider/kind-level option only (noScriptCapConfigfield). Config written on the wrong object that isn't honored usually means a version predating the dynamic-worker capability-config consistency fix.
Mitigations.
- Wire the
LOADERbinding (and confirm beta enablement) before declaringcapabilities.script; until then, drop thescriptcapability so the rest of the workspace opens. - For network-dependent scripts, set
script: { network: 'allow' }— and treat it as a deliberate egress decision (the isolate is the security boundary). - For compute that needs a full toolchain, persistence, or a non-JS language, route the LLM to the sandbox
codeinterpreter (the dual-tier pattern) or Cloudflare Filestore instead ofscript.
References. Script module, Cloudflare Dynamic Worker, Cloudflare Sandbox — script tier / dual-tier, Workspaces Security — script isolate.
12. local-sandbox fails closed or extra paths silently ignored
Symptoms. Either:
- Every
open()/resolve()on alocal-sandboxworkspace throwsWorkspaceFailedError: LocalSandboxWorkspaceProvider: requires seatbelt (macOS) or bwrap (Linux); none available (<reason>)— the session never gets a workspace. This is the provider failing closed by design: it refuses to run commands unconfined rather than silently degrading tolocal-bashsemantics. - The sandbox opens fine, but a path you granted via
readWritePaths/readOnlyPathsis still denied (writes rejected, or reads invisible on bwrap) even though the agent's uid can access it on the host.
What to check.
- No isolation backend on the host. The fail-closed message names the reason. Common cases: running on Windows (no seatbelt/bwrap equivalent); a Linux host or CI runner where
bubblewrapisn't installed (bwrap not found); a locked-down container that stripssandbox-exec/bwrapor the syscalls they need; macOS withoutsandbox-execonPATH. The provider auto-detects at construction (isolation: 'auto') — a pinnedisolation: 'seatbelt'on Linux (or'bwrap'on macOS) also resolves to no backend and fails closed. - Non-canonical extra paths.
readWritePaths/readOnlyPathsare matched against the path the kernel resolves, NOT the string you passed — the provider does NOT canonicalize them. On macOS/tmpand/varare symlinks (/tmp→/private/tmp), so a/tmp/...entry silently fails to match and the grant is a no-op. The workspace tmpdir itself is already realpath-canonicalized by the provider; only the operator-supplied extra paths are affected.
Mitigations.
- Linux / CI: install bubblewrap —
apt-get install bubblewrap(Debian/Ubuntu) ordnf install bubblewrap(Fedora/RHEL) — then re-create the provider so detection re-runs. - Where OS isolation genuinely isn't available (Windows, a host that can't run bwrap): switch to
local-bashfor trusted input, run the agent inside WSL (Windows → Linux host wherebwrapinstalls), or use a container/VM provider (Cloudflare Sandbox) for untrusted input. Do NOT try to "downgrade" the sandbox provider to unconfined execution — the fail-closed refusal is intentional. - Extra-path mismatch: pass realpath-canonical paths —
fs.realpathSync('/tmp/shared')→/private/tmp/sharedon macOS — so the grant matches what the kernel resolves.
References. Local Sandbox — Fail-closed behavior, Local Sandbox — Extra paths, Workspaces Security — local-sandbox vs local-bash.
13. docker fails closed or in-container shell errors
Symptoms. Either:
- Every
open()/resolve()on adockerworkspace throwsWorkspaceFailedError: DockerWorkspaceProvider: docker daemon not available (...)— the session never gets a workspace. This is the provider failing closed by design: it refuses to run commands unconfined rather than silently degrading. open()throwsDockerWorkspaceProvider: failed to ensure image '<image>' (...)— the image couldn't be pulled or, underpullPolicy: 'never', isn't present locally.- The workspace opens, but in-container shell commands fail with permission errors against bind-mounted files, or a long-running command never returns.
What to check.
- Daemon down / unreachable. The fail-closed message carries the reason from
engine.ping(). Common causes: the Docker daemon (or Docker Desktop) isn't running; the process can't reach the socket (/var/run/docker.sockpermissions, or the user not in thedockergroup); a remoteDOCKER_HOSTis misconfigured. The positive ping result is cached, so a daemon that came up AFTER a failed probe is re-probed on the next attempt (the error is flaggedtransient: true, so the registry retries with backoff). pullPolicy: 'never'+ missing image. The air-gap-safe path makes NO network attempt and fails closed when the image is absent. If you intend offline operation, pre-pull the image (docker pull <image>) on the host so it's present in the local image cache; otherwise droppullPolicy: 'never'to allow the pull.- Bind-mount uid/gid mismatch (esp. macOS Docker Desktop). The container runs as the host's
uid:gidso the bind mount shares ownership. On Docker Desktop for macOS (VirtioFS / gRPC-FUSE) the uid is remapped at the VM boundary — permission errors against/workspacefiles are usually this. Use thecontainerUserprovider option for images that require a fixed user. - Hung / timed-out command. Killing a
docker execdoes NOT kill the in-container child (PID namespaces). The engine bounds each call with an in-containertimeoutwrapper plus a host-side stream-destroy backstop — so a runaway command should surface as a timed-outRunResult, not a hang. If commands DO hang, check that the image actually ships atimeoutbinary (BusyBox/coreutils) and thatmaxDurationMs(or the per-calltimeoutMs) is set.
Mitigations.
- Daemon down: start Docker / Docker Desktop, fix socket permissions (add the user to the
dockergroup, or correctDOCKER_HOST), then re-attempt — detection re-probes. - Missing image: pre-pull on the host (
docker pull <image>), or droppullPolicy: 'never'. - Where a daemon genuinely isn't available: switch to
local-sandbox(host-kernel isolation, no daemon) for local isolated POSIX exec, or Cloudflare Sandbox for untrusted input on Cloudflare. Do NOT try to downgrade to unconfined execution — the fail-closed refusal is intentional. - uid mismatch: set
containerUserto match the image's expected user, or use an image whose default user owns/workspace.
References. Docker — Fail-closed behavior, Docker — The three gotchas, Docker — Resume, Workspaces Security — docker: container boundary.
Building a /healthz endpoint (round-5 D6)
registry.describe() returns a frozen point-in-time snapshot of the workspace entry (or undefined when no workspace is configured) — the cheapest path to surfacing workspace state to your monitoring system. Wiring a /healthz route looks different per runtime:
JS runtime
JSAgentExecutor.getWorkspaceRegistry(sessionId) (round-5 D6) returns the live registry for an active session. There are two paths:
Option 1 — In-process (Express, Fastify, Hono — same shape):
// healthz-handler.ts
app.get('/healthz/:sessionId', async (req, res) => {
const registry = executor.getWorkspaceRegistry(req.params.sessionId);
if (!registry) {
return res.status(404).json({ error: 'no active session' });
}
return res.json({ workspace: registry.describe() ?? null });
});Option 2 — HTTP introspection via @helix-agents/agent-server:
If you're hosting the executor behind AgentServer (round-7), the package exposes the same data over GET /workspace?sessionId=X. Operators query it through their normal monitoring stack rather than reaching into in-process executor state:
curl https://your-agent-server.example.com/workspace?sessionId=abc-123 \
-H "Authorization: Bearer $OPERATOR_TOKEN"{
"workspace": { "state": "open", "providerId": "in-memory", "openedAt": 1700000000000 }
}The route is gated by your authenticate hook (operation tag: 'workspace'). 404 disambiguation:
RUNTIME_NO_WORKSPACE_SUPPORT— executor doesn't implementgetWorkspaceRegistry(Temporal / CF Workflows) or has zeroworkspaceProviderKindsconfigured.WORKSPACES_NOT_FOUND— executor supports workspaces but the session isn't currently live on this replica.
See the @helix-agents/agent-server README on GitLab for the full status-code matrix.
The accessor returns undefined when:
- No runLoop is currently active for that
sessionId(the registry has been closed). - The agent has no
workspacedeclared. - The session is paused/interrupted (the runLoop has exited; the registry was closed in its
finally).
For paused/interrupted sessions, the operator path is "read the persisted workspaceRef from the state store" — the registry has been closed and describe() would only show a stale snapshot anyway.
Version-drift trap (post-stateless-suspension). If
getWorkspaceRegistry(sessionId)returnsundefinedfor a session you KNOW is active (the run is mid-step, you can see chunks streaming, the LLM is calling tools), verify yourruntime-jsversion includes thepublishWorkspaceRegistrycallback wiring. After the stateless-suspension redesign deleted the legacyJSAgentExecutor.runLoop, the executor'sactiveWorkspaceRegistriesmap became dead code in the window between the deletion and the fix —getWorkspaceRegistry()returnedundefinedfor every active session andGET /workspacealways 404'd. The legacyrunLooppopulated the map directly; the statelessrunLooprequires the executor to thread apublishWorkspaceRegistry?: (registry | undefined) => voidcallback throughRunLoopInput. Fix lives inpackages/runtime-js/src/run-loop.ts:374-388,475-492,1108-1118and the executor wiring atpackages/runtime-js/src/js-agent-executor.ts:3259-3267. Operators reading this doc on a deploy that pre-dates the fix would see the documentedundefinedcauses above and conclude the session is dead — when in fact the runtime regression silently broke the introspection path for every session. See the v6 → v7 stateless-suspension upgrade guide for the version-bump call to action.
Cloudflare DO runtime
The DO base class wraps JSAgentExecutor per Durable Object instance. Inside your DO subclass, call getWorkspaceRegistry(sessionId) on the inner executor and route a /healthz fetch path through it:
// my-agent-server.ts (inside your createAgentServer subclass)
async fetch(req: Request): Promise<Response> {
const url = new URL(req.url);
if (url.pathname === '/healthz') {
const sessionId = req.headers.get('x-partykit-room') ?? '';
const registry = this.executor.getWorkspaceRegistry(sessionId);
if (!registry) {
return new Response(JSON.stringify({ error: 'no active session' }), {
status: 404,
headers: { 'Content-Type': 'application/json' },
});
}
return Response.json({ workspace: registry.describe() ?? null });
}
// ... rest of your fetch routing ...
return super.fetch(req);
}The shape of describe() is documented on the Workspaces overview — Health endpoint.
Temporal
Workspaces are not supported on the Temporal runtime — there is no registry to introspect. Health for Temporal-hosted agents lives in the Temporal cluster's UI and metrics.
What to alert on
state === 'failed'for >N minutes — provider permanently broken;lastErrorcarries the cause. Pair withresetAfterMsfor auto-recovery (round-5 B4).lastAttemptAt > lastSuccessAt + threshold— workspace is being hammered AND failing (round-5 B6 — the inverted-alerting case the legacylastOpAtcouldn't surface).state === 'evicted'for sustained periods — provider unstable; check eviction-storm runbook entry.- High
lastErrorcardinality across sessions — coordinated provider issue (R2 outage, Sandbox quota).
See also
- Workspaces overview — Operations for metrics, hooks, healthz, and operator-driven recovery surfaces.
- Upgrading & Migration for per-version compat and rollback.
- Building a Provider for the provider-side error model.