Workspace Runbook
Operator-facing runbook for the eight most likely production incidents on the workspace stack. Each section: symptoms, what to check, mitigations, references.
Looking for upgrade procedures? See Upgrading & Migration. Looking for the integration-side error model? See Workspaces overview — Errors integrators should know about.
1. Workspace tool keeps failing with WorkspaceFailedError
Symptoms. The agent's stream surfaces WorkspaceFailedError errors on every workspace tool call. The session never makes forward progress on workspace ops.
What to check.
registry.describe()— does the entry'sstateshow'failed'? If so,lastErrorcarries the reason.- The capability-invariant assertion (round-2 cluster C). Common causes:
capabilities.snapshot: truedeclared on a sandbox withoutbackupR2Bindingconfigured.- A custom provider whose
open()returns a workspace missing one of the declared modules.
- Provider configuration:
- Cloudflare Sandbox: is the
SANDBOXDO binding correctly wired? Ismax_instancesreached? - Cloudflare Filestore: is
ctx.storage.sqlavailable? Did the migration tonew_sqlite_classesrun? - Local Bash: is the host POSIX? Was
tmpdirRootset to a path the process can write?
- Cloudflare Sandbox: is the
Mitigations.
- Fix the configuration mismatch and redeploy.
- If transient (provider outage, network blip), set
transientRetryAttemptshigher on the registry deps (default 3, ~10s total backoff) and ensure the provider is throwing withtransient: truefor the root cause. - For permanent provider failures, call
registry.reset(name)from operator code (NOT exposed to the LLM) once the underlying cause is fixed.
References. Errors integrators should know about, Transient vs permanent errors, Operator-driven recovery.
2. User reports their files vanished
Symptoms. A previously-saved file is no longer reachable; read_file throws or ls returns an empty listing where files used to live.
What to check.
- Lifecycle. Did the workspace
close()between sessions? Most providers tear down state on close —destroyOnClose: trueon the sandbox, tmpdir removal on local-bash. The in-memory provider drops everything on process restart. - Branch-from-checkpoint behavior. Round-4 cluster A8 fixed a silent corruption: branched sessions previously shared workspace refs with the source session. POST-fix, branched sessions start with FRESH workspaces — the source's files are not visible. This is the intended (and safer) behavior. See Workspace refs are scoped to the source session.
- Provider durability per provider.
- In-memory: state is lost on process restart.
- Local-bash: tmpdirs survive process restart but a
close()removes them; tmpfs clears on host reboot. - Cloudflare Filestore: data persists in DO SQLite for the DO's lifetime; SQLite is durable across hibernation but tied to the DO instance.
- Cloudflare Sandbox: the container persists across hibernation via the Sandbox DO's storage;
destroyOnClose: truepermanently removes it.
- Was the session branched? If so, the branch starts FRESH — surface that to the user.
Mitigations.
- For accidental closure: restore from a checkpoint within the SAME session (no branch) — that re-attaches the same refs. See Checkpoints + workspaces.
- For branch-fresh-workspace surprise: use
Snapshotter.snapshot()+restore()to seed the branch from the source. See Pitfall 7 in the upgrading guide. - For data lost to provider lifecycle: this is expected behavior. Consider switching to a more durable provider (filestore for CF; local-bash for POSIX dev) if persistence matters.
References. Lifecycle, Workspace refs are scoped to the source session, per-provider durability sections.
3. DO is OOMing
Symptoms. Cloudflare DO crashes with out-of-memory; logs show large memory consumption growing during agent execution.
What to check.
writeFilesize guard. Clusters D round-2, A round-3, and A round-4 added size guards to filestore and sandboxwriteFileto bound the worst-case allocation. Ensure yourcapabilities.fs.maxFileSizeMbis reasonable for your workload — the default is ~10 MB; agents writing 100 MB blobs need either a higher cap (and operator awareness) or a chunked write strategy.- Large
grepoperations.grepreads the file into memory before scanning. ThemaxGrepFileSizeMbknob (default 10 MB) skips files larger than the limit;skippedPathsis returned so the LLM knows. If the LLM keeps raising the cap to scan 500 MB log files, push back at the agent design layer. - Eviction storms. A flood of
WorkspaceEvictedErrorevents causeswithEvictionRetryto repeatedly re-resolve workspaces, churning open state. Check theworkspace tool: eviction retry exhaustedlog line for repeat occurrences. maxConcurrentOpens. If an agent declares many workspaces and runsworkspaceOpenStrategy: 'eager', the registry'sopenAll()fires N concurrent opens simultaneously. Each open allocates state; N too high can exhaust the heap before any open completes.
Mitigations.
- Lower
capabilities.fs.maxFileSizeMbuntil the LLM can no longer write the offending blob. - Cap
maxGrepFileSizeMbat the capability layer if the LLM is grepping huge files. - Set
workspaceMaxConcurrentOpenson the executor to bound concurrent opens (also matches the Sandbox DOmax_instances). - If eviction storms are the cause, treat as incident #7 below.
References. Tunable knobs, FileSystem module, grep skipped-paths envelope.
4. Sandbox container won't terminate
Symptoms. destroyOnClose: true is set; sessions complete; sandbox containers persist (visible in Cloudflare dashboard or via DO RPC) and accrue cost.
What to check.
closeAlltimeout. Registryclose()enforcescloseTimeoutMs(default 30000 ms). If the sandbox'sdestroy()exceeds this, the close is logged astimeoutand the framework moves on — but the underlying container may continue.sleepAfter. WhendestroyOnClose: false(default), the container suspends aftersleepAfter— by default ~10 minutes. Containers don't terminate; they hibernate. Cost on hibernated containers is much lower than running ones; check whether you actually need destroy semantics.- Sandbox SDK behavior.
@cloudflare/sandboxis pinned at an exact version because its API has been moving. Check the SDK's release notes for knowndestroy()issues.
Mitigations.
- Set
destroyOnClose: truefor one-shot agent runs. - Lower
sleepAfterto reduce idle cost (trade-off: higher cold-start latency on resume). - Tighten
closeTimeoutMsif the framework's wait is masking a real problem (better: investigate whydestroy()exceeds 30s).
References. Cloudflare Sandbox — Cost notes, Lifecycle.
5. Tmpdirs are filling /tmp on a host
Symptoms. Local-bash deployment shows /tmp/helix-ws-* directories accumulating; disk-full alerts fire on the host.
What to check.
- Close failures. Round-3 cluster C added tmpdir-cause logging on local-bash close failures. Check error logs for
tmpdir close failed— the common cause is a long-running subprocess holding a file open in the tmpdir, blockingrm -rf. closeTimeoutMs. If close exceeds the deadline, the framework moves on but the tmpdir survives. Default 30000 ms; tighter is OK on local-bash.- Crash-leaked tmpdirs. Process crash leaves tmpdirs orphaned — they live until the OS clears tmpfs (boot, manual clean).
Mitigations.
- Add a periodic cron / systemd-timer job:
find /tmp/helix-ws-* -maxdepth 0 -mtime +1 -exec rm -rf {} +(adjust-mtimeper your session lifetime). - Investigate the root cause of close failures via the tmpdir-cause logs and fix the subprocess hang.
- Consider
tmpdirRootpointing at a tmpfs that auto-clears on reboot if your sessions never need to outlive a reboot.
References. Local Bash — Lifecycle.
6. Workspace refs in state store don't match container state
Symptoms. provider.resolve(ref) throws WorkspaceFailedError with a schemaVersion message OR resolves to a sandbox/namespace that's empty.
What to check.
- Schema-version mismatch. Round-4 cluster D introduced explicit
schemaVersionon every ref. Providers support N±1 — beyond that range, resolution throws with a clear message. See Pitfall 8 in the upgrading guide. - Rollback hazard. If you rolled back across multiple schema versions, persisted refs may carry a version the OLD code doesn't know.
- Provider-side state divergence. Did the underlying container/namespace get manually deleted?
registry.describe()shows'failed';lastErrormay have provider-specific detail.
Mitigations.
- Roll forward to a version that understands the persisted ref schemaVersion.
- For sessions stuck in
'failed': identify them viaregistry.describe()filtered tostate: 'failed'+lastErrormatchingschemaVersion, then either re-create or roll forward. - For container-state divergence (the underlying sandbox or namespace was deleted out-of-band): no automatic recovery. The session is unrecoverable; surface to the user and start a fresh session.
References. Pitfall 8 — Schema version drift, Rollback procedure.
7. Eviction storms
Symptoms. Logs are full of workspace tool: eviction retry exhausted at error level; agents make slow forward progress; metrics show a high incEviction rate without recovery via incEvictionRetry.
What to check.
- Provider stability. Repeat eviction-retry-exhausted is a strong signal that the underlying provider isn't recoverable in this moment. Common causes:
- Cloudflare Sandbox: container quota exhaustion on the binding (
max_instancesreached, retry burst on top). - Cloudflare Filestore: R2 bucket reachability (if you use R2 spill).
- Local Bash: tmpfs full, host OOM, parent process churning.
- Cloudflare Sandbox: container quota exhaustion on the binding (
- Transient vs permanent classification. Round-4 cluster C added explicit
transient: trueopt-in onWorkspaceFailedError. If the provider is misclassifying a permanent error as transient, the registry retries with backoff (default 3 retries, ~10s total) before surfacing — wasted cycles. Check provider code. - Restart-storm dynamics. During a CF deployment rollout, every DO's first agent operation triggers
provider.resolve()for every workspace ref it had. WithoutmaxConcurrentOpens, this is N parallelgetSandboxRPCs per DO, multiplied by the DOs being recycled. See Restart behavior.
Mitigations.
- Set
workspaceMaxConcurrentOpensto bound concurrent opens per executor (matches the Sandbox DOmax_instances). This is per-session — for tenant-wide bounds across all sessions sharing a provider, ALSO setmaxGlobalConcurrentOpenson the provider options (round-5 B2). The two are layered: registry-level for fairness across workspaces in one session, provider-level for back-pressure to the upstream binding. - Lower
transientRetryAttempts(default 3) for paths that should fail fast. - Set
resetAfterMson the registry to auto-recover from'failed'state once the cooldown elapses (round-5 B4). Recommended:5 * 60 * 1000(5 min) so a transient outage doesn't permanently brick sessions until an operator manually resets them. - Wire
WorkspaceMetricsto surface eviction rates to your monitoring system; alert onincEviction> threshold so you catch storms early.
References. Eviction recovery semantics, Transient vs permanent errors, Restart behavior — Sandbox.
8. Cross-tenant data leak suspected
Symptoms. Tenant A reports seeing Tenant B's data in a workspace. Audit logs show writes from one session appearing in reads from another.
What to check.
- Sandbox
idsharing. Round-4 cluster A6 added the explicit-opt-in pattern: settingconfig.idrequiresshareAcrossSessions: true. Pre-fix, this was silent — multiple sessions opening the same workspace name with the sameidattached to the SAME container. Post-fix, the misconfig throws atopen().- If you're on pre-fix code, this is the most likely cause. Upgrade and respond to the assertion.
- If you're on post-fix code with
shareAcrossSessions: trueset intentionally, the sharing is by design — surface to the user and re-evaluate the threat model.
- Filestore namespace sharing. Same pattern applies to
CloudflareFileStoreWorkspaceConfig.namespace. An explicit shared namespace shares data; a default namespace (derived fromsessionId) does not. - Sub-agent inheritance. Round-4 cluster B reviewed sub-agent workspace inheritance: by DEFAULT, sub-agents are workspace-isolated. If
inheritWorkspaces: trueis set, the child sees the parent's registry — but this is opt-in and visible at the call site.- Check whether
createSubAgentTool(..., { inheritWorkspaces: true })was set unintentionally.
- Check whether
- Workspace name collisions across sub-agents. When a child declares a workspace name that exists on the parent and inheritance is NOT opted in, the framework emits a
logger.warnaudit log (sub-agent declares workspace name that exists on parent). Check for this log; the child got an isolated workspace, not the parent's.
Mitigations.
- Audit ALL workspace configs for non-default
id(sandbox) andnamespace(filestore). Either remove them or pair with the explicit-opt-in flag. - Audit
inheritWorkspaces: trueusage; restrict to scoped admin tools. - For confirmed leaks: surface to compliance, identify the affected window, follow your incident-response plan.
References. Cross-session sharing — Sandbox, Workspaces in sub-agents, Prompt-injection threat surface.
9. Audit log saturation
Symptoms. Log sink ingestion volume spikes; operator alerting fires on log-line rate or byte budget; per-session log volume reaches MB/sec sustained. Closer inspection shows thousands of identical workspace tool: shell metacharacter rejected (or sibling glob/brace expansion rejected, command not in allowlist) lines from a single session.
What to check.
- Tight LLM-loop misuse. A confused or adversarial LLM can iterate the same rejected command (
workspace__ws__run({ command: 'ls; ' + i })) thousands of times per second. Each rejection used to fire one structured warn — at default verbosity that's ~1MB/sec per session, multiplied by N sessions. - Round-5 B5 mitigation. As of round-5 B5, security warns in
tool-injection.ts,CloudflareSandboxShell, andSubprocessShellare wrapped withRateLimitedLogger.warnRateLimited. The first occurrence of each distinct rejection ALWAYS emits at full fidelity; subsequent identical events within a 60s window are deduped; a rollup line (withsuppressedCount) emits every 5s so volume stays visible. NEVER suppresses entirely. - Distinct-event keying. The dedup key incorporates workspace name + first-token (+ matched metachar where applicable) so different commands rejected from different workspaces emit independently. If you observe rate-limited rollup lines with identical content, the LLM is genuinely repeating the same exact rejection — the alerting amplitude is now O(rollups/sec) rather than O(rejections/sec).
Mitigations.
- Already mitigated by default in round-5+ — verify your version. Earlier versions emitted one warn per rejection.
- For higher dedup density, tighten
securityWindowMson the registry'sRateLimitedLogger(constructor option) — but check that you still see novel attack signals first. - Investigate the upstream LLM behavior: tight-loop rejections often indicate prompt mis-engineering or adversarial inputs. The audit log shows the first rejection plus rollup count; that's enough to identify the offending session.
- Filter your operator alerting on
[rate-limited]in the message text to surface ROLLUP volume rather than individual events.
References. RateLimitedLogger in the workspace utils export.
Building a /healthz endpoint (round-5 D6)
registry.describe() returns a frozen point-in-time snapshot of every workspace entry — the cheapest path to surfacing workspace state to your monitoring system. Wiring a /healthz route looks different per runtime:
JS runtime
JSAgentExecutor.getWorkspaceRegistry(sessionId) (round-5 D6) returns the live registry for an active session. Wire it into your HTTP handler:
// healthz-handler.ts (Express, Fastify, Hono — same shape)
app.get('/healthz/:sessionId', async (req, res) => {
const registry = executor.getWorkspaceRegistry(req.params.sessionId);
if (!registry) {
return res.status(404).json({ error: 'no active session' });
}
return res.json({ workspaces: registry.describe() });
});The accessor returns undefined when:
- No runLoop is currently active for that
sessionId(the registry has been closed). - The agent has no
workspacesdeclared. - The session is paused/interrupted (the runLoop has exited; the registry was closed in its
finally).
For paused/interrupted sessions, the operator path is "read the persisted workspaceRefs from the state store" — the registry has been closed and describe() would only show stale snapshots anyway.
Cloudflare DO runtime
The DO base class wraps JSAgentExecutor per Durable Object instance. Inside your DO subclass, call getWorkspaceRegistry(sessionId) on the inner executor and route a /healthz fetch path through it:
// my-agent-server.ts (inside your createAgentServer subclass)
async fetch(req: Request): Promise<Response> {
const url = new URL(req.url);
if (url.pathname === '/healthz') {
const sessionId = req.headers.get('x-partykit-room') ?? '';
const registry = this.executor.getWorkspaceRegistry(sessionId);
if (!registry) {
return new Response(JSON.stringify({ error: 'no active session' }), {
status: 404,
headers: { 'Content-Type': 'application/json' },
});
}
return Response.json({ workspaces: registry.describe() });
}
// ... rest of your fetch routing ...
return super.fetch(req);
}The shape of describe() is documented on the Workspaces overview — Health endpoint.
Temporal
Workspaces are not supported on the Temporal runtime — there is no registry to introspect. Health for Temporal-hosted agents lives in the Temporal cluster's UI and metrics.
What to alert on
state === 'failed'for >N minutes — provider permanently broken;lastErrorcarries the cause. Pair withresetAfterMsfor auto-recovery (round-5 B4).lastAttemptAt > lastSuccessAt + threshold— workspace is being hammered AND failing (round-5 B6 — the inverted-alerting case the legacylastOpAtcouldn't surface).state === 'evicted'for sustained periods — provider unstable; check eviction-storm runbook entry.- High
lastErrorcardinality across sessions — coordinated provider issue (R2 outage, Sandbox quota).
See also
- Workspaces overview — Operations for metrics, hooks, healthz, and operator-driven recovery surfaces.
- Upgrading & Migration for per-version compat and rollback.
- Building a Provider for the provider-side error model.