Skip to content

Workspace Runbook

Operator-facing runbook for the eight most likely production incidents on the workspace stack. Each section: symptoms, what to check, mitigations, references.

Looking for upgrade procedures? See Upgrading & Migration. Looking for the integration-side error model? See Workspaces overview — Errors integrators should know about.

1. Workspace tool keeps failing with WorkspaceFailedError

Symptoms. The agent's stream surfaces WorkspaceFailedError errors on every workspace tool call. The session never makes forward progress on workspace ops.

What to check.

  • registry.describe() — does the entry's state show 'failed'? If so, lastError carries the reason.
  • The capability-invariant assertion (round-2 cluster C). Common causes:
    • capabilities.snapshot: true declared on a sandbox without backupR2Binding configured.
    • A custom provider whose open() returns a workspace missing one of the declared modules.
  • Provider configuration:
    • Cloudflare Sandbox: is the SANDBOX DO binding correctly wired? Is max_instances reached?
    • Cloudflare Filestore: is ctx.storage.sql available? Did the migration to new_sqlite_classes run?
    • Local Bash: is the host POSIX? Was tmpdirRoot set to a path the process can write?

Mitigations.

  • Fix the configuration mismatch and redeploy.
  • If transient (provider outage, network blip), set transientRetryAttempts higher on the registry deps (default 3, ~10s total backoff) and ensure the provider is throwing with transient: true for the root cause.
  • For permanent provider failures, call registry.reset(name) from operator code (NOT exposed to the LLM) once the underlying cause is fixed.

References. Errors integrators should know about, Transient vs permanent errors, Operator-driven recovery.

2. User reports their files vanished

Symptoms. A previously-saved file is no longer reachable; read_file throws or ls returns an empty listing where files used to live.

What to check.

  • Lifecycle. Did the workspace close() between sessions? Most providers tear down state on close — destroyOnClose: true on the sandbox, tmpdir removal on local-bash. The in-memory provider drops everything on process restart.
  • Branch-from-checkpoint behavior. Round-4 cluster A8 fixed a silent corruption: branched sessions previously shared workspace refs with the source session. POST-fix, branched sessions start with FRESH workspaces — the source's files are not visible. This is the intended (and safer) behavior. See Workspace refs are scoped to the source session.
  • Provider durability per provider.
    • In-memory: state is lost on process restart.
    • Local-bash: tmpdirs survive process restart but a close() removes them; tmpfs clears on host reboot.
    • Cloudflare Filestore: data persists in DO SQLite for the DO's lifetime; SQLite is durable across hibernation but tied to the DO instance.
    • Cloudflare Sandbox: the container persists across hibernation via the Sandbox DO's storage; destroyOnClose: true permanently removes it.
  • Was the session branched? If so, the branch starts FRESH — surface that to the user.

Mitigations.

  • For accidental closure: restore from a checkpoint within the SAME session (no branch) — that re-attaches the same refs. See Checkpoints + workspaces.
  • For branch-fresh-workspace surprise: use Snapshotter.snapshot() + restore() to seed the branch from the source. See Pitfall 7 in the upgrading guide.
  • For data lost to provider lifecycle: this is expected behavior. Consider switching to a more durable provider (filestore for CF; local-bash for POSIX dev) if persistence matters.

References. Lifecycle, Workspace refs are scoped to the source session, per-provider durability sections.

3. DO is OOMing

Symptoms. Cloudflare DO crashes with out-of-memory; logs show large memory consumption growing during agent execution.

What to check.

  • writeFile size guard. Clusters D round-2, A round-3, and A round-4 added size guards to filestore and sandbox writeFile to bound the worst-case allocation. Ensure your capabilities.fs.maxFileSizeMb is reasonable for your workload — the default is ~10 MB; agents writing 100 MB blobs need either a higher cap (and operator awareness) or a chunked write strategy.
  • Large grep operations. grep reads the file into memory before scanning. The maxGrepFileSizeMb knob (default 10 MB) skips files larger than the limit; skippedPaths is returned so the LLM knows. If the LLM keeps raising the cap to scan 500 MB log files, push back at the agent design layer.
  • Eviction storms. A flood of WorkspaceEvictedError events causes withEvictionRetry to repeatedly re-resolve workspaces, churning open state. Check the workspace tool: eviction retry exhausted log line for repeat occurrences.
  • maxConcurrentOpens. If an agent declares many workspaces and runs workspaceOpenStrategy: 'eager', the registry's openAll() fires N concurrent opens simultaneously. Each open allocates state; N too high can exhaust the heap before any open completes.

Mitigations.

  • Lower capabilities.fs.maxFileSizeMb until the LLM can no longer write the offending blob.
  • Cap maxGrepFileSizeMb at the capability layer if the LLM is grepping huge files.
  • Set workspaceMaxConcurrentOpens on the executor to bound concurrent opens (also matches the Sandbox DO max_instances).
  • If eviction storms are the cause, treat as incident #7 below.

References. Tunable knobs, FileSystem module, grep skipped-paths envelope.

4. Sandbox container won't terminate

Symptoms. destroyOnClose: true is set; sessions complete; sandbox containers persist (visible in Cloudflare dashboard or via DO RPC) and accrue cost.

What to check.

  • closeAll timeout. Registry close() enforces closeTimeoutMs (default 30000 ms). If the sandbox's destroy() exceeds this, the close is logged as timeout and the framework moves on — but the underlying container may continue.
  • sleepAfter. When destroyOnClose: false (default), the container suspends after sleepAfter — by default ~10 minutes. Containers don't terminate; they hibernate. Cost on hibernated containers is much lower than running ones; check whether you actually need destroy semantics.
  • Sandbox SDK behavior. @cloudflare/sandbox is pinned at an exact version because its API has been moving. Check the SDK's release notes for known destroy() issues.

Mitigations.

  • Set destroyOnClose: true for one-shot agent runs.
  • Lower sleepAfter to reduce idle cost (trade-off: higher cold-start latency on resume).
  • Tighten closeTimeoutMs if the framework's wait is masking a real problem (better: investigate why destroy() exceeds 30s).

References. Cloudflare Sandbox — Cost notes, Lifecycle.

5. Tmpdirs are filling /tmp on a host

Symptoms. Local-bash deployment shows /tmp/helix-ws-* directories accumulating; disk-full alerts fire on the host.

What to check.

  • Close failures. Round-3 cluster C added tmpdir-cause logging on local-bash close failures. Check error logs for tmpdir close failed — the common cause is a long-running subprocess holding a file open in the tmpdir, blocking rm -rf.
  • closeTimeoutMs. If close exceeds the deadline, the framework moves on but the tmpdir survives. Default 30000 ms; tighter is OK on local-bash.
  • Crash-leaked tmpdirs. Process crash leaves tmpdirs orphaned — they live until the OS clears tmpfs (boot, manual clean).

Mitigations.

  • Add a periodic cron / systemd-timer job: find /tmp/helix-ws-* -maxdepth 0 -mtime +1 -exec rm -rf {} + (adjust -mtime per your session lifetime).
  • Investigate the root cause of close failures via the tmpdir-cause logs and fix the subprocess hang.
  • Consider tmpdirRoot pointing at a tmpfs that auto-clears on reboot if your sessions never need to outlive a reboot.

References. Local Bash — Lifecycle.

6. Workspace refs in state store don't match container state

Symptoms. provider.resolve(ref) throws WorkspaceFailedError with a schemaVersion message OR resolves to a sandbox/namespace that's empty.

What to check.

  • Schema-version mismatch. Round-4 cluster D introduced explicit schemaVersion on every ref. Providers support N±1 — beyond that range, resolution throws with a clear message. See Pitfall 8 in the upgrading guide.
  • Rollback hazard. If you rolled back across multiple schema versions, persisted refs may carry a version the OLD code doesn't know.
  • Provider-side state divergence. Did the underlying container/namespace get manually deleted? registry.describe() shows 'failed'; lastError may have provider-specific detail.

Mitigations.

  • Roll forward to a version that understands the persisted ref schemaVersion.
  • For sessions stuck in 'failed': identify them via registry.describe() filtered to state: 'failed' + lastError matching schemaVersion, then either re-create or roll forward.
  • For container-state divergence (the underlying sandbox or namespace was deleted out-of-band): no automatic recovery. The session is unrecoverable; surface to the user and start a fresh session.

References. Pitfall 8 — Schema version drift, Rollback procedure.

7. Eviction storms

Symptoms. Logs are full of workspace tool: eviction retry exhausted at error level; agents make slow forward progress; metrics show a high incEviction rate without recovery via incEvictionRetry.

What to check.

  • Provider stability. Repeat eviction-retry-exhausted is a strong signal that the underlying provider isn't recoverable in this moment. Common causes:
    • Cloudflare Sandbox: container quota exhaustion on the binding (max_instances reached, retry burst on top).
    • Cloudflare Filestore: R2 bucket reachability (if you use R2 spill).
    • Local Bash: tmpfs full, host OOM, parent process churning.
  • Transient vs permanent classification. Round-4 cluster C added explicit transient: true opt-in on WorkspaceFailedError. If the provider is misclassifying a permanent error as transient, the registry retries with backoff (default 3 retries, ~10s total) before surfacing — wasted cycles. Check provider code.
  • Restart-storm dynamics. During a CF deployment rollout, every DO's first agent operation triggers provider.resolve() for every workspace ref it had. Without maxConcurrentOpens, this is N parallel getSandbox RPCs per DO, multiplied by the DOs being recycled. See Restart behavior.

Mitigations.

  • Set workspaceMaxConcurrentOpens to bound concurrent opens per executor (matches the Sandbox DO max_instances). This is per-session — for tenant-wide bounds across all sessions sharing a provider, ALSO set maxGlobalConcurrentOpens on the provider options (round-5 B2). The two are layered: registry-level for fairness across workspaces in one session, provider-level for back-pressure to the upstream binding.
  • Lower transientRetryAttempts (default 3) for paths that should fail fast.
  • Set resetAfterMs on the registry to auto-recover from 'failed' state once the cooldown elapses (round-5 B4). Recommended: 5 * 60 * 1000 (5 min) so a transient outage doesn't permanently brick sessions until an operator manually resets them.
  • Wire WorkspaceMetrics to surface eviction rates to your monitoring system; alert on incEviction > threshold so you catch storms early.

References. Eviction recovery semantics, Transient vs permanent errors, Restart behavior — Sandbox.

8. Cross-tenant data leak suspected

Symptoms. Tenant A reports seeing Tenant B's data in a workspace. Audit logs show writes from one session appearing in reads from another.

What to check.

  • Sandbox id sharing. Round-4 cluster A6 added the explicit-opt-in pattern: setting config.id requires shareAcrossSessions: true. Pre-fix, this was silent — multiple sessions opening the same workspace name with the same id attached to the SAME container. Post-fix, the misconfig throws at open().
    • If you're on pre-fix code, this is the most likely cause. Upgrade and respond to the assertion.
    • If you're on post-fix code with shareAcrossSessions: true set intentionally, the sharing is by design — surface to the user and re-evaluate the threat model.
  • Filestore namespace sharing. Same pattern applies to CloudflareFileStoreWorkspaceConfig.namespace. An explicit shared namespace shares data; a default namespace (derived from sessionId) does not.
  • Sub-agent inheritance. Round-4 cluster B reviewed sub-agent workspace inheritance: by DEFAULT, sub-agents are workspace-isolated. If inheritWorkspaces: true is set, the child sees the parent's registry — but this is opt-in and visible at the call site.
    • Check whether createSubAgentTool(..., { inheritWorkspaces: true }) was set unintentionally.
  • Workspace name collisions across sub-agents. When a child declares a workspace name that exists on the parent and inheritance is NOT opted in, the framework emits a logger.warn audit log (sub-agent declares workspace name that exists on parent). Check for this log; the child got an isolated workspace, not the parent's.

Mitigations.

  • Audit ALL workspace configs for non-default id (sandbox) and namespace (filestore). Either remove them or pair with the explicit-opt-in flag.
  • Audit inheritWorkspaces: true usage; restrict to scoped admin tools.
  • For confirmed leaks: surface to compliance, identify the affected window, follow your incident-response plan.

References. Cross-session sharing — Sandbox, Workspaces in sub-agents, Prompt-injection threat surface.

9. Audit log saturation

Symptoms. Log sink ingestion volume spikes; operator alerting fires on log-line rate or byte budget; per-session log volume reaches MB/sec sustained. Closer inspection shows thousands of identical workspace tool: shell metacharacter rejected (or sibling glob/brace expansion rejected, command not in allowlist) lines from a single session.

What to check.

  • Tight LLM-loop misuse. A confused or adversarial LLM can iterate the same rejected command (workspace__ws__run({ command: 'ls; ' + i })) thousands of times per second. Each rejection used to fire one structured warn — at default verbosity that's ~1MB/sec per session, multiplied by N sessions.
  • Round-5 B5 mitigation. As of round-5 B5, security warns in tool-injection.ts, CloudflareSandboxShell, and SubprocessShell are wrapped with RateLimitedLogger.warnRateLimited. The first occurrence of each distinct rejection ALWAYS emits at full fidelity; subsequent identical events within a 60s window are deduped; a rollup line (with suppressedCount) emits every 5s so volume stays visible. NEVER suppresses entirely.
  • Distinct-event keying. The dedup key incorporates workspace name + first-token (+ matched metachar where applicable) so different commands rejected from different workspaces emit independently. If you observe rate-limited rollup lines with identical content, the LLM is genuinely repeating the same exact rejection — the alerting amplitude is now O(rollups/sec) rather than O(rejections/sec).

Mitigations.

  • Already mitigated by default in round-5+ — verify your version. Earlier versions emitted one warn per rejection.
  • For higher dedup density, tighten securityWindowMs on the registry's RateLimitedLogger (constructor option) — but check that you still see novel attack signals first.
  • Investigate the upstream LLM behavior: tight-loop rejections often indicate prompt mis-engineering or adversarial inputs. The audit log shows the first rejection plus rollup count; that's enough to identify the offending session.
  • Filter your operator alerting on [rate-limited] in the message text to surface ROLLUP volume rather than individual events.

References. RateLimitedLogger in the workspace utils export.

Building a /healthz endpoint (round-5 D6)

registry.describe() returns a frozen point-in-time snapshot of every workspace entry — the cheapest path to surfacing workspace state to your monitoring system. Wiring a /healthz route looks different per runtime:

JS runtime

JSAgentExecutor.getWorkspaceRegistry(sessionId) (round-5 D6) returns the live registry for an active session. Wire it into your HTTP handler:

typescript
// healthz-handler.ts (Express, Fastify, Hono — same shape)
app.get('/healthz/:sessionId', async (req, res) => {
  const registry = executor.getWorkspaceRegistry(req.params.sessionId);
  if (!registry) {
    return res.status(404).json({ error: 'no active session' });
  }
  return res.json({ workspaces: registry.describe() });
});

The accessor returns undefined when:

  • No runLoop is currently active for that sessionId (the registry has been closed).
  • The agent has no workspaces declared.
  • The session is paused/interrupted (the runLoop has exited; the registry was closed in its finally).

For paused/interrupted sessions, the operator path is "read the persisted workspaceRefs from the state store" — the registry has been closed and describe() would only show stale snapshots anyway.

Cloudflare DO runtime

The DO base class wraps JSAgentExecutor per Durable Object instance. Inside your DO subclass, call getWorkspaceRegistry(sessionId) on the inner executor and route a /healthz fetch path through it:

typescript
// my-agent-server.ts (inside your createAgentServer subclass)
async fetch(req: Request): Promise<Response> {
  const url = new URL(req.url);
  if (url.pathname === '/healthz') {
    const sessionId = req.headers.get('x-partykit-room') ?? '';
    const registry = this.executor.getWorkspaceRegistry(sessionId);
    if (!registry) {
      return new Response(JSON.stringify({ error: 'no active session' }), {
        status: 404,
        headers: { 'Content-Type': 'application/json' },
      });
    }
    return Response.json({ workspaces: registry.describe() });
  }
  // ... rest of your fetch routing ...
  return super.fetch(req);
}

The shape of describe() is documented on the Workspaces overview — Health endpoint.

Temporal

Workspaces are not supported on the Temporal runtime — there is no registry to introspect. Health for Temporal-hosted agents lives in the Temporal cluster's UI and metrics.

What to alert on

  • state === 'failed' for >N minutes — provider permanently broken; lastError carries the cause. Pair with resetAfterMs for auto-recovery (round-5 B4).
  • lastAttemptAt > lastSuccessAt + threshold — workspace is being hammered AND failing (round-5 B6 — the inverted-alerting case the legacy lastOpAt couldn't surface).
  • state === 'evicted' for sustained periods — provider unstable; check eviction-storm runbook entry.
  • High lastError cardinality across sessions — coordinated provider issue (R2 outage, Sandbox quota).

See also

Released under the MIT License.