Autopsy of an Agent Cascade: When One Missing Mount Breaks Everything
Some days you ship features. Today I dissected failures. I spent the afternoon pulling session transcripts from all eight agents, tracing a single misconfiguration through four agents and five beads, and filing the issues to make sure it doesn't happen again.
Agent Forensics
It started with a simple question: why aren't the agents getting things done? I pulled the full session logs for all eight agents — main, atlas, forge, helm, priya, rio, indago, and glue — and discovered that only three of them had actually hit blockers. The rest were idle, waiting for work.
The real story was in the cascade. Indago tried to commit a research report but couldn't write to /workspace/repos/research — the directory doesn't exist in its container. So main delegated the commit to forge. Forge cloned the repo, committed locally, then hit a 403 on git push — the GITHUB_TOKEN only covers the tabula repo, not ludus. Main escalated to helm. Helm investigated but couldn't diagnose from inside its sandbox (no access to host config). Helm created GitHub issue #56 and reported it was assigned to durandom — but it wasn't actually assigned. Main then reported helm's stale status without re-checking. I had to correct the system twice.
One missing volume mount → four agents blocked → a false status report → operator trust erosion. That's the cascade I documented in an 8-problem dependency graph. → session analysis
From Diagnosis to Issues
The forensics produced three concrete GitHub issues:
- #59 — Forge's
GITHUB_TOKENscope is too narrow (5-minute PAT fix, unblocks all forge work) - #60 — Per-agent extra bind mounts in sandbox configuration (architectural fix for indago's missing research repo)
- #56 — Enriched with detailed root cause analysis from both indago and helm session logs, including verbatim error messages and recommended fixes
Each issue has the full session transcript attached as a public Gist. No more "the agent said it worked" — now there's a paper trail.
Operational Hygiene
Beyond the forensics, I rolled out an agent-pickup label across all 17 b4arena repositories. It's a blue label that signals "an agent should pick this issue up" — a small step toward letting agents self-select work from the GitHub issue backlog instead of waiting for beads. Marcel also landed per-agent home directory isolation in ludus (#58), which gives each agent its own $HOME inside the container — no more shared state leaking between agents.
What I Learned
The meta-problem isn't the missing mount or the narrow token — those are trivial fixes. The real gap is post-action verification. Helm said it assigned the issue. Main said helm was still working on it. Neither checked. If agents verified their own actions (gh issue view after gh issue create), the cascade would have stopped at helm. That's the SOUL.md rule I'll be adding next: always verify, never assume.
By the Numbers
| Metric | Value |
|---|---|
| Commits | 3 |
| Active repos | 2 (ludus, tabula) |
| GitHub issues created/enriched | 3 (#56, #59, #60) |
| Agent sessions analyzed | 8 |
| Problems identified | 8 |
| Claude Code spend | $5.87 |
| Period | 2026-03-16 |
Written with help from Dispatch.
