Skip to main content

When the Agents Go Silent: Peeling Back a Three-Layer Failure on rpi5

· 5 min read
Christoph Görn
hacker, #B4mad Industries

Some mornings you run a healthcheck and get a green board. This morning I ran a healthcheck and discovered that rpi5 — our Raspberry Pi 5 running the full agent fleet — had been silently broken for days. No agent had processed a bead. No cron job had talked to GitHub. The watcher was dead. And the root cause wasn't one thing — it was three things stacked on top of each other, each hiding the next.

Layer 1: The Phantom Import

The first sign was ludus info status crashing with a TimeoutExpired on SSH. Oddly, SSH worked fine moments later when I listed beads manually. The real problem showed up in the watcher log:

ModuleNotFoundError: No module named 'ludus_cli.agents'

Someone (probably a previous agent session) had left uncommitted experimental code on rpi5's working tree — 40 modified files and 30 untracked ones, including a commands/agents.py that imported a module that never existed. Since uv run ludus rebuilds from source, every CLI invocation picked up this broken import and crashed. The sync cron (every 5 minutes) crashed. The watcher crashed. Every agent cron that depended on ludus crashed. All silently, into log files nobody was reading.

Fix: just deploy-agents rpi5 — a clean rsync of the current main. One command, CLI boots again. → deploy fix

Layer 2: GitHub Can't Hear You

With the CLI alive again, the sync cron produced a new error:

/usr/local/sbin/gh: line 74: /usr/bin/gh: No such file or directory

Two problems stacked here. First, the gh CLI binary was never installed on rpi5 — our package provisioning only included it for RedHat hosts, not Debian. Second, even if gh were installed, rpi5 had gh_app_enabled: false — no GitHub App credentials to authenticate with.

Rather than setting up a full GitHub App (overkill for one host), I went with a Personal Access Token for the brenner-axiom user. The beautiful thing: our architecture was already wired for this. The env template had a GH_TOKEN fallback path, and the gh-wrapper script skips its App token logic when GH_TOKEN is already set. I just had to teach github_credentials.py not to bail out when there are no App secrets — and to point the credential helper at /usr/bin/gh directly instead of the wrapper.

Three changes: add gh to the Debian package list, make the provisioning step PAT-aware, add gh_token to rpi5's manifest. Deploy, verify:

$ ssh rpi5 'gh auth status'
✓ Logged in to github.com account brenner-axiom (GH_TOKEN)

Sync immediately started working — pulling issues, skipping bot PRs, routing beads. → PAT provisioning

Layer 3: The Invisible Agent Map

Seven of eight beads processed instantly once the pipeline was unblocked. But one bead — ic-5vm, a Hausmeister project analysis assigned to Indago — stayed stuck:

⚠ No agent mapping for label 'indago'

Indago was registered. Had a workspace. Was in agent-map.json. So why couldn't the watcher find it?

Turns out there were two copies of agent-map.json on rpi5. The workspace-level copy at ~/b4arena/agents/agent-map.json (17 agents) got synced correctly. But the CLI reads from ~/b4arena/ludus/agents/agent-map.json (only 6 agents) — and that file was excluded from rsync by --exclude=agents. The exclude was there to protect agent workspace data (SOUL.md, repos/, etc.) from being overwritten, but it also blocked the config file living in the same directory.

I filed this as #94 — the fix is an --include=agents/agent-map.json rule before the exclude. For now, a manual scp unblocked Indago.

Bonus: Silencing the False Alarms

With the pipeline healthy, ludus info doctor still screamed about 8 "errors" — agents like vite, muse, hertz, and saga that are in the catalog but never provisioned on rpi5. Useful information once; noise every day.

I added a deployed: false flag to agent-map.json. Doctor and the watcher both skip these entries. The routing intent stays documented for when we're ready to spin them up, but the health dashboard is clean:

Summary: 0 errors, 1 warnings

Down from 8 errors. The one remaining warning (PinchChat service) is a known systemd naming issue.

What I Learned

Cascading failures love silence. Each broken layer masked the one below it. The import error hid the missing gh binary. The missing binary hid the auth gap. The auth gap hid the agent-map mismatch. If any one of these had produced a visible alert instead of silently logging to a file, we'd have caught it days earlier.

The architecture paid off. PAT support required minimal code changes because the env template and gh-wrapper already had a GH_TOKEN path. When you design for extension points you don't need yet, sometimes they save you months later.

Rsync excludes are sharp tools. --exclude=agents sounds like "don't touch agent workspaces." It actually means "don't touch anything in the agents/ directory" — including config files. Include rules must come before exclude rules.

By the Numbers

MetricValue
Commits1 (10 files, +207/-64 lines)
Active reposludus
Issues filed1 (#94)
PRs opened1 (#95)
New tests10 (PAT mode + deployed flag)
Doctor errors8 → 0
Period2026-03-30

Written with help from Dispatch.