Skip to main content

Verification Loop Gap

Summary

b4arena's agent system currently relies on self-reported completion: agents close their own beads and report what they did. There is no independent verification that work was done correctly. This document describes the gap, its consequences, and a path to closing it.

The Problem

The current bead lifecycle is:

Apex creates bead → Rio triages → Forge claims → Forge works → Forge closes bead

At step 5, Forge decides it is done and writes a close reason like "Fixed in PR #17". The system trusts this claim. Nobody independently verifies:

  • Did PR #17 actually get merged?
  • Did the PR pass CI?
  • Does the PR address what the bead description asked for?
  • Is the close reason substantive, or does it say "Done" with no evidence?

This is the phantom completion failure mode from multi-agent research: a bead is marked as complete, but the underlying work is absent, incomplete, or incorrect.

Why Self-Reported Completion Fails

Three findings converge on the same conclusion:

  1. ICLR 2024 research demonstrates that models cannot reliably self-correct their own reasoning. A model that produces incorrect output will, when asked to review its own output, frequently confirm it is correct.

  2. Willison's verification agent pattern: a separate agent reviews the primary agent's output with no investment in defending it. Its only job is finding problems. External verification consistently outperforms self-correction because the verifier has no sunk-cost attachment to the original output.

  3. Arize's production failure analysis: "phantom pipeline" — agents report meetings booked / tasks completed / builds passing, but the downstream evidence doesn't support the claim. The fix is always the same: independent verification of the artifact, not the agent's report about the artifact.

Current State in b4arena

AspectStatusRisk
Bead completionSelf-reported by the working agentPhantom completions go undetected
PR verificationNot checked after bead closurePR may fail CI, be rejected, or never merge
Close reason qualityNo standards enforcedGeneric "Done" close reasons provide no audit trail
Cross-agent handoffParent closes when children closeParent trusts children's self-reports (trust cascade)
Acceptance criteriaEmbedded in bead descriptionsNot verified against output after completion

Proposed Architecture

Layer 1: Deterministic Verification Script (No LLM)

A lightweight script that runs periodically (or on-demand by the Glue agent) and checks recently closed beads:

For each bead closed in the last 24 hours:
1. Does the close reason contain a PR link or commit reference?
2. If PR link: is the PR merged? Did CI pass?
3. If parent bead: are all children actually closed?
4. Is the close reason longer than 10 characters? (catches "Done", "Fixed")
5. Was the bead open for more than 5 minutes? (catches instant closures)

This catches the most common phantom completions with zero token cost.

Layer 2: Sampled LLM Verification (Glue Agent)

For a random sample (10-20%) of closed beads, the Glue agent performs a deeper check:

  1. Read the bead description (what was requested)
  2. Read the close reason and any linked PR diff
  3. Assess: does the output match the request?
  4. Report findings to Apex

This catches semantic mismatches that scripts cannot detect (e.g., "bead asked for error handling, PR added logging instead").

Layer 3: Periodic Holdout Testing

Monthly, run a set of known-answer test beads through the system end-to-end:

  1. Create beads with specific, verifiable acceptance criteria
  2. Let agents process them normally
  3. Verify the output matches expected results
  4. Compare against baseline performance

This catches gradual behavioral drift that per-bead checks miss.

Implementation Path

Phase 1: Script (immediate, zero cost)

Build a verify-completions.sh or verify_completions.py script that checks the five deterministic criteria above. Run it as a cron job or a just recipe. Output goes to a log file that Glue (or a human) can review.

This is the highest-value, lowest-effort intervention. It catches the obvious failures without requiring any LLM tokens.

Phase 2: Glue Agent Integration (after Glue deploys)

Wire the script output into Glue's health monitoring workflow. Glue reads the verification log and includes findings in the weekly health digest. Anomalies become P1 or P2 alerts per the alert tiering protocol.

Phase 3: Holdout Tests (quarterly)

Design a small set (5-10) of calibration beads with known-correct outcomes. Run them through the system quarterly. Track pass rates over time.

Design Principles

  1. Code before inference. Deterministic scripts catch 80% of issues at 0% of the token cost. Use LLM-based verification only for the remaining 20% that requires semantic judgment.

  2. Verify the artifact, not the report. Check whether the PR merged, not whether the agent said it merged. Check whether CI passed, not whether the agent said it passed.

  3. Batch findings. A daily verification log is better than per-bead interruptions. The goal is a reliable signal, not real-time alerting.

  4. The Amdahl constraint applies. If verification requires human judgment for 40% of cases, max speedup is 2.5x. Focus automation on the clearly-deterministic checks first.

Relationship to Glue Agent

The Glue agent (introduced in the same session as this document) is the natural owner of verification loop operations. The Glue SOUL already includes handoff verification as a core function. This document provides the detailed architecture for how that function should be implemented — deterministic script first, LLM-based sampling second, holdout testing third.

References