Skip to main content

Research: Agentic Engineering Patterns

TL;DR

Agentic engineering is a maturing discipline with strong consensus on fundamentals (tool loop, context engineering, verification layers) but active debate on multi-agent orchestration, subagent tradeoffs, and production readiness. b4arena already implements most recommended patterns — the main gaps are in observability tooling and structured error recovery. The field's biggest lesson: system design matters more than model intelligence.

Problem Statement

What are the established and emerging patterns for building effective coding agent systems? How does b4arena's architecture align with industry consensus, and where are the gaps?


Key Findings

1. Official Documentation & Authoritative Sources

Major guides identified beyond Willison/Zechner:

  • Anthropic — Building Effective Agents: Defines five workflow patterns (prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer) and the core principle: start with simple LLM APIs, add frameworks only when justified.

  • Google Cloud — Agentic AI Design Patterns: Catalogues 12+ multi-agent patterns including sequential, parallel, loop, review-and-critique, coordinator, hierarchical decomposition, swarm, and ReAct. Each has distinct latency/cost/capability tradeoffs.

  • OpenAI — Practical Guide to Building AI Agents: Two key orchestration patterns: agents-as-tools (bounded specialist subtasks) and handoffs (routing owns next interaction segment).

  • Google ADK — Context-Aware Multi-Agent Framework: Introduces separation of durable state (Sessions) from per-call views (working context), explicit transformation pipelines, and scope-by-default (agents reach for info rather than receiving all context).

  • LangChain — Plan-and-Execute vs ReAct: Plan-and-Execute decouples planning from execution — faster/cheaper for multi-step problems because the planner isn't consulted after each action.

Emerging consensus on design principles:

PrincipleSource
Simplicity over sophisticationAnthropic, Zechner, community
Invest in tool documentation like HCIAnthropic
Separate storage from presentationGoogle ADK
Scope by default — agents pull, not pushGoogle ADK
Interleave reasoning with acting (ReAct)Academic, all guides
Measure continuously — metrics not vibesLangChain, community
Graceful degradation over total failureMultiple

2. Community Insights & Real-World Experience

Production reality is sobering:

  • 88% of agent projects fail before production (community consensus from case studies)
  • AI generates 1.7x more bugs than humans, 75% more logic errors, 3x readability issues (CodeRabbit study, 470 repos)
  • Token economics are brutal: 28M tokens for 149 lines of code, 170M tokens in 2 days from a single agent (VentureBeat)
  • Unstructured multi-agent networks amplify errors 17.2x vs. single-agent baselines (VoltAgent)

Top anti-patterns identified by practitioners:

  1. Submitting unreviewed agent code — transfers review burden, erodes trust
  2. Monolithic agent design — single large agent doing everything becomes unmaintainable
  3. Agents with production write access — hallucinated tool calls delete tables, send mass emails
  4. Same agent reviewing its own work — reinforces rather than catches mistakes
  5. Vibe coding acceptance loop — prompt → accept → run → loop on errors, no human review
  6. Semantic duplication in instructions — similar conditions confuse agents into random choices

What actually works:

  • Agents as velocity multipliers for senior engineers with strong fundamentals
  • Modular agent design with focused responsibilities
  • Context persistence to disk — write large outputs to files, agents pull fragments
  • Explicit handoff contracts between agent boundaries
  • Memory + context editing together yields 39% higher task success, 84% token reduction
  • 60–200 line instruction files cover 80% of needs (simplicity wins)

Active debates:

DebatePosition APosition B
Subagents vs. multi-agentLightweight coordination (can bottleneck)Distributed expertise (7x token cost)
Autonomous vs. supervisedFull autonomy fragments systemsFull structure gets bypassed
Simple vs. frameworkMinimal CLAUDE.md covers 80%Complex tasks need orchestration
Model intelligence vs. system designBetter models fix everythingFailures are architectural, not model-level

3. Codebase Patterns (b4arena/meta)

b4arena implements a remarkably complete set of the recommended patterns:

External Patternb4arena Implementation
Harness + System PromptSOUL.md per agent (identity, principles, workflows, escalation)
Tool LoopAgent wake-up → claim → work → close cycle
Context EngineeringFour-Tier Execution Framework (Tier 1 = shell/0 tokens → Tier 4 = full reasoning)
Subagentsca-leash (clean-context implementation subagent for Forge/Rio)
Multi-Agent OrchestrationBeads DAG + label-based routing + Watcher dispatch
Review & CritiqueFour-Eyes Protocol (Rio triage → Forge implement → Atlas review/merge)
Design Decision RoutingRio → Atlas → Forge pipeline (architecture decided before implementation)
EscalationFour-dimensional assessment (reversibility, blast radius, commitment, visibility)
SandboxingContainer isolation per agent, shared only via /workspace/intercom/.beads
Token-Efficient Routing"Intern Test" — if a checklist suffices, it's Tier 1 (zero tokens)

Unique b4arena patterns not found in external literature:

  1. SOUL.md as modular system prompt — standardized structure (Identity → Principles → Wake-Up → Workflows → Escalation → Rules) across all agents. More structured than typical CLAUDE.md approaches.

  2. Four-Tier Execution Framework — explicit token-cost routing. No external guide formalizes this level of token budget awareness.

  3. Design Decision Gate — mandatory Rio check before any implementation bead: "does this task contain a design decision?" If yes, Atlas decides first, Forge waits. Prevents architecture-by-accident.

  4. Inline Content in Beads — never reference files by path (other agents can't access them). Forces self-contained communication. This is a practical solution to the handoff information-loss problem the community identifies as a top failure mode.


What b4arena should keep doing

  1. SOUL.md structure — more rigorous than industry norm; aligns with "invest in tool documentation like HCI"
  2. Four-Tier Execution — unique and valuable; directly addresses the token economics problem
  3. Design Decision Gate — prevents the "architecture-by-accident" anti-pattern
  4. ca-leash subagents — clean separation of orchestration from implementation context

Gaps to consider

  1. Observability/Tracing — community and Google ADK emphasize structured traces (runs → traces → threads). b4arena's current visibility into agent reasoning during ca-leash sessions is limited. Consider: structured logging of agent decisions, not just beads state changes.

  2. Error Recovery Patterns — Anthropic and community emphasize progressive failure (self-correct → fallback → graceful degradation → escalation). b4arena has escalation rules but could formalize retry/fallback behavior within agents.

  3. Context Compilation Pipeline — Google ADK's explicit transformation pipeline (named processors, not ad-hoc string concatenation) could improve how SOUL.md + TOOLS.md + task context are assembled before each agent invocation.

  4. Metrics/Evaluation — "Measure continuously, not vibes." Token spend per task tier, bead completion rates, escalation frequency, and ca-leash success rates would provide operational visibility.


Consolidated Concept Definitions

Building on agentic-engineering-patterns.md with definitions surfaced by this research:

ConceptDefinition
Agentic EngineeringThe discipline of developing production software with coding agents — encompassing prompt design, context management, verification, and iterative refinement. Distinguished from vibe coding by review and quality expectations.
Tool LoopCore agent architecture: prompt LLM → execute requested tools → feed results back → repeat until done.
HarnessThe orchestration layer around an LLM: system prompt, tool definitions, conversation replay, token management.
Context EngineeringControlling what enters the model's context to maximize output quality. The highest-leverage skill in agentic engineering.
Context WindowMaximum tokens an LLM processes at once (~1M max, quality degrades above ~200K).
SubagentFresh agent instance with clean context dispatched by a parent for a scoped subtask.
System Prompt / SOULHidden instructions defining agent identity, capabilities, and constraints. In b4arena: SOUL.md.
ReActInterleaved Reasoning (Thought) → Action → Observation cycle. Reduces hallucination by grounding reasoning in tool results.
Plan-and-ExecuteDecoupled planning then execution. Faster/cheaper than ReAct for multi-step problems because the planner isn't consulted after each action.
Prompt ChainingSequential step-by-step execution with validation gates between steps.
Orchestrator-WorkersCentral LLM dynamically breaks tasks and delegates to worker agents.
Evaluator-OptimizerGenerator + evaluator in a feedback loop for iterative improvement.
Lethal TrifectaDangerous combination: (1) private data access, (2) exposure to attacker content, (3) exfiltration channel.
Vibe CodingPrompting LLMs to generate code without reviewing or understanding it. Prototype-only.
Red/Green TDDAgent writes failing test first, then implements to pass. "Tests are effectively free now."
Conformance-Driven DevTest suite derived from multiple existing implementations; implement against those tests.
Four-Tier Execution(b4arena) Token-cost routing: Tier 1 (shell, 0 tokens) → Tier 4 (full reasoning). "Intern Test" decides tier.
SOUL.md(b4arena) Modular system prompt: Identity → Principles → Wake-Up → Workflows → Escalation → Rules.
Design Decision Gate(b4arena) Mandatory check: does the task contain a design decision? If yes, architect decides before developer implements.
Handoff ContractExplicit input/output schema between agent boundaries. Absence is the #1 multi-agent failure mode.
Scope-by-DefaultAgents pull information on demand rather than receiving all context upfront. Reduces token waste.

Trade-offs

ApproachProsCons
Minimal agent (Zechner)Full observability, low complexity, competitive benchmarksNo built-in coordination for multi-agent teams
Feature-rich agent (Claude Code)Subagents, MCP, specialist roles, IDE integrationMCP consumes 7-9% context; subagents are opaque
Multi-agent team (b4arena)Separation of concerns, architecture coherence, review layersCoordination overhead, handoff risk, token multiplication
Single-agent + toolsSimple, cheap, easy to debugContext exhaustion on complex tasks
Plan-and-ExecuteCheaper for multi-step; forces foresightRigid plans break on unexpected discoveries
ReAct (interleaved)Adaptive; adjusts to observationsMore expensive; planner consulted after every action

Open Questions

  • How should b4arena measure agent effectiveness? (token spend per tier, bead cycle time, escalation rate?)
  • Should ca-leash sessions emit structured traces for post-hoc analysis?
  • Is the Four-Tier Framework's "Intern Test" heuristic sufficient, or does it need formalization?
  • How to handle context window exhaustion mid-task in ca-leash? (checkpoint and spawn new session?)
  • Should b4arena adopt explicit error classification (transient/rate-limit/auth) in agent SOULs?

Sources

Official Documentation

Community & Practitioner

Prior Research

  • meta/agentic-engineering-patterns.md — Initial synthesis from Willison + Zechner (in b4arena/meta repo root)

Codebase

  • ludus/docs/architecture.md — Four-Tier Framework, Watcher, Agent Identity
  • ludus/agents/*/SOUL.md — Agent system prompts (main, forge, atlas, rio)
  • ludus/agents/forge/TOOLS.md — ca-leash subagent pattern