Research: Agentic Engineering Patterns

TL;DR

Agentic engineering is a maturing discipline with strong consensus on fundamentals (tool loop, context engineering, verification layers) but active debate on multi-agent orchestration, subagent tradeoffs, and production readiness. b4arena already implements most recommended patterns — the main gaps are in observability tooling and structured error recovery. The field's biggest lesson: system design matters more than model intelligence.

Problem Statement

What are the established and emerging patterns for building effective coding agent systems? How does b4arena's architecture align with industry consensus, and where are the gaps?

Key Findings

1. Official Documentation & Authoritative Sources

Major guides identified beyond Willison/Zechner:

Anthropic — Building Effective Agents: Defines five workflow patterns (prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer) and the core principle: start with simple LLM APIs, add frameworks only when justified.
Google Cloud — Agentic AI Design Patterns: Catalogues 12+ multi-agent patterns including sequential, parallel, loop, review-and-critique, coordinator, hierarchical decomposition, swarm, and ReAct. Each has distinct latency/cost/capability tradeoffs.
OpenAI — Practical Guide to Building AI Agents: Two key orchestration patterns: agents-as-tools (bounded specialist subtasks) and handoffs (routing owns next interaction segment).
Google ADK — Context-Aware Multi-Agent Framework: Introduces separation of durable state (Sessions) from per-call views (working context), explicit transformation pipelines, and scope-by-default (agents reach for info rather than receiving all context).
LangChain — Plan-and-Execute vs ReAct: Plan-and-Execute decouples planning from execution — faster/cheaper for multi-step problems because the planner isn't consulted after each action.

Emerging consensus on design principles:

Principle	Source
Simplicity over sophistication	Anthropic, Zechner, community
Invest in tool documentation like HCI	Anthropic
Separate storage from presentation	Google ADK
Scope by default — agents pull, not push	Google ADK
Interleave reasoning with acting (ReAct)	Academic, all guides
Measure continuously — metrics not vibes	LangChain, community
Graceful degradation over total failure	Multiple

2. Community Insights & Real-World Experience

Production reality is sobering:

88% of agent projects fail before production (community consensus from case studies)
AI generates 1.7x more bugs than humans, 75% more logic errors, 3x readability issues (CodeRabbit study, 470 repos)
Token economics are brutal: 28M tokens for 149 lines of code, 170M tokens in 2 days from a single agent (VentureBeat)
Unstructured multi-agent networks amplify errors 17.2x vs. single-agent baselines (VoltAgent)

Top anti-patterns identified by practitioners:

Submitting unreviewed agent code — transfers review burden, erodes trust
Monolithic agent design — single large agent doing everything becomes unmaintainable
Agents with production write access — hallucinated tool calls delete tables, send mass emails
Same agent reviewing its own work — reinforces rather than catches mistakes
Vibe coding acceptance loop — prompt → accept → run → loop on errors, no human review
Semantic duplication in instructions — similar conditions confuse agents into random choices

What actually works:

Agents as velocity multipliers for senior engineers with strong fundamentals
Modular agent design with focused responsibilities
Context persistence to disk — write large outputs to files, agents pull fragments
Explicit handoff contracts between agent boundaries
Memory + context editing together yields 39% higher task success, 84% token reduction
60–200 line instruction files cover 80% of needs (simplicity wins)

Active debates:

Debate	Position A	Position B
Subagents vs. multi-agent	Lightweight coordination (can bottleneck)	Distributed expertise (7x token cost)
Autonomous vs. supervised	Full autonomy fragments systems	Full structure gets bypassed
Simple vs. framework	Minimal CLAUDE.md covers 80%	Complex tasks need orchestration
Model intelligence vs. system design	Better models fix everything	Failures are architectural, not model-level

3. Codebase Patterns (b4arena/meta)

b4arena implements a remarkably complete set of the recommended patterns:

External Pattern	b4arena Implementation
Harness + System Prompt	SOUL.md per agent (identity, principles, workflows, escalation)
Tool Loop	Agent wake-up → claim → work → close cycle
Context Engineering	Four-Tier Execution Framework (Tier 1 = shell/0 tokens → Tier 4 = full reasoning)
Subagents	ca-leash (clean-context implementation subagent for Forge/Rio)
Multi-Agent Orchestration	Beads DAG + label-based routing + Watcher dispatch
Review & Critique	Four-Eyes Protocol (Rio triage → Forge implement → Atlas review/merge)
Design Decision Routing	Rio → Atlas → Forge pipeline (architecture decided before implementation)
Escalation	Four-dimensional assessment (reversibility, blast radius, commitment, visibility)
Sandboxing	Container isolation per agent, shared only via /workspace/intercom/.beads
Token-Efficient Routing	"Intern Test" — if a checklist suffices, it's Tier 1 (zero tokens)

Unique b4arena patterns not found in external literature:

SOUL.md as modular system prompt — standardized structure (Identity → Principles → Wake-Up → Workflows → Escalation → Rules) across all agents. More structured than typical CLAUDE.md approaches.
Four-Tier Execution Framework — explicit token-cost routing. No external guide formalizes this level of token budget awareness.
Design Decision Gate — mandatory Rio check before any implementation bead: "does this task contain a design decision?" If yes, Atlas decides first, Forge waits. Prevents architecture-by-accident.
Inline Content in Beads — never reference files by path (other agents can't access them). Forces self-contained communication. This is a practical solution to the handoff information-loss problem the community identifies as a top failure mode.

Recommended Approach

What b4arena should keep doing

SOUL.md structure — more rigorous than industry norm; aligns with "invest in tool documentation like HCI"
Four-Tier Execution — unique and valuable; directly addresses the token economics problem
Design Decision Gate — prevents the "architecture-by-accident" anti-pattern
ca-leash subagents — clean separation of orchestration from implementation context

Gaps to consider

Observability/Tracing — community and Google ADK emphasize structured traces (runs → traces → threads). b4arena's current visibility into agent reasoning during ca-leash sessions is limited. Consider: structured logging of agent decisions, not just beads state changes.
Error Recovery Patterns — Anthropic and community emphasize progressive failure (self-correct → fallback → graceful degradation → escalation). b4arena has escalation rules but could formalize retry/fallback behavior within agents.
Context Compilation Pipeline — Google ADK's explicit transformation pipeline (named processors, not ad-hoc string concatenation) could improve how SOUL.md + TOOLS.md + task context are assembled before each agent invocation.
Metrics/Evaluation — "Measure continuously, not vibes." Token spend per task tier, bead completion rates, escalation frequency, and ca-leash success rates would provide operational visibility.

Consolidated Concept Definitions

Building on agentic-engineering-patterns.md with definitions surfaced by this research:

Concept	Definition
Agentic Engineering	The discipline of developing production software with coding agents — encompassing prompt design, context management, verification, and iterative refinement. Distinguished from vibe coding by review and quality expectations.
Tool Loop	Core agent architecture: prompt LLM → execute requested tools → feed results back → repeat until done.
Harness	The orchestration layer around an LLM: system prompt, tool definitions, conversation replay, token management.
Context Engineering	Controlling what enters the model's context to maximize output quality. The highest-leverage skill in agentic engineering.
Context Window	Maximum tokens an LLM processes at once (~1M max, quality degrades above ~200K).
Subagent	Fresh agent instance with clean context dispatched by a parent for a scoped subtask.
System Prompt / SOUL	Hidden instructions defining agent identity, capabilities, and constraints. In b4arena: SOUL.md.
ReAct	Interleaved Reasoning (Thought) → Action → Observation cycle. Reduces hallucination by grounding reasoning in tool results.
Plan-and-Execute	Decoupled planning then execution. Faster/cheaper than ReAct for multi-step problems because the planner isn't consulted after each action.
Prompt Chaining	Sequential step-by-step execution with validation gates between steps.
Orchestrator-Workers	Central LLM dynamically breaks tasks and delegates to worker agents.
Evaluator-Optimizer	Generator + evaluator in a feedback loop for iterative improvement.
Lethal Trifecta	Dangerous combination: (1) private data access, (2) exposure to attacker content, (3) exfiltration channel.
Vibe Coding	Prompting LLMs to generate code without reviewing or understanding it. Prototype-only.
Red/Green TDD	Agent writes failing test first, then implements to pass. "Tests are effectively free now."
Conformance-Driven Dev	Test suite derived from multiple existing implementations; implement against those tests.
Four-Tier Execution	(b4arena) Token-cost routing: Tier 1 (shell, 0 tokens) → Tier 4 (full reasoning). "Intern Test" decides tier.
SOUL.md	(b4arena) Modular system prompt: Identity → Principles → Wake-Up → Workflows → Escalation → Rules.
Design Decision Gate	(b4arena) Mandatory check: does the task contain a design decision? If yes, architect decides before developer implements.
Handoff Contract	Explicit input/output schema between agent boundaries. Absence is the #1 multi-agent failure mode.
Scope-by-Default	Agents pull information on demand rather than receiving all context upfront. Reduces token waste.

Trade-offs

Approach	Pros	Cons
Minimal agent (Zechner)	Full observability, low complexity, competitive benchmarks	No built-in coordination for multi-agent teams
Feature-rich agent (Claude Code)	Subagents, MCP, specialist roles, IDE integration	MCP consumes 7-9% context; subagents are opaque
Multi-agent team (b4arena)	Separation of concerns, architecture coherence, review layers	Coordination overhead, handoff risk, token multiplication
Single-agent + tools	Simple, cheap, easy to debug	Context exhaustion on complex tasks
Plan-and-Execute	Cheaper for multi-step; forces foresight	Rigid plans break on unexpected discoveries
ReAct (interleaved)	Adaptive; adjusts to observations	More expensive; planner consulted after every action

Open Questions

How should b4arena measure agent effectiveness? (token spend per tier, bead cycle time, escalation rate?)
Should ca-leash sessions emit structured traces for post-hoc analysis?
Is the Four-Tier Framework's "Intern Test" heuristic sufficient, or does it need formalization?
How to handle context window exhaustion mid-task in ca-leash? (checkpoint and spawn new session?)
Should b4arena adopt explicit error classification (transient/rate-limit/auth) in agent SOULs?

Sources

Official Documentation

Community & Practitioner

Prior Research

meta/agentic-engineering-patterns.md — Initial synthesis from Willison + Zechner (in b4arena/meta repo root)

Codebase

ludus/docs/architecture.md — Four-Tier Framework, Watcher, Agent Identity
ludus/agents/*/SOUL.md — Agent system prompts (main, forge, atlas, rio)
ludus/agents/forge/TOOLS.md — ca-leash subagent pattern

TL;DR​

Problem Statement​

Key Findings​

1. Official Documentation & Authoritative Sources​

2. Community Insights & Real-World Experience​

3. Codebase Patterns (b4arena/meta)​

Recommended Approach​

What b4arena should keep doing​

Gaps to consider​

Consolidated Concept Definitions​

Trade-offs​

Open Questions​

Sources​

Official Documentation​

Community & Practitioner​

Prior Research​

Codebase​