Research: Agentic Engineering Patterns
TL;DR
Agentic engineering is a maturing discipline with strong consensus on fundamentals (tool loop, context engineering, verification layers) but active debate on multi-agent orchestration, subagent tradeoffs, and production readiness. b4arena already implements most recommended patterns — the main gaps are in observability tooling and structured error recovery. The field's biggest lesson: system design matters more than model intelligence.
Problem Statement
What are the established and emerging patterns for building effective coding agent systems? How does b4arena's architecture align with industry consensus, and where are the gaps?
Key Findings
1. Official Documentation & Authoritative Sources
Major guides identified beyond Willison/Zechner:
-
Anthropic — Building Effective Agents: Defines five workflow patterns (prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer) and the core principle: start with simple LLM APIs, add frameworks only when justified.
-
Google Cloud — Agentic AI Design Patterns: Catalogues 12+ multi-agent patterns including sequential, parallel, loop, review-and-critique, coordinator, hierarchical decomposition, swarm, and ReAct. Each has distinct latency/cost/capability tradeoffs.
-
OpenAI — Practical Guide to Building AI Agents: Two key orchestration patterns: agents-as-tools (bounded specialist subtasks) and handoffs (routing owns next interaction segment).
-
Google ADK — Context-Aware Multi-Agent Framework: Introduces separation of durable state (Sessions) from per-call views (working context), explicit transformation pipelines, and scope-by-default (agents reach for info rather than receiving all context).
-
LangChain — Plan-and-Execute vs ReAct: Plan-and-Execute decouples planning from execution — faster/cheaper for multi-step problems because the planner isn't consulted after each action.
Emerging consensus on design principles:
| Principle | Source |
|---|---|
| Simplicity over sophistication | Anthropic, Zechner, community |
| Invest in tool documentation like HCI | Anthropic |
| Separate storage from presentation | Google ADK |
| Scope by default — agents pull, not push | Google ADK |
| Interleave reasoning with acting (ReAct) | Academic, all guides |
| Measure continuously — metrics not vibes | LangChain, community |
| Graceful degradation over total failure | Multiple |
2. Community Insights & Real-World Experience
Production reality is sobering:
- 88% of agent projects fail before production (community consensus from case studies)
- AI generates 1.7x more bugs than humans, 75% more logic errors, 3x readability issues (CodeRabbit study, 470 repos)
- Token economics are brutal: 28M tokens for 149 lines of code, 170M tokens in 2 days from a single agent (VentureBeat)
- Unstructured multi-agent networks amplify errors 17.2x vs. single-agent baselines (VoltAgent)
Top anti-patterns identified by practitioners:
- Submitting unreviewed agent code — transfers review burden, erodes trust
- Monolithic agent design — single large agent doing everything becomes unmaintainable
- Agents with production write access — hallucinated tool calls delete tables, send mass emails
- Same agent reviewing its own work — reinforces rather than catches mistakes
- Vibe coding acceptance loop — prompt → accept → run → loop on errors, no human review
- Semantic duplication in instructions — similar conditions confuse agents into random choices
What actually works:
- Agents as velocity multipliers for senior engineers with strong fundamentals
- Modular agent design with focused responsibilities
- Context persistence to disk — write large outputs to files, agents pull fragments
- Explicit handoff contracts between agent boundaries
- Memory + context editing together yields 39% higher task success, 84% token reduction
- 60–200 line instruction files cover 80% of needs (simplicity wins)
Active debates:
| Debate | Position A | Position B |
|---|---|---|
| Subagents vs. multi-agent | Lightweight coordination (can bottleneck) | Distributed expertise (7x token cost) |
| Autonomous vs. supervised | Full autonomy fragments systems | Full structure gets bypassed |
| Simple vs. framework | Minimal CLAUDE.md covers 80% | Complex tasks need orchestration |
| Model intelligence vs. system design | Better models fix everything | Failures are architectural, not model-level |
3. Codebase Patterns (b4arena/meta)
b4arena implements a remarkably complete set of the recommended patterns:
| External Pattern | b4arena Implementation |
|---|---|
| Harness + System Prompt | SOUL.md per agent (identity, principles, workflows, escalation) |
| Tool Loop | Agent wake-up → claim → work → close cycle |
| Context Engineering | Four-Tier Execution Framework (Tier 1 = shell/0 tokens → Tier 4 = full reasoning) |
| Subagents | ca-leash (clean-context implementation subagent for Forge/Rio) |
| Multi-Agent Orchestration | Beads DAG + label-based routing + Watcher dispatch |
| Review & Critique | Four-Eyes Protocol (Rio triage → Forge implement → Atlas review/merge) |
| Design Decision Routing | Rio → Atlas → Forge pipeline (architecture decided before implementation) |
| Escalation | Four-dimensional assessment (reversibility, blast radius, commitment, visibility) |
| Sandboxing | Container isolation per agent, shared only via /workspace/intercom/.beads |
| Token-Efficient Routing | "Intern Test" — if a checklist suffices, it's Tier 1 (zero tokens) |
Unique b4arena patterns not found in external literature:
-
SOUL.md as modular system prompt — standardized structure (Identity → Principles → Wake-Up → Workflows → Escalation → Rules) across all agents. More structured than typical CLAUDE.md approaches.
-
Four-Tier Execution Framework — explicit token-cost routing. No external guide formalizes this level of token budget awareness.
-
Design Decision Gate — mandatory Rio check before any implementation bead: "does this task contain a design decision?" If yes, Atlas decides first, Forge waits. Prevents architecture-by-accident.
-
Inline Content in Beads — never reference files by path (other agents can't access them). Forces self-contained communication. This is a practical solution to the handoff information-loss problem the community identifies as a top failure mode.
Recommended Approach
What b4arena should keep doing
- SOUL.md structure — more rigorous than industry norm; aligns with "invest in tool documentation like HCI"
- Four-Tier Execution — unique and valuable; directly addresses the token economics problem
- Design Decision Gate — prevents the "architecture-by-accident" anti-pattern
- ca-leash subagents — clean separation of orchestration from implementation context
Gaps to consider
-
Observability/Tracing — community and Google ADK emphasize structured traces (runs → traces → threads). b4arena's current visibility into agent reasoning during ca-leash sessions is limited. Consider: structured logging of agent decisions, not just beads state changes.
-
Error Recovery Patterns — Anthropic and community emphasize progressive failure (self-correct → fallback → graceful degradation → escalation). b4arena has escalation rules but could formalize retry/fallback behavior within agents.
-
Context Compilation Pipeline — Google ADK's explicit transformation pipeline (named processors, not ad-hoc string concatenation) could improve how SOUL.md + TOOLS.md + task context are assembled before each agent invocation.
-
Metrics/Evaluation — "Measure continuously, not vibes." Token spend per task tier, bead completion rates, escalation frequency, and ca-leash success rates would provide operational visibility.
Consolidated Concept Definitions
Building on agentic-engineering-patterns.md with definitions surfaced by this research:
| Concept | Definition |
|---|---|
| Agentic Engineering | The discipline of developing production software with coding agents — encompassing prompt design, context management, verification, and iterative refinement. Distinguished from vibe coding by review and quality expectations. |
| Tool Loop | Core agent architecture: prompt LLM → execute requested tools → feed results back → repeat until done. |
| Harness | The orchestration layer around an LLM: system prompt, tool definitions, conversation replay, token management. |
| Context Engineering | Controlling what enters the model's context to maximize output quality. The highest-leverage skill in agentic engineering. |
| Context Window | Maximum tokens an LLM processes at once (~1M max, quality degrades above ~200K). |
| Subagent | Fresh agent instance with clean context dispatched by a parent for a scoped subtask. |
| System Prompt / SOUL | Hidden instructions defining agent identity, capabilities, and constraints. In b4arena: SOUL.md. |
| ReAct | Interleaved Reasoning (Thought) → Action → Observation cycle. Reduces hallucination by grounding reasoning in tool results. |
| Plan-and-Execute | Decoupled planning then execution. Faster/cheaper than ReAct for multi-step problems because the planner isn't consulted after each action. |
| Prompt Chaining | Sequential step-by-step execution with validation gates between steps. |
| Orchestrator-Workers | Central LLM dynamically breaks tasks and delegates to worker agents. |
| Evaluator-Optimizer | Generator + evaluator in a feedback loop for iterative improvement. |
| Lethal Trifecta | Dangerous combination: (1) private data access, (2) exposure to attacker content, (3) exfiltration channel. |
| Vibe Coding | Prompting LLMs to generate code without reviewing or understanding it. Prototype-only. |
| Red/Green TDD | Agent writes failing test first, then implements to pass. "Tests are effectively free now." |
| Conformance-Driven Dev | Test suite derived from multiple existing implementations; implement against those tests. |
| Four-Tier Execution | (b4arena) Token-cost routing: Tier 1 (shell, 0 tokens) → Tier 4 (full reasoning). "Intern Test" decides tier. |
| SOUL.md | (b4arena) Modular system prompt: Identity → Principles → Wake-Up → Workflows → Escalation → Rules. |
| Design Decision Gate | (b4arena) Mandatory check: does the task contain a design decision? If yes, architect decides before developer implements. |
| Handoff Contract | Explicit input/output schema between agent boundaries. Absence is the #1 multi-agent failure mode. |
| Scope-by-Default | Agents pull information on demand rather than receiving all context upfront. Reduces token waste. |
Trade-offs
| Approach | Pros | Cons |
|---|---|---|
| Minimal agent (Zechner) | Full observability, low complexity, competitive benchmarks | No built-in coordination for multi-agent teams |
| Feature-rich agent (Claude Code) | Subagents, MCP, specialist roles, IDE integration | MCP consumes 7-9% context; subagents are opaque |
| Multi-agent team (b4arena) | Separation of concerns, architecture coherence, review layers | Coordination overhead, handoff risk, token multiplication |
| Single-agent + tools | Simple, cheap, easy to debug | Context exhaustion on complex tasks |
| Plan-and-Execute | Cheaper for multi-step; forces foresight | Rigid plans break on unexpected discoveries |
| ReAct (interleaved) | Adaptive; adjusts to observations | More expensive; planner consulted after every action |
Open Questions
- How should b4arena measure agent effectiveness? (token spend per tier, bead cycle time, escalation rate?)
- Should ca-leash sessions emit structured traces for post-hoc analysis?
- Is the Four-Tier Framework's "Intern Test" heuristic sufficient, or does it need formalization?
- How to handle context window exhaustion mid-task in ca-leash? (checkpoint and spawn new session?)
- Should b4arena adopt explicit error classification (transient/rate-limit/auth) in agent SOULs?
Sources
Official Documentation
- Anthropic — Building Effective Agents
- Claude Code — How It Works
- Google Cloud — Agentic AI Design Patterns
- OpenAI — Practical Guide to Building AI Agents
- OpenAI Agents SDK — Multi-Agent Orchestration
- Google ADK — Context-Aware Multi-Agent Framework
- LangChain — Plan-and-Execute Agents
- Prompt Engineering Guide — ReAct
- Prompt Engineering Guide — Function Calling
Community & Practitioner
- Stack Overflow — Are Bugs Inevitable with AI Coding Agents?
- VentureBeat — Why AI Coding Agents Aren't Production-Ready
- Real World Data Science — Deploying Agentic AI
- DevOps.com — Lessons from 2025
- VoltAgent — Multi-Agent LLM Systems
- HackerNews — The Current Hype Around Autonomous Agents
- arXiv — Building AI Coding Agents for the Terminal
Prior Research
meta/agentic-engineering-patterns.md— Initial synthesis from Willison + Zechner (in b4arena/meta repo root)
Codebase
ludus/docs/architecture.md— Four-Tier Framework, Watcher, Agent Identityludus/agents/*/SOUL.md— Agent system prompts (main, forge, atlas, rio)ludus/agents/forge/TOOLS.md— ca-leash subagent pattern