Tool, skill, or subagent? Decomposing an agent that outgrew its prompt

When does logic belong in a tool, a skill, or a subagent? You'll learn the decision framework by doing: inherit a 402-line inventory agent, decompose it live on Claude Managed Agents, and run evals after every change to see what flips.

May 23, 202645mWatch on YouTube ↗

CHAPTERS

0:19 – 2:20
The “agent that outgrew its prompt” problem: regressions from added capabilities
Will sets up a common real-world scenario: an agent that started small but accumulated requirements, tools, and subagents until performance began to regress. The workshop will focus on restoring performance by choosing the right primitives—tools, skills, and subagents—at the right time.
- •Agents often degrade as prompts and integrations grow without architectural redesign
- •Symptoms include regressions in previously strong tasks and increased brittleness
- •Session focus: decomposing an agent using tools vs skills vs subagents
- •Hands-on format with a simulated “overgrown” agent
2:20 – 3:21
Meet Stockpilot: an inventory agent with many capabilities (and growing pains)
He introduces Stockpilot, an inventory management agent for a midsize retailer, and lists its end-to-end responsibilities. None are individually hard, but bolting them on over time created complexity that now harms reliability.
- •Stockpilot flags low stock, forecasts demand, selects suppliers, files POs, writes reports
- •Complexity comes from additive requirements rather than any single feature
- •The workshop simulates a typical customer trajectory of incremental expansion
- •Goal is to regain reliability while keeping capabilities
3:21 – 4:21
Current architecture: one orchestrator, huge system prompt, too many tools, tool-wrapped subagents
Will describes the “before” architecture: a single orchestrator with a ~400-line system prompt and 12 tools, including several that are wrappers around isolated-context subagents. The team can inspect this implementation in the repo’s Before folder.
- •Single orchestrator at the top controlling everything
- •System prompt ballooned to ~400 lines
- •12 tools total; 3 are wrappers around subagents with isolated context
- •This design is available in the repo under a “Before” folder
4:21 – 4:51
Why evals are dipping: added subagents and prompt conflicts over time
He walks through how the agent got into this state—new requirements led to new subagents and prompt additions, but architecture wasn’t modernized. Over time, the accumulation reduced eval performance and introduced failure modes.
- •Forecasting and report-writing were added as subagents as requirements grew
- •Capabilities were “bolted on” rather than re-architected
- •More moving parts increased brittleness and regression risk
- •Evals provide the signal that performance is slipping
4:51 – 6:54
Eval suite overview: regression vs failure-mode tasks and grading methods
Will outlines the evaluation setup: 12 tasks using five graders, including both deterministic and LLM-judge components. He distinguishes single-turn regression tests from multi-turn failure-mode tests and explains what metrics they capture.
- •12 eval tasks across 5 grader types
- •Regression (R*) evals: realistic single-turn tasks
- •Failure-mode (F*) evals: more complex multi-turn tasks
- •Deterministic graders track tokens, latency, turns; LLM-as-judge grades quality/tone/style
6:54 – 8:55
Three concrete failures to fix: inefficiency, orchestrator–subagent comms, and conflicting policies
He previews key failing evals and their underlying causes: the agent takes inefficient paths, subagent communication breaks down, and prompt policies contradict each other. These represent common failure patterns in complex agent systems.
- •F1 fails due to a correct but inefficient/winding execution path
- •F2 fails due to breakdown between orchestrator and subagent despite correct subagent work
- •R8 fails due to contradictory policies in different parts of the system prompt
- •These map to common real-world issues: efficiency, coordination, and prompt drift
8:55 – 10:26
Deep dive on R8: correct retrieval, wrong calculation from context confusion
Will drills into R8 to show the pattern: the agent fetches the correct baseline and promo multiplier but then uses an incorrect multiplier in the final computation. He attributes this to context/prompt issues (not model capability), driven by a long, conflicting system prompt.
- •Agent retrieves correct baseline (12/day) and promo multiplier (3.1x)
- •Final calculation uses an incorrect value (1.35), indicating confusion/hallucination
- •Root cause: context problems and conflicts in the oversized system prompt
- •Motivation for restructuring knowledge and policies
10:26 – 10:56
Workshop plan: baseline → triage → architectural changes → hill-climb on evals
He lays out the hands-on approach: run evals, diagnose failures, change the architecture, and rerun evals iteratively to improve. The objective is to raise pass rate meaningfully and reduce costly failure rates in production contexts.
- •Start by running the full eval suite to establish a baseline
- •Triage failures, then update agent design (tools/skills/subagents)
- •Iteratively rerun evals to “hill climb” improvements
- •Target is higher reliability than the initial pass rate
10:56 – 12:58
Migrating from a custom Messages API harness to Claude Managed Agents (CMA)
Will explains why CMA matters: it offloads operational complexity (hosting, scaling, security, session handling) so builders can focus on agent design. CMA separates the agent logic from the execution environment where tools run.
- •Messages API version exists but requires maintaining an agent loop/harness
- •CMA handles scaling to many concurrent users, sandboxing, and safety/security concerns
- •CMA separates agent logic from session/environment details
- •This enables faster iteration on architecture choices
12:58 – 17:00
Hands-on setup: repo structure, running evals, deploying the starter CMA agent
He walks through practical steps: clone repo, install dependencies via uv, set API key, run evals against the “before” agent, and optionally deploy the “starter” CMA agent. Participants can compare the Messages API agent vs CMA-based version.
- •Clone repo; run `uv sync` to install dependencies
- •Create API key and populate `.env` from the example
- •Run evals: `uv run evals --agent before`
- •Repo has `before/` (Messages API harness) and `starter/` (CMA); deploy via `uv run deploy starter`
17:00 – 20:33
Using Claude Code to run evals and diagnose: baseline drops to 62% with failure themes
Will demonstrates running evals inside Claude Code (Opus 4.7) to get automated triage. The observed pass rate is 62% (7/12), and Claude summarizes likely failure themes like missing tool support, output-structure mismatches, and policy confusion.
- •Claude Code runs `uv run evals --agent before` and summarizes results
- •Observed baseline in demo: 62% pass (7/12), worse than earlier expectation
- •Diagnosis themes: model doing work that should be tool-supported
- •Other themes: output structure enforcement issues and system-prompt policy conflicts
20:33 – 25:07
Fix #1 — Replace bloated system prompt with skills for progressive disclosure
He introduces skills as composable, on-demand context that Claude can pull when needed, instead of keeping everything in the system prompt. The workshop refactors the ~400-line prompt down to a short prompt and shifts procedures/policies into skills to reduce context pollution.
- •Skills package information for “sometimes-needed” procedures/policies
- •Keep system prompt to always-needed guidance only
- •Progressive disclosure reduces context pollution and confusion
- •Refactor: activate skills and shrink system prompt dramatically
25:07 – 31:11
Fix #2 — Tool simplification: prefer human-like primitives (code execution, filesystem) over many bespoke tools
Will explains Anthropic’s tool philosophy: start with general, human-like primitives (like Claude Code has), then add custom tools only where necessary. In Stockpilot, many specialized tools are removed and replaced with built-in primitives, reducing token usage and improving efficiency.
- •Start with primitives: code execution, filesystem navigation, web search, to-do lists
- •Better to run Python over CSVs than to stuff large data into context
- •CMA includes many primitives by default, reducing custom tool burden
- •Refactor removes most bespoke tools and relies on a smaller, more powerful core set
31:11 – 33:41
Tooling strategy guidance: when to use MCP vs local tools vs code-based tool execution
He addresses common questions about MCP: many teams jump to MCP too early and end up with overlapping/chaotic tool ecosystems. Recommended progression: primitives first, then custom local tools, then MCP only for shared, governed toolsets; sometimes code execution via CLIs/APIs can replace MCP and reduce context overhead.
- •Avoid “MCP first” when it creates overlapping servers and context clutter
- •Sequence: Claude Code primitives → custom tools → MCP only for shared standardized tools
- •Code execution can invoke CLIs/APIs as a flexible alternative to MCP
- •MCP can increase context usage; use when governance/reuse across clients is needed
33:41 – 40:47
Results and subagents: efficiency gains, when to keep subagents, and CMA callable agents
Will shows improvements (especially token/cost/time reductions) and then reframes subagents: use them either to parallelize work or to get a fresh, independent reviewer mind. They keep a forecasting subagent, migrate away from tool-wrapped subagents, and use CMA’s native “callable agents” for better observability and coordination.
- •Measured gains: large token reduction, lower cost, faster execution for tasks
- •Subagents are best for parallelization or independent “fresh mind” review
- •Keep forecasting as a separate subagent to avoid context contamination
- •Use CMA callable agents for native subagent support, logging, and observability; avoid tool-wrapped subagent hacks
40:47 – 45:05
Final architecture and takeaways: simpler stack, higher eval score, and hill-climbing discipline
He summarizes the end state: orchestrator on CMA, minimal tools (Bash/Read/Write), a very short system prompt, and business logic moved into skills, with only essential subagents retained. He closes with key principles—start simple, use skills for progressive disclosure, and continuously hill-climb using evals that evolve with product scope.
- •End state: CMA orchestrator; tools reduced to a small core; data synced into environment
- •System prompt reduced to ~15 lines; business logic moved to skills
- •Eval score improves to ~92% with reduced token usage and acceptable latency tradeoffs
- •Takeaways: start with primitives, use skills for on-demand knowledge, maintain and evolve evals to guide iteration

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

The “agent that outgrew its prompt” problem: regressions from added capabilities

Meet Stockpilot: an inventory agent with many capabilities (and growing pains)

Current architecture: one orchestrator, huge system prompt, too many tools, tool-wrapped subagents

Why evals are dipping: added subagents and prompt conflicts over time

Eval suite overview: regression vs failure-mode tasks and grading methods

Three concrete failures to fix: inefficiency, orchestrator–subagent comms, and conflicting policies

Deep dive on R8: correct retrieval, wrong calculation from context confusion

Workshop plan: baseline → triage → architectural changes → hill-climb on evals

Migrating from a custom Messages API harness to Claude Managed Agents (CMA)

Hands-on setup: repo structure, running evals, deploying the starter CMA agent

Using Claude Code to run evals and diagnose: baseline drops to 62% with failure themes

Fix #1 — Replace bloated system prompt with skills for progressive disclosure

Fix #2 — Tool simplification: prefer human-like primitives (code execution, filesystem) over many bespoke tools

Tooling strategy guidance: when to use MCP vs local tools vs code-based tool execution

Results and subagents: efficiency gains, when to keep subagents, and CMA callable agents

Final architecture and takeaways: simpler stack, higher eval score, and hill-climbing discipline

Get more out of YouTube videos.