Skip to content
ClaudeClaude

Tool, skill, or subagent? Decomposing an agent that outgrew its prompt

When does logic belong in a tool, a skill, or a subagent? You'll learn the decision framework by doing: inherit a 402-line inventory agent, decompose it live on Claude Managed Agents, and run evals after every change to see what flips.

May 23, 202645mWatch on YouTube ↗

CHAPTERS

  1. Why agents degrade as prompts and toolsets grow

    Will sets up a common failure mode: an agent that starts simple, accumulates requirements, and turns into a brittle system with a huge system prompt, many tools, and multiple subagents. As complexity increases, performance regresses—especially in areas the agent originally handled well.

  2. Meet Stockpilot: the inventory agent that outgrew its architecture

    The workshop centers on Stockpilot, an inventory management agent for a mid-sized retailer. It supports stock alerts, demand forecasting, supplier selection, PO filing, and weekly reporting—capabilities that become problematic when combined without architectural modernization.

  3. Current “before” architecture: single orchestrator + long prompt + tool/subagent sprawl

    Will walks through the starting architecture: one orchestrator with a ~400-line system prompt, 12 tools, and 3 tools that are wrappers around subagents with isolated context windows. This design makes it hard to keep behavior consistent and increases failure points.

  4. Eval suite overview: regression vs failure-mode tasks and grader types

    The agent is measured with 12 eval tasks across 5 grader types. Will distinguishes single-turn regression tasks (R*) from multi-turn failure-mode tasks (F*), and explains deterministic metrics (latency/turns/tokens) versus LLM-judge scoring (tone/quality/style).

  5. Three concrete failures: inefficiency, subagent communication breakdown, and conflicting prompt policies

    Will highlights specific failing evals to illustrate root causes. F1 fails due to an inefficient, winding tool path; F2 fails due to orchestrator↔subagent miscommunication; R8 fails because contradictory policies in a long prompt confuse the agent.

  6. Deep dive on R8: the “right inputs, wrong calculation” context failure

    R8 shows the agent retrieving the correct baseline and promo multiplier, then hallucinating during the calculation (using the wrong multiplier). Will frames this as a context and prompt-structure problem, not a model capability problem.

  7. Workshop plan: run evals, triage, then “hill climb” improvements

    The process is: establish a baseline, diagnose failures, change architecture, and rerun evals iteratively to climb performance. Will also introduces migration from a Messages API DIY harness to Claude Managed Agents (CMA) to offload infrastructure concerns.

  8. Hands-on setup: repo structure, commands, and deploying to Claude Managed Agents

    Will walks through practical steps: clone the repo, install dependencies with uv, add an API key, run evals on the ‘before’ agent, and optionally deploy the ‘starter’ CMA version. The repo contains both the original harness and CMA-based agent for comparison.

  9. Using Claude Code to triage eval failures and spot themes

    Will demonstrates running evals inside Claude Code to accelerate diagnosis. The example run scores 62% (7/12), and Claude identifies themes like missing tools for work the model is doing “in-head,” output-structure mismatches, and policy confusion from the long prompt.

  10. Refactor #1: replace a bloated system prompt with progressive disclosure via skills

    Will introduces skills as composable instruction bundles that Claude can pull into context only when needed. The refactor shrinks the system prompt dramatically (from hundreds of lines to a short core prompt) while moving procedures/policies into skills to reduce context pollution and conflicts.

  11. Refactor #2: simplify tools by leaning on human-like primitives (bash/read/write/code execution)

    Will advises starting with foundational “computer primitives” (file system, code execution, web search, to-do lists) before building many bespoke tools. For Stockpilot, many custom tools are removed and replaced by built-in primitives in CMA, improving token usage, cost, and time.

  12. Tooling strategy and where MCP fits: avoid chaos, standardize only when shared

    Will addresses MCP adoption patterns: many teams reach for MCP too early and end up with overlapping, ungoverned servers that add context overhead. Recommended progression: primitives first, then local custom tools, then MCP only when multiple clients/agents need a governed shared toolset; sometimes CLIs/API calls via code execution can replace MCP.

  13. Subagents: when to use them, when to remove them, and CMA-native callable agents

    Will frames two strong subagent use cases: parallelizing work (“throw more Claude at it”) and getting a fresh, isolated perspective (e.g., code review, or keeping forecasting separated). The workshop removes unnecessary subagents, keeps a forecasting subagent, and switches from “subagent-as-tool wrapper” to CMA’s native callable agents for better observability and communication reliability.

  14. Final architecture and results: simpler prompt, fewer tools, one focused subagent, higher eval score

    Will summarizes the “after” state: orchestrator on CMA, tools simplified to a small set of primitives, business logic moved into skills, and only a targeted forecasting subagent retained. This modernized design supports eval-driven iteration and yields a reported improvement to ~92% pass rate, with fewer tokens, lower cost, and often faster execution.

  15. Closing takeaways: start simple, disclose progressively, and hill-climb with evals

    Will closes with guiding principles for robust agent design: begin with a single loop and simple primitives, use skills to avoid stuffing prompts, and iterate with eval baselines and architectural tweaks. He emphasizes keeping evals current as product capabilities expand to ensure you’re measuring what matters.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome