Tool, skill, or subagent? Decomposing an agent that outgrew its prompt

When does logic belong in a tool, a skill, or a subagent? You'll learn the decision framework by doing: inherit a 402-line inventory agent, decompose it live on Claude Managed Agents, and run evals after every change to see what flips.

May 23, 202645mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Fixing bloated AI agents using skills, tools, and subagents wisely

The talk simulates a real-world agent (Stockpilot) that accumulated requirements over time, resulting in a ~400-line system prompt, many tools, multiple subagents, and performance regressions.
The team uses a suite of 12 evals (regressions and multi-turn failure modes) with both deterministic and LLM-judge grading to diagnose inefficiency, communication breakdowns, and prompt-policy contradictions.
They modernize the architecture by moving “sometimes-needed” business logic out of the system prompt into skills for progressive disclosure, reducing context pollution and conflicting instructions.
They simplify tooling by favoring “human-like primitives” (code execution, filesystem operations) over a proliferation of bespoke tools, dramatically reducing token usage, latency, and cost on key tasks.
They retain subagents only where they provide clear benefits (parallelization or “fresh mind” separation), and migrate to Claude Managed Agents to offload infrastructure concerns and improve observability around subagent behavior.

IDEAS WORTH REMEMBERING

5 ideas

Treat system prompts as “always-needed,” and push the rest into skills.

A long prompt eventually contains redundant and contradictory policies; skills let the model pull in domain procedures only when relevant, reducing confusion and context waste.

Use evals to find architectural failures, not just model failures.

Examples included inefficient paths that miss efficiency thresholds (F1), orchestrator–subagent communication mismatches (F2), and conflicting prompt policies causing incorrect math (R8).

Start with general-purpose primitives before adding many custom tools.

Giving the agent code execution and basic filesystem read/write often outperforms tool-per-task designs, especially for data work (CSVs/forecasting inputs), and reduces token load dramatically.

Subagents are best for parallel work or for intentional context separation.

They’re useful when you want “more Claude” on a problem (research/exploration) or a clean reviewer/specialist that isn’t biased by the main conversation—e.g., keeping forecasting isolated.

Prefer native managed subagent patterns over “subagent-as-a-tool” wrappers.

Claude Managed Agents’ callable agents improve observability and reduce the common failure mode where orchestration instructions get lost across agent boundaries.

WORDS WORTH SAVING

5 quotes

This pattern continued and continued until, before you know it, your system prompt had grown to become several hundred lines long.

— Will

This isn't a model problem, it's an issue with our-- the information that we're surrounding the model with.

— Will

Leave the system prompt only for the information that Claude needs in its mind, regardless of the task that you give it.

— Will

Whenever we build agents, we lean into the same primitives, um, that we as humans have access to.

— Will

You have evals, you establish a baseline, you then tweak your architecture, and you rerun evals, and you get better over time.

— Will

Agent sprawl and prompt bloatTools vs skills vs subagents (selection criteria)Eval suites: regression vs failure-mode tasksDeterministic metrics vs LLM-as-judge gradingProgressive disclosure via skillsReplacing bespoke tools with primitive capabilities (bash/files)Claude Managed Agents deployment, scaling, observability, and callable agents

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.