Tool, skill, or subagent? Decomposing an agent that outgrew its prompt

When does logic belong in a tool, a skill, or a subagent? You'll learn the decision framework by doing: inherit a 402-line inventory agent, decompose it live on Claude Managed Agents, and run evals after every change to see what flips.

May 23, 202645mWatch on YouTube ↗

EVERY SPOKEN WORD

40 min read · 7,673 words

0:00 – 0:19
Intro
1. SPSpeaker
  [on-hold music]
0:19 – 2:20
The “agent that outgrew its prompt” problem: regressions from added capabilities
1. SPSpeaker
  All right. Fantastic. Can everyone hear me? Thumbs up, all good? All right. Everyone, I hope that you have had a fantastic day at Code with Claude London so far today. My name is Will. I'm on our engineering team at Anthropic. I sit on a team called Applied AI. What that means is I essentially split my time between internal engineering work and time spent building agents with customers. So folks, imagine that you built and shipped an agent to solve a problem. I'm sure that's something that a lot of the folks in this room have actually done. And imagine this, that this agent worked fantastic, right? But it worked so well that a few weeks after shipping, you were asked to add some additional capability to the agent. A few weeks after that, you received more business requirements, and you added additional capability. This pattern continued and continued until, before you know it, your system prompt had grown to become several hundred lines long. You have dozens of tools and subagents that exist for your agent. And because of the complexity, you've started to see regressions in the areas that your agent was previously accelerating in. So if this is you, you're not alone. We see this type of scenario happen pretty commonly with customers and actually with ourselves included in that. So within this workshop, we are gonna simulate an agent that has essentially grown to a complexity where we start to see degradation in its performance. We're then gonna walk through some of the decisions that we as engineers and architects make, um, in order to improve the design of our agent to restore the performance that we expect with the additional capabilities. Specifically, we're gonna make some decisions around tools and skills and subagents. As we modernize the stack of our agent, we wanna
2:20 – 3:21
Meet Stockpilot: an inventory agent with many capabilities (and growing pains)
1. SPSpeaker
  make sure that we're using the right agentic primitives at the right time. So when do you use a tool, when do you use a skill, and when do you use a subagent? We're gonna talk through all of that in this session. As I mentioned, folks, this session will be hands-on. So let's go ahead and get started. I first want to walk you through our problem statement in our agent. So for the purposes of this session, we're gonna be focusing on an agent called Stockpilot. This is an inventory management agent that was designed by and for a midsize re-- uh, retailer. The agent that you see on the screen can do several things. It can flag low levels of stock, it can forecast demand, it can pick suppliers, it can file POs, and ultimately, it can write weekly reports for the employees of this retailer. Now, none of these capabilities are particularly complex on their own. But again, the issue is that we've essentially bolted capabilities onto
3:21 – 4:21
Current architecture: one orchestrator, huge system prompt, too many tools, tool-wrapped subagents
1. SPSpeaker
  our agent over time without modernizing our architecture. This complexity has started to cause some problems. Let's take a look at the actual architecture today of the agent. Folks, today the agent is facilitated by a single orchestrator, so you see the Stockpilot orchestrator sitting at the top of the screen. The agent has a system prompt, as I mentioned, that's grown to be about four hundred lines long. It has twelve different tools. Three of those tools happen to be wrappers around subagents with completely isolated context windows. So if you have the repo pulled up, which we'll go into more detail in just a bit, there's an agent that's under a folder called Before, which essentially walks through this, this agent exactly. So again, orchestrator, long system prompt, a lot of tools. We have a lot of subagents. The result of this is that our evals have started to dip. So let's imagine how we got here for a moment. Again,
4:21 – 4:51
Why evals are dipping: added subagents and prompt conflicts over time
1. SPSpeaker
  like, we built that agent up front to solve a really specific problem. We received business requirements to, say, add maybe some forecasting capability to our inventory management agent. So what we decided to do was essentially just spin up a forecaster as a subagent. Again, later on, we, we received more requirements to add report writing capability to our agent, so we decided to add another subagent for that report writing capability. Again, our eval
4:51 – 6:54
Eval suite overview: regression vs failure-mode tasks and grading methods
1. SPSpeaker
  started to dip over time because we added more and more complexity while just bolting this capability on. So let's take just a little bit of time and talk about eval specifically. For this agent, folks, we have twelve different eval tasks across five different types of graders. So my colleague gave a talk on eval shortly before this. Evals will have a component within this workshop, but it won't be the main focus. I'll give you a quick summary of the tactical evals that we're using for this agent. On the left side of the screen, you see some IDs. You see several evals that start with the letter R. This stands for regression. These are more realistic single-turn tasks that we grade the model's capability on. So imagine I give the model a task, the model comprehends that task in the for-- within the agent, uh, calls some tools, and then provides a response back to me. We're essentially evaluating that response.We also have some more complex tasks that we're grading the model on. So you see those F, uh, IDs, the IDs that start with F on the left side of the screen, that stands for failure mode. In this case, we're evaluating the model over a more complicated multi-turn task that we're grading. Now, again, I won't go into evals too specifically. We have a number of different types of graders that are both deterministic and non-deterministic. Right? When I talk about not-- when I talk about, uh, deterministic evals, we're grading things like turn count and, like, latency and, like, the number of tokens that are used as our agent is completing a particular task, and we're tracking those deterministic metrics over time. We're also using the idea of LLM as a judge to evaluate the non-deterministic characteristics of our agent. So personality and tone and style and output quality. We're using a non-deterministic grader as a part of our eval to evaluate our agent's,
6:54 – 8:55
Three concrete failures to fix: inefficiency, orchestrator–subagent comms, and conflicting policies
1. SPSpeaker
  uh, non-deterministic characteristics. Now, we're gonna run the evals for our agent in just a bit. But when you do, you'll find that the agent is struggling a bit. I'll talk about some of these evals in just a little bit more depth. So F one on the screen, third from the bottom, this is essentially simulating a daily low stock sweep. So again, this is an inventory agent. We're simulating our ability to look through all of our inventory and pull the low levels of stock. This eval you'll find will actually fail because the agent is gonna do the right thing, but it's gonna take a very winding path to do so. So instead of taking the straightest line from point A to point B, the agent is gonna take a very inefficient path. It's gonna get to the right end, but it's gonna fail the eval because it's not at the efficiency that we'd like. F two on the screen is another eval that you'll see fail. This eval actually evaluates the ordering process under a particular promotion package. This is going to fail because we are using a subagent for this particular task. The subagent is actually getting the task right, but there's a communication breakdown between our subagent and our orchestrator. This is a really common point of failure that we see when customers have, have really complicated systems with a lot of subagents. It's important to get the communication between your subagents and your orchestrator just right. In the case of F two, like you see on the screen, this is an eval that's gonna fail because we have a breakdown in that communication. The last one that I'll highlight that you'll see fails is R eight on the screen. R eight will essentially check the forecasting during a particular promotion month. This eval is also going to fail because we have two different policies that live in very different parts of our system prompt and actually end up contradicting each other. So I mentioned over time, our system prompt has grown. We start to have some conflicts, and the model gets
8:55 – 10:26
Deep dive on R8: correct retrieval, wrong calculation from context confusion
1. SPSpeaker
  confused, leading towards a failure for this particular eval. Now, in the repo, you'll see it in the README, when we run these evals, you'll see that they're gonna pass upfront at about eighty-three percent, which is okay. But if you work in the world of manufacturing, that is not okay. Seventeen percent failure is a really expensive failure percentage. Now, let's double-click on R eight again just so that we can understand a little bit about what's happening behind the scenes. Again, R eight is where we're essentially calculating the forecast during a particular month with a promotion. And so in my s-- uh, on my screen here, on the right side where you see kind of the simulated terminal window, uh, within the first block under the commented text, we can see that the agent pulled the right forecasting baseline and also pulled the right promotion multiplier. So forecasting baseline, twelve units a day, promotion multiplier, three point one x. This is all correct. But in the calculation part below that, we can see that there was actually some kind of hallucination that happened. Instead of using that three point one x promo multiplier, the agent actually ended up using one point three five. So something happened along the way. A hint here is that the reason for this is that we have context problems. So this isn't a model problem, it's an issue with our-- the information that we're surrounding the model with. Our system prompt has grown to be really long and is very confusing for the model
10:26 – 10:56
Workshop plan: baseline → triage → architectural changes → hill-climb on evals
1. SPSpeaker
  and has some conflicts in it which lead to the issue that shows up within this eval. So folks, our objective in this workshop will first be to run our suite of evals. We're gonna triage the issues, and we're gonna update the design of our agent accordingly. And then we're gonna do something that we call internally hill climbing towards eval improvement, right? So we run our evals, we get a baseline, it's gonna be about eighty-three percent. We're then gonna optimize the architecture
10:56 – 12:58
Migrating from a custom Messages API harness to Claude Managed Agents (CMA)
1. SPSpeaker
  of our agent, and we're gonna continue then running our eval so that we climb on them, hopefully seeing the success percentage, uh, improve over time. In this lab, we're also gonna start with an agent that is self-created on our Messages API. Um, again, if you have the repo and you click on the before folder, I'll show you this in just a bit. This is an agent that is built from scratch on our Messages API. We're gonna actually migrate that a- that agent to Claude Managed Agents. Claude Managed Agents essentially allows us to offload the messiness that comes with maintaining an agentic harness and scaling agents safely and securely to thousands and tens of thousands of users, right? Like, if I wanna build my agent locally and run it locally, I can do that pretty quickly and pretty easily. But the moment that I need to take that agent, I need to host it remotely, and I need to allow hundreds and, and, and thousands of users to, at the same time, engage with that agent, there's an infrastructure problem, there's a scaling problem, there's memory, there's security, there's so much that I have to account for. So in order to offload that, so I can just worry about the architecture of my agent itself and make decisions around tools, skills, and subagents, I'm gonna offload everything else to Claude Managed Agents.So again, to break that down just a bit, um, there's been a few talks on CMA so far today, but this is really where we're able to separate the agent from the session details from the sandboxed environment where tool calls are actually happening. Um, again, this allows us to offload particular parts of the stack to then worry about the-- to then only worry about the design of our agent itself. All
12:58 – 17:00
Hands-on setup: repo structure, running evals, deploying the starter CMA agent
1. SPSpeaker
  right. I mentioned that we're gonna get hands-on in this workshop. We are gonna go ahead and do that right now. Now, what you see on the screen here is the workshop URL as well. Um, if you haven't had a chance to grab it, feel free to go ahead and do so. Um, this is where we're keeping all of the different workshops throughout Code with Claude within London, so you can go back and revisit them, if helpful. Within this workshop, we're gonna be working on agent decomposition, so that's gonna be the name of the folder that we're actually gonna be working within. Great. Let me jump forward here. Perfect. So the first thing that we're gonna do as a part of this workshop is we're first gonna get a baseline. So when you open up that link, you'll first, uh, clone the repo. So we're gonna clone the repo locally. We have a UV project that's set up, so we're gonna run UV sync in order to make sure that we have all of our packages and our dependencies to be able to invoke the Anthropic SDK and then eventually deploy our agent to Claude Managed Agents. So we can run UV sync to do that. I mentioned previously that we're gonna need an API key for this workshop as well. So using those credits that you got at the start of this session, uh, you can go to your Claude console account and create an API key. If you copy the ENV example, you'll just have to manually copy your API key into the ENV file that's created for you. Now, all the twelve evals that I previously walked you through, we have all of those set up already. So in order to get a baseline and run those evals, you have to run UV run evals dash dash agent before. This is all in the read me. But if you just run that command, you will be able to, um, actually go about running your evals. Now, in terms of our building here, we're gonna take a number of steps to actually go about running our evals using Claude Code to triage the results of them, and then climbing accordingly on our agent. Um, so we're first gonna take some-- we're gonna take a look at our s- the system prompt that we have for our agent itself. Um, so I mentioned earlier that our system prompt is currently sitting at about four hundred lines long. We've been stacking information on our system prompt over and over again as we've continued to get more business requirements. So our system prompt is very long. We'll take a look at that. We are then gonna take some time to evaluate the tools that we're using. Right now, as I mentioned, we have twelve different tools. Three of them are actually kind of wrapped subagents, so we'll take a look to see what we can do to make that more efficient. And then lastly, if there are any subagents that we really need to make our agent effective, we're gonna take a look at the best way, um, to actually construct subagents with Claude Managed Agents. I'm gonna jump back just for a moment. There's one thing that I forgot to mention for you as you get started. Um, within the repo folder, there's two different, uh, folders that you'll see. There's a before folder, and then there's a starter folder. Those contain two separate agents. So if you want to view the messages API version of the agent, again, this is just me building my own agent loop and my own agent harness around the Anthropic messages API to invoke Claude. You'll see that within the before folder. If you want to view what that agent looks like when deployed on Claude Managed Agents, um, you can look in the starter folder, which exists right below that. If you want to deploy your agent on Claude Managed Agents, you can run UV run deploy starter. So again, run your evals using the messages API version, uh, dash dash agent before. You can then deploy your agent on Claude Managed Agents. We already had it built for you, and it's really easy to use Claude Code to kind
17:00 – 20:33
Using Claude Code to run evals and diagnose: baseline drops to 62% with failure themes
1. SPSpeaker
  of compare the two, um, and understand exactly what's going on and what some of the differences are, uh, with Claude Managed Agents. Okay, so I'm gonna jump over here, and we're just gonna open up Claude Code, and we are gonna build together. I'm gonna zoom in very far so that you can see everything and so that I can see everything, and we'll just talk through exactly what happens when I run some of these evals, and we'll talk through the process that we usually go through to do what I just called hill climbing on the evals themselves. Okay, so if you're looking at Claude Code here, again, I just used Claude Code to actually run my evals because I want Claude's help in triaging what's going on. Um, so this is me. I'm using Claude Code. I have Opus four point seven running, as you can see on the screen. Um, my effort level is set to extra high. I usually set effort as extra high with Opus four seven, and I forget about it. That's the effort level that I usually stay on. We find that it gets great performance, um, with extra high effort altogether. Now, you can see on the screen the first thing that I did was I ran my eval. So I used the bash capability in Claude Code, and I ran UV run evals dash dash agent before. Claude actually went ahead and ran my eval. So I'm gonna scroll down, and we're gonna look at what Claude found while actually running those. So you can see the response that we got, the results that we got from this eval run was actually lower than what I told you before. So we ran them, and we got sixty-two percent, which is worse than the eighty-three percent that we started with. So we passed seven out of twelve of them.And it looks like Claude has provided us with a diagnosis for the different evals that we actually failed. Let's scroll down just a bit more. And we are gonna use Claude to understand a little bit more about why this actually happened. So you can see here, I am using Claude to provide me some of the themes around why we actually failed some of these evals. Again, this is a great technique if you have evals for your agent. Um, again, as Giri showed before this session, you can use Claude to actually go about triaging these. So it looks like there's a few different themes that Claude is figuring out based on this agent. So the first thing, Claude is seeing that our model is taking on a lot of work that it should have tools in order to do. So our model is doing a lot of reasoning across information that it just doesn't have the tools to be able to complete. It looks like there is some issues that we have with the enforcement of output structure. So our model and our subagents are producing information in a particular output structure that doesn't align, um, exactly with, uh, what we're looking for with, um, to, to pull the best performance from, uh, from our agent. If I continue to scroll down here, you can see there was some policy issues, et cetera. Um, as I mentioned before, we have a system prompt that's really long right now. Um, and so Claude is seeing some confusions based on the information that's found within the system prompt. So again, you can see Claude has found some root causes. Now, we're gonna do a few different things here. Again, we're gonna go one by one and address some of the areas, um, that we're seeing issues on within our agent. So I'm gonna scroll down here, and we are gonna use, um, Claude Code to triage
20:33 – 25:07
Fix #1 — Replace bloated system prompt with skills for progressive disclosure
1. SPSpeaker
  some things within our agent. Okay. So the first thing that I'm gonna ask Claude to do, we're gonna talk through this. Claude is making some changes, which is great. Um, system prompts tend to get very, very long when we accumulate agents over time. So the first prompt that I ran, if you're following along, feel free to go ahead and do this. I encouraged Claude to look at my agent.py file, which is where our main CMA, um, agent loop is located. Again, that's agent.py. And I essentially said, "Hey, Claude, do you have any thoughts on the system prompt? Maybe I can use skills instead of a long-running system prompt for progressive disclosure." So the first thing that we'll talk about is skills. There's been a few other sessions on skills. The short definition that I like to use is that skills are packaged in composable information that Claude has the ability to pull into context whenever Claude realizes that it needs that information to complete a particular task, right? Skills are really useful with Claude Code. Like, if you need to provide Claude information on your testing process, or if you want to package up your brand and your UI components and bundle them into a skill that Claude can pull into context whenever needed, skills are fantastic. Skills are also useful within the agents that you're building for your customers. So if you're building a product and you are going to give that product to customers, you're building an agent, skills are great within that. In the case of the agent that we have on the screen here, um, again, we have a lot of different policies and a lot of procedures that go into our inventory management system. As I accumulated requirements over time, instead of building skills, I decided to take all of that information and keep appending it to my system prompt, so my system prompt got longer and longer and longer over time. This is not something that we recommend you do based on the introduction of skills, right? Leave the system prompt only for the information that Claude needs in its mind, regardless of the task that you give it. Skills are fantastic for packaging information that Claude is going to need some of the time, not all of the time, right? So if I ask Claude to go build a forecast, Claude is going to, um, go ahead and do that. Let's see. I lost my computer just for a second. There we go. If I ask Claude to go ahead and build a forecast, right, Claude is not going to need forecasting information unless I specifically ask it, um, to go ahead and, and build that forecast, right? So in the case of that particular task, I want Claude to pull forecasting information into its context window. Skills are also fantastic for making sure that you are being efficient with context because if you stuff all of this information into the system prompt, you're polluting that context window with information that Claude does not need, um, in order to complete a particular task. So again, the first thing that I did-- I'll zoom in just a bit more so that you can see this, and I'll scroll up just a bit. I said, "Hey, Claude. Can you help me take a look through my system prompt? Um, can I use skills instead? Um, my system prompt is too long, and I need some help." And so Claude did an analysis of this and realized that I have some pre-built skills that I can use to supplement information in my system prompt. So the first correction or fix that we're gonna make to modernize our architecture here is we are actually going to, um, remove, uh, many of, uh, much of the system prompt, and we're gonna put that information into skills. And so you can see here, the first thing that we're doing with Claude is we are activating a number of different skills that previously were not there before, and we're actually swapping our system prompt to be a short prompt instead of a long one. So if you're curious, if you feel like you have a long system prompt within the agents that you're building, feel free to take a look at this to see the differences between what was, like, a 400-line system prompt compared to about a 50-line system prompt. We've supplemented that, and we've switched a lot of that information to skills.Great. I am now going to continue working with Claude's. You can see we made those changes here, which is fantastic.
25:07 – 31:11
Fix #2 — Tool simplification: prefer human-like primitives (code execution, filesystem) over many bespoke tools
1. SPSpeaker
  Um, there's some evals that I can go rerun. I'm gonna ask Claude to do one more thing, and then we're gonna, we're gonna rerun some of our evals to see where we've improved. So I mentioned before that we have 12 different tools. You saw those on the screen in the second slide that I shared. As a part of this inventory management agent, we have s- we have tools that we've created for everything. So whenever Claude needs to retrieve data, we have a tool. Whenever Claude needs to analyze data, we have a tool for that. We have tools for everything. So I'm gonna ask Claude to take a look at the tools that my agent has and help me think through how I can optimize here. So right now, Claude is running an analysis across the different tools that I have for my agent, and we're gonna get to see what some of the results were. Now, while this is working, um, I'll give you a, a tip, um, when it comes to building agents that we carry with us at Anthropic for our agents internally and the agents built with customers. Whenever we build agents, we lean into the same primitives, um, that we as humans have access to. So imagine yourself when you show up to work, right? You have a computer that's sitting in front of you. You have the ability to navigate files on a file system. You can type in the browser, and you can search the web. If you're an engineer, you have the ability to write and execute code. When you think about Claude Code as an agent, we've effectively given Claude access to all of the same primitives that you and I have access to when we show up to work every single day. Like, Claude Code is a great coding agent because Claude is really good at code. But essentially, what we've done with Claude Code is we've just given Claude access to a computer, right? And this is really powerful because this allows us to drop in better versions of Claude as we continue to release new models, and Claude just uses those primitives better than it did before, right? Like, imagine yourself after this conference compared to yourself when you walked in. You're gonna have the same tools at your fingertips, but you're... theoretically, your brain's gonna be a little bit bigger. You're gonna be smarter based on what you learned here, and you're gonna be more effective while using the same tools. Claude works the same exact way, right? And so whenever we build agents, we lean into human-like primitives first. These primitives are things like code execution and the navigation of a file system, the keeping of a to-do list, the ability to search the web. These are foundational tools that we always start with when we build agents, and we remove them as needed. An example that I like to give is with file, uh, like document analysis. If you're building an agent that requires document analysis, maybe you have a lot of ex-- uh, CSVs or Excel sheets that your agent is gonna be looking over, code execution, so the ability to write and run code, is one of the best ways of, uh, uh, doing data analysis and working across lots of documents, right? Like, if you need Claude to look across a CSV, giving Claude a bash tool so that Claude can write a quick Python script and reason across the results after running that Python script is much more effective than just uploading the entire CSV into Claude's context window, right? So again, we lean into these, uh, computer-like primitives first when building an agent. So if I scroll down here, that's exactly what we did here. You can see we took a lot of steps, and we actually removed most of the tools that exist within our agent, and we replaced them with, uh, some of the primitives that I talked through previously. This is an inventory management agent that leans really well to this. Um, I have the ability to consolidate and remove a lot of the tools that I'm using to reason across Excels and reason across forecasting data and just give Claude access to the same tools that Claude Code has in order to do that. What's cool about this is that when you build using, uh, Claude Managed Agents, these tools are actually included by default. So if you want to give Claude access to those same tools that Claude Code has and use them to, uh, drive powerful capability within your agent, you don't have to worry about writing a tool that gives Claude the ability to write and run code. Or you don't have to write a tool that gives Claude the ability to use the file system. You can just rely on those built-in tools, um, that we have built ourselves for Claude Code that we just make available through, uh, Claude Managed Agents. I'm gonna ask Claude to rerun an eval. The evals to see if we are getting better. Now, with your agent, there's always gonna be the need to add some custom tools as well. Like, you're not-- you're only gonna get so far by giving your agent the same tools that we give Claude Code. Um, so we always start with those, uh, primitives like code execution and web search and to-do lists, et cetera. Um, we always start there, and then, uh, we either remove those tools as we don't need them, right? There might be some agents where we just don't need web search, so we'll go ahead and remove that tool. Um, and then we'll add custom tools whenever we need them, right? So again, when you think about tools, I encourage you to start with those Claude Code primitives, those human-like primitives, and then add custom tools only as you need them. In the case of this specific inventory agent, um, we were able to remove most of the tools and replace them with Claude Code. So you can see right now, Claude is redeploying my agent to Claude
31:11 – 33:41
Tooling strategy guidance: when to use MCP vs local tools vs code-based tool execution
1. SPSpeaker
  Managed Agents. So again, I have my agent locally.I am redeploying it based on some of the changes that we've made, and now I can rerun some of my evals to see the results. So you can see in that last command, I'm rerunning, uh, the F1 eval, and we're gonna see what happens as a result. Now, we always get a lot of questions when it comes to MCP. So in the case of CMA here, you have a couple different options when it comes to tools. You can first lean on those Claude Code primitives, things like web search and code execution and file system. Again, that's what we start with. You can then create, uh, just custom tools, so standalone tools that only your agent has the ability to use. Then you can connect your agent to MCP. We see a lot of folks run towards MCP first, and a lot of our customers end up in this ecosystem where there's a lot of kind of chaotic MCP servers that exist. A lot of times they have overlap, um, which can create some problems. So when we build agents, again, we start with those Claude Code tools. We then create local tools only for our agent. We don't run to MCP. And then only in the case where we have a common collection of tools that multiple clients will benefit from accessing, do we go about the process of collecting those and publishing them as an MCP server. So only when we have multiple agents, maybe multiple Claude Code clients that need to access the same set of standardized and governed tools, we run towards MCP. Something else that's becoming increasingly common throughout the industry is leaning on Claude's ability to effectively use code execution as a means of executing tools. So we see a lot of capabilities coming out around just giving Claude access to, uh, use CLIs and invoke APIs using code and actually run tools using code instead of MCP. One of the drawbacks of MCP is that it does, um, cause some, uh... it can cause some context issues just in terms of polluting context and taking up a lot of space. So there may be some cases where you can just rely on code execution, either through CLIs or just by giving Claude the ability to invoke APIs using code as a means of, um, creating more flexibility for your agent where you do not have to use MCP.
33:41 – 40:47
Results and subagents: efficiency gains, when to keep subagents, and CMA callable agents
1. SPSpeaker
  So just something to keep in mind as you're building. Great. Okay, so Claude just got done. Um, looks like we have the before and after from some of the changes that we've made, and I think that this is pretty compelling, right? The first thing that jumps off the screen to me is the token usage. So [chuckles] before, I was using over two hundred thousand tokens for a particular task. After leaning in on some of those file system primitives, you can see that that went down dramatically. This is a direct result of giving my agent code execution. So again, imagine instead of giving my agent a full CSV that needs to be read into context, I just give my agent the ability to write and run Python as a means of kinda navigating across all of that information. The agent uses a lot less tokens when it can write code and then run code and then read the results instead of having to consume all of that data in Claude's mind and then use all of that kind of collective brainpower to then make decisions based on the results. Um, a few other things. Um, we can see that our cost went down as well because we're just not using as many tokens, which makes sense. Um, our, our, our execution time went down as well. So this was a pretty good case where I think we got better overall, but this is not something that will happen all the time, right? We-- like, we might see some cases where we regress, but this was the case where using some of those primitives as opposed to some of our more stagnant tools was clearly the, uh, the right decision. Great. Okay, we're gonna jump back, and we're gonna talk about subagents for just a bit. I'm gonna copy another prompt to Claude, and we are going to, um, investigate subagents. Now, I mentioned before that we had twelve different tools. Three of them were effectively wrapping subagents. So if I'm Claude, I have the ability to call on a tool. That tool is a wrapper for a subagent. I can then go and invoke that subagent. I see Claude's doing a lot here. I'll scroll up just a bit, and then we'll talk through it. The two main use cases where-- or the two main instances where we see subagents initially as being really effective is first when you wanna throw a lot of Claude at a problem, right? So let's say that you're trying to do deep research or, like, web search. Um, let's say that you're trying to do, in the case of Claude Code, code-based exploration. That's a great case where, like, having many different minds running at the same problem makes sense. So subagents are a great way to parallelize and throw a lot of Claude at a problem to get it done faster and more effectively. The second case where it's really common to use subagents is when you need a fresh mind to look at a problem. So I'll use the Claude Code example first. If I'll, I'll use my example as a developer. If I am writing code, I do not wanna be the same person that is writing and also reviewing my code. I'm gonna have somebody else review my code. So in the case of Claude Code, it makes a lot of sense to have one instance of Claude doing the writing of the code and then another instance of Claude coming over the top and reviewing that. That does not have context about the initial, uh, instance of Claude. This is a great case for a subagent. Using just a code review subagent and layering it over the top is a great way to do this. We also have a, a subagent within, um, our, our agent here, our inventory management agent, that we've actually kept as the result of, um-Of some of the changes that we've made, and that's for forecasting specifically. So again, I have a forecasting capability that's within my inventory management agent. I do wanna keep my forecasting separate from my main instance of Claude. I don't want anything in my initial context window to distort the forecasting process. I do have a skill that kind of walks through the step-by-step sequence and the guidelines that I prefer Claude use when writing and building forecasts. But again, I don't want the same Claude that I'm-- that, uh, my customer is talking with to also be the Claude that writes the forecast, right? So I wanna divide that. So I'm leaning on that second example of when to use a subagent, um, as the place where we'd like to go about doing this. So in this case, we've removed our other subagents, and we've just replaced them with primitive tools, um, but we are going to leave the forecasting subagent. Now, we're not going to expose our subagent as a tool. Using Claude Managed Agents, there's a native capability for subagents, um, that allows the logging and the observability of your subagents, um, to be really effective. One of the problems with subagents is that when you have multiple instances of Claude running, first off, it's difficult to make sure that the communication between your orchestrators and your subagents is accurate and is seamless, right? There's a lot that can get lost in translation. Just like when I'm talking to one of my colleagues, I might be thinking something, they might be interpreting it completely differently. The same thing happens with orchestrators and subagents. The, um, the other thing that can happen is logging is really difficult in some cases, right? Because then you have to worry about collecting the transcripts from multiple different agents. So within Claude Managed Agents, we've added this native subagent capability. I saw it on here. Let me scroll up just a bit. I think Claude found it. Yeah. So there's this callable agents capability that exists within Claude Managed Agents, which is essentially just like managed subagents, so that within your session information, you have observability and metrics about what exactly your subagents are doing that is as accurate as your initial orchestrator, right? Um, this is again meant to solve one of the common problems of just having, uh, a lot of information that is hard to track with subagents. We just did some building. Again, I'm gonna skip through these because we spent some time talking about them. We just talked about subagents. Again, there's a few different cases where you can use them. We t- just talked about callable agents. You can also just define your subagent as a tool, which is what we did previously. But we actually moved away from that, and we decided to use the CMA native capability. Um, there are a lot of cases where you can just now scrap the subagent entirely and just give more flexibility and capability to your main agent. So what we have a lot of customers doing is actually just consuming capability into their main, in this case, orchestrator, because frontier models have gotten intelligent enough to manage across more information where you just don't need as many subagents. So again, when you're thinking subagent, I had a lot of Cla-- or I have a big problem that I wanna throw a lot of Claude at, or I want a separate Claude to kind of look at, um, the work of either me or of a different instance of Claude, two great times to
40:47 – 45:05
Final architecture and takeaways: simpler stack, higher eval score, and hill-climbing discipline
1. SPSpeaker
  use subagents. Okay, so let's look at the architecture that we ended with. Again, refreshing us, we started with an orchestrator, system prompt of about four hundred lines long. We had twelve tools, three of them were subagents. What did we end with after this exercise? We still have an orchestrator, but we deployed that on Claude Managed Agents because I didn't wanna have to worry about infrastructure, scaling, security, et cetera. I just wanted to worry about my agent, right? Like in, in Will's simple terms, like that is when I reach for Claude Managed Agents because I just want to worry about building the best thing possible and not all the messiness that comes with it. We simplified our tools. We now have, uh, we have, uh, right now three different tools, so we actually simplified everything to just use Bash, Read, and Write. Now, when our agent starts executing, we sync some data into the Claude Managed Agents environment so that it can reason across that data. We actually simplified our system prompt to fifteen lines long, and we replaced all of our business logic with skills. So again, I was just stuffing requirement after requirement into my system prompt. I decided to take that, package it up as skills so that Claude could pull that information into its brain only when Claude realized that it needed it in order to solve a problem. As a result of this, we showed how we can then start hill climbing on evals to see improvements over time. So at the end of this, my eval score is about ninety-two percent. I've simplified my design. I'm leaning into some of the primitives, um, that make, uh, Claude great. Um, and I'm seeing the positive results after doing so. Again, some of the eval results, you see that here after running this, um, we're getting faster. We're using fewer tokens because we're leaning into code execution. Um, our turn count is remaining sort of the same, but again, because the token usage and the cost is going down, I'm actually okay with Claude taking more turns. There are some cases where we'll see the latency not drop maybe as much as you would expect. But for some of these more sophisticated high intelligence agents, where like forecasting is at play, I'm willing to take a little bit higher latency, um, at the expense of seeing my performance improve and my costs go down. All right, let's wrap with some, some takeaways here in our last minute. When we build agents, we start with a single agent loop that i-- that is equipped with very simple primitives that give Claude some of these human-like capabilities, like the ability to use the file system, like the one that you have on your computer, web search, code execution, um, sometimes a to-do list. Again, we start there, and then we build accordingly. The next thing that we did is we used progressive disclosure through skills. Instead of stuffing our system prompt with a lot of information, we made information accessible to Claude whenever Claude realized that it needed that information in order to solve a problem. This is great because we can run more efficiently, and, uh, we're not polluting our context window, um, and we're giving Claude more flexibility to make decisions. The last thing that I want you to walk away with, write evals in general. This idea of hill climbing is a concept that we lean on really, uh, heavily at Anthropic, right? You have evals, you establish a baseline, you then tweak your architecture, and you rerun evals, and you get better over time. Now, as a result, it's important to make sure that your evals are updated as your product capability expands. Always make sure that your evals are encompassing the things that you care about and that you're measuring within your agent so that you can actually make sure that your agent is accomplishing the thing that you set out to accomplish. With that, folks, we're gonna go ahead and wrap. I really appreciate your time today. Um, I'll be in the back after the session, just outside of this room in case you have any questions at all. Um, thank you for spending your day at Code with Claude in London. I hope you have a great rest of your day. Appreciate it. [upbeat music]

Episode duration: 45:05

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode mWvtOHlZM-I

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Intro

The “agent that outgrew its prompt” problem: regressions from added capabilities

Meet Stockpilot: an inventory agent with many capabilities (and growing pains)

Current architecture: one orchestrator, huge system prompt, too many tools, tool-wrapped subagents

Why evals are dipping: added subagents and prompt conflicts over time

Eval suite overview: regression vs failure-mode tasks and grading methods

Three concrete failures to fix: inefficiency, orchestrator–subagent comms, and conflicting policies

Deep dive on R8: correct retrieval, wrong calculation from context confusion

Workshop plan: baseline → triage → architectural changes → hill-climb on evals

Migrating from a custom Messages API harness to Claude Managed Agents (CMA)

Hands-on setup: repo structure, running evals, deploying the starter CMA agent

Using Claude Code to run evals and diagnose: baseline drops to 62% with failure themes

Fix #1 — Replace bloated system prompt with skills for progressive disclosure

Fix #2 — Tool simplification: prefer human-like primitives (code execution, filesystem) over many bespoke tools

Tooling strategy guidance: when to use MCP vs local tools vs code-based tool execution

Results and subagents: efficiency gains, when to keep subagents, and CMA callable agents

Final architecture and takeaways: simpler stack, higher eval score, and hill-climbing discipline

Get more out of YouTube videos.