EVERY SPOKEN WORD
25 min read · 4,673 words- SPSpeaker
[upbeat music] Please welcome to the stage Product Management Lead, Claude Platform of Anthropic, Brad Abrams. [upbeat music] Thank you. Well, good, good afternoon. You've almost made it to the end of Code with Claude. Thank you for hanging in there with us. But I gotta tell you, you are in the most important session of the day, so you made a good choice being here. Thank you. Uh, this is an important session 'cause we're gonna talk about putting agents into production, the real-world case of agents in production. And you know, just to get us started, how many of you already have an agent in production? Not some demo proof of concept, but actually in production. Who's got it? Okay. Now, keep your hands up. Keep your hands up if you're happy with the cost, reliability, latency. Yes? Okay. There's still a few. If... You can put your hands down, but if you're sitting next to one of these guys, you should just talk to them after and get the real tips. Um, but we're gonna, we're gonna talk through a few things in this session that will help you manage things like cost and latency and reliability, a few techniques. So let's drill in here. And it turns out the most important technique to think about is, is prompt caching. I caught a couple of sessions. A lot of people are mentioning it, uh, but if you're not already doing prompt caching, then you absolutely should. And I'm gonna tell you, um, I can look at the analytics, and I know a few of you are not doing prompt caching, so it's absolutely worth mentioning this. With long-running agents, prompt caching is very important because the context continues to grow over tool call after tool call. You have the tool c- uh, tool call, tool result, uh, and then the tool call and tool result again, and all of that gets appended to the, to the prompt. And so there's a lot of common, uh, elements in that prompt that unless you do prompt caching, we're reprocessing that every time. With prompt caching, if you mark which sections are common i- in your prompt, then we're able to com- compute the KV values, uh, essentially pre- pre-cache the, um, the models, you know, part of the inputs to the models in KVs and save those. And that saves a lot of latency when that happens and a lot of processing time. We skip the whole first part of inference when you do that, and so it's a big cost savings. Um, in fact, it's a ninety percent discount. So if you're doing a long-running agent, you're not doing, uh, prompt caching, uh, you're miss- missing out on a ninety percent discount on input tokens, which is the, the largest share for most customers. Um, you also get faster response time, especially time to first token. Uh, and then a little-known fact is that it also, uh, pr- uh, prompt cache tokens don't count against your API rate limits. So if you have a rate limit, uh, and you wanna, like, manage that rate limit as, as well as you can, the, the cache tokens don't count against that. So we almost ten-x'd those rate limits today, but still, if you, if you, if you still wanna manage those, you can with this. Uh, and some of our customers have done a really amazing job with prompt caching. I just wanna highlight a couple of these. Uh, Cursor, uh, uh, Replit, Perplexity have all... are all well into the nineties, and they have done significant engineering effort. I've sat with some of these engineering teams, and I can tell you they have done a lot of effort to get prompt caching into the nineties. But I'm gonna tell you a hint. So we have a couple of tools that are available to you right now, uh, that you can use that make it way easier. The first is on the screen here. You can go and, uh, do some deep inspection of what's happening with prompt caching in your app. So I encourage you, feel free if you would like to pop open your laptop, go to the Claude console and check out your, the new prompt cache dashboard under Analytics, and you can actually see what your production agent is doing in terms of prompt caching. And if it's not in the nineties, you have, you have some work to do. So that's the first thing. The second thing is we've recently launched a skill for Claude Code that's an expert at prompt caching. So all... And it's installed by default with Claude Code, so all you need to do is go to Claude Code and say, "Improve my cache hit rate," and, uh, Claude will walk you through the process of adding the cache control, [clears throat] uh, markers in there, maybe reorganizing some of the prompt, uh, just walk you through that process so you can get a very high cache hit rate. So that's an absolute no-brainer I think you should, I think you should take a look at doing. [clears throat] So let's take a look at a demo of, uh, how prompt caching looks. So I'm gonna invite Ben on stage. Ben, ju- come on out here, um, and we're gonna take a look at this demo. So Ben and I worked on this demo. It's a executive dashboard. So let's say you're the CEO of a company, and you have all these objectives, and we're running a... Wait. Ben, is this the demo we agreed-We're at Code with Claude. This looks like, uh, like late '90s SharePoint UI. I don't know. What, do you have Claude Code? Okay, pop open Claude Code. Let's see what we can do. Okay, Ben's got Claude Code, [laughs] um, and, uh, it's attached. He, he's got the source code for this, and we're gonna see if we can improve this theme. So how many people want the a, a better theme? Okay, there we go.
- SPSpeaker
[cheering]
- SPSpeaker
Okay. Is that a better theme? Is this like a little bit more, um, appropriate for the venue that we're in? We are now not some boring '90s, um, CEO. We are the CEO of Hero Corp AI, and what Hero Corp does is rents out superheroes to battle villains, protect Metropolis, come to your child's birthday party, whatever, um, they do. And then we're seeing, uh, the objectives. This is objective one. Uh, I'm, I'm told that retention of superheroes is a very important thing, and so objective one is around retaining them. And Ben, like, I don't know, maybe we're not paying superheroes enough. It looks like, it looks like, uh, they're a little low here. Um, and you can see some updates from each of the superheroes and then as CEO, some tasks that we can go do. So, um, Ben, do you know, what, what's the cache hit rate on this? Do you know? No, no. Okay, okay. You don't, you don't know what the cache hit... Okay, okay. So first off, you gotta know what your cache hit rate is. So what we've done is we've implemented a dev console for this little demo. So slide open the dev console and let's take a look. So in this little dev console, what you see here is our context usage, tool calls, and then there's this, uh, agentic transcript that's happening. So don't you wish all your apps came with this beautiful dev, uh, dev dashboard? But I notice in this dashboard, one thing that's standing out to me immediately is the cache hit rate is, like, zero. It's zero. I mean, I don't know. Uh, so Ben, is there something we can do to improve the cache hit rate? So he's gonna open up Claude Code, go back to that, and just, and just, uh, im-improve the cache hit rate. Uh, and now we'll rerun that. And notice when we're rerunning it, we're hitting all those same tool calls again, but this time in the agentic transcript, you're seeing, uh, cache writes and cache hits. So cache ca- The first time the inference system sees a prompt segment, it writes it to the cache. So we store those KV values for, um, five minutes, uh, by default. You can extend that with some options. Um, and, and then the next time that a loop comes around, we, that becomes a cache hit. So in a normal agentic loop, you'll see some cache hits and some cache reads. So we're, we're doing a little better here, and I think you'll watch over the course of the demo, you'll see that cache hit rate get better. Okay, so that's prompt caching. But, um, if you scroll down a little bit, let's look at some of these other objectives. Ben, look, Ben, I gave you a million tokens of context, one million tokens in, in this Opus four seven. Million tokens, and that's not enough apparently. So with all the tool calls that we're doing to get information from Slack, from Gong transcripts, from Salesforce, all of the... In fact, why don't you pop open one of those and just show there's just an, an enormous amount of data that's getting flooded into the context. So even at a million token context, we're running out before we even get through objective one. So we should think about how we wanna handle this. As you might guess, I have a couple of techniques for handling this. Let's switch back to the slides, and let me describe some techniques for context engineering. Okay, so context engineering is really a discipline. It's the discipline of deciding what belongs in Claude's context. One mistake I see developers doing is using abstractions over top of the platform that obscure what's in the context, and then as a developer, you don't really know what Claude's seeing in its context. And, and that, that makes it difficult for you to optimize. It makes it difficult for you to be a context engineer. So I encourage you to really pay a lot of attention to, to look at the full transcript that Claude has access to, that Claude's using, 'cause it'll be very insightful. And the, this discipline is, is you making a proactive decision to decide, uh, what should be in. And I'm gonna talk here about three different tools that we have available today in production for you to use to help manage the context. The first one is about reducing the tool declarations that's in the platform, uh, that's in your context. The second is about reducing the tool results that pollute the context. And then finally, compaction reduces all those stale turns that are no longer needed. So let's drill into each of these and see how they look. So tool-- with tool search tool, we have many customers that have tens or even hundreds of tools loaded. If you have a long-running, multi-use agent, oftentimes it does need many tools to get its job done. That's, that's one of the beauties of LLMs, is they're general purpose and can do lots of different things. So we wanna encourage you to have a lot of tools. But if you look at this without case, if you load all those tools in upfront in the system prompt, then that leaves very little space to do your actual work. So if we look at without, uh, sorry, wi-with tool search tool, what we're doing is defer loading all those tools. You still declare them up front, but we don't put them in the prompt yet. We defer loading them, and then we load them just in time, just as the model needs them. So if the model, if in a particular agentic tr- uh, uh, trajectory, the model never needed that tool, then it never gets loaded. Uh, and that really optimizes the context pretty well.And we see, uh, customers like Lovable, uh, have reduced their token uses by ten percent, uh, and, and it actually not only does it save you money in latency to do that, but what Lovable saw, and I think many people will see, it actually increases the intelligence of the model to be more careful about what goes in the, in the context. So that's tool search tool. The next one to look at is programmatic tool calling. So first off, don't you love these animations? I-- can you tell that I have, uh, free tokens from Claude, uh, to be able to build these animations? Um, so what, what programmatic tool calling does is it solves the problem of tools that return too much data. Uh, and, uh, it's many of the tools that we just showed you return huge amounts of text that just gets stuffed in the, in the prompt. And you know, that works fine. If you're building a little demo, that works fine. But here in this session, we're talking about going to production, not building a demo. So to go into production, you need to be a little bit more thoughtful about this. And you can try to mess with the tools to have them return less, but oftentimes what happens when that is you miss a case. There's a, there is some case where the model needed that data, and then you've removed it, and now the model doesn't have it. So what we-- the insight that we had here is that the model is actually pretty good at writing Python code. I don't know if you've noticed that, uh, but the models are actually very, very good at writing Python. So what we do is we expose the model. We say, here are the set of tools that are available right now in context, and it writes Python code. You'll notice the first time it writes code to inspect the schema of what's returned, and then the second time, like you're seeing here, it knows the schema, so it, it act- we-- the tool returns all of its data. It st-stays in memory, and then the model writes code to pull out just the little bits. Just one byte here, a few words there, and it, it uses just that in its context. So the model is deciding what it needs in its own context. And with this, uh, technique, Quora is, uh, is saving a lot of money and seeing increased intelligence for Python, a lot of, uh, HTML parsing that they're doing. Okay. And then finally, I call this one the sledge- the sledgehammer technique, 'cause even if you do a great job at managing your tool declarations and the tool results, if you have a long enough running agent, then you will hit the, the context window limits. And so what compaction does is it removes all those unused turns, all those turns of the conversation that are not needed anymore. It just compacts those down to a short summary, and that summary is important. And in fact, there's a lot of s- uh, like intelligence that goes in to create that summary so that the model can continue on. It doesn't lose the thread. It, it, it kind of knows what it needs to keep doing with compaction. Um, and we see Hex is actually, uh, using that in, in production now. Okay, so let's... I know you're dying to see. Let's switch back to the demo and see our superhero, our Hero Corp, uh, agent. And let's see. So we should add tool search tool, uh, compa-- Like, you know what? Let's just add them all. Just add all of them. So we're gonna add tool search tool, programmatic tool calling, and compaction in one go. So now when we reload the page, note-- watch the context bar as it goes up there. Notice firstly, it's moving way slower than it did before. Remember before, it loaded, uh, uh, in the first objective, it went all the way to a million tokens. We're calling the exact same tools that are returning the exact same data, but we're being smart. We're using context engineering to understand what goes in. So with much less context, uh, we're able to load the entire page. Um, and I know some of you bean counters have already noticed that the cost went way down when we did that as well. Uh, so but let's walk through these one at a time. Just make sure we follow what exactly each of them do. So let's start with, um, you see, uh, tool search tool in here. So here is what the tool search tool returns. So again, the, the model looked at the problem we gave it and said, "Oh, uh, I'm gonna need a tool that does this thing, like, that does hero retention metrics." And then our system went through all of the hundreds of tools that were there and picked out, I don't know, three or four and returned those. So rather than having a hundred in context, we only have three or four. And so something like, I don't know, like, uh, hero retention metrics gets called, and then I guess later we see it. Do we see it? Um, which one do we see? Do you see it? Oh, yeah, there it is. There it is. Sorry, I missed it. Hero retention metrics is right there. So we the, we dynamically added this tool just in time, and then as soon as we added it, the mo-model turned around and called that, and we returned the full data from that. Okay, so that was tool search tool. The next one we talked about was programmatic tool calling. So in this one, we flag it as code execution in this, in this view, and look at... Just examine the code that the model writes. Uh, and we-- this is not some special thing I have. We give you exactly the code the model writes so you can understand what's actually happening. You see, it's calling those methods. If you look at the, uh, results equals async gather, it's calling each of those same tools, the same tools we saw before. It's calling those with the same parameters.But this time, the-- we're not loading all of that into context. The, the model is store is lo-loading them into a JSON object, um, and is printing out very well documented. You see contact pipeline gong exactly where it came from. And then you see that, um, colon two thousand five-five hundred. Like, that-- the model knows exactly which object that it needs, so it doesn't... Like, all of that crap we loaded in the first turn, it doesn't need that. It just needs this one little bit, so it's pulling that out. So it's printing it, uh, and then that is what gets loaded into the model's context. Okay, so that's tool search tool. And then finally, the last one we talked about was compaction. So actually for compaction, uh, why don't we look and see. It, it gets called-- Yeah, yeah. It gets called a couple of times in here. So what I did, just for demo reasons, we set the compaction threshold pretty small, like, what, like five hundred K, something like that. And, and that is a common technique to... You don't have to go, "I lo- I launched a million token context. I love a million token context," but you may not need all of that. Uh, it may save you some cost, and it may save you some latency, um, and it may give you more intelligence to keep, keep the model at, at a lower threshold. So that's what we did here. We kept it a lower threshold and, um, and so as the context grow-grew to that threshold, the model, uh, we paused execution, sent the entire transcript to another model call, which summarized it and gave this summary. So go ahead and expand the summary. Um, and you can see this is the summary of everything that went on. So it took hundreds of tool calls with all their results and everything that was going on and just compacted it down to, like, here, here's the few things you need to know to keep going. So the model could, could keep going. Okay. So yeah, we hit, I think we hit all the things. Um, but I notice it's still costing ten dollars to run this thing. I mean, I gotta tell you, the hero-- if you're, uh, uh... Selling tokens is a great business. Uh, second to that is selling superheroes. That's also a really good business. But still, ten dollars per load, I mean, we're gonna get in trouble for this thing. So I notice that you're using Opus. Is this-- Yeah, uh, where does it say this? Yeah, yeah. So the model that's being used is Opus. Again, Opus four seven, absolutely great model. Um, but it's expensive. Um, and it may be better to use a smaller model. So it turns out a small model like Sonnet, um, can do, can do tool calling just as well, write code just as well, but there's just a few things it, it, it doesn't do as well, and we'll talk about how to take advantage of that. So let's switch back to slides. And we'll talk about advisor, the advisor strategy. So what the problem we're trying to solve with advisor is we want Opus-level intelligence, but at Haiku-level costs or Sonnet-level cost. Uh, that sounds great, right? So the insight behind here really came from engineering teams. So you know if you've worked with engineering teams that you can pair a junior engineer with a senior engineer, and that junior engineer will get a lot better, right? Because the senior engineer won't be doing all the work. They won't be hands-on keyboard doing the work for them, but they will do code reviews. They'll look at design documents. They'll do coaching. And that's exactly true with models as well. What we see is that if we give Haiku a way it can call Opus and ask for help, then O-- and Opus can scan over the transcript and see what's happening, then it, it can give some advice back to Haiku, which really helps a lot. And so you see in, again, another beautiful animation here. Um, you see that the executor, uh, with Haiku knows every shape. It knows exactly what to do with each shape, except this oddball shape. And the oddball shape, it has to go ask the advisor. The advisor knows, and then it puts, uh, the, the advisor tells the executor what to do with that oddball shape. Uh, and we're seeing customers like Bolt are, are using this to help, help manage their costs as well. Okay, so let's switch back to the demo, and we'll take a look at this step. Uh, so yeah, why don't we go ahead and add advisor to this? So what we're gonna do with adding advisor is gonna switch the model from, um, it was Opus. Now you see that model line. It says, "Sonnet four six plus Opus as advisor." So immediately, again, I saw the accountant's eyes, uh, light up because the price went down significantly. Uh, immediately, 'cause we're doing all the tool calling and Python code, whatever. All that's being done with Sonnet, which is way cheaper than Opus, so you get immediate savings. But the question is, are you getting the intelligence as well? So there's one really important objective that we have here, um, and that is the Metropolis renewal. This is a contract I've been very worried about for Hero Corp. They have to win the Metropolis renewal. Um, and if we look at this advisor call, the, the way it works is, um, Sonnet went and looked at all of the gong transcripts, all the data about the Metropolis renewal, and reali-- and, um, Sonnet said, "Great, this looks good. This is on track." And then, but then it said, "Oh, maybe I'll just call the advisor and just make sure," 'cause this is a, a high impact deal. And Opus, uh, reviewed the same transcript. It reviewed that same transcript, and it said, "Oh, Sonnet, you missed a thing." Buried deep in the transcript, we see that-cryothane is actually needed, that the mayor actually wants cryothane. Uh, and this is like way detailed and Sonnet just missed it. And so Sonnet said on track, but Opus caught it. So now you see the advantage. Not only you're using Sonnet for it being cheap... Oh, wait, the marketing team told me not to say it's cheap. It's not cheap, it's inexpensive. So, um, Sonnet is inexpensive. Uh, and so we're using, um, we can use that, which is great, but you still get the intelligence of Opus, but just on demand, just exactly what is needed. Uh, and it catches this cryothane thing. Okay, so we caught that. Um, and so now you can see it says in red, it's not, it's not good. Uh, we need to do something. And I don't know about you, but I have some tension about this. I wanna make sure that we can, can win the Metropolis contract. So if you scroll down a little bit, there's an actionable thing that the CEO can do, um, and that is to lock in cryothane. I also love these superhero names. It turns out, it turns out the legal team wouldn't let me use any real [laughs] superhero names, so we have to use cryothane. Um, the, uh... We can lock this in. So I, are... Should we, should we save the contracts? Who's in favor of saving the contract here? Yes? Okay, great. Thank you. You're with me. Okay, let's click that and, and lock in cryothane. We have it. Okay, so that's... This is, in the agentic world, this is all you do as a CEO, is just click the button and you're, and you're good. Everything's saved. Okay, we've got just one minute. Thank you, Ben. Uh, let me wrap us up. So back to slides. Uh, what we saw here is we saw, uh, it's very important to do prompt caching. If you're not doing that, ignore the rest of the talk [laughs] and go do that. If you are doing that, really pay attention to context engineering. Uh, make sure you are aware of what's in the context, and then optimize that for what's actually needed. And finally, we talked about using the advisor strategy, so it's really on, um, on-demand intelligence. But that's not all. Amazing stuff that we've shipped in the platform just in the last few months. So I wanna call out my favorite things, um, this workload identity federation. So if you're concern- if your security team is concerned about losing API keys, you need to worry no longer. Uh, WIF is a great answer for that. Um, and then I just love this Ant CLI command line tool. Everything you can do in the console, just about, is available via command line tool. And the best thing about a command line tool is Claude just loves command line tools. So Claude, uh, if you use it via Claude Code, just tell it about the Ant CLI tool, and it will go and manage everything for you and do just an amazing job at that. So the thing to really take away here is that betting on the platform means that it's gonna-- the platform's gonna keep getting better, and your agents are gonna keep getting better as new things keep coming out in the platform. Okay, so that's all. Thank you very much. I appreciate it. [audience applauding] [upbeat music]
Episode duration: 28:15
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode 7oO37GRhwGk
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome