Skip to content
ClaudeClaude

Getting more out of the Claude Platform

Cut cost, manage context, boost intelligence. In this session, we'll show you how to put our latest platform capabilities to work. Through live demos you'll see what great prompt caching looks like, learn to keep context lean for long-running agents with tool search, programmatic tool calling, and compaction, and use the advisor strategy for a cost-effective intelligence boost.

May 22, 202626mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. SP

    [on-hold music]

  2. SP

    Please welcome to the stage member of technical staff at Anthropic, Puneet Shah.

  3. SP

    [upbeat music] [audience applauding]

  4. SP

    Welcome to the last session of Code with Claude here in London. Yeah, let's give it up for everyone who's come before. [audience applauding] Woo! So I'm a product manager on our team, on the platform team at Anthropic, and I've shipped some of the features like our one million context window, fast mode, and a number of our improvements on prompt caching amongst others. And that's what makes me especially excited about this session. You guys have made a very good decision to come here, uh, because this is about the Claude platform. And if you think about what we have, we have these great models, but there's this whole layer on top of it, the platform, that's not about just getting you the intelligence, but helping you build a real business, a real products that really deliver for your users on top of those models. And so, um, in the spirit of this being the last session, I wanna like-- let's get a little movement. So if you have built an agent of any sort, uh, s- big, small, I don't care what, uh, go ahead and stand up. Okay, good. Nice. A lot of agent builders here. This is nice. Now, if you have put one into production, go ahead and stay standing and everyone else just take a seat. Okay. If you're really happy about the quality, about the cost, about the speed of those agents, stay standing and everyone else take a seat. Okay, nice. Nice. Okay, look around. Remember these folks. These are, these are our true experts. Go find them at happy hour afterwards. Um, um, you can all have a seat now. Thank you. Thank you for humoring me. Uh, what I wanna do in this session is share some of what we've learned about what helps you get the most out of that Claude platform. And one of the first things that we have on the platform that I wanna talk through is prompt caching. If you remember nothing else from this session, think about prompt caching. And what caching is, is it's a way that where we-- if you're not familiar, where we take your input tokens, we process them, and then we cache that before we generate the output tokens. And that cache we then continue to reuse as the conversation moves on. And so when you have a new message, uh, in the conversation, we just process those additional new tokens, but the rest we just pull from the cache. And okay, why is this useful for you? Why, why should you care? Well, the first reason is you get a ninety percent discount because we're not reprocessing them. We pass those savings on to you. You get a ninety percent discount, so a huge cost savings to actually build your p- uh, your agent. The second is that you also get a rate limit boost effectively. If our, uh, rate limits are not, uh-- they don't count your cached tokens. And so if you have an eighty percent cache hit rate, meaning eighty percent of your tokens are cached, then you effectively have a five times larger rate limit in practice. So that's great. Um, and then the last benefit is latency. If you are starting to cache a lot and that conversation's getting a, a lot longer, because we're no longer processing all those tokens, what ends up happening is your time to first token goes down. Uh, and so these are great benefits. Um, and if you're kind of looking for a target and you're building agentic applications, something in the kind of eighty percent above range is a good place to try to target. Um, but if you look at some of these customers here, we have, you know, Replit, Cursor, Perplexity, Claude Code, they're hitting ninety plus percent. Like, these people have really-- I've talked to all these customers. They've put a ton of effort into making their prompt caching work because of all of those benefits I've described. And one of the things that they all start with is just understanding what is your prompt cache hit rate? What is the prompt cache hit rate you get? That's the first question you should understand. And thankfully because we've learned how much work that these people are putting in, we, we said, "Let-- Why don't we build it for you guys?" And so now today, if you go to the console, uh, on the Claude platform, console.anthropic.com, you can actually see analytics right next to your cost and usage pages. Uh, you can actually see analytics on prompt caching. And, and even just yesterday we continued to improve this. Yesterday we just launched ways to actually figure out why has your cache broken. Turns out the ordering of how you make those prompts matters a lot. If you-- uh, a common error I see is like people put a, a timestamp into the system prompt. You know, what day is it? Okay, that's useful, but then it breaks the system prompt because it changes, um, as you go and that then breaks your cache. The li- uh, uh, the, the tokens need to be exactly the same. And so w- you can actually see how did it break and, uh, you'll see that on the analytics page. And that's a great place to get started. If you're seeing it at zero percent, it's okay. That's why you're here. Um, you can start with a one-line code change with auto caching, implements kind of a basic prompt caching. Um, or better yet, go to the Claude API skill. Uh, go to Claude Code or many other coding agents and we have a, a skill built in prepackaged that lets you ask it to, uh, improve your cache hit rate and it will help with how you manage and order that prompt to get the best performance. So prompt caching, super important. Um, we're gonna talk a little bit more about it, but I wanna talk about something I've been working on, I've been really excited about. You guys are here for my startup pitch. Thank you so much. Um, I'm also in my free time, uh, the CEO of Hero Corp. Um, and I have my CTO actually here. Ben, do you wanna come out here? Um, he runs our technology team. Let's give it up for Ben. [audience applauding] Um, uh, and can we switch over to the demo laptop? So we're, we're-- WaitHey, Ben, what, what, what is-- Ben, this is Claude with code. This is twenty twenty-six. This looks like it's from the nineties. What, what's going on here? Guys, who, who wants a better theme with something a little bit more of the times? Yeah, we got, we got a few thumbs up. Okay, Ben, let's, let's improve the theme. Let's make it, let's make it ready for, for code with Claude. Let's let that cook. Nice. So we are Hero Corp. Um, we are a superheroes for hire company. Um, you know, we help fight crime in the city, keep the tube running on time, uh, you know, staff your three-year-old's birthday party and make sure there's sufficient number of balloons. Uh, we do it all. Uh, and some people have misconceptions, I think, about the, uh, the superhero world. They think that we're like fly by your seat of your pants. We're analytical. We use OKRs, we plan, and that's why we have a dashboard. And this dashboard is, uh, how we track our performance. You can see, uh, ninety-day retention, ninety-four percent. Mm, we could do better. Uh, we got one flight risk there probably because our compare-- compensation is a little bit low. I-- Okay, yeah, I probably should be paying the superheroes more. Um, you can see some of our different superheroes in the Anthropic Cinematic Universe. Um, our lawyers, uh, required us to use these, uh, names instead of others. Uh, uh, are listed here with their quotes about how things are going. So do you mind pulling open our-- We bu-we built a little developer console. Um, you'll hear me repeat this a number of times. You should always look at the transcript of your agents to really understand what's going on. Um, and so right here you can see we're pulling in a bunch of data, and if maybe you wanna pop open one of those. Um, you'll see here there's a bunch of data we're pulling in from web, from Slack, from Gong, from Jira, all the different sources to aggregate it to take a, a big holistic view on what's going on at the company. And, you know, right... What, what's our prompt cache hit rate? Do you wanna take a look? Zero? Ben, come on. Come on. Okay. Let-let's, let's implement prompt caching. So, um, uh, that's the starting point. We, we know what our prompt cache hit rate. It's okay, it's at zero. We, we can do better. We're gonna get there. And so we're right now at about thirty-one pounds. You saw it before it flashed away. And we've just implemented prompt caching. And, and what's happening on the right is the exact same output. Remember, prompt caching has no impact on intelligence. It is the exact same thing, just pre-processed and saving you money. And now, instead of that long block that he showed you earlier, we're just saving that in cache and then reusing that. And if you, if you look there, if you kinda zoom in a bit, you'll see that it says, "Cache write one seventy-two tokens," uh, right up there. Um, and then cache hit one seventy-two tokens. And what's going on there is that we are writing that to the cache and then as the conversation continues with the agent, great, we're just able to reuse those tokens and get that ninety percent savings. Um, and so we've already kind of about halved the cost of our, uh, agent because we've gotten that kind of about fifty-eight percent cache hit rate. So nice. Okay, this is a good first step, uh, to start out with. Let's, let's take a look at maybe the rest of this, uh, dashboard. Um, wait. Ben, there's five OKRs. Uh, this is-- I gave you a million tokens of context and this is where we get... Okay. Okay, so [clears throat] turns out we've filled up the context window, and now we got a little bit more work to do. So let's switch back to slides. What we need is context engineering. And what context engineering is, is the art and science of figuring out what context you expose to Claude to give your agent the best performance. And again, I said this earlier, I'm gonna repeat it many times. Look at your transcript, um, to see it. Uh, to see what the Claude, uh, the, the models are actually seeing. Um, and that will, I think, be incredibly illustrative to see, is there a lot of stuff that you really don't need to be passing to Claude, or is there a lot of relevant stuff that's keeping it on track? And we'll go through today. There's many techniques. We'll go through kind of three primary ones about how you can implement, uh, better context engineering on your models. And so to start, the first is around figuring out what tools we pass to the model and kind of narrowing that down. The second is about narrowing down what results from those tools we share to the model. And finally, as that conversation continues in the history, it's about keeping that conversation going for kind of almost unlimited feeling context. So let's go through these one by one. The first is tool search. Agents use a ton of tools. It's one of the magic of making great agents is tool calls. And, and Claude models are especially, uh, uh, capable of leveraging those tools. We see tens, even hundreds of tools used in agents, uh, today. Now, one of the problems, though, is when you define all those tools and you pass them to the model, what ends up happening is they fill up a lot of your context, and that leaves a lot less room for the actual work that needs to happen in the model. And so what you're looking at here is that, you know, that context fills up after just a couple turns. So instead, we have tool search tool. And what tool search tool does is it figures out-- You define all your tools up front, but we only pass the model a tool that tells it, "Okay, when you think you might need a tool, call this," and it gives then a list of the tools, and then only then does it put into context the actual definition of the tool. Um, and you can see here on the, uh, on the bottom row, on the middle row rather, uh, we're just passing in those orange tools right when needed, leaving a lot more space for the actual stuff you need. Um, and Lovable tried this. Um, you heard from Fabian earlier on this stage, and he was talking about how, um, he's told us in, um, with tool search that they actually reduced their overall token consumption by ten percent with just this, uh, this solution. And importantly, it actually improved the performance of their model as well. That they saw that because you're putting less gunk into context, you're putting more relevant stuff there onlyIt turns out the models actually perform better. Um, so they've rolled this out to all their users. Okay, so that's tool search. What happens about what you do with those results from the tools? That's the next one, is programmatic tool calling. For programmatic tool calling, what-- the idea here is: how do we curate that content that's coming back from tools just to what's most relevant? And the kind of the, the insight here is it turns out Claude's really good at writing code. Um, if you've used Claude Code or any other coding tool that leverages the models, you have been able to experience this, and we use that to your advantage here as well. With-- We have C-Claude just write a simple Python script that can actually call those same tools, get the same results, but instead do a little bit of work to curate that content and then send it to the model, just what's most relevant. And folks like Quora have used it. Um, they've used it with HTML content, where they've been able to strip away all the stuff that's just irrelevant, keep the part that's relevant, and seen the performance of their models improve. Okay, so the conversation continues. Um, you've probably heard from folks, uh, on this stage before about how from, like, Lisa and Jeremy, about how we're seeing the models able to do even hours of autonomous work today. And so inevitably, you're gonna hit that million context, uh, threshold. And what-- with compaction is a way to allow you to continue that conversation instead of getting halted to a screeching stop, screeching stop, uh, when you hit that, uh, full context. What happens is it summarizes the context with your prompt, shifts it down to lower context, removes the turns that are no longer relevant, and then the conversation continues. Rinse, repeat. And from there, you get this almost feeling of unlimited context, um, keeping the models on track through compaction. And Hex has used this, um, that since using this, they've been able to simplify down their code, um, and seen that their, uh, c- uh, model's able to continue to perform nicely. Nice. So, um, let's go back to the demo. Um, I think, Ben, we've, uh, we've got a couple solutions of context engineering. Take us, take us forward. Let's see what we can do. So right here, you're seeing immediately that, uh, we've now seen that-- See that, uh, context, uh, window bar kind of in the left-hand panel? You're seeing that's going up a lot slower because we're putting in less into context in each turn. And what you'll notice that when it hits around 400K, just right there, you notice how it went back down? What happened was that we hit a threshold. We've set it to 400K. I'll get into, uh, that in a bit. And then it compr-- uh, compressed it down with compaction. [coughs] So, okay, this is great. Um, uh, let's, let's take a look and see what we can do with, like, how is this working? Let, let's take a look at tool search, maybe. Can we find one in there? Yeah, there we go. So we wanna get our hero retention metrics. Remember, uh, we-- this is an analytical business. We need to, we need to understand retention. Um, and so it calls the models, and it asks them, "Okay, what are, um, the, the tools that, uh, can help me get my hero retention metrics?" Turns out there is a hero retention metrics tool. And now note in each of these, you can notice that, like, that one is fourteen thousand tokens, um, in the schema. The, uh, hero list is sixty-one hundred. Um, the next one's ninety-three hundred. Like, these are large tools, and that's, that's totally normal for a lot of agents. And because we haven't put them into context, there's a lot less, uh, room that's being taken up by these tools that we won't need. And instead we call... Okay, hero retention metrics, that's the correct one. Um, if we go down to the transcript to find the-- Uh, here we go. Great. So what you see here now is these are the exact definition of that tool. Just that tool, none of the rest. And great, now the model can use the tool it needs, and it leaves room for everything else. So, okay, we've got the tool we need. Now what happens next when we get the results? Let's take a look at... Yeah, right here. This is-- So Gong is, uh, if you're not familiar, it's a tool that kind of records your sales team's conversations, so you can, like, look at the data, analyze it, see how, uh, how the field is reacting, uh, to maybe the product you've launched. And, um, you know, these three-year-olds, I love them, but they really care about their green balloons a lot when we send superheroes to them. They go on and on for thirty minutes, even sixty minutes sometimes, and it's just not that relevant, frankly, for when we're putting this dashboard together. I don't need all that data. I just kinda need the general sentiment, the vibes of, uh, of how they feel about the conversation. So with, uh, programmatic tool calling, we've-- uh, the, the model has created a nice little script here, and it first looks at the first twenty-five hundred, uh, tokens, uh, characters. It's there, just understands what's the structure, and then it realizes all we want is the aggregate sentiment. And so it then writes a simple way to loop through, get the aggregate sentiment from the variety of calls, and stream that into the dashboard. And great. What previously was this massive result block that you just saw actually is now down to just the parts of it that we really need for what we need to do right now. Again, keeping that context narrow. And, and remember, this session's not about a cool demo. There's lots of cool demos in AI. Love them too. But, um, the, the kinda core thing here is about actually putting it into production. And in that case, you need to really make sure that the context really matches what's actually needed to make your mo-- uh, your product successful. So, um, okay, that's the second one. Let's talk about the third one, which is compaction. So we hit that 400K threshold, and now we need to take it down. We, we have chosen 400K. Now, I launched a million context. I'm a big fan of the million context window. Lots of scenarios where a million context is great. It might be for your scenario that the right combination of intelligence, cost, latency is not a million. Start with, you know, 500K. 400K is often a good starting point, but, you know, it changes by model. Um, and what we've done is we set that threshold, and then we create the summary. And you get to create a-- your own prompt, um, to help guide it. Um, and then it, it just puts together the key facts that's needed.To summarize where things are, keep the conversation on track without losing the, the wrong context, uh, and adds that in here. And then great, uh, moves on with the conversation. So super, we've implemented, uh, context engineering. For those of you with your accounting eyes, you can see the cost is down, um, to about eleven pounds. We've reduced it down about a third from where we were earlier. Um, but I have to, I have to confess one thing. Um, this has been a tough business. Uh, this one superhero you'll hear about, Cryothane, he's-- he does a lot of stuff that just increases my insurance premiums, and it is, like, a tough business. Like, margins are really, really thin and, um, yeah, we need to get the cost down. Every time I'm loading this right now, it's costing eleven pounds. That's pretty high. And, um, what model are we using on this? Okay, we're using four-- Opus four-seven. Good model, but it's high intelligence and therefore also higher cost. I wonder, like, could we maybe figure out how to do this with Sonnet and Haiku, but do it with closer to Opus intelligence? Let's switch back to slides and see what, see what we got. Okay, so we've got advisor strategy. Uh, of course, I had a solution to that problem. Uh, so the idea behind advisor strategy is that you have the model-- The agent is run with an executor that's Sonnet or Haiku. And the, the kinda insight here is that it-- when it-- it, it can kinda figure out what to do with all these different shapes, except when it kind of-- You'll see in a second, it, it encounters this sort of weird oddball shape, um, and it doesn't know what to do with this, and it just calls the advisor, um, asks what to do, and then it, it gives it advice on what to do. Uh, the insight here is if you've ever worked with development teams, you've probably noticed there are senior engineers paired with junior engineers that make that junior engineer so much more powerful. That junior engineer is still the person hands on keyboard getting work done. But with the coaching, the mentorship, the code reviews, the help on architecture from the senior engineer, they're able to actually achieve so much more than they otherwise would have, sometimes even approaching what someone with, uh, that senior engineer skills could have done solo. And that's the kind of same idea here. It turns out that same logic works with models. Um, that pairing the kind of Sonnet/Haiku gets you kind of the prices of those models approaching even sometimes Opus intelligence, a big, uh, intelligence boost. And folks like Bolt have used this. They've kinda gotten better architectural decisions where they can see on complex tasks, it improves performance. While on less complex tasks, um, uh, it has no extra overhead. It just doesn't call the tool. And so it's a pretty nice trade-off. Um, it's like a Pareto optimal way to improve your cost, uh, and intelligence, uh, performance. So let's, let's go back to the demo machine, and let's, let's take a look. Let's see. Um, let's go ahead and implement it. Okay, so we're starting at, like, just about eleven pounds, and you'll see, yeah, that, that, that bar is going up a lot less fast. Of course, we are using Sonnet. Now, as you can see, there's Sonnet four-six with Opus four-seven advisor. Okay. So it, it kinda seems obvious, of course, the cost is gonna be lower. We just switched to, uh, a more, uh, inexpensive, I guess, like, a higher value model. Um, a-and the question is not is it better quality. Uh, it's not is, is it lower cost. It's is it also better quality? Is it, is it, is it also losing intelligence on this dashboard? And so if we wanna scroll through, I think that there was a, uh-- So we have this, this one contract that we've been struggling with. Uh, we really need to land it. I, I, I don't know if we're gonna raise our next round if we don't get this. Um, it's the Metropolis renewal. Uh, Metropolis, uh, uh, has been our big customer from the start. They've been big supporters of Hero Corp. I need to keep them on board. Um, and, uh, so Sonnet looked through the transcripts, was like, "I think the renewal is on track. We're doing well." But this is important, so let's ask Opus. And this is one of those ways that the advisor tool can be used. You can pass it a transcript. It'll take a look. Anything wrong? And it'll report back of, "Hey, you might have missed a thing," or, "Looks good." In this case, Sonnet said it was green, but Opus looks through, and it looks through it deeply, more fine-eyed, and is able to say, "Actually, um, the mayor specifically wants Cryothane." That guy who's increasing my insurance premiums? Yeah. They-- He specifically wants him. He's too good to lose. I know it. Um, it's, it's... Anyway, um, he's great. So, but, uh, he specifically wants him, but he's just unavailable that day. And so, uh, actually, this is red. It-- The, the renewal will not go through if, uh, he's unavailable on the day of the big event, uh, that they want at City Hall. And so this is a watermelon, if you've heard that term. It's, uh, green on the outside, but deep, deep red on the inside. Uh, and so [clears throat] what happens is Opus overrides, uh, Sonnet and, uh, lets us know and, uh, catches this for us. So that's good. Okay. This is, this is great. We, we were able to kind of recover that intelligence and have good performance. So okay, to backtrack here, like, what happened in this dashboard? We started at over ten times the cost, and we brought that down through one prompt caching. We figured out what our prompt cache hit rate was, zero. We implemented it. And by the way, one great way to implement it again is the Claude API, uh, skill within Claude Code. It helps figure out a lot of that logic for you. We looked through the transcript then to see exactly what's going on, and we then implemented context engineering. First, reducing what tools are sent to the model. Two is then curating what data gets backs-- uh, from those results, gets sent back to the model. And then finally, helping create almost unlimited context with compaction. And then, uh, the last part was then helping reduce that cost further, uh, while preserving intelligence through advisor tool.So, um, we've done all this. Uh, the Opus has given us a, a, a nice little button here in, in, you know, the era of AI agents. As CEO, I just click buttons nowadays. So, um, should we save the Metropolis contract? Yeah, let's do it. Let's do it. Go ahead and click the button, and Metropolis is saved. We're gonna get our next round, I think. I hope. Let's see. Uh, the VC's in the house, we can, we can talk afterwards. Um, so let's switch back to slides. Okay, so key takeaways. Again, to review, what did we cover here today? First thing, prompt caching. If you do nothing else, prompt caching. Figure out what your prompt cache hit rate is and implement it. Cloud API skill is a great way to get started. Then we did the context engineering, uh, pieces. We curated what's getting sent to the model, both from tools, from, uh, results of those tools, and then compacting it as that conversation grows. And finally, advisor strategy, a Pareto optimal way in many cases to get better cost and intelligence. A great trade-off. Uh, now, I've talked about things that have launched literally in the last twenty-four hours here. So the cloud platform is evolving fast. I'm excited you've been here for this talk, but your learning journey is not over. Um, we're gonna continue to try to do our best to make sure you can build not just great demos, but real production agents that work for you and your customers, that let you build a business o- on this platform, and we're gonna do that continuously. That's our commitment to you, and that means keeping abreast of all the things that we're launching. Frankly, I didn't even have enough room on this slide to put everything that's launched in twenty twenty-six. Um, this is just a subset, uh, of what we've, uh, w- we've already launched this year. Um, I'm particularly excited about, uh, automatic prompt caching, a one-line way to implement proca- prompt caching if you've never implemented it, um, or the cloud platform on AWS. This one's very cool. I mean, it's this whole platform that we've talked about available, where I know a lot of folks in this room use our models on AWS. It's all right there. So yeah, this is a great way, uh, to get started. I'm excited to see what you guys build, and thank you for coming. [audience applauding] [upbeat music]

Episode duration: 26:40

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode QIriO1-vHYw

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.