The thinking lever

Adaptive thinking and effort controls give developers a new decision: how much should Claude reason for a given task? This session covers thinking budgets, effort levels, and the cost, latency, and quality tradeoffs involved.

May 8, 202624mWatch on YouTube ↗

EVERY SPOKEN WORD

20 min read · 3,996 words

SPSpeaker
[singing] When it all falls down, it could all fall down like confetti. You just gotta let it All right. Hello, everyone, and welcome. Uh, my name is Matt Bleifer. I am a product manager on Anthropic's research team, and today I will be sharing a little bit how, uh, Claude leverages compute at inference time, otherwise known as test time compute, in order to break down and solve some of your hardest software engineering challenges. Along the way, I'll share a little bit about what levers you have at your disposal, uh, in order to influence how Claude spends tokens, and I will also share some best practices to help you be able to get the most out of it. So one of the key developments in large language models over the last couple of years has been the scaling of test time compute, creating something that we've all come to know as reasoning models. Similar to how we can scale compute at training time by training bigger models over longer time horizons using more data, we can also scale compute at test time by allowing those models to spend more time working on a problem. So if you look at this graph on the left, you can see that when we move from Haiku to Sonnet to Opus, as the model gets more intelligent, it's able to get a better score on our agentic coding evaluation. And then similarly in the graph on the right, as that same model, Opus, actually just spends more time working on a problem, it's able to correspondingly get better and better scores. This is what we mean by scaling test time compute. Now, this isn't true just of software engineering. It's really true of a whole variety of knowledge work domains, whether it's agentic search, computer use, or PhD-level academic reasoning. If we can allow models to spend more time working on a problem, they can achieve better and better results. So looking at a bunch of charts and graphs is always great for understanding the data and the different correlations, uh, but nothing really beats seeing a tangible example of what this looks like in practice. And so what I went ahead and did is I ran Opus four dot seven on a few different effort levels, scaling the amount of time that it works on a given prompt, where in this case, I asked it to create a realistic simulation of cars going down a one-way street at a traffic light. So the first result we have here is Opus four dot seven running on low effort. You can see in the simulation, it took about fifty seconds to produce a result for us and worked for about forty-six hundred output tokens. And I'd say it accomplished something that was fairly reasonable. We do, in fact, have cars going down a one-way street. Uh, they are stopping at the traffic light. But overall, it's a pretty basic simulation. The traffic flow is fairly basic. Uh, the graphics are limited. And for some reason, Claude thought it would be a great idea to put the traffic light right in the middle of the road, which maybe wasn't the best design decision, but we will still call it functionally passing. So the next thing I did here is I went ahead and I cranked that effort dial up a bit. So when I moved effort up to high for Opus four dot seven, it took about twice the time working on our traffic simulation and double the output tokens. But as you can see, it was able to achieve a better result. It has cars of different types. It smartly moved the traffic light, uh, over to the side of the road. And Opus told me that it even implemented what it called an intelligent driver model, where every car would more uniquely respond to the dynamics of the car around it, doing a better job simuli- uh, simulating a realistic traffic pattern. So again, twice the amount of time, better results. Now, the last thing that I did here is I cranked that effort dial all the way up to max, and in this setting, Opus four dot seven took ten X the amount of time that it did when executing this same prompt on low effort and ten X amount-- the amount of tokens. But as you can see, it was able to achieve the best results yet. We have the best graphics, my favorite traffic light of all of them, uh, and really realistic driving patterns. This is all an example of how when you allow Claude, even on the same model, to just spend more time working on a problem, it can get better results. As we continue to scale test time compute further and further, Claude isn't just going to work for seconds or minutes or hours on a problem. It's gonna work for days, weeks, months, even years, spending tokens to try to solve some of humanity's toughest challenges. So when I talk about test time compute, uh, I really mean any form of Claude spending tokens at inference time in order to solve your problem. However, we can break these token types down, uh, into three kind of distinct buckets. The first bucket that we have is thinking tokens. This is the classic form of tokens that underline, uh, underlines what we know as reasoning models. Thinking tokens represent Claude's internal monologue. It's Claude's space to reason step by stepTo consider different potential options, do some chain of thought reasoning, create a scratchpad where it needs to work through a problem, and ultimately spend time thinking through what it needs to do in order to take the best actions and deliver the best results. The second form of tokens that Claude can spend when taking on a task is tool calling tokens. Tool calling is Claude's way of interfacing with the rest of the world, whether it's using tool calls to execute a search, in this example, giving me more information about the Code with Claude conference, uh, or reading and writing files in order to build out software engineering projects. There's really millions of different tools that Claude can call, but in all of these scenarios, tool calling tokens are Claude's way of interfacing with its environment. The last type of tokens that Claude can spend is text, and this is Claude's way of interacting with you. Whether it needs to give you updates as it's working on a really, really tough problem and let you know how it's progressing, give you a summary at the end to explain all of the things that it did in response to the tough task that you gave it, uh, or simply just responding to a simple question that you have, text tokens are Claude's way of communicating with an end user. So again, we have three different types of tokens: thinking, tool calling, and text, and all three of these we think of as really fundamental to the way in which Claude works and the way in which Claude responds to problems. But all of these tokens that we're spending have really direct costs to users in the form of both practical token costs that we pay for, uh, as well as waiting time. When Claude spends more tokens, it means that we as users have to wait longer for our result. And so we think it's really important that we give users the ability to influence or constrain how Claude spends tokens. Using Claude, users can express their preferences and constraints in a couple ways. The first way is with that effort dial that I talked about. Effort is a way for you to tell Claude how you want it to trade off time, cost, and quality when responding to your task. Should Claude spend more time in order to get a better answer? Should it spend less time in order to get a faster answer? These are all preferences that you can give to Claude as a user that it will take into account when it goes and it spends these tokens. Another form of constraints that users can provide is in the form of budgets. Recently, we launched a feature that we call task budgets, which allows you to tell Claude an upper bound of the amount of tokens that it will spend when working on a task. So you might say, "Hey, I want you to build out this particular software engineering feature for me, but I don't want you to spend more than a hundred thousand tokens before you stop and check in with me as a user." Budgets could come in the form of tokens, but they could also come in the form of time, uh, or cost. And I think this will get increasingly important as we continue to move up that exponential and Claude is working for days, weeks, months, or more to be able to give some guidelines about how long it should work on a particular problem before it stops to check in. Now, given all of these preferences and constraints, it's really up to Claude to figure out the best way to spend those tokens in order to maximize outcome. So given the user's effort settings, given a potential budget setting, how does Claude allocate tokens across thinking, tool calling, and text in order to maximize its performance and its user experience? When reasoning models were first introduced, they followed a really specific pattern in terms of how to spend these tokens. The first thing they would do is they would think, and they would spend those thinking tokens to work through a problem, and then after that, they would move on to tool calling, and then lastly, they would move to text. We improved on this when we introduced interleaved thinking, which allowed Claude to actually use thinking and reasoning in between tool calls. In this mode, Claude could call a tool, get a result, think about that result, then determine what it wants to call next, so on and so forth, all the way until it decides to give a final answer. Recently, we launched adaptive thinking, and adaptive thinking is the next evolution on top of interleaved thinking. In this new paradigm, Claude is free to think whenever appropriate. There's really no constraint on when Claude needs to think, how much it needs to think, or in what order it spends any of these tokens. It can leverage thinking, tool use, and text in whatever order is needed in order to best meet the requirements of your task. Claude could choose to start with a text response in order to acknowledge the user request. It could stop to call a tool. It could then think about that tool, respond to the user to give an update, continue to call tools, so on and so forth, all the way until it provides that final answer of the work that it did. Claude could also choose not to think at all for simple queries that don't require it. Now, in practice, Claude will typically think more often and longer in response to higher effort levels, but everything is really prompt dependent. You can imagine that if I were to survey someone in the cl- in the crowd here and say, "What is two plus two?" And I ask them to spend a little bit amount of time on the problem versus a lot of time on the problem, you're roughly going to spend the equivalent amount of time working on that problem.However, the story would change a lot if I asked you to conduct a really sophisticated research task. The difference between your thinking on low effort and high effort there could be quite dramatic. Now, adaptive thinking is not a model router, and it's not an automated thinking toggle. So it's not taking your query, classifying it based off of difficulty, and figuring out whether it should use a thinking version of the model or a non-thinking version of the model. Rather, it's the difference between telling Claude, "You must spend at least one thinking token at the start of this response," and telling Claude, "You can spend thinking tokens whenever and however needed in order to solve this problem." It's really about Claude having the option to think at every single step of the process. We run all of our benchmarks on adaptive thinking, uh, since Opus 4.6, and it's really our intelligence-maximizing setting, uh, that shows performance parity or better with interleaved thinking while delivering a better user experience. So I wanna dig a little bit more into effort and contrast it to the ways in which we've used thinking in the past. Historically, users have used thinking toggles a lot like an effort dial. So if you wanted Claude to spend more time working on a problem, you might turn on thinking inside of claude.ai or Claude Code, and you would expect that it would spend more time working in order to give you a better result. That's a pretty reasonable instinct. However, thinking toggles are kind of a poor proxy for an effort dial. Rather than expressing how hard you want Claude to work, what you're really doing is you are turning on and off a really core capability of the model. You're constraining how it's allowed to work, not how hard you want it to work. An effort dial is a much better expression of the idea of spend more tokens in order to get a better answer. It moves thinking, tool use, and output text all together instead of just toggling one of them. As an analogy with tool use, we don't tell Claude to always search or never search. We tell Claude to figure out when it should search based off of the problem at hand, and that's really what allows Claude to be agentic in response to your query. In a similar way, when we work with all of our teammates, uh, we don't ask them to turn their inner monologue on and off in response to a question. Uh, we ask them how hard to try, and then our teammates decide how hard they are going to think on that problem and what actions that they will take in response. So I wanted to dig in a little bit more on effort and give you some practical guidance of how you should think about setting these effort levels depending on your use case at hand. First, whenever possible, it's always best to run evals and then chart performance where you compare on your X-axis here something like total tokens, time, or cost and on your Y-axis, performance. This allows you to create an effort curve like this and get a better idea of what trade-offs you might make by selecting a given effort level. Higher effort is going to improve performance on most intelligence-bound tasks, but it also may show diminishing returns. For your use case, you might look at a graph like this and say, "I will spend whatever tokens I need in order to get the best intelligence." Or you might say, "The relative improvement in performance between extra high and max is not worth the difference in tokens that I will spend, and so extra high is a better setting for my use case." Low effort can instead help accomplish a task much quicker and save you a lot of tokens, but it's also going to limit how thorough Claude is in accomplishing the task at hand. As a quick tip, my advice is that when using low effort, Claude is really trying to save tokens as much as possible, and so sometimes you may catch it taking unexpected shortcuts that you didn't expect it to. And so in addition to looking at evals, we always think it's a best practice to spend time reading your transcripts and better understand exactly how Claude is responding at a given effort level for the thing that you're asking it to do. On the flip side, uh, low efforts have also surprised us in some really interesting ways. So one of my favorite evaluations that we've created is called Claude Plays Pokemon, where, uh, Claude gets the opportunity to work its way through the original Pokemon Red game that many of us, uh, grew up knowing and loving. When we ran Claude Plays Pokemon on low effort, something really interesting happened, and it actually ended up treating the game much like a speedrun. So it would skip trainer battles in order to s-save itself some time. It would use healing items that it stocked up on instead of wasting time going back to Pokemon healing centers, and it would spam an item called a repel that would limit disruptive encounters with other Pokemon, making it through caves much more quickly. And what I find most interesting about this is oftentimes we might correlate low effort with lower intelligence, but for any of us that grew up playing this game, what you really realize is this is a super clever strategy. Uh, it takes a certain amount of intelligence to figure out how you might minimize token spend in order to get through these levels as fast as possible. And so it was interesting to see how Claude's interpretation of low effort translated to beat the game as fast as possible, employing actually very clever strategies along the way. So evals are always ideal. I'll always champion them any time that you have them. Um, but I wanted to just give you some quick rules of thumb on how you might select an effort setting, uh, in the absence of evals or even if you have them. This is just a good way to think about things.First, uh, max effort, no surprise, can deliver gains on your hardest tasks. But as I mentioned before, it can sometimes show signs of diminishing returns. I recommend testing it for your most intelligence demanding use cases, but don't always assume that this is going to be either the ceiling on performance, uh, or really the best bang for your buck. It could be the case that a level down is going to give you roughly equivalent performance at a real fraction of the cost. Extra high effort is a new setting that we introduced with Claude Opus 4.7, and we found this to be the best setting for most coding and agentic use cases. This is currently our default in Claude Code and Claude.ai for Opus 4.7 and like I said, it really does a good job maximizing intelligence without kind of going overboard. High effort is a great setting if you're trying to balance token usage and intelligence, and this is probably the value that I would recommend for any intelligence sensitive use case. High is a g- good place to start and test up from there. Medium is good for cost sensitive use cases where you're willing to trade a little bit of intelligence, uh, in order to get a much faster result. And then low is good for reserving for kind of your short scope tasks and latency sensitive workloads. Uh, however, as I mentioned before, it's always good to just put it in practice and see what actually happens because it might surprise you. I mentioned at the start of the talk that test time compute is a second way of scaling intelligence as compared to training time compute. So it kind of begs the question, if both give similar trade-offs with respect to performance, speed and cost, uh, when should I use a smaller model or when should I use a lower effort level on a bigger model? As some quick guidelines, I would say first, low effort on a bigger model is good for an intelligence demanding use case where you're trying to optimize for speed. Going back to our example here of our traffic light simulation, you can see that Opus 4.7 on low effort spent about the same amount of output tokens and only took a little bit longer than Haiku 4.5 on max effort, but I would say it achieved a much better result. So often the low effort on the larger, more intelligence model can give you a better bang for your buck when trading off speed versus intelligence on, uh, an intelligence demanding use case. On the flip side, smaller models, uh, can be really good if you're trying to optimize cost and your use case is not too intelligence demanding. So if you have some simpler LLM tasks, especially if you need to do them in bulk, something like classification, information extraction, basic summarization, that's where small models are going to come in handy and they're going to be able to save you a lot of cost when you don't need peak intelligence. Another case where small models are really useful is if your application demands a really low time to first token. So if you want Claude to give responses as fast as possible in response to a user query, the nature of the smaller models means that they will produce tokens oftentimes much sooner, sooner and give you a better time to first token. The way that I think about this is use small models for a fast time to first token, use bigger models at lower effort for a fast time to last token. Wherever possible, as I said before, I recommend evaluating both. It's good to build these eval curves across a few different model types and across various effort levels and then look at what the trade-offs give you for your use case that you're trying to optimize. All right, so before closing the talk, I wanted to just summarize three key actionable items that I hope that you take away from this. One, enable thinking whenever possible in order to give Claude that space to reason. Thinking, like I said, is really core to how Claude works and gives it that space and that inner monologue to be able to work through your problem as efficiently as possible. If you want to modulate the amount of time that Claude is spending thinking on your problem, I recommend using effort levels or budgets as your way of influencing Claude's behavior. Second, I might sound like a broken record here, but if you have evals, use them. Use that to find your ideal balance. Chart your curves, test on different effort settings, test with different budgets, test with different models, look at what the performance gives you and decide what makes sense for your use case without forgetting to always dig in and read those transcripts. And lastly, if you're not gonna do any of that and you just need to make a choice and you're working on anything coding and software engineering related, my advice would be go with extra high. It's a pretty good setting and gives a great bang for your buck while delivering great intelligence. Our North Star for Claude overall is that it allocates compute incredibly well when asked for it and that you can set a quality bar and a budget and Claude will just go ahead and figure out the rest and give you the best performance for your use case. Adaptive thinking and effort levels and budgets are all a step in this direction, uh, but they're really just the beginning and there's a lot more to come. I'm excited to share more with you in the future, so stay tuned and thanks so much for taking the time. If you wanna chat more about this, uh, I'll be around the conference in the audience. I'm always happy to, uh, nerd out about these things. So thank you. [applauding] [upbeat music]

Episode duration: 24:01

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode OXJO4LldSnc

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome