The thinking lever

Adaptive thinking and effort controls give developers a new decision: how much should Claude reason for a given task? This session covers thinking budgets, effort levels, and the cost, latency, and quality tradeoffs involved.

May 20, 202621mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

How Claude uses inference-time compute to improve reasoning performance dramatically

Increasing test-time compute (more tokens/time at inference) reliably improves Claude’s performance across diverse benchmarks and domains, similar to scaling training compute.
Claude’s runtime compute is expressed through three channels—thinking, tool calling, and text output—each contributing to capability, latency, and token cost.
Effort levels (low→max) provide a practical dial to trade speed/cost for better results, but gains can show diminishing returns at the highest settings.
Adaptive thinking evolves earlier “think-then-act” patterns by letting Claude decide when to think, call tools, or answer, improving efficiency relative to fixed thinking schemes.
Best practice guidance emphasizes using evaluations to choose effort/model size, enabling thinking when possible, and defaulting to extra-high effort when uncertain.

IDEAS WORTH REMEMBERING

5 ideas

More inference-time tokens can directly buy better outcomes.

Across internal and public-style benchmarks (coding, QA, computer-use, PhD-level questions), performance improves as the model spends more tokens thinking/working before answering.

Effort is a speed–intelligence–cost trade-off, not a simple on/off switch.

Low effort can be fast and sufficient for simple tasks, while high/extra-high often yields better reasoning; max can help on the hardest tasks but may provide only marginal gains per added token.

Treat “thinking” as a core capability, not a feature to disable.

Turning off extended thinking isn’t equivalent to “think less”—it removes a problem-solving tool; Anthropic’s recommendation is to enable thinking whenever possible and tune its length via effort.

Adaptive thinking is closer to how humans solve problems and is more efficient.

Rather than forcing a single think-then-act phase, adaptive thinking lets Claude decide when to think, act (tool calls), or respond—often yielding Pareto improvements over interleaved thinking.

Bigger models often beat smaller models even with lower effort when intelligence is required.

A comparison showed Haiku producing a substantially worse simulation despite similar token usage; rule of thumb: if the task is meaningfully intelligence-bound, prefer a stronger model (e.g., Opus) even at low effort.

WORDS WORTH SAVING

5 quotes

Test time compute has direct costs in the form of tokens, token count, and time that it takes.

— Alexander Brichen

Someone doesn't ask me a question, I like stand there processing it, and then suddenly I like go execute a bunch of steps and come back with the answer, right? Instead, which is how we resulted in developing interleaved thinking, y-you do something, you think about it, you do another thing, you think about it, and then you come back with a result.

— Alexander Brichen

A thinking toggle is actually a poor proxy for the amount of effort that a model should put in.You're not expressing how hard you want Claude to think when you turn a thinking toggle on or off. You're actually just turning off a core capability.

— Alexander Brichen

You should enable thinking really whenever possible. Give Claude that space to reason. Give it the scratch pad so it knows that it can use that thinking tool when it needs to.

— Alexander Brichen

And then finally, when in doubt, go with extra high. It's the default that we've set for our products, and I would argue that it's a great kind of Pareto efficient outcome between latency and n- number of tokens and intelligence.

— Alexander Brichen

Test-time compute vs train-time computeEffort levels (low, medium, high, extra high, max)Thinking/tool calling/text as compute channelsInterleaved thinking vs adaptive thinkingDiminishing marginal returns at high effortModel size trade-offs (Haiku vs Sonnet vs Opus)Evaluation-driven tuning and budgets (token/task constraints)

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.