The thinking lever

Adaptive thinking and effort controls give developers a new decision: how much should Claude reason for a given task? This session covers thinking budgets, effort levels, and the cost, latency, and quality tradeoffs involved.

May 7, 202624mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

How Claude scales inference compute with effort, budgets, adaptive thinking

The talk defines test-time compute as spending more inference-time tokens to improve results, showing performance gains both from larger models and from higher effort on the same model.
A concrete coding-style example (traffic simulation) demonstrates that increasing effort raises time and token usage while improving realism, graphics, and behavioral fidelity.
Claude’s inference tokens are grouped into three buckets—thinking tokens, tool-calling tokens, and text tokens—each contributing differently to problem solving and user interaction.
Newer inference behavior (adaptive thinking) removes rigid sequencing, letting Claude choose when and how much to think, call tools, or communicate to best satisfy task constraints.
Practical guidance is provided for selecting effort levels, using eval curves to find diminishing returns, and choosing between small models vs larger models at low effort depending on latency and cost needs.

IDEAS WORTH REMEMBERING

5 ideas

More inference-time compute can raise task performance without changing the model.

Across benchmarks and the traffic-simulation demo, the same Opus model performs better when allowed to spend more time/tokens, though the gains may taper off at the highest settings.

Claude’s “compute” is not just hidden reasoning—it’s a mix of thinking, tool calls, and user-facing text.

Thinking tokens support internal deliberation, tool-calling tokens connect to external actions like search/files, and text tokens handle status updates and final outputs; optimizing outcomes depends on balancing all three.

Adaptive thinking is a flexibility upgrade, not a router or simple on/off toggle.

Instead of forcing a fixed order (think → tools → answer) or even strict interleaving, adaptive thinking allows Claude to think or communicate at any point, and to skip thinking for trivial tasks.

Effort is a better control surface than “thinking toggles.”

Toggles can constrain a core capability, whereas effort more directly expresses the desired trade-off among time, cost, and quality across the whole workflow (thinking + tools + output).

Budgets matter more as tasks get longer and more agentic.

Task budgets set upper bounds (tokens/time/cost) and create natural “check-in” points, which becomes critical if models run for hours or longer on complex engineering or research tasks.

WORDS WORTH SAVING

5 quotes

Similar to how we can scale compute at training time by training bigger models over longer time horizons using more data, we can also scale compute at test time by allowing those models to spend more time working on a problem.

— Matt Bleifer

As we continue to scale test time compute further and further, Claude isn't just going to work for seconds or minutes or hours on a problem. It's gonna work for days, weeks, months, even years, spending tokens to try to solve some of humanity's toughest challenges.

— Matt Bleifer

Thinking tokens represent Claude's internal monologue.

— Matt Bleifer

An effort dial is a much better expression of the idea of spend more tokens in order to get a better answer.

— Matt Bleifer

Our North Star for Claude overall is that it allocates compute incredibly well when asked for it and that you can set a quality bar and a budget and Claude will just go ahead and figure out the rest and give you the best performance for your use case.

— Matt Bleifer

Test-time compute (inference-time scaling)Effort levels and quality/cost/latency trade-offsToken types: thinking vs tool-calling vs textInterleaved thinking vs adaptive thinkingTask budgets (token/time/cost caps)Evals and effort-performance curvesModel selection: small model vs big model, low effort vs high effort

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.