The thinking lever

Name: The thinking lever
Uploaded: 2026-05-08T00:00:00Z
Duration: 24 min 1 s
Description: The talk defines test-time compute as spending more inference-time tokens to improve results, showing performance gains both from larger models and from higher effort on the same model.

Adaptive thinking and effort controls give developers a new decision: how much should Claude reason for a given task? This session covers thinking budgets, effort levels, and the cost, latency, and quality tradeoffs involved.

May 8, 202624mWatch on YouTube ↗

CHAPTERS

Why “test-time compute” matters for solving hard engineering problems
Matt Bleifer (Anthropic Research PM) frames the talk around how Claude can spend more compute at inference time—“test-time compute”—to tackle difficult software engineering tasks. He positions this as a major recent driver behind “reasoning models,” analogous to scaling compute at training time.
Evidence of scaling: model size and effort both raise coding performance
Using benchmark-style charts, the talk shows two scaling axes: moving from smaller to larger models (Haiku → Sonnet → Opus) and increasing effort on the same model. Both trends improve agentic coding evaluation scores, illustrating that additional inference-time work can translate to measurable gains.
Concrete demo: traffic-light simulation improves from low → high → max effort
A practical example demonstrates what “effort” changes look like in output quality. The same prompt (simulate cars at a traffic light) yields increasingly realistic simulations as effort increases, at the expense of more time and tokens.
The long horizon: models working for days, weeks, or longer
Matt extrapolates the trend: as test-time compute scales, Claude could work on problems for dramatically longer horizons. This sets up why control mechanisms (preferences and budgets) will become increasingly important.
Three token “buckets”: thinking, tool calling, and text
The talk decomposes inference-time spend into three token types that together constitute test-time compute. Each serves a distinct role: internal reasoning, interacting with external tools, and communicating results to the user.
Costs and controls: effort preferences and task budgets
Because more tokens mean higher cost and longer wait times, users need levers to shape how Claude allocates compute. Matt introduces two control surfaces: an “effort dial” for quality vs latency trade-offs, and “task budgets” to cap spend before check-ins.
How token allocation evolved: sequential → interleaved → adaptive thinking
Matt traces the progression of how reasoning models schedule thinking, tools, and text. The newest paradigm—adaptive thinking—removes rigid ordering and lets Claude decide when (or whether) to think throughout a task to maximize performance and UX.
What adaptive thinking is not: no router, no automatic “thinking mode” switch
Adaptive thinking isn’t a classifier that routes queries to different model variants or forces a fixed amount of reasoning. Instead, it grants flexibility—Claude may think at any step if useful—making behavior more task- and prompt-dependent.
Why thinking toggles are a poor proxy for effort
Matt argues that users historically used a thinking toggle as a surrogate for “try harder,” but that conflates capability with effort. Effort should control overall token spend (thinking, tools, and output), not just enable/disable one internal mechanism.
Operationalizing effort: build effort curves with evals (and read transcripts)
For selecting effort levels, Matt recommends running evaluations and charting performance against tokens/time/cost to find the best trade-off for a given use case. He also stresses inspecting transcripts, especially at low effort, to catch surprising shortcuts or behaviors.
Surprising behavior at low effort: Claude speedruns Pokémon
A memorable evaluation shows that low effort doesn’t always mean “dumber”—it can mean “optimize for minimal token/time.” In Pokémon Red, Claude used speedrun-like tactics (skipping battles, using items strategically) to progress faster.
Rules of thumb for effort levels (max to low) and default guidance for coding
Matt provides practical recommendations for each effort tier, highlighting extra-high as a strong default for most agentic coding. He advises testing max for the hardest tasks but watching for diminishing returns, and reserving low for short, latency-sensitive work.
Choosing smaller models vs lowering effort on a bigger model
The talk closes the loop on trade-offs between model choice and effort. Bigger models at low effort can beat smaller models at high effort for intelligence-demanding tasks with speed constraints, while smaller models are better for cheap bulk tasks and fast time-to-first-token.
Final takeaways: enable thinking, use evals, default to extra-high for coding
Matt summarizes three actionable recommendations: preserve reasoning capability, choose effort/budgets to control spend, and rely on evals to find the right trade-off. For teams that need a quick default for coding, he suggests extra-high effort as a strong baseline.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Why “test-time compute” matters for solving hard engineering problems

Evidence of scaling: model size and effort both raise coding performance

Concrete demo: traffic-light simulation improves from low → high → max effort

The long horizon: models working for days, weeks, or longer

Three token “buckets”: thinking, tool calling, and text

Costs and controls: effort preferences and task budgets

How token allocation evolved: sequential → interleaved → adaptive thinking

What adaptive thinking is not: no router, no automatic “thinking mode” switch

Why thinking toggles are a poor proxy for effort

Operationalizing effort: build effort curves with evals (and read transcripts)

Surprising behavior at low effort: Claude speedruns Pokémon

Rules of thumb for effort levels (max to low) and default guidance for coding

Choosing smaller models vs lowering effort on a bigger model

Final takeaways: enable thinking, use evals, default to extra-high for coding

Get more out of YouTube videos.