The thinking lever

Name: The thinking lever
Uploaded: 2026-05-20T00:00:00Z
Duration: 21 min 17 s
Description: Increasing test-time compute (more tokens/time at inference) reliably improves Claude’s performance across diverse benchmarks and domains, similar to scaling training compute.

Adaptive thinking and effort controls give developers a new decision: how much should Claude reason for a given task? This session covers thinking budgets, effort levels, and the cost, latency, and quality tradeoffs involved.

May 20, 202621mWatch on YouTube ↗

CHAPTERS

0:18 – 1:18
Test-time compute as a new scaling lever for Claude
Alexander Brichen introduces the “thinking lever”: how Claude uses additional compute at inference time (test-time compute) to solve harder problems more effectively. He frames the talk around practical best practices for controlling token spend and runtime behavior.
- •Defines test-time compute and why it matters at inference time
- •Goal: better use of tokens to solve hard tasks
- •Sets up “levers” users can control to trade off speed vs quality
1:18 – 2:21
Evidence: more tokens at runtime boosts benchmark performance
The talk contrasts two forms of scaling—bigger models (train-time compute) and more runtime thinking (test-time compute)—and shows both increase performance. He notes this pattern holds across domains like coding, QA, computer use, and PhD-level exams.
- •Model size (Haiku → Sonnet → Opus) raises performance on internal benchmarks
- •Increasing token spend at inference time also improves scores
- •The effect generalizes across multiple benchmark categories
2:21 – 2:51
Live demo setup: traffic-light car simulation across effort levels
He introduces a tangible demonstration: generating a realistic one-way street traffic-light simulation at three effort levels (low, high, max). The intent is to make the quality/latency/token trade-off visible.
- •Prompt: one-way street with cars and a traffic light
- •Run the same task at low, high, and max effort
- •Compare output quality vs time and token usage
2:51 – 3:52
Low effort result: functional but simplistic simulation
The low-effort run produces a workable simulation with basic car behavior and light changes. It’s competent but lacks realism and detail, illustrating what you get with fewer tokens and faster runtime.
- •Produces a usable one-way road and traffic-light behavior
- •Cars stop/go at a regular cadence
- •Limited detail and realism; mostly baseline functionality
3:52 – 4:52
High effort result: more realism and better spatial reasoning
At high effort, Claude spends roughly double the time/tokens and generates a richer simulation with varied vehicles and improved traffic-light placement. The speaker highlights emergent improvements like more “intelligent” driver interactions, while noting imperfections remain.
- •More vehicle variety (e.g., lorries) and scene complexity
- •Improved traffic light placement compared to low effort
- •Mentions smarter interaction between cars (reacting to each other)
4:52 – 5:23
Max effort result: highest fidelity and strongest overall behavior
Max effort uses ~10x the tokens/time and yields the most detailed, visually coherent simulation with improved traffic light realism and vehicle motion. The example reinforces the core message: more runtime compute can translate into better outputs—at a cost.
- •Largest increase in time and token usage
- •Best visual/detail quality and more coherent scene elements
- •More convincing vehicle behavior compared to lower effort runs
5:23 – 6:23
From minutes to months: autonomy horizon and the METR framing
He extrapolates from the demo to long-horizon work, suggesting models may progress from minutes of work to days or longer. He references METR-style autonomy measurements and how both training-time and test-time scaling contribute to longer, more capable task execution.
- •Runtime compute implies a time/cost trade-off for better outcomes
- •Autonomy horizon concept (hours of human work at given accuracy)
- •Progress driven by both bigger models and better test-time compute
6:23 – 7:55
What counts as test-time compute: thinking, tools, and text
He breaks test-time compute into three components: internal thinking, tool calling, and the final text output. Each has direct costs in tokens and latency, motivating user controls over how much compute Claude uses.
- •Thinking: scratchpad reasoning space
- •Tool calling: interface to external systems (web, files, MCP, SaaS)
- •Text: the user-visible response; all components incur token/time costs
7:55 – 8:55
User controls: effort dial vs strict budgets
Two user-facing levers are presented for controlling runtime behavior: an effort setting (low→max) and budgets (hard constraints like max tokens or task budgets). The remainder of the talk focuses on effort as the primary practical control.
- •Effort trades off intelligence vs speed/latency
- •Budgets enforce strict constraints (token limits, task budgets)
- •Sets context for why effort is central in real deployments
8:55 – 9:57
Interleaved thinking: reasoning between tool calls like humans work
He explains the evolution from a single “think then act” phase to interleaved thinking—alternating tool use and reflection after each call. This better matches human workflows where actions and reconsideration happen iteratively.
- •Old flow: think once → execute tool calls → answer
- •Interleaved thinking: think after every tool call
- •Closer to iterative human problem-solving patterns
9:57 – 11:58
Adaptive thinking: model-controlled allocation of thinking, tools, and text
Adaptive thinking extends interleaving by letting Claude decide when (or whether) to think, call tools, or ask clarifying questions. He emphasizes this isn’t request classification; it’s granting flexible control, and it tests as Pareto-efficient versus prior serving approaches.
- •Claude chooses dynamically among thinking, tool calls, and text
- •Can skip thinking entirely for trivial questions
- •Benchmarks show Pareto efficiency relative to fixed interleaving
11:58 – 13:29
Why a thinking toggle is the wrong mental model
He argues that turning “thinking” on/off is a poor proxy for effort because it disables a core capability rather than expressing desired work level. The better approach is to allow thinking and control resource usage via effort/budgets, analogous to giving tools without forcing their use.
- •Thinking toggle removes capability instead of controlling intensity
- •Analogy: you don’t force or forbid web search; you provide it as needed
- •Preferred: set constraints and let Claude allocate compute appropriately
13:29 – 17:33
Best practices for choosing effort: evals, diminishing returns, and defaults
He recommends using task-specific evals to select an effort level and notes diminishing returns at the high end. Practical guidance: use low for latency-sensitive/simple tasks, high when intelligence is needed, extra high as the default sweet spot, and max only for the hardest problems.
- •Use representative eval sets to pick the right effort level
- •Expect diminishing marginal gains at max effort
- •Rules of thumb: low for extraction/summarization; extra high as default; max for toughest tasks
17:33 – 21:17
Model size vs effort: when to choose Opus vs Haiku (and final takeaways)
He compares a smaller model at effort to a larger model at lower effort, concluding that if a task needs meaningful intelligence, a larger model often wins even at low effort. He closes with actionable takeaways—enable thinking, use evals, default to extra high—and a vision where users set budgets and Claude self-allocates compute over long horizons.
- •Comparison: Haiku output is weaker despite similar token spend
- •Guideline: prefer larger models for intelligence-bound tasks
- •Takeaways: enable thinking; use evals; default extra high; aim for budget-based long-horizon allocation