CHAPTERS
Why “test-time compute” matters: the thinking lever
Alexander Brichen introduces the core idea: Claude can spend more compute at inference time (more tokens, more time) to solve harder problems better. The talk frames this as a practical “lever” users can influence to trade latency and cost for higher-quality outcomes.
Reasoning models and scaling: train-time vs test-time compute
The speaker connects recent progress in “reasoning models” to the broader scaling story. Performance rises both when models get larger (train-time compute) and when they are allowed to think longer at inference (test-time compute).
Live demo setup: traffic-light car simulation prompt
To make the concept tangible, the talk uses a single coding/simulation prompt and runs it at three effort settings. This establishes a controlled comparison where only the effort level changes.
Low effort output: fast, workable, but simplistic
At low effort, Claude produces a functional simulation with basic dynamics. However, it’s relatively simple and contains design limitations that would benefit from more deliberation.
High effort output: more realism and better scene reasoning
With higher effort, Claude spends roughly double the time/tokens and produces a more complex simulation. The result improves in realism and shows better “common sense” adjustments, though still imperfect.
Max effort output: 10× compute for highest fidelity
At max effort, Claude uses about an order of magnitude more time/tokens and delivers the most detailed and visually coherent simulation. The improvements illustrate how additional test-time compute can raise solution quality on complex tasks.
Long-horizon capability: from minutes to days of “work”
The talk broadens from the demo to a future-facing view: models may extend from seconds/minutes of work to days/weeks/months. A benchmark narrative (“meter”/autonomy) is used to suggest increasing ability to complete longer tasks with acceptable accuracy.
The three components of test-time compute: thinking, tools, text
Test-time compute is decomposed into three token-consuming activities. This helps users reason about where compute goes and why different workloads may require different configurations.
User controls: effort dial vs budgets
The speaker explains the two primary mechanisms users have to shape runtime compute. Effort is a coarse “low→max” dial, while budgets impose stricter constraints such as max tokens or task budgets.
From sequential to interleaved to adaptive thinking
The serving approach evolved from a single “think then act” block to a more human-like loop of acting and reflecting. Adaptive thinking generalizes this by letting the model decide when to think, call tools, or respond with text.
Why a “thinking toggle” is the wrong mental model
Turning thinking on/off is framed as disabling a core capability rather than expressing how hard the model should work. The recommended framing is to always provide the capability and control intensity via effort/budgets.
Effort best practices: evaluate performance and expect diminishing returns
Choosing an effort level should be driven by measurement on representative tasks. The talk emphasizes diminishing marginal returns at the high end and recommends using difficult evals to identify the sweet spot.
Rules of thumb by effort level + “Claude Plays Pokémon” insight
Practical guidance is given for when to use each effort setting, including a surprising example where low effort produces a clever, shortcut-seeking strategy. This highlights that constraints can change the model’s approach, not just quality.
Model size vs effort: when to use Haiku vs Opus
The talk compares a Haiku-generated simulation with the Opus result to illustrate that effort can’t fully substitute for base model capability. If the task needs real intelligence, a larger model at lower effort may outperform a smaller model at higher effort.
Closing takeaways: enable thinking, use evals, default to extra high, aim for constraint-based autonomy
The speaker summarizes actionable recommendations and the longer-term vision. The end state is to set goals and budgets while Claude automatically allocates compute appropriately for the task’s importance and horizon.
