Skip to content
ClaudeClaude

The thinking lever

Adaptive thinking and effort controls give developers a new decision: how much should Claude reason for a given task? This session covers thinking budgets, effort levels, and the cost, latency, and quality tradeoffs involved.

May 20, 202621mWatch on YouTube ↗

CHAPTERS

  1. Why “test-time compute” matters: the thinking lever

    Alexander Brichen introduces the core idea: Claude can spend more compute at inference time (more tokens, more time) to solve harder problems better. The talk frames this as a practical “lever” users can influence to trade latency and cost for higher-quality outcomes.

  2. Reasoning models and scaling: train-time vs test-time compute

    The speaker connects recent progress in “reasoning models” to the broader scaling story. Performance rises both when models get larger (train-time compute) and when they are allowed to think longer at inference (test-time compute).

  3. Live demo setup: traffic-light car simulation prompt

    To make the concept tangible, the talk uses a single coding/simulation prompt and runs it at three effort settings. This establishes a controlled comparison where only the effort level changes.

  4. Low effort output: fast, workable, but simplistic

    At low effort, Claude produces a functional simulation with basic dynamics. However, it’s relatively simple and contains design limitations that would benefit from more deliberation.

  5. High effort output: more realism and better scene reasoning

    With higher effort, Claude spends roughly double the time/tokens and produces a more complex simulation. The result improves in realism and shows better “common sense” adjustments, though still imperfect.

  6. Max effort output: 10× compute for highest fidelity

    At max effort, Claude uses about an order of magnitude more time/tokens and delivers the most detailed and visually coherent simulation. The improvements illustrate how additional test-time compute can raise solution quality on complex tasks.

  7. Long-horizon capability: from minutes to days of “work”

    The talk broadens from the demo to a future-facing view: models may extend from seconds/minutes of work to days/weeks/months. A benchmark narrative (“meter”/autonomy) is used to suggest increasing ability to complete longer tasks with acceptable accuracy.

  8. The three components of test-time compute: thinking, tools, text

    Test-time compute is decomposed into three token-consuming activities. This helps users reason about where compute goes and why different workloads may require different configurations.

  9. User controls: effort dial vs budgets

    The speaker explains the two primary mechanisms users have to shape runtime compute. Effort is a coarse “low→max” dial, while budgets impose stricter constraints such as max tokens or task budgets.

  10. From sequential to interleaved to adaptive thinking

    The serving approach evolved from a single “think then act” block to a more human-like loop of acting and reflecting. Adaptive thinking generalizes this by letting the model decide when to think, call tools, or respond with text.

  11. Why a “thinking toggle” is the wrong mental model

    Turning thinking on/off is framed as disabling a core capability rather than expressing how hard the model should work. The recommended framing is to always provide the capability and control intensity via effort/budgets.

  12. Effort best practices: evaluate performance and expect diminishing returns

    Choosing an effort level should be driven by measurement on representative tasks. The talk emphasizes diminishing marginal returns at the high end and recommends using difficult evals to identify the sweet spot.

  13. Rules of thumb by effort level + “Claude Plays Pokémon” insight

    Practical guidance is given for when to use each effort setting, including a surprising example where low effort produces a clever, shortcut-seeking strategy. This highlights that constraints can change the model’s approach, not just quality.

  14. Model size vs effort: when to use Haiku vs Opus

    The talk compares a Haiku-generated simulation with the Opus result to illustrate that effort can’t fully substitute for base model capability. If the task needs real intelligence, a larger model at lower effort may outperform a smaller model at higher effort.

  15. Closing takeaways: enable thinking, use evals, default to extra high, aim for constraint-based autonomy

    The speaker summarizes actionable recommendations and the longer-term vision. The end state is to set goals and budgets while Claude automatically allocates compute appropriately for the task’s importance and horizon.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.