CHAPTERS
- 0:01 – 1:01
Why test-time compute matters for solving hard engineering tasks
Matt Bleifer frames the talk around how Claude can spend more compute at inference time (“test time compute”) to tackle difficult software engineering problems. He previews practical levers users can control—effort settings and budgets—to balance quality, cost, and latency.
- •Test-time compute as a key recent LLM development
- •Goal: better outcomes on hard software engineering challenges
- •Users can influence token spend with effort and budgets
- •Talk will include best practices and practical guidance
- 1:01 – 2:32
Scaling intelligence: training compute vs. inference compute
The talk explains that intelligence can scale not only by training larger models, but also by allowing a model to spend more time reasoning at test time. Performance improves both when moving to stronger models and when granting the same model more inference-time compute.
- •Bigger models tend to score higher on agentic coding evals
- •Same model improves when allowed to spend more time per task
- •Test-time compute benefits many knowledge-work domains
- •Higher compute can translate into higher task performance
- 2:32 – 3:38
Traffic simulation demo: low effort baseline output
A concrete example shows Opus 4.7 building a traffic-light car simulation at low effort. The output is functional but simplistic, illustrating what “saving tokens” looks like in practice.
- •Low effort produced a result in ~50 seconds with ~4,600 tokens
- •Basic cars and traffic light behavior works
- •Limited graphics and realism
- •Design oddity: traffic light placed in the road
- 3:38 – 4:09
Traffic simulation demo: high effort improves realism
Turning effort up to high increases time and tokens but yields a noticeably better simulation. The model adds variety, fixes earlier layout issues, and introduces more realistic driving dynamics.
- •High effort roughly doubles time and output tokens
- •More vehicle variety and better presentation
- •Traffic light moved to the roadside
- •Introduces an “intelligent driver model” concept for dynamics
- 4:09 – 5:40
Traffic simulation demo: max effort and the case for long-horizon work
At max effort, Opus 4.7 spends around 10× the time and tokens compared to low effort, producing the best visual and behavioral result. This motivates the broader idea that future systems may work for days or longer on major problems.
- •Max effort: ~10× time and tokens vs. low effort
- •Best graphics and most realistic driving patterns
- •Same prompt, better results by allocating more compute
- •Vision: models working days/weeks/months for big challenges
- 5:40 – 7:41
Three token buckets: thinking, tool use, and user-facing text
Matt breaks inference-time token spend into three categories. He explains what each token type represents and why they’re all important to task completion and user experience.
- •Thinking tokens: internal reasoning/scratchpad for step-by-step work
- •Tool-calling tokens: interacting with the outside world (search, files, etc.)
- •Text tokens: communicating progress, summaries, and answers to users
- •Test-time compute includes all three token categories
- 7:41 – 9:44
Why users need control: costs, latency, and constraints
More tokens mean both higher monetary cost and longer wait times, so user controls are essential. The talk introduces the two main mechanisms: effort settings and explicit budgets.
- •Token spend directly affects cost and waiting time
- •Effort expresses a time–cost–quality preference
- •Task budgets cap spend before Claude must check in
- •Budgets may be defined in tokens, time, or cost
- 9:44 – 10:44
From sequential reasoning to interleaved thinking
Early reasoning-model behavior followed a rigid pattern: think first, then use tools, then respond. Interleaved thinking improved this by letting Claude alternate between tool calls and reasoning as it gathers information.
- •Original pattern: thinking → tool calls → final text
- •Interleaving enables: tool → reflect → next tool → … → answer
- •Improves adaptiveness during multi-step tasks
- •Sets the stage for more flexible compute allocation
- 10:44 – 13:20
Adaptive thinking: flexible allocation across thinking, tools, and text
Adaptive thinking removes rigid ordering constraints and lets Claude decide when (or whether) to think, call tools, or speak to the user. Matt clarifies that it’s not routing between models, but a more flexible “permission” structure for reasoning across the workflow.
- •Claude can think/tool/text in any order as appropriate
- •May start by acknowledging, then tool use, then think, etc.
- •Can skip thinking entirely for trivial queries
- •Not a router or auto-toggle; it’s freedom to think at any step
- 13:20 – 14:51
Effort vs. thinking toggles: why effort is the better knob
Matt argues that thinking toggles are a poor substitute for “try harder,” because they disable/enable a core capability rather than set a quality–cost trade-off. Effort is presented as the correct abstraction because it jointly influences thinking, tool use, and output behavior.
- •Thinking toggles constrain *how* Claude works, not *how hard* it works
- •Effort better captures the user’s desired trade-off
- •Analogy: we don’t force “always search” or “never search”—Claude decides
- •Analogy: you ask teammates to try harder, not to toggle inner monologue
- 14:51 – 16:23
How to choose effort: build curves, watch for diminishing returns, read transcripts
The recommended approach is empirical: run evals and plot performance vs. cost/time/tokens to find the right point on the curve. Matt highlights diminishing returns at higher effort and stresses reviewing transcripts to see what shortcuts or behaviors emerge at different settings.
- •Create effort curves: performance vs. tokens/time/cost
- •Higher effort often improves results but can have diminishing returns
- •Low effort may take surprising shortcuts—verify via transcripts
- •Evals plus qualitative review gives the clearest picture
- 16:23 – 17:54
Surprising low-effort behavior: Claude ‘speedruns’ Pokémon
A Pokémon Red evaluation illustrates that low effort isn’t simply “less smart”—it can induce strategies optimized for speed and minimal token spend. Claude skips battles, uses items efficiently, and reduces random encounters to progress faster.
- •Low effort led to speedrun-like play behavior
- •Skipped trainer battles to save time
- •Used healing items instead of returning to centers
- •Spam ‘repel’ to reduce disruptive encounters and move faster
- 17:54 – 19:25
Rules of thumb: what each effort level is best for
Matt provides practical guidance for selecting effort settings when evals are unavailable. He calls out where max is useful, why extra high is a strong default for coding, and when medium/low are appropriate for cost or latency sensitivity.
- •Max: best for hardest tasks; test it but expect diminishing returns
- •Extra high: introduced with Opus 4.7; best default for coding/agentic work
- •High: strong balance for intelligence-sensitive use cases
- •Medium/Low: cost- or latency-sensitive and short-scope tasks
- 19:25 – 21:57
Model size vs. effort: choosing between small models and low-effort large models
The talk compares when to downshift effort on a powerful model versus switching to a smaller model. The guidance focuses on whether you’re optimizing for intelligence, total latency, cost, or time-to-first-token.
- •Low-effort large model: good for intelligence-demanding tasks with speed constraints
- •Small models: best for simpler bulk tasks (classification, extraction, summarization)
- •Small models also improve time-to-first-token responsiveness
- •Heuristic: small model for fast first token; big model (low effort) for fast last token
- 21:57 – 24:01
Closing takeaways: enable thinking, use evals, default to extra high for coding
Matt ends with three actionable recommendations and a broader north star: Claude should allocate compute well under user-defined quality and budget constraints. Adaptive thinking, effort, and budgets are positioned as early steps toward that vision.
- •Enable thinking to give Claude space to reason (then control with effort/budgets)
- •Use evals and transcript review to find the right settings
- •If you must pick quickly for coding: choose extra high
- •Future direction: set a quality bar + budget, and Claude optimizes the rest
