Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

When a new AI model drops, it’s judged based on a static benchmark grid that doesn’t account for how long the model is allowed to think. How then should we measure a model’s true capability? OpenAI research scientist Noam Brown returns to talk with Sarah Guo about his latest essay on why the AI industry’s traditional benchmark grids are broken, and how large-scale test-time compute is fundamentally changing how models are evaluated. Noam explains how, if properly scaffolded, today’s models can reason for weeks or even months on complex tasks. He also discusses real-world implications of test-time compute, from building poker solver bots to disproving legendary math conjectures. Together, they also unpack the large gaps in current AI safety frameworks, explore the bottlenecks for recursive self-improvement, and look ahead at the future of multi-agent collaboration and global knowledge sharing. Read more: ⁠Implications of Large-Scale Test-Time Compute⁠ Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @polynoamial | @OpenAI Chapters: 00:00 – Cold Open 00:43 – Noam Brown Introduction 01:23 – Why Benchmarks Are Broken 04:19 – Compute Budgets and Projections 05:34 – How Long Should Models Think? 06:47 – Benchmark-Maxxing 08:34 – Using Poker Bots as Evals 11:26 – Safety Evals When Model Capability Scales With Budget 14:41 – Release Cycle vs. Agent Runtime 17:06 – Latent Model Capability 20:59 – Limits on Recursive Self-Improvement 27:09 – Large-Scale Multi-Agent Coordination 29:11 – Competition at the Frontier 31:51 – Breaking the Benchmark Grid Equilibrium 33:29 – Why Benchmarks Should be Evaluated by Cost 36:18 – Conclusion

Noam BrownguestSarah Guohost

Jun 26, 202636mWatch on YouTube ↗

CHAPTERS

0:00 – 1:24
Cold open: safety policies miss test-time compute as a capability multiplier
Noam frames the core problem: modern model capability is no longer a single fixed property but increases with the test-time compute (money/time/tokens) you’re willing to spend. He argues most preparedness/responsible scaling policies don’t specify what inference budget they evaluate at, creating a blind spot.
- •Model capability scales with inference budget (e.g., $10 vs. $10M)
- •Older models (e.g., GPT-3) didn’t meaningfully benefit from huge test-time budgets
- •Current safety frameworks often treat capability as static
- •Key unresolved question: at what budget should models be evaluated?
- •Policy and eval practices lag behind inference-time scaling reality
1:24 – 2:38
Why the 5.5 vs 5.4 benchmark grid misled people
Noam explains why initial reactions to the 5.5 release were skeptical: the standard benchmark grid showed only modest gains. The grid format hides that 5.5 is more compute-efficient at reasoning, so comparisons can be unfair if thinking time isn’t controlled.
- •Benchmark grids reduce model performance to single scores per benchmark
- •On paper, 5.5 looked like only a small improvement over 5.4
- •Real usage revealed a larger jump than the grid suggested
- •The missing variable is test-time compute / thinking time
- •Efficiency differences can dominate perceived capability
2:38 – 4:00
Plateaus are too far out: models can improve for weeks under scaffolding
The usual suggestion—run models until performance plateaus—breaks down because the plateau can be extremely far away. With modern scaffolding and long-horizon workflows, models can keep improving for very long runs, making “run-to-asymptote” impractical for evaluations.
- •Controlling for thinking time reveals larger performance gaps
- •Old assumption: models plateau quickly (true-ish in GPT-3 era)
- •New reality: productive improvement can continue for weeks
- •Scaffolding can extend effective reasoning horizons dramatically
- •Eval methodology must acknowledge practical limits on runtime
4:00 – 4:20
Budgeted evals and performance curves: plotting capability vs cost/tokens/time
Noam proposes a new evaluation norm: fix a budget (cost/tokens/time) or report a curve showing performance as a function of test-time compute. This makes model comparisons more honest and more aligned with real deployment constraints.
- •Use an explicit inference budget for each benchmark run
- •Prefer performance-vs-compute curves over single grid numbers
- •Budgets can be measured in tokens, wall-clock time, or dollars
- •This approach captures both raw capability and efficiency
- •Enables apples-to-apples comparisons across models and settings
4:20 – 5:34
Compute projections: extrapolating from small budgets to large-budget performance
They discuss the challenge of evals that are too slow/expensive to run to completion, especially in domains like cyber. Noam suggests forecasting large-budget performance by measuring improvement slopes at smaller budgets, and calls out this projection problem as an open research opportunity.
- •Some eval domains show continued gains past 100M tokens
- •Long-run evaluations can be prohibitively time-consuming
- •Performance often improves smoothly enough to estimate trends
- •Potential research: predict $10k performance from $10–$100 runs
- •Release cadence pressures labs to use faster proxy measurements
5:34 – 6:47
How long should models think in practice? Fast iteration vs long deliberation
Sarah asks whether users underuse test-time compute. Noam argues optimal thinking time depends on the workflow: long runs can boost benchmark scores, but real work often requires rapid iteration and flexible latency.
- •Week-long “think then answer” is often impractical
- •Users tend to prefer quick back-and-forth iteration
- •Thinking time should be adaptable to task urgency and value
- •Balance responsiveness with deeper reasoning when warranted
- •Practical constraints shape how inference-time scaling is used
6:47 – 8:10
Benchmark-maxxing: scaffolds, self-consistency, and misleading paper gains
Noam warns that many headline benchmark improvements can come from spending more test-time compute—e.g., running multiple samples, picking the best, or using a judge model. Without compute controls, benchmark-maxxing can look like true model progress when it’s largely extra budget.
- •Simple scaffolds (N samples + select/judge) can inflate scores
- •Benchmark gains may reflect more inference compute, not better models
- •Concern is less “cheating” than misleading comparisons
- •Benchmark optimization risk increases once benchmarks become targets
- •Held-out/private sets can reduce overfitting to public benchmarks
8:10 – 11:01
Poker bots as a high-signal eval: reasoning, iteration, and failure modes
Noam describes using poker bot construction as an evaluation because it requires multi-step reasoning and has fewer readily available code templates. He explains how model performance progressed across versions, including common pitfalls and improvements in reliability.
- •Poker bots require deep reasoning and iterative debugging
- •Limited open-source tooling makes it hard to “pattern match”
- •5.2 enabled a river solver and sped up Noam’s work ~5×
- •Optimization quality was striking (often 10–100× faster code)
- •Earlier models could “gaslight” and fail basic sanity checks
11:01 – 11:49
From river solvers to full solvers: rapid capability gains and near-term expectations
Noam contrasts earlier partial success with 5.5’s near zero-shot competence and predicts full solver construction may soon be achievable with minimal steering. This highlights how capability can appear suddenly once models cross a practical threshold.
- •5.5 can build much more of the poker stack with gentle steering
- •Full-scale solver work is increasingly feasible with today’s models
- •Near-term possibility: full solver zero-shot in 6–12 months
- •Model reliability improved vs earlier releases
- •Progress feels like crossing “useful inflection points”
11:49 – 14:44
Safety evals in an inference-scaling world: capability depends on budget
Noam explains why current safety governance struggles: most frameworks were designed when test-time compute scaling was limited. If dangerous capabilities scale with budget, then evaluating at a single fixed budget can understate real-world risk at higher spend levels.
- •Preparedness/RSP frameworks focus on static capability levels
- •Modern models’ dangerous capabilities may scale with spend
- •Key governance gap: what inference budget defines evaluation?
- •Some frameworks consider this, but it’s not a dominant norm
- •Ignoring this creates a mismatch between reported safety and reality
14:44 – 17:06
Release cadence vs long-horizon agents: you can’t measure the ceiling fast enough
They discuss a growing mismatch: models can now run productive long-horizon tasks for weeks/months, but new models ship every couple of months. Fully testing the ceiling of an agent that can run for months would require months-long evals, which conflicts with competitive release pressure.
- •Stronger models can sustain longer-horizon autonomous work
- •To know month-scale capability, you may need to run it for a month
- •New releases arrive before prior models are fully explored
- •Users discover long-run behaviors only after significant time
- •Competition creates incentives to avoid delaying releases for evals
17:06 – 20:59
Latent capability case study: the Erdős unit distance disproof and expensive scaffolds
Noam recounts an internal OpenAI model contributing to a notable math result, and notes similar outcomes could be extracted from public models with enough scaffolding. The episode illustrates that many models’ best achievements may be “latent,” only emerging under larger budgets and better orchestration.
- •Internal model helped disprove Erdős unit distance conjecture
- •Result came from relatively low budget in that internal setting
- •Community later found a scaffold to elicit it from 5.5 too
- •General scaffolds (strategy enumeration + exploration) can work
- •Latent capability may require $1k–$100k+ to reliably unlock
20:59 – 27:10
Limits on recursive self-improvement: why ‘overnight’ takeoff seems unlikely
Noam argues we’re not at the point where unlimited inference budget yields universal superintelligence. Some tasks don’t improve with more thinking, and research itself is bottlenecked by taste and time; heavy reliance on long test-time compute also makes time a limiting factor against instant explosions.
- •Not all benchmarks benefit from more inference (e.g., factual recall)
- •Some tasks can improve arbitrarily with compute (e.g., brute-force Sudoku)
- •Research taste and direction-setting remain weak points for models
- •Models accelerate parts of research, but bottlenecks shift elsewhere
- •Long test-time compute makes wall-clock time a key limiter on takeoff
27:10 – 31:51
Large-scale multi-agent coordination: compounding knowledge like civilization
They revisit multi-agent systems and argue the biggest unlock may come from AI systems that can coordinate and accumulate knowledge over time, analogous to how human civilization compounds progress. Today’s models are short-lived in context; future systems may share and build globally.
- •Multi-agent work exists but may be far from its ceiling
- •Human progress comes from coordination and accumulated knowledge
- •Current models ‘disappear’ after short contexts and don’t truly compound
- •Early systems hint at where coordination could go (e.g., MultiBook/OpenClaw)
- •Future: productive knowledge sharing and compounding across agents
31:51 – 36:18
Breaking the benchmark-grid equilibrium: evaluate by cost and compare routing fairly
Noam describes a social/industry coordination problem: everyone publishes benchmark grids because everyone expects them. He argues the community should switch to x-axis reporting (cost/tokens/time), and apply the same budget controls when judging routing/consensus layers so gains aren’t just extra compute or benchmark overfitting.
- •Industry stuck in a ‘publish the grid’ equilibrium despite flaws
- •Proposal: normalize x-axis reporting (tokens/time/$) as default
- •Routing/consensus can boost scores but must be compute-normalized
- •Question: routing vs just letting one model think longer at same budget
- •Benchmark skepticism still applies even with better cost accounting

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Cold open: safety policies miss test-time compute as a capability multiplier

Why the 5.5 vs 5.4 benchmark grid misled people

Plateaus are too far out: models can improve for weeks under scaffolding

Budgeted evals and performance curves: plotting capability vs cost/tokens/time

Compute projections: extrapolating from small budgets to large-budget performance

How long should models think in practice? Fast iteration vs long deliberation

Benchmark-maxxing: scaffolds, self-consistency, and misleading paper gains

Poker bots as a high-signal eval: reasoning, iteration, and failure modes

From river solvers to full solvers: rapid capability gains and near-term expectations

Safety evals in an inference-scaling world: capability depends on budget

Release cadence vs long-horizon agents: you can’t measure the ceiling fast enough

Latent capability case study: the Erdős unit distance disproof and expensive scaffolds

Limits on recursive self-improvement: why ‘overnight’ takeoff seems unlikely

Large-scale multi-agent coordination: compounding knowledge like civilization

Breaking the benchmark-grid equilibrium: evaluate by cost and compare routing fairly

Get more out of YouTube videos.