Skip to content
No PriorsNo Priors

Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

When a new AI model drops, it’s judged based on a static benchmark grid that doesn’t account for how long the model is allowed to think. How then should we measure a model’s true capability? OpenAI research scientist Noam Brown returns to talk with Sarah Guo about his latest essay on why the AI industry’s traditional benchmark grids are broken, and how large-scale test-time compute is fundamentally changing how models are evaluated. Noam explains how, if properly scaffolded, today’s models can reason for weeks or even months on complex tasks. He also discusses real-world implications of test-time compute, from building poker solver bots to disproving legendary math conjectures. Together, they also unpack the large gaps in current AI safety frameworks, explore the bottlenecks for recursive self-improvement, and look ahead at the future of multi-agent collaboration and global knowledge sharing. Read more: ⁠Implications of Large-Scale Test-Time Compute⁠ Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @polynoamial | @OpenAI Chapters: 00:00 – Cold Open 00:43 – Noam Brown Introduction 01:23 – Why Benchmarks Are Broken 04:19 – Compute Budgets and Projections 05:34 – How Long Should Models Think? 06:47 – Benchmark-Maxxing 08:34 – Using Poker Bots as Evals 11:26 – Safety Evals When Model Capability Scales With Budget  14:41 – Release Cycle vs. Agent Runtime  17:06 – Latent Model Capability  20:59 – Limits on Recursive Self-Improvement 27:09 – Large-Scale Multi-Agent Coordination  29:11 – Competition at the Frontier  31:51 – Breaking the Benchmark Grid Equilibrium  33:29 – Why Benchmarks Should be Evaluated by Cost 36:18 – Conclusion

Noam BrownguestSarah Guohost
Jun 26, 202636mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

Modern AI benchmarks ignore test-time compute, distorting capability comparisons drastically

  1. Traditional single-score benchmark grids are misleading because they don’t control for test-time compute (tokens/time/cost), which now strongly affects model capability.
  2. Newer models may look only marginally better on standard benchmarks yet feel dramatically better in practice because they achieve similar or higher performance with far less “thinking time.”
  3. On many tasks, performance does not plateau until extremely large inference budgets (weeks/months or 100M+ tokens), making “run until it converges” evaluations incompatible with rapid release cycles.
  4. Safety and responsible-scaling frameworks built in the ChatGPT/GPT‑3 era often under-account for inference-budget scaling, raising the question of what budget to use when evaluating dangerous capabilities.
  5. Scaffolding, multi-sampling, routing, and multi-agent setups can “benchmark-max” results; comparisons only become meaningful when normalized by total test-time compute cost.

IDEAS WORTH REMEMBERING

5 ideas

Model capability is increasingly a function of inference budget.

Brown argues modern systems can do far more when allowed more tokens/time/money at test time, so a model’s “capability” is no longer a single fixed value.

Benchmark grids hide efficiency gains and mis-rank models.

A newer model can appear only slightly better on a grid while being substantially better per unit compute; controlling for thinking time can reveal the real jump (e.g., 5.5 vs. 5.4).

“Evaluate until performance plateaus” is becoming impractical.

For some benchmarks, productive improvement can continue for extremely long runs (weeks) or massive token budgets (100M+), so full ceiling-finding doesn’t fit typical evaluation timelines.

Benchmarks should be plotted against an x-axis of compute/cost.

Instead of one number per benchmark, Brown recommends reporting performance as a function of tokens/time/$ to enable apples-to-apples comparisons and highlight tradeoffs.

Compute-normalization is a partial antidote to benchmark gaming.

Techniques like best-of-N sampling, judge selection, or scaffolding can inflate scores; accounting for total test-time compute clarifies whether gains are real or just more spending.

WORDS WORTH SAVING

5 quotes

The problem is we're in a world now where the capability of the model is a function of how much money you put into it, basically.

Noam Brown

The proper way to evaluate the models now is you either have some kind of budget for the benchmark, whether it's tokens or cost or time or whatever, or you plot the performance as a function of the amount of test-time compute that's going into the model.

Noam Brown

5.5 and o-other models can think for, if you scaffold them reasonably well, can think for weeks even, um, before having performance plateau on some of these benchmarks.

Noam Brown

And so you kind of end up in this, this bad equilibrium where everybody kind of knows that it's a bad equilibrium, but, like, nobody wants to break out.

Noam Brown

If it requires so much test-time compute to unlock the full capabilities of the model, then that means you're bottlenecked by time.

Noam Brown

Benchmark grids vs. cost/compute-normalized evalsInference-time scaling and “how long should models think?”Projecting performance from small to huge budgetsBenchmark-maxing via scaffolds, self-consistency, routingPoker bots as a reasoning-heavy evaluationSafety evals under variable inference budgetsRelease cadence vs. long-horizon agent runtime and latent capability

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.