No PriorsWhy Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown
At a glance
WHAT IT’S REALLY ABOUT
Modern AI benchmarks ignore test-time compute, distorting capability comparisons drastically
- Traditional single-score benchmark grids are misleading because they don’t control for test-time compute (tokens/time/cost), which now strongly affects model capability.
- Newer models may look only marginally better on standard benchmarks yet feel dramatically better in practice because they achieve similar or higher performance with far less “thinking time.”
- On many tasks, performance does not plateau until extremely large inference budgets (weeks/months or 100M+ tokens), making “run until it converges” evaluations incompatible with rapid release cycles.
- Safety and responsible-scaling frameworks built in the ChatGPT/GPT‑3 era often under-account for inference-budget scaling, raising the question of what budget to use when evaluating dangerous capabilities.
- Scaffolding, multi-sampling, routing, and multi-agent setups can “benchmark-max” results; comparisons only become meaningful when normalized by total test-time compute cost.
IDEAS WORTH REMEMBERING
5 ideasModel capability is increasingly a function of inference budget.
Brown argues modern systems can do far more when allowed more tokens/time/money at test time, so a model’s “capability” is no longer a single fixed value.
Benchmark grids hide efficiency gains and mis-rank models.
A newer model can appear only slightly better on a grid while being substantially better per unit compute; controlling for thinking time can reveal the real jump (e.g., 5.5 vs. 5.4).
“Evaluate until performance plateaus” is becoming impractical.
For some benchmarks, productive improvement can continue for extremely long runs (weeks) or massive token budgets (100M+), so full ceiling-finding doesn’t fit typical evaluation timelines.
Benchmarks should be plotted against an x-axis of compute/cost.
Instead of one number per benchmark, Brown recommends reporting performance as a function of tokens/time/$ to enable apples-to-apples comparisons and highlight tradeoffs.
Compute-normalization is a partial antidote to benchmark gaming.
Techniques like best-of-N sampling, judge selection, or scaffolding can inflate scores; accounting for total test-time compute clarifies whether gains are real or just more spending.
WORDS WORTH SAVING
5 quotesThe problem is we're in a world now where the capability of the model is a function of how much money you put into it, basically.
— Noam Brown
The proper way to evaluate the models now is you either have some kind of budget for the benchmark, whether it's tokens or cost or time or whatever, or you plot the performance as a function of the amount of test-time compute that's going into the model.
— Noam Brown
5.5 and o-other models can think for, if you scaffold them reasonably well, can think for weeks even, um, before having performance plateau on some of these benchmarks.
— Noam Brown
And so you kind of end up in this, this bad equilibrium where everybody kind of knows that it's a bad equilibrium, but, like, nobody wants to break out.
— Noam Brown
If it requires so much test-time compute to unlock the full capabilities of the model, then that means you're bottlenecked by time.
— Noam Brown
High quality AI-generated summary created from speaker-labeled transcript.