No PriorsWhy Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown
CHAPTERS
- 0:00 – 1:24
Cold open: safety policies miss test-time compute as a capability multiplier
Noam frames the core problem: modern model capability is no longer a single fixed property but increases with the test-time compute (money/time/tokens) you’re willing to spend. He argues most preparedness/responsible scaling policies don’t specify what inference budget they evaluate at, creating a blind spot.
- •Model capability scales with inference budget (e.g., $10 vs. $10M)
- •Older models (e.g., GPT-3) didn’t meaningfully benefit from huge test-time budgets
- •Current safety frameworks often treat capability as static
- •Key unresolved question: at what budget should models be evaluated?
- •Policy and eval practices lag behind inference-time scaling reality
- 1:24 – 2:38
Why the 5.5 vs 5.4 benchmark grid misled people
Noam explains why initial reactions to the 5.5 release were skeptical: the standard benchmark grid showed only modest gains. The grid format hides that 5.5 is more compute-efficient at reasoning, so comparisons can be unfair if thinking time isn’t controlled.
- •Benchmark grids reduce model performance to single scores per benchmark
- •On paper, 5.5 looked like only a small improvement over 5.4
- •Real usage revealed a larger jump than the grid suggested
- •The missing variable is test-time compute / thinking time
- •Efficiency differences can dominate perceived capability
- 2:38 – 4:00
Plateaus are too far out: models can improve for weeks under scaffolding
The usual suggestion—run models until performance plateaus—breaks down because the plateau can be extremely far away. With modern scaffolding and long-horizon workflows, models can keep improving for very long runs, making “run-to-asymptote” impractical for evaluations.
- •Controlling for thinking time reveals larger performance gaps
- •Old assumption: models plateau quickly (true-ish in GPT-3 era)
- •New reality: productive improvement can continue for weeks
- •Scaffolding can extend effective reasoning horizons dramatically
- •Eval methodology must acknowledge practical limits on runtime
- 4:00 – 4:20
Budgeted evals and performance curves: plotting capability vs cost/tokens/time
Noam proposes a new evaluation norm: fix a budget (cost/tokens/time) or report a curve showing performance as a function of test-time compute. This makes model comparisons more honest and more aligned with real deployment constraints.
- •Use an explicit inference budget for each benchmark run
- •Prefer performance-vs-compute curves over single grid numbers
- •Budgets can be measured in tokens, wall-clock time, or dollars
- •This approach captures both raw capability and efficiency
- •Enables apples-to-apples comparisons across models and settings
- 4:20 – 5:34
Compute projections: extrapolating from small budgets to large-budget performance
They discuss the challenge of evals that are too slow/expensive to run to completion, especially in domains like cyber. Noam suggests forecasting large-budget performance by measuring improvement slopes at smaller budgets, and calls out this projection problem as an open research opportunity.
- •Some eval domains show continued gains past 100M tokens
- •Long-run evaluations can be prohibitively time-consuming
- •Performance often improves smoothly enough to estimate trends
- •Potential research: predict $10k performance from $10–$100 runs
- •Release cadence pressures labs to use faster proxy measurements
- 5:34 – 6:47
How long should models think in practice? Fast iteration vs long deliberation
Sarah asks whether users underuse test-time compute. Noam argues optimal thinking time depends on the workflow: long runs can boost benchmark scores, but real work often requires rapid iteration and flexible latency.
- •Week-long “think then answer” is often impractical
- •Users tend to prefer quick back-and-forth iteration
- •Thinking time should be adaptable to task urgency and value
- •Balance responsiveness with deeper reasoning when warranted
- •Practical constraints shape how inference-time scaling is used
- 6:47 – 8:10
Benchmark-maxxing: scaffolds, self-consistency, and misleading paper gains
Noam warns that many headline benchmark improvements can come from spending more test-time compute—e.g., running multiple samples, picking the best, or using a judge model. Without compute controls, benchmark-maxxing can look like true model progress when it’s largely extra budget.
- •Simple scaffolds (N samples + select/judge) can inflate scores
- •Benchmark gains may reflect more inference compute, not better models
- •Concern is less “cheating” than misleading comparisons
- •Benchmark optimization risk increases once benchmarks become targets
- •Held-out/private sets can reduce overfitting to public benchmarks
- 8:10 – 11:01
Poker bots as a high-signal eval: reasoning, iteration, and failure modes
Noam describes using poker bot construction as an evaluation because it requires multi-step reasoning and has fewer readily available code templates. He explains how model performance progressed across versions, including common pitfalls and improvements in reliability.
- •Poker bots require deep reasoning and iterative debugging
- •Limited open-source tooling makes it hard to “pattern match”
- •5.2 enabled a river solver and sped up Noam’s work ~5×
- •Optimization quality was striking (often 10–100× faster code)
- •Earlier models could “gaslight” and fail basic sanity checks
- 11:01 – 11:49
From river solvers to full solvers: rapid capability gains and near-term expectations
Noam contrasts earlier partial success with 5.5’s near zero-shot competence and predicts full solver construction may soon be achievable with minimal steering. This highlights how capability can appear suddenly once models cross a practical threshold.
- •5.5 can build much more of the poker stack with gentle steering
- •Full-scale solver work is increasingly feasible with today’s models
- •Near-term possibility: full solver zero-shot in 6–12 months
- •Model reliability improved vs earlier releases
- •Progress feels like crossing “useful inflection points”
- 11:49 – 14:44
Safety evals in an inference-scaling world: capability depends on budget
Noam explains why current safety governance struggles: most frameworks were designed when test-time compute scaling was limited. If dangerous capabilities scale with budget, then evaluating at a single fixed budget can understate real-world risk at higher spend levels.
- •Preparedness/RSP frameworks focus on static capability levels
- •Modern models’ dangerous capabilities may scale with spend
- •Key governance gap: what inference budget defines evaluation?
- •Some frameworks consider this, but it’s not a dominant norm
- •Ignoring this creates a mismatch between reported safety and reality
- 14:44 – 17:06
Release cadence vs long-horizon agents: you can’t measure the ceiling fast enough
They discuss a growing mismatch: models can now run productive long-horizon tasks for weeks/months, but new models ship every couple of months. Fully testing the ceiling of an agent that can run for months would require months-long evals, which conflicts with competitive release pressure.
- •Stronger models can sustain longer-horizon autonomous work
- •To know month-scale capability, you may need to run it for a month
- •New releases arrive before prior models are fully explored
- •Users discover long-run behaviors only after significant time
- •Competition creates incentives to avoid delaying releases for evals
- 17:06 – 20:59
Latent capability case study: the Erdős unit distance disproof and expensive scaffolds
Noam recounts an internal OpenAI model contributing to a notable math result, and notes similar outcomes could be extracted from public models with enough scaffolding. The episode illustrates that many models’ best achievements may be “latent,” only emerging under larger budgets and better orchestration.
- •Internal model helped disprove Erdős unit distance conjecture
- •Result came from relatively low budget in that internal setting
- •Community later found a scaffold to elicit it from 5.5 too
- •General scaffolds (strategy enumeration + exploration) can work
- •Latent capability may require $1k–$100k+ to reliably unlock
- 20:59 – 27:10
Limits on recursive self-improvement: why ‘overnight’ takeoff seems unlikely
Noam argues we’re not at the point where unlimited inference budget yields universal superintelligence. Some tasks don’t improve with more thinking, and research itself is bottlenecked by taste and time; heavy reliance on long test-time compute also makes time a limiting factor against instant explosions.
- •Not all benchmarks benefit from more inference (e.g., factual recall)
- •Some tasks can improve arbitrarily with compute (e.g., brute-force Sudoku)
- •Research taste and direction-setting remain weak points for models
- •Models accelerate parts of research, but bottlenecks shift elsewhere
- •Long test-time compute makes wall-clock time a key limiter on takeoff
- 27:10 – 31:51
Large-scale multi-agent coordination: compounding knowledge like civilization
They revisit multi-agent systems and argue the biggest unlock may come from AI systems that can coordinate and accumulate knowledge over time, analogous to how human civilization compounds progress. Today’s models are short-lived in context; future systems may share and build globally.
- •Multi-agent work exists but may be far from its ceiling
- •Human progress comes from coordination and accumulated knowledge
- •Current models ‘disappear’ after short contexts and don’t truly compound
- •Early systems hint at where coordination could go (e.g., MultiBook/OpenClaw)
- •Future: productive knowledge sharing and compounding across agents
- 31:51 – 36:18
Breaking the benchmark-grid equilibrium: evaluate by cost and compare routing fairly
Noam describes a social/industry coordination problem: everyone publishes benchmark grids because everyone expects them. He argues the community should switch to x-axis reporting (cost/tokens/time), and apply the same budget controls when judging routing/consensus layers so gains aren’t just extra compute or benchmark overfitting.
- •Industry stuck in a ‘publish the grid’ equilibrium despite flaws
- •Proposal: normalize x-axis reporting (tokens/time/$) as default
- •Routing/consensus can boost scores but must be compute-normalized
- •Question: routing vs just letting one model think longer at same budget
- •Benchmark skepticism still applies even with better cost accounting