Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

When a new AI model drops, it’s judged based on a static benchmark grid that doesn’t account for how long the model is allowed to think. How then should we measure a model’s true capability? OpenAI research scientist Noam Brown returns to talk with Sarah Guo about his latest essay on why the AI industry’s traditional benchmark grids are broken, and how large-scale test-time compute is fundamentally changing how models are evaluated. Noam explains how, if properly scaffolded, today’s models can reason for weeks or even months on complex tasks. He also discusses real-world implications of test-time compute, from building poker solver bots to disproving legendary math conjectures. Together, they also unpack the large gaps in current AI safety frameworks, explore the bottlenecks for recursive self-improvement, and look ahead at the future of multi-agent collaboration and global knowledge sharing. Read more: ⁠Implications of Large-Scale Test-Time Compute⁠ Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @polynoamial | @OpenAI Chapters: 00:00 – Cold Open 00:43 – Noam Brown Introduction 01:23 – Why Benchmarks Are Broken 04:19 – Compute Budgets and Projections 05:34 – How Long Should Models Think? 06:47 – Benchmark-Maxxing 08:34 – Using Poker Bots as Evals 11:26 – Safety Evals When Model Capability Scales With Budget 14:41 – Release Cycle vs. Agent Runtime 17:06 – Latent Model Capability 20:59 – Limits on Recursive Self-Improvement 27:09 – Large-Scale Multi-Agent Coordination 29:11 – Competition at the Frontier 31:51 – Breaking the Benchmark Grid Equilibrium 33:29 – Why Benchmarks Should be Evaluated by Cost 36:18 – Conclusion

Noam BrownguestSarah Guohost

Jun 26, 202636mWatch on YouTube ↗

EPISODE INFO

Released: June 26, 2026
Duration: 36m
Channel: No Priors
Watch on YouTube: ▶ Open ↗

EPISODE DESCRIPTION

When a new AI model drops, it’s judged based on a static benchmark grid that doesn’t account for how long the model is allowed to think. How then should we measure a model’s true capability? OpenAI research scientist Noam Brown returns to talk with Sarah Guo about his latest essay on why the AI industry’s traditional benchmark grids are broken, and how large-scale test-time compute is fundamentally changing how models are evaluated. Noam explains how, if properly scaffolded, today’s models can reason for weeks or even months on complex tasks. He also discusses real-world implications of test-time compute, from building poker solver bots to disproving legendary math conjectures. Together, they also unpack the large gaps in current AI safety frameworks, explore the bottlenecks for recursive self-improvement, and look ahead at the future of multi-agent collaboration and global knowledge sharing. Read more: ⁠Implications of Large-Scale Test-Time Compute⁠ Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @polynoamial | @OpenAI Chapters: 00:00 – Cold Open 00:43 – Noam Brown Introduction 01:23 – Why Benchmarks Are Broken 04:19 – Compute Budgets and Projections 05:34 – How Long Should Models Think? 06:47 – Benchmark-Maxxing 08:34 – Using Poker Bots as Evals 11:26 – Safety Evals When Model Capability Scales With Budget 14:41 – Release Cycle vs. Agent Runtime 17:06 – Latent Model Capability 20:59 – Limits on Recursive Self-Improvement 27:09 – Large-Scale Multi-Agent Coordination 29:11 – Competition at the Frontier 31:51 – Breaking the Benchmark Grid Equilibrium 33:29 – Why Benchmarks Should be Evaluated by Cost 36:18 – Conclusion

SPEAKERS

Noam Brown
guest
OpenAI research scientist focused on AI reasoning, evaluation, and agentic/scaffolded inference; known for work on poker-playing AI systems.
Sarah Guo
host
Co-host of No Priors and investor focused on AI and technology.

EPISODE SUMMARY

In this episode of No Priors, featuring Noam Brown and Sarah Guo, Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown explores modern AI benchmarks ignore test-time compute, distorting capability comparisons drastically Traditional single-score benchmark grids are misleading because they don’t control for test-time compute (tokens/time/cost), which now strongly affects model capability.

RELATED EPISODES