OpenAIWhy Tejal Patwardhan stopped underestimating the models - Episode 21
At a glance
WHAT IT’S REALLY ABOUT
Frontier evals evolve as benchmarks saturate and models surprise researchers
- Traditional academic benchmarks are increasingly saturated, so they no longer distinguish top models or predict future progress well.
- Reasoning advances showed capability gains without larger models, creating urgency to forecast and prepare for rapidly scaling generalization beyond math.
- OpenAI is shifting from static, well-specified tests toward realistic, long-horizon, tool-using and even physical-world evals (e.g., wet-lab optimization).
- Benchmarking-to-win-leaderboards is framed as harmful because it can trade off real user usefulness and honest scientific measurement.
- Multimodal and agentic systems (voice, vision, computer use) demand new evaluation infrastructure, safety testing, and production monitoring.
IDEAS WORTH REMEMBERING
5 ideasSaturated benchmarks stop being informative.
Once models approach near-perfect scores, results can’t meaningfully separate frontier systems, so evals must become harder and more realistic (like comparing geniuses on a high-school exam).
Reasoning changed the eval game by shifting capability without scaling size.
Early reasoning-model experiments improved performance by “thinking longer,” and unexpectedly transferred to difficult science questions (e.g., GPQA), forcing faster upgrades to science/professional evals.
Optimize for real tasks, not leaderboard metrics.
Patwardhan argues “benchmarking is bad” when it means training to look good on a test rather than improving general usefulness; users quickly notice models that overfit flashy scores.
The best benchmarks mirror real work with ambiguity and tools.
GDPVal measured tasks across occupations using detailed prompts, but the next step is less-specified manager-style requests where models must decide what to do, gather context, and execute end-to-end.
Agentic models make static evals insufficient.
Models can now take actions (APIs, browsers, repo search, running code) and work for days; evaluation must account for long horizons, tool scaffolding, and operational reliability, not just single-turn answers.
WORDS WORTH SAVING
5 quotesGenerally bad. Benchmarking is bad.
— Tejal Patwardhan
We were really nervous because we were like, "This human baseline's kind of hard. We don't know if the model's going to beat it." But we should never underestimate the model.
— Tejal Patwardhan
It was kind of a feel the AGI moment, one of many.
— Tejal Patwardhan
Hitting the wall is just so not the right way to think about.
— Tejal Patwardhan
We have this saying on our team that pain is the moat.
— Tejal Patwardhan
High quality AI-generated summary created from speaker-labeled transcript.