Why Tejal Patwardhan stopped underestimating the models - Episode 21

The old tests are getting too easy. Tejal Patwardhan leads OpenAI’s frontier evals team, which is finding new ways to measure and forecast progress as models become more capable. She and host Andrew Mayne discuss why evals matter for research, how benchmarks can break or get gamed, and what models need to be judged on next. Chapters 00:00:24 Growing up at OpenAI 00:03:10 Why reasoning changed everything 00:06:28 What made o1 surprising 00:11:20 Why old benchmarks stopped working 00:14:45 What makes a good benchmark 00:17:35 Why evals are getting harder 00:22:09 Measuring voice and vision models 00:24:48 Testing models on real science 00:33:23 How OpenAI tracks frontier progress 00:40:47 What AI means for work

Andrew MaynehostTejal Patwardhanguest

Jun 16, 202644mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Frontier evals evolve as benchmarks saturate and models surprise researchers

Traditional academic benchmarks are increasingly saturated, so they no longer distinguish top models or predict future progress well.
Reasoning advances showed capability gains without larger models, creating urgency to forecast and prepare for rapidly scaling generalization beyond math.
OpenAI is shifting from static, well-specified tests toward realistic, long-horizon, tool-using and even physical-world evals (e.g., wet-lab optimization).
Benchmarking-to-win-leaderboards is framed as harmful because it can trade off real user usefulness and honest scientific measurement.
Multimodal and agentic systems (voice, vision, computer use) demand new evaluation infrastructure, safety testing, and production monitoring.

IDEAS WORTH REMEMBERING

5 ideas

Saturated benchmarks stop being informative.

Once models approach near-perfect scores, results can’t meaningfully separate frontier systems, so evals must become harder and more realistic (like comparing geniuses on a high-school exam).

Reasoning changed the eval game by shifting capability without scaling size.

Early reasoning-model experiments improved performance by “thinking longer,” and unexpectedly transferred to difficult science questions (e.g., GPQA), forcing faster upgrades to science/professional evals.

Optimize for real tasks, not leaderboard metrics.

Patwardhan argues “benchmarking is bad” when it means training to look good on a test rather than improving general usefulness; users quickly notice models that overfit flashy scores.

The best benchmarks mirror real work with ambiguity and tools.

GDPVal measured tasks across occupations using detailed prompts, but the next step is less-specified manager-style requests where models must decide what to do, gather context, and execute end-to-end.

Agentic models make static evals insufficient.

Models can now take actions (APIs, browsers, repo search, running code) and work for days; evaluation must account for long horizons, tool scaffolding, and operational reliability, not just single-turn answers.

WORDS WORTH SAVING

5 quotes

Generally bad. Benchmarking is bad.

— Tejal Patwardhan

We were really nervous because we were like, "This human baseline's kind of hard. We don't know if the model's going to beat it." But we should never underestimate the model.

— Tejal Patwardhan

It was kind of a feel the AGI moment, one of many.

— Tejal Patwardhan

Hitting the wall is just so not the right way to think about.

— Tejal Patwardhan

We have this saying on our team that pain is the moat.

— Tejal Patwardhan

Capability overhang and forecasting progressReasoning paradigm and generalization beyond mathBenchmark saturation and why old tests failRealistic work-based evals (GDPVal)Agentic evals: tool use, computer use, long-horizon tasksMultimodal eval challenges (voice, vision, video)Science frontier evals including wet-lab experiments and safety

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.