Why Tejal Patwardhan stopped underestimating the models - Episode 21

The old tests are getting too easy. Tejal Patwardhan leads OpenAI’s frontier evals team, which is finding new ways to measure and forecast progress as models become more capable. She and host Andrew Mayne discuss why evals matter for research, how benchmarks can break or get gamed, and what models need to be judged on next. Chapters 00:00:24 Growing up at OpenAI 00:03:10 Why reasoning changed everything 00:06:28 What made o1 surprising 00:11:20 Why old benchmarks stopped working 00:14:45 What makes a good benchmark 00:17:35 Why evals are getting harder 00:22:09 Measuring voice and vision models 00:24:48 Testing models on real science 00:33:23 How OpenAI tracks frontier progress 00:40:47 What AI means for work

Andrew MaynehostTejal Patwardhanguest

Jun 16, 202644mWatch on YouTube ↗

EPISODE INFO

Released: June 16, 2026
Duration: 44m
Channel: OpenAI
Watch on YouTube: ▶ Open ↗

EPISODE DESCRIPTION

The old tests are getting too easy. Tejal Patwardhan leads OpenAI’s frontier evals team, which is finding new ways to measure and forecast progress as models become more capable. She and host Andrew Mayne discuss why evals matter for research, how benchmarks can break or get gamed, and what models need to be judged on next. Chapters 00:00:24 Growing up at OpenAI 00:03:10 Why reasoning changed everything 00:06:28 What made o1 surprising 00:11:20 Why old benchmarks stopped working 00:14:45 What makes a good benchmark 00:17:35 Why evals are getting harder 00:22:09 Measuring voice and vision models 00:24:48 Testing models on real science 00:33:23 How OpenAI tracks frontier progress 00:40:47 What AI means for work

SPEAKERS

Andrew Mayne
host
Host of the OpenAI Podcast.
Tejal Patwardhan
guest
Research lead at OpenAI focused on frontier evaluations and model benchmarking.

EPISODE SUMMARY

In this episode of OpenAI, featuring Andrew Mayne and Tejal Patwardhan, Why Tejal Patwardhan stopped underestimating the models - Episode 21 explores frontier evals evolve as benchmarks saturate and models surprise researchers Traditional academic benchmarks are increasingly saturated, so they no longer distinguish top models or predict future progress well.

RELATED EPISODES

Inside image generation’s Renaissance moment — the OpenAI Podcast Ep. 19

What happens now that AI is good at math? — the OpenAI Podcast Ep. 17

Episode 13 - The Thinking Behind Ads in ChatGPT

State of the AI industry — the OpenAI Podcast Ep. 12

How AI is accelerating scientific discovery today and what's ahead — the OpenAI Podcast Ep. 10

Inside ChatGPT, AI assistants, and building at OpenAI — the OpenAI Podcast Ep. 2

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Episode Details