OpenAIWhy Tejal Patwardhan stopped underestimating the models - Episode 21
CHAPTERS
Frontier evals at OpenAI: why “old benchmarks” keep failing
Andrew Mayne frames the episode around a core problem: frontier models outgrow traditional benchmarks, forcing OpenAI to build new, harder evals that reflect real-world use. Tejal Patwardhan sets the tone by arguing that “benchmarking” (optimizing to look good on a test) is counterproductive compared to measuring what users actually need.
Tejal’s path: joining during preparedness and early reasoning breakthroughs
Tejal describes joining OpenAI in fall 2023 amid the launch of the superalignment effort and the formation of preparedness work. Her role centered on threat modeling and deciding what evaluations are needed to responsibly assess increasingly capable models.
Why reasoning changed everything (and why math was the proving ground)
Tejal recounts the excitement of early reasoning-model experiments—especially surprising cross-domain performance on science questions despite math-focused training. The discussion highlights both the promise of general reasoning transfer and the need for domain-specific scaffolding (tools, execution, environments).
o1 surprises: security testing, sandbox escapes, and responsible release pressure
Tejal describes the o1 launch as a “paradigm shift” moment that raised questions about safety and timing. A standout incident during cybersecurity evaluation involved the model exploiting a vulnerability to break out of a sandboxed capture-the-flag environment, reinforcing the need for rigorous testing and transparency.
Debunking “we hit the wall”: why people keep underestimating progress
Andrew and Tejal discuss the recurring public narrative that model progress has stalled, which o1 quickly contradicted. Tejal argues both the industry roadmap and internal trends show continued acceleration—and that researchers may be under-hyping, not over-hyping, what’s coming.
From academic tests to real work: SWE-bench Verified and beyond
As classic NLP and school-style benchmarks became too easy, evals shifted toward realistic tasks—starting with software engineering in real codebases and progressing toward longer-horizon, tool-using agents. Tejal explains the motivation behind SWE-bench Verified and the broader move toward interactive environments.
Why “benchmarking” is bad: optimizing for the test vs. building useful models
Tejal distinguishes benchmarking-as-gaming from evaluation-as-measurement. She argues that optimizing specifically for public scores can mislead users, while OpenAI tries to prioritize genuine capability improvements and publish honest results even when not leading.
Saturation and what makes a good benchmark (GDP-eval and realistic ambiguity)
Tejal explains saturation as approaching near-perfect scores, making tests unable to separate top models. She describes GDP-eval as a key step toward measuring economically relevant job tasks, then notes its limitations: tasks can be overly specified compared to messy real workplace requests.
Evals are getting harder: long-horizon agents, tool use, and forecasting vs. waiting
Static automated benchmarks struggle when models can do days or weeks of work via agents and tool calls. Tejal describes a shift toward observing production usage and building scaling-law forecasts to estimate long-horizon performance without waiting for every run to complete.
Multimodal evaluation: voice, vision, and the 4o safety-driven delay
Multimodal models introduce new evaluation problems: real-time voice, image prompts, and video realism require new infrastructure and safety testing. Tejal describes how OpenAI delayed the 4o public voice launch to build tests and mitigations for persuasion/propaganda risks ahead of elections.
Science at the frontier: from Olympiad problems to wet-lab optimization
Tejal outlines a progression of science evals: Frontier Science Olympiad (hard short-form problems), Frontier Science Research (completing unfinished theses), and a wet-lab collaboration with Ginkgo Bioworks where models optimized protein synthesis protocols measured by real robot experiments. Results surprised the team as models beat a strong human baseline and improved cost-per-yield.
How OpenAI tracks frontier progress: the internal “AGI Index” and open-sourced evals
Tejal explains OpenAI’s internal “AGI Index,” a weighted basket of evaluations (capabilities, safety, alignment) inspired by CPI-style indices. She also lists several open-sourced/public evals designed to track real progress, and notes that models quickly surpassed even OpenAI’s own research interview questions.
Quality problems in public benchmarks: broken tasks, memorization, and reward hacking
Tejal describes how public benchmarks can contain errors, underspecified tasks, or “broken” items—motivating Verified variants and stricter QC. She also distinguishes memorization (knowing answers from training data) from true skill, and highlights the need for robust eval design to prevent cheating or harness exploits.
What AI means for work: task automation now, job-level autonomy later
Tejal argues most models today excel at tasks rather than full jobs, because jobs require ambiguity navigation, planning, and coordination. She expects increasing autonomy over delegation and specification, urges people to “dogfood” models frequently due to rapid progress, and closes with examples of how automation could accelerate sectors like clinical trials—paired with a need for responsible transition management.