OpenAIWhy Tejal Patwardhan stopped underestimating the models - Episode 21
EVERY SPOKEN WORD
50 min read · 10,232 words- 0:00 – 0:24
Intro
- AMAndrew Mayne
Hello, I'm Andrew Mayne, and welcome to the OpenAI Podcast. On today's episode, we're talking to the research lead, Tejal Patwardhan, about the need to build frontier evals as old benchmarks get saturated.
- TPTejal Patwardhan
Generally bad. Benchmarking is bad. How can we make these models useful for people in their real work? We were really nervous because we were like, "This human baseline's kind of hard. We don't know if the model's going to beat it." But we should never underestimate the model.
- 0:24 – 3:10
Growing up at OpenAI
- AMAndrew Mayne
Tejal, I have a question. How did you end up where you were? What brought you into OpenAI?
- TPTejal Patwardhan
Oh, I thought we weren't gonna start with this o- [laughs]
- AMAndrew Mayne
Tejal, I have a question for you. What would you like to start with?
- TPTejal Patwardhan
Um, can we start with, like, tell us, like, what you did when you started at OpenAI, and then we can, like, work, work backwards.
- AMAndrew Mayne
Okay. Don't you wanna talk about your early days? No?
- TPTejal Patwardhan
No. I, I grew up at OpenAI. It's just, like, not-
- AMAndrew Mayne
Okay. [laughs] Um, tell me a bit about your journey here working inside artificial intelligence inside OpenAI.
- TPTejal Patwardhan
So I joined OpenAI in fall '23, and it was right after ChatGPT had come out, GPT-4 was out-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... and OpenAI had started its super alignment team, and I, uh, joined for the preparedness team that was getting started as we were starting to get, look at how capable these models were becoming and think about, you know, what would the next generation of models look like. And at the time it was extremely exciting because, um, right after I joined was when some of the early results for the reasoning models had started to pick up, and we were thinking about, you know, if these models really take off, what will the future of capabilities look like, and how can we pre- be prepared, um, for that future? And so we did a whole bunch of work on, like, threat modeling and, like, what eval should we be running. How do we think about releasing a model like this? It was a very exciting time to join.
- AMAndrew Mayne
What got you interested in this area?
- TPTejal Patwardhan
Yeah. Well, to me, evals are really exciting because they're a way to sort of measure and understand what our models can do and see progress, you know, sort of before it tends to happen. Like, there's this term called capability overhang, which is this idea that the models will be capable of things long before people actually adopt them and use them for those capabilities. Like, there, you know, there might be cultural or legal or regulatory barriers towards using a capability even before it's ready. And so being someone who can, like, help develop and measure our models via evals, it helps you really understand what this technology can do and sort of see the future before it happens-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... which is very, um, interesting. And I also think it's important because it can help sort of ready the world for what's happening.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
Like, part of, um, when I originally started here, part of why I was really excited to work on some of the preparedness evals was because I thought these models were getting very capable, and it felt like a lot of my friends, like, in my real life didn't really understand how s- powerful these models would soon become.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
Because they'd look at, you know, a ChatGPT output and be like, "Yeah, it's hallucinating," and, like-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... "It's kind of not that smart and kind of reads like AI slop." And it's like, well, that's now, but, like, the question is the slope. Like, if the slope is very high, then, you know, change might be happening much faster than one would expect. And so I think one of the greatest services that we can do is sort of measure and share with the world what progress looks like, um, especially because there's often this capability overhang before people really understand and feel that, um, in the models themselves. Um, so that's part of why I think all of this is very important.
- 3:10 – 6:28
Why reasoning changed everything
- AMAndrew Mayne
Reasoning was such an exciting moment, and for most of the world, that didn't happen until, you know, a year later that they found out about this. But what was that like for you to all of a sudden understand that if you gave the models a longer time to think about things, you got better results even though the size hadn't gotten bigger?
- TPTejal Patwardhan
That was a really fun time. I mean, um, so in some of the early experiments, which I- we've talked about now, it's like the model is trained really just on math, and I remember there was this set of experiments where Nat McAleese was like, "Hey, the model is trained on math. But if you eval it on GPQA," which was this benchmark with, like, biology and chemistry and physics problems, "the model is doing really well, like how this is very interesting, and smarter models are much smarter." And he had put together this forecast that at the time it, it, it said that if, you know, progress kept going, within six months we'd have human level performance on science from just training on math. And we were like, "Oh my gosh. That's crazy." And, uh, at the time this was extremely locked down. It was like we kind of found our way to, like, curl to be able to see some model outputs, and we were like, "Wow, this is, like, one of the smartest things, like, I've ever seen. Like, I've never seen a model reason like this before." It was just like if this, if this becomes a paradigm that continues to scale, but then we just looked back and we were like, you know, um, GPQA was like, you know, PhD level biology, chemistry, and physics, and we were like, "Ah, that's, what is that? We really need professional level." And we just, like, keep, kept changing the stakes of what counted. But yeah, it was very cool.
- AMAndrew Mayne
I remember early on when AP Bio was just, that was the benchmark to try to see if the model could do that. But what's interesting as you brought this up is that a lot of stuff that comes out from OpenAI is math-focused.
- TPTejal Patwardhan
Math has been useful because it's more objectively verifiable in some ways. So some of the earlier problems that we trained on, it was just easier to do RL and scale up the reasoning paradigm on math. Um, and, and so, and math is also useful in various ways. You know, it's, like, one of the core t- you know, types of science. But also in many ways it's just happened by coincidence to be a thing that we focused on, but it's not necessarily the end product of what we even want to focus on in research. Like, we're now realizing, okay, if we can do this for math, can we scale this up for other types of science, for professional work, for, you know, for capabilities that are useful to humans on a personal level. Um, and so I think math is more like the proof point versus, like, the end goal.
- AMAndrew Mayne
But it does seem, like you said, though, that if something is able to think for a long time, break something down into steps and think through them as you have to do for really complex mathematical problems, it does just carry over.
- TPTejal Patwardhan
Well, this is a big debate. [laughs]
- AMAndrew Mayne
Hmm.
- TPTejal Patwardhan
So, like, uh, some of it definitely carries over. Like, the general idea of reasoning can be useful, but then also there could be some domain specific skills or tools or types of reasoning that you would need in different domains. Like, for example, for coding, you need to be able to actually write and execute code and test code if you want to scale up a coding agent.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
And so something we've thought about a lot in terms of both evals and then also training is how do we make sure we also give the model the skills and tools and affordances that it would need to reason in that particular domain? And some of the benefits of math will translate, and then also you might need some domain specific-
- AMAndrew Mayne
Right
- TPTejal Patwardhan
... scaffolding to really- Pull out its full abilities. Like, kind of, you know, like a general high school or liberal arts education and then, like, a specialized education.
- AMAndrew Mayne
Reasoning models were just a very interesting moment because I think it changed a lot of the ways we thought about what was possible even with just a certain amount of compute if you let a model think longer and you gave the model the opportunity to just, just come up with more complex answers to this.
- 6:28 – 11:20
What made o1 surprising
- AMAndrew Mayne
Were there any interesting things that happened with o1 that surprised you?
- TPTejal Patwardhan
So the o1 release process was very exciting for- We were sort of thinking about the reasoning paradigm for a very long time, and, um, there were people that were worried about making sure we, we didn't release it too soon just because it felt like a paradigm shift-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... like, possibly the thing that got us to AGI. Like I, I said at the beginning, we thought we had AGI in six months when, like, some of the early runs were happening. Um, and so there was this question of, okay, how do we put this out responsibly? How do we test this technology? And, um, during the initial launch review for o1, we, during some of our cybersecurity tests, the model, it was, like, one of the first examples of the model, like, breaking out of the sandbox. We published about this.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
Um, where it was supposed to be in this Docker container doing this capture the flag, and the model found this, like, security vulnerability in, like, how we had implemented, um, the, the capture the flag scenario and it broke out, and we were all like, "Oh, no. [laughs] What else has the model done if it did this?" Um, and it was kind of a feel the AGI moment, one of many. I feel like ever since then there have been many other such moments where the model has done something really surprising or intelligent or novel that we wouldn't, we didn't even think of when we were doing the tests, and then you would come back and look at the transcripts and results and be like, "Wow, these guys are- they're clever. They're clever." And then it was just very important that we published, um, and made sure the world knew, like, the models can do this sort of thing. Yeah.
- AMAndrew Mayne
There was this period right before o1 and was announced, a lot of people were like, "Well, it looks like we've hit the wall. It's been a few months since anything's happened." Then o1 came out and they're like, "What's a wall?"
- TPTejal Patwardhan
Hitting the wall is just so not the right way to think about. Yeah, I, I get very frustrated when I see posts like that because I'm like, man, if you look at... I feel like I've been looking at this model improvement and this progress for a long time, and it just keeps getting better.
- AMAndrew Mayne
Yeah.
- TPTejal Patwardhan
Like, it just keeps getting better, and if I look at our research roadmap now, I see no signs of stopping. Like, things are-
- AMAndrew Mayne
Wow
- TPTejal Patwardhan
... just gonna keep getting better. This is gonna be a really crazy year. A lot of really cool, um, research is going to come out, and I think this is probably true across the whole industry. So yeah, if anything, people are really under, they really under-expect from the models.
- AMAndrew Mayne
It, it seems like sometimes, though, that they're, OpenAI releases a lot. They tell people what things were headed and say that this looks interesting. Sometimes people forget this or you get rumors of stuff like Q*.
- TPTejal Patwardhan
Q*, man. [laughs] Very interesting. But no, people, people don't realize. Like, I don't know, I feel like we try to be very open and say like-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... "Hey, guys, here are some plots. Like the lines are going up."
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
"Things are really capable." I think maybe there's this few, there's, like, this, um, like, meme that, oh, the researchers, they, they don't understand. They, like, the models are only good at math and research-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... but not good at things in the real world, but I just don't think that's true. I think, um, people from even other occupations that have transitioned into OpenAI, like, are starting to see our models are picking up at all sorts of things. And, uh, I know it's like it, it might seem like the researchers are trying to overhype the model or something. But if anything, I think we're under-hyping the power of them.
- AMAndrew Mayne
You, you brought up AGI. If, if I brought GPT-4 back from, you know, March 2023 back into, let's say, you know, 2020, I think people would've called it that, and now we have this much more different idea of this. People talk to AI every day, have long conversations with it, things like no one talks about the Turing test anymore, as when nobody really understood what he was trying to explain, you know. But now we're, we're well past that period. Is there the eval for AGI?
- TPTejal Patwardhan
Yeah. I mean, the models passed the Turing test and no one talked about it.
- AMAndrew Mayne
Yeah.
- TPTejal Patwardhan
It's kind of crazy. Um, yeah, like I think models can, are pretty much indistinguishable, um, from humans in, in many, many situations. Um, in terms of the test for AGI, I mean, I think if a model can do, like, there's the classic most economically-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... valuable work, and I think people are increasingly using the model for large parts of their work.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
And, um, I think there'll be, like, a big spectrum and debate of, like, when exactly this happened, but gosh, I certainly feel like Codex does a lot of work for me.
- AMAndrew Mayne
Yeah.
- TPTejal Patwardhan
Um, and I feel very lucky to have unlimited tokens, you know, so that's-
- 11:20 – 14:45
Why old benchmarks stopped working
- AMAndrew Mayne
stuff. How have these been evolving?
- TPTejal Patwardhan
It used to be that, you know, even the academic benchmarks, so to speak, our models couldn't pass, like, you know, classic tests that someone would take in high school or college-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... or sort of more multiple choice types of questions. And as the models got smarter, we had to make things more and more realistic. So- One of the first benchmarks that we put out more publicly was this benchmark called SWE-bench Verified, which was like testing how well the model could, you know, interact in real code bases in Python, like-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... Django and like, you know, complete PRs and that sort of thing. Um, and like pass unit tests. And then those became even more advanced where we were like, okay, can the model take, you know, multi-step actions on like some complex environment, take actions on the computer, like, um, take actions that link up to the real world with like some of our wet labs and biology work? So I think over time, as the models keep getting better, we have to be more ambitious with like how long horizon-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... and how realistic our measurements are. And doing that is very fun because you have to like sort of stay ahead of the pace of progress.
- AMAndrew Mayne
So two terms I want you to unpack. Uh, when we talk about benchmarks, you often hear benchmarking.
- TPTejal Patwardhan
Yeah. Benchmarking is, I would say, this idea that you, uh, if, if, uh, someone training a model was just trying to look good on some-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... evaluation or benchmark and not actually making the model generally useful, and I would say that's generally not super helpful because you want the model-
- AMAndrew Mayne
Mm
- TPTejal Patwardhan
... to be good at the real thing that the user might want to do, and you don't just care about it looking good in some like marketing copy because like when a user uses it, they'll be like, "Hey, this is like not quite-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... what I signed up for." Um, and so generally bad. Benchmarking is bad.
- AMAndrew Mayne
Yeah. And I think the, the way that I've heard it explained kind of makes sense is that you have X amount of compute budget, time, how much you're gonna spend on it, and you can spend a large part of that making the model just overall very good, or I can say I'm gonna spend 90% of it so my evals are gonna look really good when I release it. And sometimes we've seen people just go literally use those evals for it. It comes out like, oh, that looks like a great model, and then you find out, oh, it's only good at that.
- TPTejal Patwardhan
Yeah. That's not a great experience for the user. So we've-
- AMAndrew Mayne
[laughs]
- TPTejal Patwardhan
I think something that the OpenAI research program has done quite well is try to be very disciplined about making sure we are investing in general model improvements on the areas that really matter, and then, you know, you'll run some evals at the end for comparison. Um, but the goal should not be, oh, we just wanna look good on an eval. We want to make a model that's useful to push forward the frontier of science or push forward the frontier of work or something like this. Um, and I think Jakob has done a really good job also, like enforcing throughout the research org.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
Like, we should be really scientific and honest, and that's included, you know, we've published results where our models were not the best before. We just wanna publish-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... the reality and make sure that we are painting a very accurate picture of what our models can do and then aim to make them useful in the real world as much as we can.
- AMAndrew Mayne
You mentioned the software engineering bench as a, one of the metrics that's maybe not as useful now, and we hear the term saturated. Explain what it means when a benchmark's saturated.
- TPTejal Patwardhan
Saturated is when, um, a model is close to passing all of the questions correctly, like getting close to 100% on the test.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
Um, and once a benchmark is saturated, it's not super useful because you can't really tell models apart with that test. It's like comparing two geniuses on like a high school math exam.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
Like, they might just both pass, but that's not very useful as you're trying to separate really, really smart, um, pieces of intelligence. So the challenge is always to make more and more difficult, realistic, unsaturated benchmarks that you can then measure models against over time and forecast sort of where progress is going.
- 14:45 – 17:35
What makes a good benchmark
- AMAndrew Mayne
How do you do that now? How do you figure out what a good benchmark's gonna be?
- TPTejal Patwardhan
Yeah, I mean, the best benchmarks I think are really realistic and measure something people actually care about.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
So one of our first forays towards doing this, which, you know, it's been a while now, but um, that we published was called GDPVal. Like, I was really excited that, about the idea of having a measurement for how the models could interact with the real world, and we were really having this crisis of evals where we kept training successively better models, and on SWE-bench they looked about the same because-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... they were just doing really well, and like we were reaching the top of what that benchmark could measure. And we were like, man, we have no idea how to measure what people actually want to use our models for. And so there was very much a, hey, like the Bureau of Labor Statistics has a list of all the top jobs and like all the top tasks per job, and if you're a financial analyst, like doing an investment diligence or writing a legal memo or, um, you know, writing a, a paper based on a piece of research or something like this. And the idea was can we actually ask the model those tasks that someone would want in real life with the context they would have at the time, and then see how the model could solve those tasks? And at the time when we tested one of the earliest models on this benchmark, it got like, you know, less than 20%. Like, um-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... if you compare how well a model would do on this well-specified work task compared to a human, like the model was way worse. But I'm like really proud of the org for being like, "Actually, you know what? We should publish this new way to sort of measure and forecast progress on real world economic impacts," and it's been like very useful to a lot of economists. And also our models now are the best.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
Um, and it's like very cool because I think at the time, um, we were like not really in- investing in real world work in some of our training programs, um, and weren't even measuring or tracking it. And I think now there's a lot more focus on how can we make these models useful for people in their real work, like for real scientists, and this kind of helps catalyze a wake-up call that, hey, maybe we should also think about how to measure how stuff is used in the real world. So that was pretty cool. But now we're like, okay, this benchmark's probably too easy 'cause it's extremely well specified. Like, each of the prompts is, you know, hundreds of words of, "I want you to go to this spreadsheet and make this change and do this thing, and then take that calculation and put it in a memo." It's like very detailed. And I think the next step is how do we give the model as much ambiguity as you would give a report in the real world? Like, you know, if a manager asks like, "Hey, can you run this analysis for me?" They should go figure out what to do, put that together, run the analysis, and give you an output. And so I think, um, we've been working a lot on like more realistic ways to measure real work in the real world, whether that's in like science, for personal use, or even for enterprise.
- AMAndrew Mayne
There is seems to be something to the idea of instead of hiding a benchmark, putting it out there because internally as an org you go like, "Okay, this can't stand."
- TPTejal Patwardhan
Yeah. It's, it really motivates research also. I think people want to know the truth and they want to know where we can be better and, um, deliver a better model for our users. And so knowing the gaps is quite
- 17:35 – 22:09
Why evals are getting harder
- TPTejal Patwardhan
useful.
- AMAndrew Mayne
What do you think the current limitations are right now with the ways that we're doing evals?
- TPTejal Patwardhan
I think the types of work that we're doing now with, with Codex and with our latest reasoning models like 55, it's just such a different level of, uh, capability than what we had even six months ago where, um, a static benchmark just doesn't measure the long hur-
- AMAndrew Mayne
Mm
- TPTejal Patwardhan
... like the, the nature of how long you can get work out of these things. Like, these models can work for days or weeks for you, and like internally in research we've had the models just like run for really long periods of time to do work. And one of the problems with an automated eval is you kind of need it to run within some amount of time and get results to be able to look at them, and, uh, a lot of the ways that we're measuring models now also just include looking at production usage and looking at real world use by people and seeing what they're using it for and, um, what types of tasks they're able to get done because the time horizon of how much work is done by the model is just getting so much longer.
- AMAndrew Mayne
It was interesting watching, for instance, long context. There was kind of this early race for companies to say that, "Hey, our model's gonna take, you know, you know, 100,000 tokens, a million tokens," whatever. But there wasn't a lot of evaluation on how well that was, and then we got needle in the haystack, which is a method of seeing if it could find a word or whatever, and I think that people sort of assumed that that was a solved problem, but it wasn't. It was just the benchmarks weren't really good, and then we had to have better benchmarks, and is that what kind of made it better was finally people could, one, spend more attention solving that problem when they understood where it was failing?
- TPTejal Patwardhan
Yeah, we definitely have better benchmarks for this sort of thing now, and then also sometimes these problems reveal gaps in how we're thinking about training. So one example is we used to think, oh, what matters is just how much context you can stuff into the model at test time, when now it seems that you can just dump a bunch of files in a container and the model can kind of wrap around and search for what it needs-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... and when. And like this ability to have search or tools to figure out what context you should use can be more efficient than just stuffing everything in the context, and we wouldn't have really realized that without trying that out-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... and then seeing how that performed on various benchmarks. So, um, I think that makes it, this like makes the model a lot more useful because for example, now the model can like search over a whole repo and like find-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... the files that you need and like understand the context of where you're making changes. And the same is true for many work contexts where, you know, folks in Codex can now like upload their local file system and like, you know, you might have made PowerPoints before or sent Slacks that are, um, relevant to the work that you're doing now, and the model can sort of search over that context with tool calls. Um, and so we're not as limited by how much you can literally stuff into context 'cause the model can search.
- AMAndrew Mayne
Do you have any favorite evals?
- TPTejal Patwardhan
My favorite eval? I mean, GDP eval is my favorite public eval.
- AMAndrew Mayne
Okay.
- TPTejal Patwardhan
But I, I have many internal evals.
- AMAndrew Mayne
[laughs] All right.
- TPTejal Patwardhan
But I will say the name of one of them. It's called Houdini Bench, and I cannot explain further.
- AMAndrew Mayne
Oh my God. You know I'm m- was a magician, right? So-
- TPTejal Patwardhan
No.
- AMAndrew Mayne
Yeah. Yeah, yeah. Yeah, it was-
- TPTejal Patwardhan
Maybe. I don't know if you'd pass Houdini Bench. [laughs]
- AMAndrew Mayne
No, I'd probably not pass Houdini Bench. That was actually one of the things that I was played around with some of the early vision models and stuff, was, was using stuff, photographs of stuff of magic tricks and stuff and seeing this.
- TPTejal Patwardhan
That's very cool. Yeah, multimodal brings a whole new element.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
Um, like, uh, I remember when 4o had first come out, there was a group of, there was a group of us that was, was sitting on the roof of this building that our minds were just so blown by the idea of a real-time voice model, and then we were like, how do we even eval this thing, right? Because the whole paradigm of doing things in text and code and on your computer-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... is just completely blown away if there's like a, a voice interaction in real time. Something that was really interesting about that launch is, and we said this publicly at the time, is we actually delayed the public launch by six weeks as we were figuring out how to make sure the model was safe.
- AMAndrew Mayne
This was 4o?
- 22:09 – 24:48
Measuring voice and vision models
- AMAndrew Mayne
ways. And so it seems like that's really, where do you even begin trying to figure out how you're gonna measure that?
- TPTejal Patwardhan
Yeah, I mean, oh, it's just a lot of work. Um-
- AMAndrew Mayne
Yeah
- TPTejal Patwardhan
... usually for, for any of these we start with what would humans do in this case. So like, you know, you would like have a set of inputs that you put into the model and a set of outputs you would evaluate, and then you can like build up, okay, can we like automate some of these? Can we build a new platform to measure this sort of thing at scale and sort of, um, move from there? But for some of the natively multimodal, it's just like you have to like rip apart a bunch of your infra and make-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... make stuff work. Like this was also true with Sora for, you know, we were interested in making sure the videos weren't overly realistic or could be used for the wrong thing, and that required like, especially from safety, building up a whole new stack of evals and mitigations, like including refusals at the model level, monitoring, um, when this was being used in prod. Um, and yeah, it requires a whole new stack of thinking.
- AMAndrew Mayne
Yeah. Well, that, that's the thing too is that when you start to think about, okay, how do you prioritize one eval over another? When do you decide that this isn't a, or do you just sort of go, look, this one's saturated, we move on. And 'cause there is, even though you may not be trying to optimize towards certain public benchmarks, you still have to figure out like what we're, what, what's important to us now. Like there was a time when OpenAI was leading in code, and then there was a time when it wasn't.
- TPTejal Patwardhan
Now?
- AMAndrew Mayne
Now there is a time it is. [laughs] But there was a dark period where that happened and-
- TPTejal Patwardhan
Yeah, we try not to be- get distracted by public benchmarks too much-
- AMAndrew Mayne
Yeah
- TPTejal Patwardhan
... because it can be kind of noisy. I think the, um... Internally, we have this thing called AGI Index, which is inspired by the idea of like CPI or inflation where you have like some weighted basket of goods-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... and you're tracking the price of those goods. Um, for the same thing for us, it's we have like this basket of evals that include measurements across all of the core areas we're interested in.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
That, that can include alignment, it can include safety, it can include capabilities, and just sort of what you want from your model, and we just iterate. We like, uh, keep updating that index to represent more and more sort of the difficult version of what we want our models to do, and we sort of track that index internally and try not to be distracted by, um, you know, trying to benchmark some public benchmark or something like that.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
It's more having a blend of evals across different domains that we care about across science or work, and then also safety and alignment and making sure we keep making progress on that sort of weighted basket. Um, try to stay focused.
- AMAndrew Mayne
We, we've watched this evolution of these evals. We've watched the evolution of the models, and I've talked to people here working in the sciences, like people who are active in the sci- not just researchers who like science or like computer science, but people who are in biology, mathematics.
- 24:48 – 33:23
Testing models on real science
- AMAndrew Mayne
Can you tell me what's going on with the evals in the scientific frontier? 'Cause we're at this point now where it seems like we're gonna see meaningful results.
- TPTejal Patwardhan
Yeah. I think the, the work in some of our science evals is some of our most exciting. So in the past few months, there's a few tiers of evals that we've made public. So the first tier was this eval called Frontier Science Olympiad, which was kind of, uh, the equivalent to, to the Math Olympiad style evals that we had before where we were measuring how well the models could do on like, um, high school Olympiad style problems in biology, chemistry, and physics, and they were sort of shorter answer but still quite hard, and the models weren't very good yet. And then the next phase we did was frontier science research, which is also public and people can run this, um, which measured how well models could help complete sort of unfinished biology, chemistry, and physics theses. So we had people who were PhDs or professors in these fields that had some text that was not published-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... like maybe part of their thesis, um, and just turned that into an evaluation where the model was given maybe some input data or some initial starting point, and it had to sort of see how it'd fill out the rest of that paper and judge against a rubric for how well it did. And, you know, that was starting to measure like, okay, are the models starting to do research? Like are, are they using tools, this sort of thing. And then one of the final iterations of this was to see how well the model could do in the real world in a wet lab. And so we worked with this company called Ginkgo Bioworks that has a bunch of really cool automated wet lab robots where the model had to optimize this protocol for protein synthesis, and the idea was the model would, um, generate a protocol, and then they would actually automatically test it in the wet lab where they would like put in the reagents the model suggested and then see what protein yield they got. Um, and this was for a protein that's like sort of related to this, um, ovarian cancer drug or it's like a sort of a toy scenario for that. And the model like, we were really nervous at first because we were like, "This human baseline's kind of hard. We don't know if the model's going to beat it." But we should never underestimate the models because, you know, it just, the, the curve is pretty, pretty clear. Just every cycle got better and better, beat the human baseline, and then set, set the state-of-the-art on how, um, efficiently the model could cost per yield generate this protein. And I think that's just the start of how if we give these models optimization problems, like, you know, go try to figure out how inexpensive you can make this vaccine or, you know, generate, synthesize this protein that's important for a drug, the model can just go and keep optimizing these protocols with real world inputs, and it was one of our first time de- de-risking an eval that's actually connected to the real world. Like we weren't waiting for a piece of code to run. We were waiting for the robot to finish the experiment so we could record, um, how much protein was synthesized. And yeah, I just think the models are gonna do so much science for us.
- AMAndrew Mayne
Mm.
- TPTejal Patwardhan
It's gonna be really interesting.
- AMAndrew Mayne
Well, that was exciting 'cause that was just like, I think, GPT 5, and it hadn't gone through any sort of, "Here's how to be a scientist," and now these models have progressed a lot since then. They have a lot more real world experience with this.
- TPTejal Patwardhan
Yeah, that wasn't even with one of our best models.
- AMAndrew Mayne
Yeah.
- TPTejal Patwardhan
It was like just an early reasoning model. Um, and so I think, yeah, all of these things stack. Like we'll have better pre-training, we have better RL and post-training, and we're going to get a lot better at using these models at test time to really elicit their capabilities. And I think the next generation of evals is really about how can we have these models take actions in the real world and solve sort of unsolved problems for us that would take humans a long time.
- AMAndrew Mayne
Mm.
- TPTejal Patwardhan
You know, some of these scientific problems that we haven't been able to put enough effort against, it's like, well, now we have all of these agents that can spend compute to solve problems for us and try to steer them towards what would be useful.
- AMAndrew Mayne
It, it does seem like that brings in a new challenge, though. Do you think that evals are going to get a lot more complex?
- TPTejal Patwardhan
Yeah. I mean, we have this saying on our team that pain is the moat. I really think-
- AMAndrew Mayne
[laughs]
- TPTejal Patwardhan
... a lot of operations in the physical world will become part of the bottlenecks in being able to measure what the models can do because e- even just starting with digital, there's so much more scaffolding and infrastructure work we need to do to run these. Like now if we wanna test how well Codex does, it's like, well, the model is calling APIs. It's like taking actions on your computer and in your browser. It's making artifacts for you. It's writing and running and executing that code. Um, it's just so much more complex to measure that model, and that's only digital. Now, if you want them to measure how the model could interact with the physical world, there's all sorts of ops and logistics that you need to have a really smooth process for to see how you can deploy these things at scale. And, um, yeah, I think a lot of the work is actually shifting from being like theory or math or even programming.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
Like I feel like people don't program that much. They just ask Codex, and more shifting towards like planning, operations, physical stuff. Or at least, at least my job has shifted-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... a lot that way. Um, and those things are very hard. It's actually kind of easy to just, like, write something, like, in a corner. Um, it's a lot harder when you have to manage all of these operations and logistics.
- AMAndrew Mayne
It's exciting, but it seems like part of the challenge is these aren't just simple evals anymore. They take more compute, they take more time. When you're trying to do a long-horizon eval, you know, it's long. You have to wait a long time to get the outcome on that.
- TPTejal Patwardhan
Yeah, defin- So it's both a lot more work to come up with the evals and run them at scale, and also if the, you know, the work takes a longer amount of time, we don't get the signal as fast. So we have to invest more in scaling laws where we can predict, okay, well, if by one day the model looks like this, then we can forecast that, uh, at seven days it would look like this.
- AMAndrew Mayne
Mm.
- TPTejal Patwardhan
And sort of come up with trends that we can, so that we can get signal faster. Otherwise, we're just, like, stuck there waiting for a week to get an update, which is not the most productive way to spend time.
- AMAndrew Mayne
I have certain benchmarks and things I use to test every time a new model comes out to find out how it's personally useful to me. And it's one of the things I tell people who run businesses or other things, is think about your own evals, things that will tell you where something is. Because sometimes people might try something, uh, like, th- they might try ChatGPT six months ago and go like, "Eh, it wasn't good. It didn't do this," and they don't realize how fast things move. Do you have any advice for people on how to figure out how to come up with a benchmark?
- TPTejal Patwardhan
Yeah. I mean, things move really fast. Things change every couple of weeks, and I feel like people are not as awake about... Um, in, in my job, I'm one of the first people in the world to see some of the most powerful models, so I'm extremely AGI-filled, and I think progress is happening a lot faster.
- AMAndrew Mayne
What have you seen?
- TPTejal Patwardhan
What have I seen? [laughs]
- AMAndrew Mayne
[laughs]
- TPTejal Patwardhan
I've seen good models, man.
- 33:23 – 40:47
How OpenAI tracks frontier progress
- TPTejal Patwardhan
then you'll be mind blown.
- AMAndrew Mayne
Let's talk about, uh, frontier evals.
- TPTejal Patwardhan
Yeah. So the goal of the frontier evals team is really to measure and forecast progress of the frontier models at OpenAI-
- AMAndrew Mayne
Yeah
- TPTejal Patwardhan
... to better understand where we are, where we're going, and sort of try to share that with the world. And one of the things I think the team has tried to do is to help publish and open source as much that we can. So you know, some evals that we've helped open source include, like, Suitebench Verified, which helped measure progress on coding. MLEbench, which was a way to measure how well models could train other models-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... and sort of track the progress of machine learning engineering skills in our models. Um, Paperbench, which was a way to measure how well models could replicate real top machine learning papers, um, from like ICML or ICLR. Um, and GDP Eval, which, you know, helped measure how well models could perform on real world tasks across, you know, over 40 occupations. And the goal for all of these has been, you know, the models might not seem good now, but if you just plot how they increase with each, you know, the, the results improve with each model generation. Often when people say, like, "Oh, well, I expect this will take, like, a year or whatever," they, like, over-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... they over, um, e- expect in terms of how much time it will take to saturate a benchmark. And, like, even at my own or people on my team's predictions are often, like, not ambitious enough for how fast things will change. And so I just think we're trying to do our service in helping inform the world about, um, what is possible. I think some of these research acceleration evals in particular are quite, um, interesting. Like, when we first started, we had this eval called the OpenAI Research Interview Eval, which was just taking the researcher questions that we asked people applying to OpenAI-
- AMAndrew Mayne
Yeah
- TPTejal Patwardhan
... and putting those in an eval. And the model blasted through that, like, pretty, pretty quickly. It's, like, definitely can pass our interviews right now. Um, which I think has caused a whole other slew of downstream questions on, like, how do we make sure people don't cheat on the interviews?
- AMAndrew Mayne
Mm.
- TPTejal Patwardhan
And, like, how do we actually measure research talent? Um- But I think all of this is very useful because, um, measuring internal progress is k- it's like kind of a way to measure the lever by which the models will keep getting better faster, like sort of the acceleration of the slope of improvement, so to speak. And, um, yeah, I think having ways to measure model progress is, is just good information.
- AMAndrew Mayne
I've heard that in some of the evals that were out there for a while, that it turned out that there were actually errors in the questions, that that was an issue with some of the evals. That that was some of the publicly available ones were actually you couldn't score above a certain level, and if you did, it was actually because you were training on the data, and people looked at that and found out like, "Oh, there's actually, this is not the right answer."
- TPTejal Patwardhan
Yeah. This is a problem with a lot of public benchmarks, I think. Like, so the original reason for SWE-bench Verify was because we wanted to run SWE-bench, and it was half the problems were, like, either broken or underspecified.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
And f- you know, people in the industry were publishing results on this as some metric of how well you did, and we were like, "Well, we should at least try to fix it, and then, like, share that so we can have a better yardstick." Um, but I think one of the reasons that public benchmarks maybe aren't always as, you know, uh, battle tested as we'd like is that not, they, they tend to be like, you know, someone in, in a lab, like an academic lab, like had a good idea and like wanted to write a paper, but they never had to run that eval at scale in like-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... production, training run or a production e- like level eval sweep for a launch. And just when you run some of this stuff at scale, it like breaks or falls over, and you like catch all of these bugs. Um, and so I kind of think sitting in a lab and being closer to product is a forcing function for making sure the quality-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... of your measurements is really high because, like we're not doing this to like look good in a paper. We're like doing this, like it has to work because it has to work for our systems at scale. So it kind of forces the quality to be high.
- AMAndrew Mayne
And it, it seems like kind of one of the things that can happen is these models become incredibly capable. Sometimes they're very good at, sometimes they can solve a problem, but they'll take sort of the laziest path and kind of m- they can, they can give you the memorized answer instead of solving it, and we saw that with like counting and like how many words are in a, how many letters in a character, in a word or whatever. And it was often the model, if you prompt it right, it would get the answer right, but if you didn't prompt it the right way, it would just sort of throw you an answer.
- TPTejal Patwardhan
Yeah. That brings up all sorts of interesting, um, concepts. I mean, so there's o- this one concept of memorization, which is the idea that the model literally knows the answer-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... and doesn't have to really think or reason to solve. It's just like regurgitating-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... something it already knows, and that makes the measurement not super useful because you're just measuring whether you happen to have train o- trained on that data a ton-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... versus whether the model learned the skill that you, or tool or capability you were trying to measure. So that's one way to avoid that, is to try to be really clean and disciplined about your data, not including any benchmarks or any evals that you want to measure, and that helps solve sort of the first problem that you laid out. Um, so, so that, that's one thing. And then there's, there's this other thing where like the model can kind of like reward hack or sometimes like cheat-
- AMAndrew Mayne
Mm-hmm
- 40:47 – 44:22
What AI means for work
- AMAndrew Mayne
How are you tracking this? How do you look for areas where you think this is gonna have an impact?
- TPTejal Patwardhan
Yeah. These are very difficult questions. Um, I think that, uh, our, I think people are not calibrated to how much work our models will be able to do, um, and how quickly-
- AMAndrew Mayne
Mm-hmm
- TPTejal Patwardhan
... like across a, a wide variety of jobs. And, um, right now, the models are still mostly just good at tasks versus a job.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
Like there's a lot to a job than a task, right? Like you have to figure out what you want to work on, navigate like ambiguity. Like you might have coworkers that you're collaborating with and like communicating with, and then you might like figure out what task you wanna do and then give that to a model, and that's kind of the phase we're at now, where it's a lot of, I mean, even in my job, the mo- the model is like doing individual tasks for me, but I'm still doing a lot of the thinking and planning, um, and that sort of thing. And I think people aren't even calibrated to that.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
Like I feel like people in software and research are a lot more calibrated or, by calibrated, I mean like- ... realize how capable the models are, um, compared to some of my friends in other industries. And I, like, wish people just tried the models more and saw, because the people who try and see first, like, they'll start to really get it. Um, but I also think the models are going to start to be able to do the stuff, like the delegating part at some point too, um, maybe not too far from now. The, um, figuring out what to work on, fig- navigating ambiguity, like writing the spec that the model then executes on. And people should really start to think about, okay, what is... what happens in the maximally AGI-pilled world where even just for digital work, the model can come up with what to do, do it, execute it on it, like interact with the real world. Like, you know, if it's, you know, there's entire businesses that now, like, you see like stories of, like, unicorns that where it was, like, mostly AI and a few employees that were, like, able to drive all of this value. Um, and so I do think there's this question of, you know, are we realizing how big this might be?
- AMAndrew Mayne
Well, personally, I think the opportunity space is getting bigger. Everybody I know, the most, the most AGI-pilled people I know, the people who are using tools like Codex all the time are doing way more now. They're more productive now because they don't have to do the tasks and the jobs. As the AI gets better at handling certain jobs, they're like, "Cool, there are five jobs I need done now 'cause I can do more." And I think that we just think about the, the, the light cone of the potential where we can be is bigger than we can imagine, and I think these tools just help us get there faster, not narrow it.
- TPTejal Patwardhan
I think that it's probably some mix of things.
- AMAndrew Mayne
Yeah.
- TPTejal Patwardhan
Even if you have models that can speed up paperwork, like think about like, like a clinical trial for a drug, right?
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
It's like you s- people spend months putting together, together all this paperwork, like hundreds of pages of like why they should be able to do the trial, and they like submit it to the FDA, and then there's like a 35% chance it got rejected because they like made a mistake or forgot something.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
They revise, then finally you can do the trial. And you know, these processes are, are good, but it just takes a long time. And then the trial is, you know, you have a case and a control or whatever, and you're like documenting symptoms, um, and tracking these for, like just documenting what happens for a long time, and then doing a bunch of data analysis. Like, a lot of this is just documentation or data analysis or sort of like very classically digital work. And I think if models can help accelerate all parts of this, you know, for health, for energy, manufacturing, policy research, education, like this will be very accelerative.
- AMAndrew Mayne
Mm-hmm.
- TPTejal Patwardhan
We will have hopefully, you know, faster, cheaper, better goods and that's really good for people. It's like very good for the individual consumer. So I think that is like something people should be excited about. But we should be very thoughtful about how to navigate the transition to that world, um, in a way that's thoughtful and like, um, responsible.
- AMAndrew Mayne
Excellent. Thank you, Tejal.
- TPTejal Patwardhan
Thank you for having me
Episode duration: 44:22
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode CFqjjKp9Y-Q