Skip to content
No PriorsNo Priors

No Priors Ep. 113 | With OpenAI's Eric Mitchell and Brandon McKinzie

This week on No Priors, Elad and Sarah sit down with Eric Mitchell and Brandon McKinzie, two of the minds behind OpenAI’s O3 model. They discuss what makes O3 unique, including its focus on reasoning, the role of reinforcement learning, and how tool use enables more powerful interactions. The conversation explores the unification of model capabilities, what the next generation of human-AI interfaces could look like, and how models will continue to advance in the years ahead. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @mckbrando | @ericmitchellai Show Notes: 0:00 What is o3? 3:21 Reinforcement learning in o3 4:44 Unification of models 8:56 Why tool use helps test time scaling 11:10 Deep research 16:00 Future ways to interact with models 22:03 General purpose vs specialized models 25:30 Simulating AI interacting with the world 29:36 How will models advance?

Sarah GuohostBrandon McKinzieguestEric MitchellguestElad Gilhost
May 1, 202538mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:003:21

    What is o3?

    1. SG

      (instrumental music) . Hi, listeners, and welcome back to No Priors. Today, I'm speaking with Brandon McKinley and Eric Mitchell, two of the minds behind OpenAI's O3 model. O3 is the latest in the line of reasoning models from OpenAI, super powerful with the ability to figure out what tools to use, and then use them across multi-step tasks. We'll talk about how it was made, what's next, and how to reason about reasoning. Brandon and Eric, welcome to No Priors.

    2. BM

      Thanks for having us.

    3. EM

      Yeah, thanks for having us.

    4. EG

      Do you mind walking us through, um, O3, what's different about it, what it, what it was in terms of a breakthrough in terms of, like, you know, a focus on reasoning and you're adding memory and other things versus just a core foundation model at LLM and what that is?

    5. EM

      So O3 is, like, our most recent model in this O series line of models that, um, are focused on thinking carefully before they respond, and these models are in sort of some vaguely general sense smarter than, like, models that don't think before they respond. You know, similarly to humans, um, it's easier to be, you know, more accurate if you think before you respond. I think the thing that is really exciting about O3, um, is that not only is it just smarter if you make, like, an apples to apples comparison to our previous O series models, you know, it's just better at, like, giving you correct answers of math problems or factual questions about the world or whatever. Um, this is true and it's great, and we, you know, will continue to train models that are smarter, um, but it's also very cool because it uses a lot of tools that, um, uh, that, that enhance its ability to do things that are useful for you. So yeah, like, you can train a model that's really smart, but, like, if it can't browse the web and get up-to-date information, there's just a limitation on how much useful stuff that model can do for you. If the model can't actually write and execute code, um, there's just a limitation to, um, how, you know, the, the sorts of things that an LLM can do efficiently, whereas, like, a relatively simple Python program can, you know, solve a particular problem very easily. So, um, not only is the model it's, on its own smarter than our previous O series models, which is great, but it's also able to use all these tools that, like, further enhance its abilities and whether that's doing, like, research on something where you want up-to-date information or you want the model to do some data analysis for you, or you want the model to be able to do the data analysis and then kind of review the results and adjust course as it sees fit instead of you having to be so sort of prescriptive about, like, each step along the way. The model's sort of able to take these, like, high-level requests, like do some due diligence on this company and, you know, maybe run some reasonable, like, forecasting models on so and so thing, and then, you know, write a summary for me. Like, the model will kind of, like, infer a reasonable set of actions to do on its own. So it gives you kind of like a higher level interface to, to doing some of these more complicated, uh, tasks.

    6. EG

      That makes sense. So it sounds like basically there's, like, a few different changes between your core sort of GPT models where now you have something that takes a pause to think about something so at inference time, you know, there's more compute happening and then also it can do sequential steps because it can kind of infer what are those steps and then go act on them. How did you build or train this differently from just a core foundation model or, you know, when you did, when you all did, uh, GPT-2.5 and 4 and all the various models that have come over time. Um, what is different in terms of how you actually, uh, construct one of these?

    7. BM

      I guess the short answer

  2. 3:214:44

    Reinforcement learning in o3

    1. BM

      is reinforcement learning, uh, is, is, is the biggest one. Um, so yeah, rather than just having to predict the next token in some large pre-training corpus from, you know, uh, you know, everywhere essentially, uh, now we have a more, uh, focused goal of the model solving very difficult tasks and taking as long as it needs to do to figure out the, the answers to those problems. Something that's, like, kind of magical for me, a user experience for me was we've, in the past for our reasoning models, talked a lot about test time scaling, and I think for a lot of problems, uh, you know, without tools, test time scaling might occasionally work and may, but at some point, the model's just kind of ranting in its, in its internal chain of thought, and, uh, especially for, like, some visual perception ones. It knows that it doesn't, eh, it's not able to see the thing that it needs and it just, it just kind of like loses its mind and goes insane. And, uh, I, I think tool use is a really important component now to continuing this, like, test time scaling, and you can feel this when you're talking to O3. At least, oh, my, my impression when I first started using it was, um, the longer it thinks, like, I really get the impression that, like, I'm going to get a better result, and you can kind of watch it do really intuitive things, and, uh, it, it, it's, it's a very different experience by being able to kind of trust that, uh, as you're waiting, like, it's worth the wait and you're gonna get a better result because of it, and the model's not just off doing some, you know, totally irrelevant thing.

    2. EG

      That's cool. I think in your original, um, post about

  3. 4:448:56

    Unification of models

    1. EG

      this too, y'all had a graph which basically showed that you, you looked at how long it thought versus the accuracy of result, and it was a really nice relationship. So clearly, you know, thinking more deeply s, about something really matters, and it, it seems like, um, in the long run, do you think there's just gonna be a world where we have, um, so, sort of a splitter bifurcation between models which are sort of, um, fast, cheap, efficient, get certain basic tasks done and then there's another model which you upload a legal MNA folder and it takes a day to think and it's slow and expensive but then it produces, you know, output that would take you a team of people, you know, a month to produce? Or how do you think about the world in terms of how, how all this is evolving or where it's heading?

    2. EM

      Y- you know, I, I think for us, like, unification of our models is something that, you know, Sam has talked about, uh, publicly that, you know, we have this big crazy model switcher in ChatGPT and there are a lot of choices and, um, you know, we have, uh, a model that might be good at any particular thing, you know, that a user might wanna do, but that's not that helpful if it's not easy for the user to figure out, well, which model should I use for that task? Um, and so yeah, making the models better able, you know, making this experience more intuitive is definitely something that is, is, like, valuable and, and something we're interested in doing, and that...... that applies to this, you know, uh, question of like, uh, you know, are we gonna have like two models that, you know, people pick between, or a zillion models that people pick between, or do we put that decision, you know, inside the model? Um, I... You know, I think, uh, everyone is gonna try stuff and figure out what works well for, like, the problems they're interested in and, like, the users that they, they have. But, um... But yeah, I mean, that, that question of like, how do you, you know, make that, uh, that sort of decision be like as, you know, effective, accurate, like intuitive as possible is, is definitely top of mind.

    3. SG

      Is there a reason, from a research perspective, to combine reasoning with pre-training or try to, um, have more control of this? Because if you just think about it from the product perspective of like the end consumer dealing with ChatGPT, like, you know, we won't get into the, the naming nonsense here, but they don't care. They want, like, the right answer and the amount of intelligence required to get there in as little time as possible, right?

    4. BM

      The ideal situation is it's like intuitive, uh, that like how long should, uh, you have to wait? You should have to wait as long as it takes for the model to, like, give you a correct answer or... And I, uh, I, I hope we can get to a place where our models have a more precise understanding of their own level of uncertainty, um, because, you know, if they, if they already know the answer, they should just kind of tell you it. And if, uh, if it takes them a day to actually figure it out, then they should, they should take a day. But you should always have a sense of like, uh, it takes exactly as long as, as it, as it needs to for that current like model's intelligence. And I, I, I feel like we're on the, the right path for that.

    5. SG

      Yeah. I wonder, um, if there isn't a bifurcation though between like an end user product and the developer product, right? Because there are lots of companies that use, you know, um, the APIs to all these different models and then for very specific tasks and then o- on some of them, they might even use like open source models with really cheap inference with stuff that they control more.

    6. BM

      I hope you could just kind of tell the model like, "Hey, this is a (laughs) API use case and, uh, yeah, y- you really can't be over there thinking for like 10 minutes. Uh, we gotta get an answer to the user." (laughs) . Uh, uh, it, it'd be great if their models could kind of get to be more steerable, uh, like, like that as well.

    7. EM

      Yeah, I think it's just a general steerability question. Like, at the end of the day, if the model's smart, like you should be able to specify like the context of your problem and the model should do the right thing. Um, there's gonna be some like limitations because, you know, maybe, uh, just figuring out, given your situation, what is like the right thing to do might require thinking in and of itself to figure out. So like it's not that you can obviously do this perfectly, but, um... But yeah, pushing, you know, some, all the right parts of this into the model, uh, to make things easier for the user is like... Seems (stutters) is a very good goal.

    8. SG

      Can I go back to something else you said? Like, um, so the first guest

  4. 8:5611:10

    Why tool use helps test time scaling

    1. SG

      we ever had on the podcast was actually Noam Brown. Um-

    2. EM

      Oh, nice.

    3. SG

      Uh... So-

    4. EM

      I've heard of him.

    5. SG

      You know, two-

    6. BM

      Hi, Noam.

    7. SG

      ... two plus years ago. Yes. Hi, Noam. It'd be great to get some intuition from you guys for why tool use helps like test time scaling work much better.

    8. BM

      I can give maybe very concrete cases for like the, the visual reasoning side of things. The, uh... There's a lot of cases where, uh... And back to als- also the model being able to estimate its own uncertainty, you'll, you'll give it some kind of question about an, an image and the model will very transparently tell you what it should have thought. Like, "I, I, I don't know. I can't really see the thing you're talking about very well." Or like, uh, it, it almost knows like that its vision is not very good. And, uh, well, what's kind of magical is like when you give it access to a tool, it's like, "Okay, well I gotta figure something out. Uh, le- let's see if I can like manipulate the image or crop around here or something like this." And, um, what that means is that it's, it's, it's like much more productive use of tokens as it's doing that. And so your test time scaling slope, you know, goes from something like this to, you know, something much steeper. And, uh, we've seen exactly that, like the, the, the test time scaling slopes for wi- without tool use and with tool use for, for visual reasoning specifically are very noticeably different.

    9. EM

      Yeah, I would also say like for like writing code for something like, um... There are a lot of things that an LLM could try to figure out on its own, but would require a lot of, uh, attempts and self-verification that you could write a very simple program to do in like a verifiable a- and, and, you know, much faster way. So, um, you know, I, I do some research on this company and like use this type of, you know, valuation model to tell me like, you know, what the valuation should be. Like, you could have the model like try to crank through that and like fit those coefficients or whatever in its context, or you could literally just have it like write the code to just do it the right way, um, and just know what the actual answer is. And so, um, yeah, I think like part of this is you can just allocate compute a lot more efficiently because you can defer stuff that the model doesn't have comparative advantage to doing to a tool that is like really well suited to doing that thing.

    10. EG

      One of the ways I've been using, um, some form of O3 a lot is deep research, right? I think,

  5. 11:1016:00

    Deep research

    1. EG

      um, that's basically a research analyst, um, AI that you all have built that basically, uh, will go out, will look up things on the web, will synthesize information, will chart things for you. It's, it's pretty amazing in terms of its capability set. Did you have to do anything special in terms of, um, you know, any form of specific reinforcement learning specifically for it to be better at that or other things that you'd built against it? Or how did you think about the data training for it, the data that was used for training it? Like, I'm just sort of curious like how that product, if at all, is a branch off of, off of this and how you thought about building that specifically as part of this broader effort.

    2. EM

      I think when we, uh, think about like tool use, I think browsing is one of the most like natural places where, you know, you, you think of as a starting point of like, okay, like... A- a- an- and it's, it's not always easy. I mean, like the, you know, initial kind of, uh, browsing that we, uh, included in GPT-4 a few years back, like it was hard to make it, you know, work in a way that felt like reliable and like useful. Um, but you know, in the sort of, you know, modern... These days-... last year (laughs) , you know, uh, two years ago is ancient history. Um, I think it feels like a natural place to start because it's, like, so widely applicable to so many types of queries. Like, anything that is, you know, requires up-to-date information, like, it should help to browse for. And so, um, in terms of, like, a test bed for, hey, like, does, you know, the way we're doing RL, like, does it really work? Or, you know, can we really get the model to learn, like, uh, longer time horizon kind of meaningful extended behaviors? Like, it- it feels like kind of a natural place to start in some ways in that it, you know, also is fairly likely to be, like, useful in a- in a relatively short amount of time. So it's like yeah, let's- let's try that. Um, I mean, you know, in- in RL, like, at the end of the day you're defining an objective and, uh, if you have an idea for, like, who is gonna find this most useful, like, you know, you- you might, like, wanna tailor your, the objective, you know, to who you expect to be using the thing, what you expect they're going to want, you know. What is their tolerance for... Do they wanna sit through a 30-minute roll-out of deep research, you know? Do they, when they ask for a report, you know, do they want a page or five pages or a gazillion pages? So, um, yeah, I mean, you're- you're definitely, you know, you wanna tailor things to, like, who you think is gonna be using it.

    3. EG

      I feel like there's a lot of almost, like, white-collar behavioral work that, um, you, or knowledge work that you all are really capturing through this sort of tooling going forward, and you mentioned software engineering as one potential area. Um, deep research and sort of a- analytical jobs is another where there's all sorts of really interesting work to be done that's super helpful in terms of augmenting what people are doing. Are there two or three other areas that you think are the most near-term interesting applications for this, whether OpenAI is doing it or others should do it, uh, aside? I'm just sort of curious how you think about the big application areas for this sort of technology.

    4. BM

      I guess my, uh, you know, very biased one that I'm excited about is- is coding and also, uh, research in general, being able to, like, improve upon the velocity that we can do research at OpenAI and others can do research when they're using our tools. Uh, I think our models are getting a- a lot- a lot better very quickly at- at being actually useful and it seems like they're kind of reaching some kind of inflection point where, uh, they- they- they are useful enough to, uh, want to reach out to and- and use, like, multiple times a day for m- for me at least, which, uh, wasn't the case. They were always, like, a little bit, you know, behind what I wanted them to be, especially when it comes to, like, navigating and using our internal code base which is not simple, uh, and, uh, it's amazing to see, like, more recent... Our- our- our- our models actually really spending a lot of time trying to understand the questions that we ask them and, uh, coming back with things that, uh, save me, yeah, like many hours of- of my own time.

    5. EG

      People say that's the fastest potential bootstrap, right? In terms of each model subsequently having- helping to make the next model better, faster, cheaper, et cetera and so people often argue that that's almost like a inflection point on the exponent towards super intelligence, is basically this, um, ability to use, uh, AI to build the next version of AI.

    6. BM

      Yeah. And there's so many, like, different components of research too. There's, it's- it's not just, uh, you know, sitting off in the ivory tower thinking about things but there's- there's, like, hardware, uh, there's, uh, you know, various components of training and evaluation and stuff like this and each of these can be turned as some kind of task that can be optimized and iterated over, so there's plenty of, uh, y- you know, room to- to squeeze out improvements.

    7. SG

      We talked about browsing the web, writing code, arguably the greatest tool

  6. 16:0022:03

    Future ways to interact with models

    1. SG

      of all, right, especially if you're trying to figure out how to spend your compute, write more efficient code, um, generating images, writing text. There are certainly, like, trajectories of action I think are not in there yet, right? Like, reliably using a sequence of business software.

    2. BM

      I'm really excited about the computer use stuff, uh, it kinda drives me crazy in some sense that our models are not already just, like, on my computer all day, uh, watching what I'm doing and, well, I- I know that can be creepy for some people and, like, I think you should be able to, like, opt out of that or have that o- opted out by default. I hate typing also, uh, I- I- I wish that I could just kind of, like, be working on something on my computer, I hit some issue and I'm just like, you know, "Well, like, what am I supposed to do with this?" And I can just kind of ask. I think there's tons of space for, yeah, being able to improve on, like, how we interact with the models and, uh, this goes back to them being able to use tools in a more in- intuitive way. I guess using tools closer to how we use them, um... It's also surprising to me how intuitively our models do use the tools we give them access to. It's, like, weirdly human-like but I guess that's not too surprising given the data they've seen before but yeah.

    3. SG

      I think a lot of things are weirdly human-like. Like, my intuition for, like, well, why is tool use so impactful to test time scale, like, why is the combination so much better? Take any- a- any role, you can make a decision when you are trying to make progress against a task as to, like, do I get external validation or do I sit and think really hard, right? And usually you wanna do, like, one is more efficient than the other and it's not always just sit in a vacuum and think really hard with what you know.

    4. BM

      Yeah, absolutely, yeah.

    5. EM

      You can seek out sort of new inputs, like, it doesn't have to be this closed system anymore. And I do feel like the- the closed systemness of the models is still sort of a limitation in some ways. Like, you're not- you're not necessarily, like, turning this... I mean, like, I think it'd be great if the model could control my computer for sure but in some sense it's... There's a reason we don't go hog wild and say like, "Oh yes, here's, like, the keys to the kingdom, like, have at it." Um, there are still, you know, asymmetric costs to, like, the time you can save and the types of errors you can make and so we're trying to, like, iteratively kind of, you know, deploy these things and, like, try them out and figure out, like, where are they reliable, you know, and where are they not, um, because yeah, like, if you did just let the model control your computer, it could do some cool stuff, like, I have no doubt but, you know, do I trust it to, like, respond to all of the, you know, random emails that Brandon sends me? A- actually maybe for that task it doesn't require that much intelligence but...

    6. SG

      (laughs)

    7. EM

      You know, more generally, like-

    8. BM

      That's true, that's true.

    9. EM

      ... do I, you know, do I trust it to, to do everything I'm, I'm doing? Like, you know, some things. And I'm sure, like, that set of things will be bigger tomorrow than it was yesterday. Um, but yeah, I think part of this is, like, we limit the affordances and keep it a little bit in the, like, sandbox just out of caution, um, so that, you know, you don't send some crazy email to your boss or, um, you know, delete all your texts or delete your hard drive or something.

    10. SG

      Is there some sort of, like, um, organizing mental model for, like, the tasks that one can do with, uh, you know, increasing intelligence, test time scaling and tool- improved tool use, right? Because I, I look at this and I'm like, okay, well, you have complexity of task and you have time scale. Um, then you have, like, the ability to come up with these RL rewards and environments, right? Then you have, like, usefulness. Um, uh, you, maybe you have some in, of course you have some intuition about, like, diversity and generalization across the different things you can be doing, but it, it, so it seems like a very large space and scaling RL, like new gen RL is not... it's just not obvious, like how to, to me it's not obvious how you do it or how you choose the path. Is there some sort of organizing framework that, you know, you guys have that you can share?

    11. EM

      I mean, I, I don't know if there's, like, one organizing framework. I think there are a few, like, factors at least that I think about in, like, the very, very grand scheme of things, is like how much, like, in order to solve this task, like, how much uncertainty with the environment do I have to, like, wrestle with? Like, um, for some things where it's like this is a purely fa- like, who was the first president of the United States? Like, there's zero, like, environment I need to interact with to, like, reach the answer to this question correctly. I just need to remember the answer and say the answer. You know, if I want you to, like, write some code, you know, that, like, solves some problem, well, now I have to deal with a little bit of, like, not purely internal model stuff, but also, like, okay, I need to execute the code and, like, that code execution environment is maybe more complicated than my model can memorize internally. So I have to do, like, a little bit of, like, writing code and then executing it and making sure it does what I thought it did and then testing it and then giving it to the user. And things get, like, the, the amount of that sort of stuff outside the model that you have to, like, you know, you can't just recall the answer and give it to the user. You have to, like, test something and, you know, run an experiment in the world and then wait for the result of that experiment. Like, the more you have to do that, the more uncertain the results of those experiments. Like, in some sense, that's, like, one of the core, like, attributes of, like, what makes the tasks hard. Um, and I think another is, like, how, you know, simulatable they are. Um, like, stuff that is really bottlenecked by, like, time, like the physical world, um, is also, you know, just, just harder than stuff that we can simulate really well. You know, it's not a, it's not a coincidence that, you know, so many people are interested in coding and, you know, coding agents and things. Um, and that, like, you know, robotics is hard and it, you know, it, it's, it's slower and, you know, I used to work on robotics and, like, it's frustrating in a lot of ways. I think both this, like, how much of the external environment do you have to deal with and then, like, how much do you have to wrestle with the unavoidable slowness of the real world are two, like, dimensions that I, I sort of think about.

    12. EG

      It's super interesting because if you look at, um, historically some of these models, one of the things that I think has continued to be really impressive is the degree to which they're generalizable. And so I think when GitHub

  7. 22:0325:30

    General purpose vs specialized models

    1. EG

      Copilot launched, it was on Codex, which was, like, a specialized code, code model, and then eventually that just got subsumed into these more general-purpose models in terms of what a lot of people are actually using for, um, coding-related applications. How do you think about that in the context of things like robotics? So do you, you know, there's, like, probably a dozen different robotics foundation model companies now. Do you think that eventually just merges into the work you're doing in terms of there's just these big general-purpose models that can do all sorts of things? Or do you think there's a lot of room for these standalone other types of models over time?

    2. BM

      I will say the one thing that's always struck me as kind of funny about us doing RL is that we don't yet do it on the most, like, canonical RL task of robotics. Um, and I, I, I personally don't see any reason why we couldn't have this, these be the, this, uh, the same model. Um, I think there, there are certain challenges with, like, I don't know, do you want your, um, RL model to be able to, like, generate an hour-long movie for you, uh, natively as opposed to, like, a tool call? Like, that's where it's probably tricky to, I mean, have you have more conflict between having, like, everything in the same set of weights. But, um, certainly, like, the things you see O3 already doing in terms of, like, uh, you know, exploring a picture and things like that are, are kind of like early signs of, uh, some- something like an agent exploring, like, an external environment. So I, I don't think it sounds too far-fetched to me.

    3. EM

      Yeah, I mean, I, I think the, the, the thing that came up earlier of the also the, like, intelligence per cost thing, you know, the, the real world is, like, an interesting litmus test because at the end of the day, like, there is a, you know, frame rate in the real world you need to live on. And it doesn't matter if you get the right answer after you think for two minutes, like, you know, the ball is coming at you now and you have to catch it. Uh, gravity's not gonna wait for you. So you, you, that's an extra constraint that we get to at least softly ignore when we're talking about these purely disembodied things. That's kind of-

    4. EG

      It's kind of interesting though, 'cause really small brains are very good at that, you know? So you look at a frog or, you know, you start looking at different organisms and you look at sort of relative compute.

    5. EM

      Yeah.

    6. EG

      And, you know, very simple systems are very good at that. Ants, you know (laughs) ? Like, um, so I, I think that's kind of a fascinating question in terms of what's the baseline, uh, amount of capability that's actually needed for some of these real wor- real world tasks that are reasonably responsive in nature.

    7. BM

      It's really tricky with, uh, with vision too. There, we have, so our models have some, I think, maybe famous, like, like edge cases of where they...... don't s- do the right thing. Uh, I think Eric probably knows where I'm going with this. Li- I don't know if you ever asked, like, our models to tell you what time it is on a clock. Uh, they, they really like the, the time 10:10. Uh, so yeah.

    8. EM

      It's my favorite time too, so that's a, that's usually what I tell people. (laughs)

    9. BM

      It's like over 90% or something like that of all clocks on the internet are 10:10. Uh, and it's because it looks like, I guess like a happy face, and it looks like nice and... But it, but anyways like, the, what I'm getting at is like our, our visual system was developed by interacting with, you know, the external world and, uh, h- having to be good at like navigating things, you know, avoiding predators. Um, and our, our, our models have learned vision in a very different type of way. And, uh, I, I, I think it'll... We'll see like a lot of really interesting things if we can get them to be kind of closing the loop by, you know, reducing their uncertainty by taking actions in the real world, um, just as opposed to like thinking about stuff. Yeah.

    10. SG

      Hey Eric, you brought up, um, the idea of like how, uh, what in the environment can be simulated, right, as a,

  8. 25:3029:36

    Simulating AI interacting with the world

    1. SG

      as, uh, as an input as to like how difficult will it be to improve on this. Um, uh, as you get to long running tasks, like let's just take software engineering. Like, there is a lot of interaction that is not just me committing code continually. It's like, I'm gonna talk to other people about the project, in which case you then need to deal with the problem of like, can you reasonably simulate how other people are going to interact with you on the project in an environment? That seems really tricky, right? I'm not saying that, you know, O3 or whatever set of, uh, foundation models now doesn't have the intelligence to respond reasonably, but like, how do you think about that simulation being true to life as it, uh, true to life, true to the real world, uh, as you, uh, involve human beings in an environment in, in theory?

    2. BM

      My spicy, I guess, take on, on that is like... I don't know if it's spicy, but O3 in some sense is already kind of simulating what it'd be like for a single person to do something with like a browser or something like that and, I don't know, train two of them, uh, together, uh, so that you'd have, you know, you have two people interacting with each other. Um, and yeah, there's no reason you can't scale this up so that models are, are trained to be really good at cooperating with each other. I mean, there's a lot of already exi- existing literature on multi-agent RL and, uh, yeah, th- if, if, if you want the model to be good at something like collaborating with a bunch of people, like, maybe a not too bad starting point is making it good with collaborating with other models.

    3. EM

      Man, someone should do that.

    4. BM

      Yeah. Yeah. We should really start thinking about that, Eric.

    5. SG

      I think it is a l- I think it is a little bit spicy because yes, the work is going on. It is interesting to hear you think that is a useful direction. Uh, I think lots of people would still like to believe, not me. Like, my comment was extra good on this poll request or whatever it is, right? Um, and, and-

    6. BM

      Okay. I can sy- I can sympathize with that. Sometimes I see our models training and I'm like, "Ugh, what are you doing?" You know? Like, uh, "You're, you're taking forever to figure this out." And I actually think it'd be really fun if you could actually train models in an interactive way, n- uh, you know, forget about just like a test time. But it... I think it'd be really neat to train them to d- to do something like that, uh, be able to like intervene, uh, when it makes sense. And yeah, just more, more me being able to tell the model to, you know, cut it out, uh, in like in the middle of its kind of chain of thought and, uh, i- it being able to learn from that on the fly I think would be great.

    7. EM

      Yeah. I do think this is like the intersection of these two things where it's both, uh, like an, uh, a point of contact with the external environment that is like can be very high uncertainty. Like, humans can be very unpredictable, um, in some cases, and it's sort of limited by the tick of time in the real world if you wanna like, you know, deal with actual humans. Like, humans have a fixed, you know, clock cycle, um, uh, you know, m- in their, in their head. Um, so yeah, I mean, th- this is... If you, you know, if you wanna like do this in the literal sense, it's hard. And so, you know, scaling it up and, and, you know, making it work well is, is, you know, it's not obvious how to do this. Uh-

    8. BM

      Yeah. We are a super expensive tool call. You know, if you're a model you can either ask me, you know, meat bag over here to, uh, you know, he- help with something and I'll try to think really slowly. In the meantime, it could have like used browser and read like a 100 papers on the topic and something like that. So it's, uh, yeah, how do you model the, the trade-off there?

    9. EM

      But the human part's important. I mean, I think in any research project, like, my interaction with Brandon are the hardest part of the project, you know? Like, writing the code is... That's the easy part.

    10. SG

      Well, and there's, there's some analog from, um, self-driving. Uh, I was gonna say the, you know, hanging out with me every week is the hardest part of doing this podcast, but-

    11. EM

      It's my favorite part.

    12. BM

      Look at how healthy their relationship is, Eric. We need to learn from this.

    13. EM

      No, we're honest. It's okay. We gotta work through it. Okay. (laughs)

    14. BM

      Okay. (laughs)

    15. SG

      In self-driving one of the like classically hard things to do was like predict the human and the child and the dog, like, agents in the environment versus, um, uh, like what the environment was. Um, and, and so, uh, I, I think there's like some analogy to be drawn there. Um, going back to just like how you progress the O series of models from here, is it, uh, is it a reasonable like assessment that some people have,

  9. 29:3638:10

    How will models advance?

    1. SG

      uh, that the capabilities of the models are likely to advance in a spikier way because you're relying to some degree more on the creativity of research teams in like making these environments and deciding, you know, how to create these, um, evals versus like we're scaling up on existing data set in pre-training? Is that a fair contrast?

    2. BM

      S- spikier or like a... What's the plot here? What's the like the x axis and the y? Like what-

    3. EM

      Like domain is the x axis and y is capability?

    4. SG

      Yes, because y- you're like choosing what domains you are really creating this RL loop in.

    5. EM

      I mean, I think this is a very reasonable, uh, hypothesis to, um, to hold. I think there is some like counter-evidence that I think should, you know, be factored into people's intuitions. Like, you know, Sam tweeted an example of some creative writing from one of our models that, um, I think was-... I, I'm not an expert and I'm not gonna say this is like, you know, publishable or like groundbreaking, but, um, I think it probably updated some people's intuitions on like what, you know, you can train a model to do really well. And so, I think there is some structural reasons why you'll have some spikiness just because like as a organization, you have to decide like, "Hey, we're gonna prioritize, you know, X, Y, Z stuff." And like as the models get better, the surface area of stuff you could do with them grows faster than, you know, you can potentially, like say, "Hey, this is the niche, you know, we're gonna carve out. We're gonna try to do this really well." So like there... I think there's some reason for spikiness, but I think some people will probably go too far with this in saying like, "Oh, yes. These models only be really good at math and code." And like not, you know... Like, everything else is like you can't g- get better at them. And I, I think that is probably not the right intuition to have.

    6. BM

      Yeah, and I think probably all like, uh, major AI labs right now have some partitioning between let's just define a bunch of data distributions we want our models to be good at and then just like throw data at them. And then another set of people in the s- same companies is prob- are probably thinking about how can you kind of lift all boats, uh, at once with some like algorithmic change. And, uh, I- I think, yeah, we definitely have, uh, th- both of t- those types of efforts at, at OpenAI. And, um, I think especially at the... on the data side, like there are going to naturally be things that we have a lot more data of than, than others. And, uh, but ideally, yeah, we, we have plenty of efforts that will not be so reliant on the exact like subset of data we did RL on and it'll generalize better.

    7. SG

      I get pitched every week, and I bet Elade does too, uh, a company that wants to generate data for the labs in some way. And, um, or it's, you know, access to human experts or whatever it is, but like, you know, there's, there's infinite variations of this. Um, uh, if you could wave a magic wand and have like a perfect set of data, like what would it be that you know would advance model quality today?

    8. EM

      This is a dodge, but like uncontaminated evals. Um, always super valuable, and that's data. Um, and I mean, yeah, like you want, you know, good data to train on and that's of course valuable for making the model better, but I think it is often neglected how also important it is to have high quality data, which is like a different definition of high quality when it comes to an eval. Um, but yeah, the eval side is like often just as important because you don't... You need to measure stuff. And like as you know, from, you know, trying to hire people or whatever, like evaluating the capabilities of like a general like capable agent (laughs) is really hard (laughs) to do in like a rigorous, you know, way. So yeah. I think evals are a little underappreciated.

    9. BM

      Th- that is... It's true. Evals are... I mean, especially with some of our recent models where we've kind of run out of reliable evals to track 'cause they kinda just solved s- a few of those. Um, but on the, on the training side, I think it's always valuable to have, uh, training data that is kind of at the next frontier of model capabilities. I mean, I think a lot of the things that O3 and O4 Mini could already, can already do, those types of tasks, like basic tool use, uh, we probably aren't, uh, you know, super in the need for, for new data like, like that. But I think it'd be hard to say no to a dataset that's like bunch of like multi-turn user interactions in some code base that's like a million lines of code that, you know, is like a two-week research task of like adding some new feature to it that requires like multiple poll requests. I mean, that... I mean, like something that was like super high quality and has a ton of supervision signals, uh, for us to learn from. Uh, yeah. That... I think that would be awesome to have, you know. I definitely wouldn't turn that down.

    10. SG

      You play with the models all the time, I assume a lot more than average humans do. What do you do with reasoning models that you think other people don't do enough of yet?

    11. EM

      Send the same prompt many, many, many times to the model and, and get an intuition for the distribution of responses you can get. I have seen... It drives me absolutely mad when people do these comparisons on Twitter or wherever and they're like, "Oh, I put the same prompt into blah blah and blah blah and this one was so much better."

    12. BM

      (laughs)

    13. EM

      It's just like, dude, you... Like, uh, like, uh, I mean, it was something we talked about a bit w- when we were launching is like, yeah, O3 can do really cool things, like when it chains together a lot of tool calls and then like sometimes for the same prompt, it won't have that, you know, moment of magic or it, it will, you know, just take a little... It'll do a little less work for you. And so, yeah, though like the peak performance is really impressive, but there is a distribution of behavior and I think people often don't appreciate that there is this distribution of outcomes when you put the same prompt in and getting intuition about that is useful.

    14. SG

      So as an end user, I do this and I also have a feature request for your friends in the product org. Um, I'll ask, uh, you know, Oliver or something, but it's just I want a button where I... like assuming my rate limits or whatever support it, I want to run the prompt automatically like 100 times every time, even if it's really expensive. And I want the model to rank them and just give me the top one and two.

    15. EM

      Interesting.

    16. SG

      And just let it be expensive.

    17. EG

      Or a synthesis across it, right? You could also synthesize the output and just see if there's some... Although maybe you're then reverting to the mean in some sense relative to that distribution or something, but it seems kinda interesting, yeah.

    18. EM

      Yeah.

    19. SG

      Maybe there's a good infrastructure reason you guys aren't giving us that button. (laughs)

    20. BM

      Well, it's expensive, but-

    21. EM

      There are-

    22. BM

      ... uh, I, I think it's a great suggestion, yeah.

    23. EM

      Yeah. I think it's a great suggestion.

    24. BM

      How much would you pay for that? (laughs)

    25. SG

      A lot, but I'm a, I'm a price insensitive user of AI. Yeah.

    26. BM

      I see. Perfect. Those are our favorite.

    27. SG

      But maybe there are many of us. (laughs)

    28. EG

      (laughs)

    29. EM

      (laughs)

    30. BM

      (laughs)

Episode duration: 38:10

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode OBQ4YeNeSno

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome