Dwarkesh PodcastJohn Schulman (OpenAI Cofounder) — Reasoning, RLHF, & plan for 2027 AGI
EVERY SPOKEN WORD
150 min read · 30,095 words- 0:00 – 17:20
Pre-training, post-training, and future capabilities
- JSJohn Schulman
I think even in one or two years, you could imagine having the models carry out a whole coding project, moving away from using the model of like a search engine and more towards having a whole project that I'm, like, doing in collaboration with the model. We might not wanna jump to having AIs run whole firms immediately, even if the models are good enough to actually run a successful business themselves.
- DPDwarkesh Patel
D- if there's no other bottlenecks next year or y- or something, you got AGI, wh- what- what's the plan? Today, I have the pleasure to speak with John Schulman, who is one of the co-founders of OpenAI and leads the post-training team here. And, um, he also led the creation of ChatGPT and is the author of many of the most important and widely cited papers in AI and RL, including PPO and many others. So John, really excited to chat with you. Thanks for coming on the podcast.
- JSJohn Schulman
Thanks for having me on the podcast. I'm a big fan.
- DPDwarkesh Patel
Oh, thank you. Tha- thank you for saying that. Um, so the first question I had is, we have these distinctions between pre-training and post-training. Beyond what is actually happening in terms of loss, function, and training regimes, I'm just curious, taking a step back conceptually, like, what kind of thing is pre-training creating? What does post-training do on top of that?
- JSJohn Schulman
In pre-training, you're basically training to imitate all of the content on, on the internet or on the web, um, including websites and code and so forth. Uh, so you get a model that can basically, um, generate, uh, content that looks like, uh, random web pages from the internet. And, um, th- the model is also trained to maximize likelihood, where it has to put a probability on everything. So it's, um... The objective is, uh, basically predicting the next token given the previous tokens. Tokens are, like, words or parts of words. And, uh, since the model has to put a probability on it, uh, and it's, we're training with, um... to maximize log probability, it ends up being very calibrated. So it can not only generate all of this, uh, the content of the web, it can also assign probabilities to everything. So, so the base model can effectively take on all of these different personas or generate, um, all these different kinds of content. And then, uh, when we do post-training, uh, we're usually targeting a narrower, um, range of behavior where we basically want the model to behave like this kind of chat assistant. And, uh, it's a, it's a more specific persona where it's, um, th- trying to be helpful. It's not trying to imitate a person. It's, um, answering your questions or doing your tasks. Um, and, uh, we're optimizing on a different objective, which is more about producing outputs that humans will like and find useful, as opposed to just trying to imitate, uh, this raw content from the web.
- DPDwarkesh Patel
Yeah. Okay. I- I think, uh, uh, m- maybe I should take a step back and ask... Um, right now we have these models that are pretty good at acting, acting as chatbots. Just taking a step back from how these processes work currently, what would the models release by the end of... Kinds of things the models released in the year will be capable of doing? What do you see the progress looking like five... You know, th- uh, carry this forward for the next five years?
- JSJohn Schulman
Oh, yeah. Five years. Yeah, I think, uh, the models will get quite a bit better, um-
- DPDwarkesh Patel
But in what way?
- JSJohn Schulman
... in the course of five years.
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
Uh, so I mean, I think even in one or two years, we'll find that, uh, a lot of, um... You can use them for a lot of, um, uh, more, like, involved tasks than they can do now. So you could, um... So, so for example, right now, um, th- uh, like, you could imagine, uh, having the models to carry out a whole coding project instead of maybe giving you one suggestion on how to write a function. So, uh, you could imagine the model, like, you giving it sort of high level instructions on what to, what to code up and it'll go and, uh, it'll, um, go and write, uh, many files and test it-
- DPDwarkesh Patel
Hmm.
- JSJohn Schulman
... look at the output, iterate on that a bit. So just much more complex tasks.
- DPDwarkesh Patel
And fundamentally, the unlock is that it can act coherently for long enough to write multiple files of code? Or what, what, what, what has changed between now and then?
- JSJohn Schulman
Yeah. I would say, uh, this will come from some combination of just, uh, training the models to do, um, harder tasks like this. So, um, just, uh, like, I'd say right... Uh, the models aren't, um, aren't particularly, uh, like... Uh, most of the, uh, training data is more, like, doing single steps at a time. And I would expect us to do more, uh, for training the models to c- uh, carry out these longer projects. Um, so I'd say any, any kind of training, uh, any... like doing RL, uh, to learn how to do these tasks, uh, however you do it, whether it's, whether you're supervising the final output or s- supervising it, like, each step. Um, I think any kind of training, uh, at, uh, carrying out these long projects is gonna make them a lot better. And, uh, since, uh, the f- the whole, um, area is pretty new, I'd say there's just a lot of low-hanging fruit and-
- DPDwarkesh Patel
Interesting.
- JSJohn Schulman
... um, do- in doing this kind of training. So I'd say that's one thing. Um, also, I would expect that as the models get better, they're just, um, better at recovering, uh, from errors or they have, um, just, uh... They're better at, um, at dealing with, um, dealing with edge cases or when things go wrong, they know how to recover from it. So, uh, the models will be more sample efficient, so you don't have to collect a ton of data to, uh, teach them how to get back on track. Just a little bit of data or, uh, or just their, like, generalization from, uh, th- from other, um, abilities will allow them to get back on the tra- on track, whereas current models might just get stuck and get lost.
- DPDwarkesh Patel
Uh, uh, I'm not sure I understood actually how, uh, uh, uh, uh, I wanna understand more explicitly how the generalization, eh, helps you get back on track? C- can you say more about that? I'm not sure I got, got by those two concepts are connected.
- JSJohn Schulman
Right. They're not directly, uh, connected. So I would say you usually have a little bit of data, um, that does everything. Uh, so I mean, if you have, um...Yeah, if you collect a diverse dataset, um, you're gonna get a little bit of everything in it. And, uh, and if you have models that generalize really well, uh, even if there's just a couple examples of getting back on track-
- DPDwarkesh Patel
I see, okay, interesting.
- JSJohn Schulman
... or even, um, like maybe in the pre-training there's examples of getting back on track, then like, the model will be able to generalize from, uh, those other things it's seeing f- to the current situation. So I think, uh, like, uh, if you have, uh, models that are, uh, weaker, you might be able to get them to do almost anything with enough data, but you might have to put a lot of effort into, um, a particular, uh, domain or skill. Whereas for a stronger model, it might just do the right thing without any training data or any effort.
- DPDwarkesh Patel
Do you have some intuition about right now these models can maybe act coherently for five minutes. Uh, we want them to be able to do tasks that for a human would take an hour, then a week, then a month, and so forth. To get from each of these benchmarks, is it gonna be each one takes 10x more compute, uh, analogous to the current scaling loss for pre-training or is it gonna be a much more streamlined process because, uh, th- just getting to that point where you're already stamp- more sample efficient and then you can just, uh, you just go to the years of carrying out tasks or something?
- JSJohn Schulman
Yeah, I would say at a high level, I would, I would agree that, um, longer h- horizon tasks are gonna, um, require more, uh, model intelligence to do well and are gonna be more expensive to train for. Um, I'm not sure I would expect there to be a really clean scaling law unless you, um, uh, set it up in a very careful way or design your, uh, yeah, design the experiment in a certain, certain way. Because, uh, I, I would say, uh, there might end up being some phase transitions where, um, once you get to a certain level, um, you can, uh, deal with, um, you can deal with much longer tasks. So for example, people, um, uh, like I think when people, um, like think, when people do planning for, uh, at different time scales, I'm not sure they use completely different mechanisms.
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
So, uh, I, we probably use the same, uh, mental machinery if we're thinking about one month from now, one year from now-
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
... uh, or, uh, like 100 years from now. Uh, it's, um, so it, we're not actually doing some kind of reinforcement learning that, uh, where we need to worry about a discount factor that covers that time scale and so forth. So I think, uh, uh, I think using language you can describe all of these different time scales and then you can do things like, uh, plan to, uh, in the moment you can try to make progress towards your goal whether it's a month away or ten years away. So I might expect the same out of models where there, um, some kind of, um, I, I don't know if it's a phase transition, but, uh, like there's some capabilities that work at multiple scales.
- DPDwarkesh Patel
Yeah. Well, okay, so, uh, correct me if this is wrong, but it seems like that implies right now we have models that are on a per token basis pretty smart. Like, they might be as smart as humans on a per token basis, uh, the smartest humans. And the thing that prevents them from being as useful as they could be is that five minutes from now, they're not gonna be still writing your code in a way that's coherent and aligns with the broader goals you have for your project or something. If it's the case that once you start this long horizon RL training regime it immediately unlocks your ability to be coherent for longer periods of time, should we be predicting something that is human level as soon as that regime is unlocked or, and if not, what, what is remaining after we can plan for a year and execute projects that take that long?
- JSJohn Schulman
Yeah, it's not totally clear what we're gonna see once we get into that regime and, um, how fast progress will be, so that's, uh, that's still uncertain. Um, I would say I would expect there to be, um, um, I, I would ex- I wouldn't expect everything to be immediately solved by doing any training like this. I would think, uh, there'll be other, um, like miscellaneous deficits that the models have that, um, cause them to get stuck or not make progress or make, um, worse decisions than humans. So, uh, I, I wouldn't say I, I expect that this o- one little thing will unlock every, all capabilities. But I, um, yeah, it's not clear, uh, but it might, uh, like some improvement in the ability to do long horizon tasks might go quite far.
- DPDwarkesh Patel
Would you say it's plausible or is it seems quite likely that there will be other reasons why there might be bottlenecks? And, and I'm also kind of curious, like what would be the nature of the bottlenecks? Th- so it has all these representations for pre-training, now it can do, act coherently for a long period of time because of long horizon RL. What's remaining?
- 17:20 – 29:43
Plan for AGI 2025
- JSJohn Schulman
- DPDwarkesh Patel
It seems like then you should be planning for the possibility you would have AGI very soon.
- JSJohn Schulman
Yeah, I think it's, uh, I think that would be reasonable.
- DPDwarkesh Patel
Mm-hmm. So, uh, what, what's the plan if like, if there's no other bottlenecks next year or something, you got AGI, what's the plan?
- JSJohn Schulman
Well, I would say that if AGI came way sooner than expected, uh, we would definitely wanna, we would wanna be careful about it and we would, uh, we might wanna, um, like, uh, slow down a little bit on, uh, training and deployment until we're pretty sure we know, uh, we, we can deal with it safely, um, and we, we have a, um, a pretty good handle on what it's gonna do, what it, what it can do. So I think, uh, yeah, we would have to be, we would have to be very careful, um, if it happened way sooner than expected because I think, uh, our understanding is rudimentary in a lot of ways still.
- DPDwarkesh Patel
And what would, what would being careful mean? Like, uh, 'cause presumably you are already careful, right? You do these evaluations before your, um, deploying.
- JSJohn Schulman
Yeah, I would say, uh, just like, uh-... uh, maybe not, um, uh, not training the even smarter version, um, not what... Being really careful when you do train it that it's not, uh, i- it's, um, like properly sandboxed and everything. Um, maybe not deploying it at scale, um, or yeah, being, uh, y- yeah, being care- careful about what, um, what scale you deploy it.
- DPDwarkesh Patel
Mm. Yeah. I guess I'm not... Okay, so, uh, so, uh, let- let's just play with the scenario. Like it happens next year and then, uh, y- you're- you're not training a smarter system but... And then you're- you're deploying somewhat in a measured way, um, I... Yeah. I- it- it, I'm w- yeah. I'm wondering, well, presumably if this is just this isn't particular to open AI but this is just intelligence was just much easier than we expected and this is why it happened, um, and so you wait to deploy a little bit. Now other companies have a similar level of capabilities. What- what happens next? So you, you waited to deploy, what are you waiting for? What are- what are you talking with these... W- what were, what is every company doing in this scenario?
- JSJohn Schulman
Yeah. Yeah, the game theory is a little tough to think through. So, oh, yeah. So first of all, I don't think this is gonna happen next year but it's still useful to have the conversation.
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
Maybe it's like two or three years instead. But, um, yeah. I guess-
- DPDwarkesh Patel
But two or three years is still pretty soon, you know?
- JSJohn Schulman
Yeah. Yeah. It's still pretty soon. I do think, uh, you probably need some coordination, like, uh, everyone needs to agree on some, uh, on some reasonable, uh, like limits to deployment or to further training, uh, for this to work. Otherwise, uh, otherwise you have the- the race dynamics where everyone's trying to, everyone's trying to stay ahead and, uh, like, everyone's, uh, like... A- and that might require compromising on safety. So I think you would probably need some coordination among the, uh, larger entities that are doing this kind of training.
- DPDwarkesh Patel
And so you're coordinating to, um, I guess, w- pause deployment until, until what exactly? Like until you figure out what's happening in the models, or?
- JSJohn Schulman
Like pause, uh, either, uh, further training, pause deployment, uh, like, avoid certain types of training that we think might be riskier. Uh, so just, uh, like setting up some reasonable rules for, uh, um, like, uh, what- what everyone should do to... Yeah. Ha- having everyone somewhat limit, uh, limit these things.
- DPDwarkesh Patel
And but, uh, l- limit to what end? Because I g- I guess at some point, then you're gonna have to like, the- the potential energy that's within this intelligence will, uh, uh, you know, it'll be unleashed so... Uh, w- what- what- what is a plan to do... Like supposing in two years, we get the AGI and now everybody's freaking out and so now the AI companies have paused, um, and now what? Or is, or what- what- what would be the plan to wait till or?
- JSJohn Schulman
Yeah. That's a... I don't have a good answer to that. I mean, I would say, um, if we can... If everyone is gonna coordinate like that, uh, I think we would be... That would be an okay scenario. That would be a pretty good scenario, because I do think, uh, like, um, building these models is very capital-intensive and, uh, there are a lot of complex pieces so it's not like everyone's gonna go and recreate this stuff at home. Uh, so I think it is possible to do, g- given the relatively small number of entities who could train the largest models, it does seem possible to coordinate. So I'm not sure how, uh, how you would maintain this, uh, this equilibrium for a long, uh, period of time but I think if we got to that point, um, we would be in an okay position.
- DPDwarkesh Patel
Or would we? I guess I'm curious like, um, with the, uh, I'm not sure what happens next because like, fundamentally the problem, uh, uh, or the benefit is that like w- we've got a ton of like... You- you like push it to the server and now we've got a bunch of intelligences or they could push themselves to the server. Um, and I'm... Now we've got everybody coordinated but I'm not sure what's, what we do next in this- in this world. Or like why that- why that sets us up for a good outcome.
- JSJohn Schulman
Yeah. I would say if we had everyone, um, reasonably coordinated, we could, uh, figure out some... A- and we felt like we had solved the technical problems around alignment well enough to be able to, uh, deploy like really smart AIs that, um, can like, uh, like act- act as an extension of people's will but also, uh, prevent, uh, them from being misused in some way that would cause a catastroo- catastrophe, I think then, uh, then that would be great. Like we could, uh, go ahead and, uh, like safely deploy these systems and uh, it would, um, it would usher in a lot of, uh, prosperity and a new, uh, like much, uh, m- uh, more rapid phase of scientific advancement and so forth. So I think that would be what the good scenario would look like.
- DPDwarkesh Patel
Okay, so that's... Th- that makes sense but I'm curious like how would you know in a couple of years if, uh, you- you, like all these, uh, actors, even in the best case scenario, they've agreed to pause until we've figured out that we're building aligned systems that, uh, uh, are not themselves gonna attempt to take over or coup or not gonna enable somebody else to do that. How... What would proof of that look like or what would evidence of that look like?
- JSJohn Schulman
Well, I would say if we, um, if we can deploy, uh, like, uh, systems incrementally that are successively smarter than the ones before, then I think that's, uh, safer. So I hope the way things play out is- is it's not this scenario where everyone has to coordinate and lock things down-
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
... and safely release things, uh, like, because it would like lead to this big build up in potential energy-
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
... potentially. So I would rather, uh, some scenario where we're just, um, continually releasing things that are a little better than what came before and then we, uh, while like making sure we're, um, confident that each, um, diff is-Right, like improving, uh, improving the safety and alignment, uh, in like, uh, correspondence to the improvement in capability. So, a- and if, if things started to look a little bit scary, then we would be able to s- uh, slow things down. So that's what I would hope for. Um, I would say, um, i- if there's more of a discontinuous jump and the question is how do you know if the thing you've got is safe to release, um, I would say, I can't give a, a generic answer. Like, I would wanna... But, um, like the type of thing you mi- might wanna do to make that more, uh, more acceptable would be you would wanna do, um, a lot of, uh, testing, like simulated deployment. Um-
- DPDwarkesh Patel
Mm-hmm.
- JSJohn Schulman
... um, you, where, that you expect so red teaming of sorts, like you'd wanna do that in a way that you feel is like, uh, much less favorable than, uh, or much, uh, more likely to fail than the thing you're planning to do i- in the real world. Uh, you'd wanna have a really good monitoring system so that you can, uh, like, i- if something does start to go, go wrong with the deployed system, you can, uh, y- you feel like it's gonna be, uh, detectable immediately. Like you've got, maybe you've got something watching over, uh, the deployed AIs and what they're doing and looking for signs of trouble. So I, uh, so I would wanna, um, yeah, I would say just, um, y- you'd want some defense in depth. Like you'd wanna have some combination of, uh, like the model itself, uh, seems to be, um, like w- really well-befi- behaved and have like impeccable, uh, moral compass and everything. And you're pretty co- confident that it's, it's extremely resistant to any kind of takeover attempt or something or like severe misuse. And then you would also wanna have like, uh, really good monitoring on top of it so, yeah, you could detect any kind of, any trouble.
- DPDwarkesh Patel
Uh, wh- wh- what are you keeping track of while you're doing long horizon RL or when you eventually start doing it that, uh, you, you could notice a sort of discontinuous jump before you deploy these systems broadly?
- JSJohn Schulman
I would say you would wanna have a lot of evals that you're running during the training process, uh-
- DPDwarkesh Patel
And like what specifically would it-
- 29:43 – 40:10
Teaching models to reason
- DPDwarkesh Patel
Okay, so b- before we get back to that, I think, uh, l- let's step back and talk about like, uh, today's, um, RLHF systems and everything, um, but I, but I do wanna follow up on that third point. It's kind of interesting. Um, okay, so today's RLHF, the way in which it influences these models is, would you characterize it as, in terms of human psychology, is it a drive, is it a goal? Is it an impulse? Like psychologically, what kind of thing, eh, eh, in what way is it being changed? And not, not just like the persona of a chatbot but just like, don't talk that way, talk this other way, or don't, don't put those kind of outputs.
- JSJohn Schulman
Yeah, I would say there are probably some analogies with a drive or a goal in humans so, in that, um...... mm, you have, um, you're trying to steer towards a certain set of states rather than some other states, um, and so I- I would, I would think that our concept of a driver goal has, um, other, um, elements like, uh, like the feeling of satisfaction you get for achieving it and, uh, and those things might, um, be more, like have more to do with the learning algorithm than, uh, what the model does at runtime, uh, when you just have a fixed model. So I would say th- I would say there are probably some analogies, though, um, it's, uh, I- I don't know exactly, um, like how, how close it is, but I would say to some extent it is, um, it, the models, um, do have drives and goals in some meaningful way and in, in the case of RLHF where you're trying to, um, maximize, um, human approval as measured by a r- reward model, the model is just trying to produce something that people are gonna like and p- and is, they're gonna judge as correct.
- DPDwarkesh Patel
I've heard two ideas in terms of using that inner monologue type of thing to get better at reasoning, at least publicly the kinds of things I've seen, and I'm curious which you think is more promising. One is that the model learns from, it outputs a bunch of potential trains of thought and it learns to follow the one that leads to the correct answer and is trained on that before deployment. And the other one is, you use a bunch of compute to do inference in deployment which involves the model talking to itself af- eh, you know, while- while it's deployed. Which one do you expect it to be closer to when it's like really good at reasoning? Is it because it's doing just a bunch of inference calls or is it just because you've trained it to do well at that?
- JSJohn Schulman
Well, I would say you could define reasoning as, um, tasks that require some kind of, uh, uh, like computation, um, at test time or maybe some kind of, uh, deduction. Um, so, so by definition reasoning would be tasks that require, um, like some test time computation and a s- like step-by-step computation. Um, on the other hand, I would also, um, expect to gain a lot out of, um, like doing some kind of, um, training time computation or practice at training time. Uh, so, so I would think that, um, you get the best results by combining, uh, combining these two things.
- DPDwarkesh Patel
Mm-hmm. W- what right now, you know, you have these two ways in which the model learns. It's either in training, w- whether it's free training or with the post-training, but it's like mo- most of the compute and training is spent on pre-training and just g- glossing over trillions of tokens just like standing by as they a- all, you know, like almost like skimming trillions of tokens worth of information, which if a human was subjected to that would just be totally confused, right? It's like not a very efficient way to learn. And the other way is in-context learning, but of course that is, it's more sample efficient there, but it's destroyed with each instance. I'm curious if you think that there's a path for something in between those where it's not destroyed at each instance, but it's also not as, um, uh, uh, not as sort of frivolous as just, uh, seeing trillions of tokens where it's more deliberate and active.
- JSJohn Schulman
Yeah, so do you mean, um, models having some kind of, uh, medium-term memory? So, uh, too much to fit in context, but, um, like much smaller scale than pre-training?
- DPDwarkesh Patel
I'm not sure if memory, uh, it might be memory, I'm, uh, I- I don't have context, but, uh, certainly like when I, when I'm trying to prepare for this conversation, uh, it feels like I- I think of like what- what I should understand this so I look it up and I like read it carefully and I maybe think about it as I'm reading it, and I'm not sure what it naturally corresponds to in terms of models, but I'm, what- what that looks like? I'm curious.
- JSJohn Schulman
I see, so it's not just a memory, but it's also somewhat like specializing to a task that, um, specializing to a certain task or putting a lot of effort into like some particular project.
- DPDwarkesh Patel
And I'm not even sure it's specialization, more so, um, I'm thinking about I don't understand this part so let me look into this part deeper. I already understand this, I'm gonna... Like specializing to your existing knowledge base, um, yeah.
- JSJohn Schulman
I- I see, so it's not just about, uh, finding like, uh, I- I don't know, training on a bunch of sources that are relevant of fine tuning on some special domain. It's also about like, uh, like reasoning about like developing some knowledge through your own reasoning and also-
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
... using some sort of, uh, introspection and self-knowledge to figure out what you need to learn.
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
Um, yeah, I would say that does feel like, um, something that's missing from today's systems. I mean, I would say, um, people haven't really pushed too hard on this middle ground between, uh, like large scale training, like where you produce the, like this snapshot model that's supposed to do everything, like a deployed model and then like on the other hand, like in-context learning. And I think part of that is that we've just been, uh, increasing context length so much that-
- DPDwarkesh Patel
Mm-hmm.
- JSJohn Schulman
... there hasn't been an incentive for it so if you can go to like the th- 100,000 or a million context then that's actually quite a lot and uh, it's not, um, it's not actually the bottleneck in a lot of cases, but I agree that's, um, you'd probably also want to supplement that by some kind of fine tuning like the, uh, the capabilities you get from fine tuning and in-context learning are probably somewhat complementary, so I would expect us to want to build systems that do some kind of online learning and also have some of these, uh, cognitive skills of, uh, like introspecting on their own knowledge and uh, seeking out new, new knowledge that fills in the holes.
- DPDwarkesh Patel
Uh, is this all happening at the same time? Like, uh, is it just like a new training regime where all these things can happen at once or whether it's the long horizon t-... training or whether it's this kind of training, are they separate or are they just because, like, the model is smart enough so they can both introspect and it can act on longer horizons and you can get adequate reward on long horizon tasks?
- JSJohn Schulman
Yeah, I would say if you're doing some kind of long horizon task, uh, w- well, I would, uh, y- you're learning while you do the task, right? So the only way to do something that involves, uh, a lot of steps is to, um, like, to have learning and memory that gets updated during the task. So, uh, like, there's a continuum between, um, like, uh, like, short-term memory, uh, um, b- between short-term and long-term memory. So, um, I would say, uh, yeah, I would expect, uh, I would expect this, uh, capability would start to become, uh, like, the need for it would start to become clear when we s- start to, uh, look at long horizon tasks more and, uh, and to some extent just, um, putting, um, a lot of stuff into context will pro- will take you pretty far because we have really long context now. But you probably also want things like fine-tuning. And as for, uh, like, introspection and the ability to do active learning, um, that might, uh, like, automatically fall out of the model's abilities to know what they know 'cause they have some, like, um, models have some calibration, um, regarding what they know. And that's why, like, that's why, um, models don't hallucinate that badly-
- DPDwarkesh Patel
Right.
- JSJohn Schulman
... uh, because, yeah, they have some understanding of the, their, their own limitations. So, uh, I think that, like, same kind of ability could be used for something like active learning.
- DPDwarkesh Patel
Mm. And how... So th- uh, there's all these complicating RL, RL procedures, uh, that m- many of whom you've pioneered. How many of them will be relevant when you get to the point where the, uh, the model itself is this smart, that it can act as its own environment and interact in a more online and stable way? Um, is i- is s- is its, is the path for progress gonna be more straightforward than the kinds of solutions that were required for RL in the past?
- JSJohn Schulman
Well, I think policy gradient algorithms are not the most sample efficient algorithms, so that's probably not what you want to do at test time if you want to learn really fast. Um, but th- who knows? I mean, maybe it's not that bad. Um, so I think, um, something like, um, like motor learning in animals is probably something like a policy gradient algorithm.
- DPDwarkesh Patel
Mm.
- JSJohn Schulman
And, uh, so for example, you're, like, learning how to, um, shoot baskets. Uh, I think you probably, uh, like, that takes, uh, maybe thousands of tries to, um, get more accurate. And I think you probably... Uh, there's probably some- something that's, uh, like a policy gradient algorithm un- underneath. Um, but, uh, that's not gonna be the fastest way to learn in, um... Like if, i- if you have a model trying to do a project or some kind of task, um, so I would think we would want to rely more on, like, in-context learning, um, where, uh, you effectively have a learned algorithm. Like, you've learned how to explore. Uh, like, you've learned how to try all the possibilities exhaustively, um, and, uh, i- instead of doing the same thing over and over again, making the same mistake. So yeah, I would say w- we'll be able to do things that look more like learned search algorithms an- and that'll be the kind of thing that, uh, gets used in a particular task.
- 40:10 – 51:33
The Road to ChatGPT
- DPDwarkesh Patel
Interesting. All right. I, I, I want to, uh, step back and ask about your own history. So, um, at least at OpenAI. So, uh, you, you led the creation of ChatGPT. At what point do you, did you realize, first of all, these LLMs are the path to go and then a chatbot would be... Or some way to instruct them would be a useful thing to do? Just walk me through the whole lineage from like, when this became the, your main focus and yeah, w- uh, h- what, yeah, what the process was like.
- JSJohn Schulman
Yeah. So early... Um, so we had, um, uh, before ChatGPT, uh, we had, um... OpenAI had these instruction-following models and uh, that was ba-... The idea there was, um, we had base models and people can, um, prompt them in elaborate ways. Um, but, uh, they're also kind of hard to prompt. You had to, uh... They basically do autocomplete, so you have to set up a very good prompt with some examples. So, uh, uh, so, uh, people at OpenAI, uh, were working on, um, just taking the base models and making them easier to prompt so that if you just wrote a question, it would answer the question instead of giving you more questions or something. Uh, so that was... Uh, so we had these instruction-following models, which were kind of like base models but a little easier to use. Um, and those were the original ones deployed in the API. Or after, um, GPT-3, those were the next, uh, generation of models. Um, then at the same time, there were definitely a lot of people thinking about, um, chat. So, uh, so Google had some papers, uh, like they had, uh, LaMDA and, um, earlier MeNA. So they had these chatbots and it was more like, um, uh, like you had a... It was more like a base model that was really specialized to, um, the task of chat, really good at chat. And, uh, like, I think at least, uh, looking at the examples from the paper, it was more, uh, used for sort of fun applications like, um, where the model would, uh, like, take on some persona and pretend to be that persona. It was not so functional like, um, like help me refactor my code. Um, so yeah, there were definitely people thinking about chat. I had worked on a project before, uh, looking at chat called WebGPT, which was more about doing question answering with the help of, uh, web browsing and retrieval.
- DPDwarkesh Patel
Mm.
- JSJohn Schulman
And...Well, when you do question answering, uh, it really wants to be in a chat because, um, you always want to ask follow-up questions-
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
... or sometimes you need to clarif- the model should ask a clarifying question because the question's ambiguous. So it was kind of clear after we did the first version of that that we should, the next version should be conversational. So anyway, we started working on, uh, like, the conversational chat assistant, um, and, uh, we, uh, this was built on top of GPT-3.5, which was done training at the beginning of 2022 and that model was quite good at language and code, so we quickly realized that it was actually quite good at coding help and that was one of the things we were excited about. So yeah, we worked on that, uh, we worked on that for, for most of the year and, uh, we had, we had browsing, um, as another feature and it, though we ended up, uh, like, de-emphasizing that later on because the, like, the model's internal knowledge was so good that we didn't, that the browsing, um, wasn't the most interesting thing about it. Um, and then, uh, we were thinking about... We had it out for beta testing or to friends and family for a while and, uh, we were thinking about doing a public release, um, but, um, at that time, uh, actually GPT-4 finished training in August or, um, yeah, in August that year and, um, actually the, um, like the flagship RL effort at OpenAI was the instruction following effort-
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
... because that was the models that were being deployed into production, so, um, like the first fine tunes of GPT-4 used that, um, that whole stack. And that was, um, yeah, those models were really good and everyone got really excited about that after seeing the, uh, like instruct fine tune GPT-4s. Uh, but, so they were really, really good. They would occasionally give you amazing outputs but they were also, like, a little bit... The model was clearly, like, pretty unreliable, like it would sometimes hallucinate a lot and it was like pretty... It would sometimes give you pretty unhinged outputs, so it was clearly not quite ready for prime time but it was, like, obviously very good. Um, and uh, yeah, so I guess that, um, people forgot about chat for a little while after that, because, about this like alternative branch, uh, but then we ended up, um, we pushed it further and we ended up like mixing together all the datasets, like the instruct and the chat data, and, to try to get something that was the best of both worlds, and, uh, I think the, yeah, the models we... The chat models were like, uh, were clearly more, um, like, it was an easy- easier to use, it was sort of more, um, it sort of, uh, like automatically had much more sensible behavior in terms of like the model knowing its own limitations. That was actually one of the things that, uh, I got excited about as we were developing it that, uh, like, I realized a lot of the things that, um, people thought were flaws in language models, like just like blatantly hallucinating-
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
... uh, could be not completely fixed, but you could make a lot of progress with pretty straightforward methods. Uh, oh yeah, and also the, um, the other thing about chat was that, uh, like, when we had these instruct models, uh, like the task of, uh, complete this text but in a nice way or in a helpful way, that's like a pretty poorly defined task. So I think, uh, like, I think that task is like both confusing for the model and for the human who's supposed to do the data labeling, whereas for chat, um, I think people had an intuitive sense of, uh, like what a helpful robot should be like, so I think it was just much easier to tell people, uh, like, uh, to, to give, for people to get the idea of what, what the model was supposed to do.
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
And, uh, so that, so as a result, I think the, um, like the model had a much more coherent personality and, uh, like it was much, like, easier to get, um, like robo- like pretty sensible behavior, um, robustly.
- DPDwarkesh Patel
Interesting. Uh, is it the case that anybody could have made ChatGPT using your publicly available fine-tuning API?
- JSJohn Schulman
Um, not exactly. I mean, uh, they could have... Um, I don't remember the status of which models were availab- available for fine tuning. Uh, you, assuming we had 3.5 available for fine tuning at the time, you could have made something pretty decently close, but I'm not sure you would have, um... I don't think you would have been able to do just one iteration of fine tuning where you have like pure- purely human written data and you fine tune on that. I think you would want, like, you would want to do several iterations.
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
Like, if you're not going to do RL, um, whi- which we did, um, you would want to do some kind of iterative supervised fine tuning where you have like humans edit the model generated outputs because it's really hard to get people to... Like, if you train on human generated data, even if it's really high quality, it's just hard for a model to fit that data perfectly because it might not be, like, it might not be something a model is capable of outputting. Uh, so you need to do something iterative that looks a little bit more like RL. Uh, so I think if you had done that, you could have gotten something pretty close but, um, that would have been kind of non-trivial. Um, but we also had another, uh, like instruction following model trained with RL that was released a little before ChatGPT so I think if you put a chat, like, wrapper on that, you would get something decently close. Uh, but it, like that model, um, like if you just prompted it with chat, um, so, but that model had some, uh, differences in, uh, strengths, like it was, like that model was pretty good at writing and poetry and so forth but it wasn't, uh, it sort of, it wasn't as good at knowing its limitations and, uh, factuality and so forth.
- DPDwarkesh Patel
Um, so, uh, setting back from 3.5, I think I heard you somewhere say GPT-2, you were super impressed. Compared to your expectations in 2019, has AI progressed faster or slower than you would have expected?
- JSJohn Schulman
I would say faster than I would have expected since GPT-2.
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
... I was pretty, um, like, bought into, um, scaling and, uh, yeah, pre-training and so forth being a good idea. Um, but, um, when GPT-2 was done, I was... I would say I wasn't completely, uh, sold on it, um, being, uh, revolutionizing everything. Um, like, I only really pivoted what I was working on and what, uh, y- yeah, w- what my team was working on in, um, after GPT-3. So after that, uh, we kind of got together and said, "Oh, yeah. Let's, uh, uh, y- let's, um... This language model stuff works really well. Let's see what we can do here." But, uh, yeah. After GPT-2, I wasn't quite sure yet.
- DPDwarkesh Patel
Hmm. Uh, especially if, uh, w- the stuff we were talking about earlier with, uh, RL starts working better with the smarter models. Will the fraction of compute that is spent on training, that is pre-training versus post-training, change significantly in favor of post-training in the future?
- JSJohn Schulman
Yeah, there are some arguments for that. I mean, right now it's a pretty lopsided ratio.
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
But you could argue that the, uh, output generated by the model is, like, high quality compared to... or higher quality than what's, uh, most of what's on the web. So, uh, it sort of makes more sense for the model to, uh, think by itself, um, instead of just, um, like, training to, uh, imitate what's on the web. So I think there's a first principles argument for that. And, um, I would say we found a lot of gains through post-training so, um, I'm not sure... So I would expect us to keep, um, like, pushing this methodology and probably increasing the amount of compute we put into it.
- DPDwarkesh Patel
Hmm. The current GPT-4 has a ELO score, E- ELO score that is, like, 100 points higher than the original one that was released and is that all because of what you're talking about with these improvements that are brought on by post-training? Or...
- JSJohn Schulman
Yeah. I would say that we've, um... I would s- say that most of that is post-training.
- DPDwarkesh Patel
Interesting.
- JSJohn Schulman
Um, so there are a lot of, um... There are a lot of different, uh, separate axes for improvement. Like, you can, uh, yes, it's... We think about, um, like, data quality, data quantity, just doing more iterations of the whole, uh, process of deploying and collecting new data and, like, changing what you're... what kind of annotations you're collecting. So there's a lot of s- ... Uh, a lot of things that stack up but together they give you a pretty good, um, like, effective compute increase.
- DPDwarkesh Patel
Yeah. I mean, th- the, eh, that's a huge increase. That's, like, really interesting that there's this much, uh, this much room for improvement, uh, from post-training. What
- 51:33 – 1:00:18
What makes for a good RL researcher?
- DPDwarkesh Patel
is, uh, w- what makes for somebody who's really good at doing this sort of RL research? Uh, I hear it's super finicky but, like, what- what- what is the sort of intuitions that you have that enable you to find these ways to mess with the data and set up these environments?
- JSJohn Schulman
I'd say I just, um, have a de- decent amount of experience at this point from, uh, like, the different parts of the stack. From, like, uh, RL algorithms obviously, sin- since I've worked on those since, uh, grad school, uh, to, like, uh, the data collection, um, like, the annotation process. Uh, to, um, like, language... Playing with language models.
- DPDwarkesh Patel
Mm-hmm.
- JSJohn Schulman
So, I... I mean, I'd say I'd just dabbled with these things and, uh, I'd say that people who, um, do well at this kind of research, uh, have some view of the whole stack and have a lot of curiosity about the different parts of it. And, uh, also sort of think about, um... Well, you wanna be both empirical, um, and, uh, like, use experiment... Let experiments update your views but you also wanna think from first principles somewhat. Like, uh, what, um, uh... Like, assuming that, um, like, learning, uh, works, uh, like, what would be the ideal type of data to collect?
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
And that sort of thing.
- DPDwarkesh Patel
So because there doesn't seem to be a model released since GPT-4 that seems to be significantly better, there's... seems to be, uh, a... the hypothesis that potentially we're hitting some sort of plateau and that these models aren't actually generalizing that well. And you're gonna hit some sort of data wall beyond which point the abilities that are unlocked by memorizing a vast corpus of pre-training data won't actually help you get something much smarter than GPT-4. Um, what do you think of that hypothesis? Is that wrong? And like, I... I think we talked about some examples generically about generalization, the Spanish to English and so forth, but is there... Yeah, w- wha- w- i- i mean, okay. So th- maybe this is a run-on s- uh, question.
- JSJohn Schulman
(laughs)
- DPDwarkesh Patel
But (laughs) , um, one, one example I was thinking of was the idea that there is transfer from l- uh, language, c- uh, reasoning and code. If you train a bunch of code, it gets better at reasoning and language. And if that's the... Is that actually the case? Do you see things like that which suggests that there's all this kind of positive transfer between different modalities? So once you charge tr- uh, training on a bunch of videos and images, it'll get smarter and it'll get smarter from s- synthetic data? Or does it seem like the abilities that are unlocked are extremely local to the exact kind of labels and data you put into the, the training purpose?
- JSJohn Schulman
Yeah. Okay. Yeah, I'll try to respond to all of that.
- DPDwarkesh Patel
(laughs)
- JSJohn Schulman
So, uh, first, um, are we about to hit the data wall? I mean, I wouldn't draw too much from, uh, the, uh, time since GPT-4 was released because, I mean, it does, um, um... Yeah, it takes a while to, um, like, train these models and to, um, like, get all the, uh... do all the prep to, um, train a new model, like, generation of models. So, uh, yeah, I wouldn't draw too much from, from that fact. Um, yeah, I would say, um, there are definitely some challenges from the limited amount of data. Um, but-... I wouldn't expect us to immediately hit the data wall, um, but I would expect, uh, the nature of, um, pre-training to somewhat change over time as we get close, closer to it. Um, in terms of like, uh, generalization from different types of pre-training data, um, I would say it's pretty hard to, um, do science, uh, on this type of question because you can't do that, create that many pre-train models. So maybe, uh, you can't train a, like a GPT-4 size model. You can't do ablation studies at GPT-4 si- scale. Uh, maybe you can do, like train a ton of, um, GPT-2 size models or maybe even a GPT-3 size model with different data blends and see what you get. Uh, so I'm not like, um, aware of any results, uh, or like public, like public results on, um, like ablations, um, involving code data and reasoning performance and so forth. Um, so that would be, I'd be very interested to know about those results, but.
- DPDwarkesh Patel
I'm, I'm actually curious about, uh, I mean, if one of the things is that the model gets smarter as it's bigger, would an ablation on a GPT-2 level model, which suggests that there isn't that much transfer, how much evidence does that provide for the level of transfer on a similar set of domains in a GPT-4 level model?
- JSJohn Schulman
Right. You might not be able to conclude that, uh, if transfer fails at GPT-2 size then it's also gonna fail at a higher scale. Uh, so it might be that, um, like for, uh, the smaller models, um, you, uh, yeah, for the larger models you learn these better shared representations. Um, or the smaller models have to lean, uh, too much on memorization whereas the larger models can learn how to do the right computation. So I would expect, uh, this to be true to some extent.
- DPDwarkesh Patel
Th- this might have a very simple answer, but so bigger models, you train them on the same amount of data and they become smarter, or conversely, they can, to get the same amount of smarts, you ha- you have to train them on less data. What, what, why, why is that the case? Like wha- it's got more parameters, it saw less things and now it's equally as smart. Why, why did that, why, why is that the case?
- JSJohn Schulman
Uh, I don't think anyone has a good answer for, a, a good explanation of the, uh, scaling law with, um, parameter count. I mean, there's some, uh, I don't even know what the be- uh, what the best, um, sort of mental model is, uh, for this. Like clearly you have more capacity if you have a bigger model but, uh, so like you should be able to eventually get, uh, lower loss but I guess, uh, why are bigger models more sample efficient? Um, I guess you could, um, I could give you some like very sketchy, uh, explanations.
- DPDwarkesh Patel
Yes, please. (laughs)
- JSJohn Schulman
Like, uh, like they have, um, like you could say that the model is, uh, like uh, sort of an, uh, an ensemble of a bunch of different circuits that do the computation. So it has a, like, um, you could imagine that it's doing, um, it has a bunch of, uh, like computations that it's doing in parallel and it's, uh, like doing some, like the output is a weighted combination of them. Uh, and, uh, i- if you have more, um, just width of the mo- or if you just have a, a, I mean, actually width is somewhat similar to depth because, uh, like with residual networks, uh, you end up like, th- the depth can do something similar to width in terms of like updating what's in the residual stream. But, uh, if you, yeah, you could argue that, uh, you're learning all these things in parallel. Uh, y- you're learning all these different computations in parallel and you just have more of them with the bigger model, so you have more chance that, uh, one of them is lucky and, uh, ends up, um, like, uh, having high, um, like, like winning, guessing correctly a lot, and getting up-weighted. So that's kind of like a, um, uh, what would be the, yeah, yeah, there's some algorithms, uh, that work this way, like that, um, like mixture, uh, what is it? Mixture, uh, some kind of mixture model. Um, or, uh, multiplicative weight update algorithm. Yeah, there's some algorithms that kind of work like this so, uh, where you have like a, um, some kind of mixture of, uh, I, I don't wanna say mixture of experts-
- DPDwarkesh Patel
(laughs)
- JSJohn Schulman
... 'cause it means something different but, uh, like basically a weighted combination of experts with some learned gating. Uh, and, uh, um, oh actually, anyway, I said something slightly wrong but anyway, uh, yeah. You, you could imagine something like that and just having a bigger model get, gives you more chances to get the right, uh, function. So that would be, um, a- and then of course, uh, it's not just like you have a bunch of, uh, like totally disjoint, like functions that have, uh, you're taking a linear combination of. It's more like a library where, uh, you might chain the functions together in some way so, uh, you, you, like it's, there's some composability.
- DPDwarkesh Patel
Hmm.
- JSJohn Schulman
Um, so yeah. So I would just say there's like, um, the bigger model has a bigger library of different, uh, computations, including lots of stuff that's kind of dormant and only being used some of the time. Uh, but those thing, but it has like more space to look for the-
- DPDwarkesh Patel
Hmm.
- JSJohn Schulman
... uh, like look for the circuits that do something useful.
- DPDwarkesh Patel
I wanna ask you about, um,
- 1:00:18 – 1:14:36
Keeping humans in the loop
- DPDwarkesh Patel
uh, stepping, uh, back from, um, the current, uh, research questions. Just stepping back, I wanna understand your sort of like modal scenario of what happens over the next re- uh, few years. I think, uh, to- towards the beginning of the conversation we were talking about the case in which it progresses really fast. But just like, let, let's just say like the modal scenario. Um, you're unlocking long horizon RL at some point but then, as you said, there's potentially other bottlenecks. So w- what's happening l- you know, uh, how good are these models? How are they being deployed? What other modalities are part of them? At what, at what stage are these being unlocked and so forth? I'm j- just kinda wanna understand your broader picture of what the next few years look like.
- JSJohn Schulman
Yeah. I would expect, um...I would expect things like, okay, new modalities to be added, uh, like, um, over time or, uh, pretty soon. Um, I would ex- yeah, I would expect the capabilities to generally keep getting better through a combination of pre-training and post-training, and that'll open up new use cases. So right now, um, AI is still, um, not a huge, uh, part of the economy. Like, there's a-
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
... pretty small fraction of, uh, jobs that it can help with at all. Um, so I'd expect that to be higher over time. A- a- and not just from the models, uh, improving. Also from people just figuring out how to in- integrate them into different processes. So even- even if we just, um, froze the models at their current, uh, state, um, I- I think you would still see a lot of growth in how they're being used. Um, so I'd expect there to be a lot of, um, like, I would expect AI to be, um, he- used much more widely and, um, y- I would expect it to be used for more, um, kind of techn- like, technically sophisticated tasks like, um, yeah, like I gave the programming example earlier, um, of doing like longer projects. But also helping with, um, various kinds of, uh, research. So I would hope that, uh, we can use, um, AI to accelerate science in various ways and, uh, just, um, like, uh, because you can pote- potentially have the- the models like understand all of the literature in a given field and be able to, like, uh, be able to sift through tons of data, um, like more than a person would have patience to do. So I would hope that we can basically, uh, like, yeah, h- um, well, I hope the form factor would basically be that people are, uh, still driving all this and you have your, uh, like helpful assistants that you can use, you can sort of direct and point to lots of different problems that are useful to you. And everyone sort of ha- has all these, uh, AIs uh, helping them, uh, helping them do more, get more done.
- DPDwarkesh Patel
Hey, everybody. Real quick, I wanna tell you about a tool that I wish more applications used. So obviously, you've noticed every single company is trying to add an AI chatbot to their website. But as a user, I usually find them really annoying 'cause they give these long, generic, often useless answers. Commandbar is a user assistant that you can just embed into your website or application and it feels like you're talking to a friendly human support agent who is browsing with you and for you. And it's much more personalized than a regular chatbot. It can actually look up users' history and respond differently based on that. It can use APIs to perform actions. It can even proactively nudge users to explore new features. One thing that I think is really cool is that instead of just outputting text, Commandbar can kinda just say, "Here, let me show you," and start browsing alongside the user. Anyways, they're in a bunch of great products already. You can learn more about them at commandbar.com. Thanks to them for sponsoring this episode. Uh, but obviously at some point they're gonna be better at than, uh, everyone at what- whatever they wanna do. So, um, uh, yeah, what- what- what will that process look like? Right now they're clearly only helping you. At some point, they're able to just do things for you and maybe like run entire firms for you or whatever. Um, at that point is it, yeah, is it just gonna be like a smooth process and at that point the hope is that we have systems that are aligned with the user enough that they can count on the firm being run in the way they expect and so forth?
- JSJohn Schulman
Yeah. I think, um, well, we might not wanna jump to having AIs run whole firms immediately. I mean, uh, we- we might wanna have people, uh, like overseeing, um, uh, like overseeing, uh, these, uh, like important decisions and, uh, calling the shots. So, uh, even- even if the models are good enough to, uh, like to actually run a successful business themselves. Um, so, uh, yeah, to some extent there might be, uh, choices there. Um, and, uh, I think people will still have different interests, uh, and what they wanna different ideas for what kind of, uh, interesting pursuits they wanna direct their A- AIs at. And, uh, like they can, people- people could, uh, like, um, y- yeah, do a lot of, um, yeah, AI doesn't necessarily have an intrinsic, uh, like, um, any kind of intrinsic desire, but-
- DPDwarkesh Patel
Not yet. (laughs)
- JSJohn Schulman
... it's something once we put, we put it in, uh, the system. So I think, uh, so people can still end up being e- even if AIs, uh, like become extremely capable, uh, I would hope that people are still the drivers, uh, of like what the AIs end up doing.
- DPDwarkesh Patel
Yeah. But I- I wonder if the economic e- equilibrium is so far from that where, um, you have sort of the equivalent of Amdahl's law in a firm. The slowest part of the process is the one that's gonna bottleneck you. And so even if the AI makes all the non-human parts of the firm 10X more efficient, the firm can no longer... You know, it's- it's still bottlenecked by that step. And so if in the... if like one company decides to proceed by keeping humans in the loop on all the things that you really wanna owe human oversight on, then they'll just be out-competed by other companies. If one country decides to go this route, other countries will beat it. This doesn't seem a... I hope this, like I, yeah. I- I wonder if this is a sort of sustainable, uh, plan for keeping humans in the loop.
- JSJohn Schulman
Right. So I think if you, um, if we wanted to keep, uh, humans in the loop, uh, which seems reasonable, um, and, uh, it turned out that, um, firms with any humans in the loop were out-competed with, by firms that didn't have any humans, then I think then you would obviously need some kind of regulation that, uh, like disallowed, um, having no humans in the loop for running a whole company.
- DPDwarkesh Patel
Mm-hmm. But-There are so many companies in the w- uh, c- uh, in, well, I guess in any country but, uh, l- let alone the world. But th- it, yeah, I wonder if it's better to do the regulation on companies and say, like, "You've got to keep humans in the loop on important processes," but then you've got to define what important processes are. You've got to monitor every single company. Um, and you also gotta get collaboration in every single country which has firms in it. Versus, if this is a problem, should it be solved before the model is even deployed? Such that, hopefully you would get into a situation where if you did decide to build a firm end-to-end on these models, it's, basically does what you want it to do and y- you don't need a human in the loop. Does that question make sense? Like I- I guess I'm just-
- JSJohn Schulman
Yeah.
- DPDwarkesh Patel
... wondering in this situation-
- JSJohn Schulman
Right. It's-
- DPDwarkesh Patel
... how do we actually monitor ev- every single firm as a human in the loop and what happens if, like, China doesn't decide to do that and so forth?
- JSJohn Schulman
Right. Um, yeah, you would eith- either have to have, uh, like, um, every country, uh, agree to this regulatory regime or you would need every, um, you would need all of the model infrastructure or the model providers to agree to this, uh, kind of requirement. Um, so it's definitely, uh, gonna be non-trivial. Um, so, uh, I guess, uh, yeah, this is looking a ways ahead so it's a little hard to imagine, uh, to imagine this world, um, before seeing anything, anything like it. Um, but, uh, so for example, uh, like, there are some questions like would, uh... Are we actually confident that, uh, AI-run companies are, uh, better in every way or, uh, do we think they're better most of the time but occasionally they, um, malfunction? Because AIs are still, like, they're still less sample efficient in certain ways, like dealing with very wacky situations. So, um, so actually, uh, AI-run firms have higher tail risks because they're more likely to malfunction in a big way. So I guess that, there might be some question, practical questions like that that would, that would also determine how things play off, like, play out. Like maybe, uh, maybe if you just require people to, um, be accountable for various, like, liability, this would also change the incentives a bit. Um, so if- if it turned out that, uh, like, AIs are better at running everything and they're also completely benevolent and we've, like, totally solved alignment and we can, like, they're better at, um, being accountable to, uh, like, their, uh, to people than, uh, people are, then I would say, uh, (laughs) maybe, maybe it's okay, uh, having the AIs run the firms. But I think that's, uh, that might be pretty far out and I think we, we're more likely to be in a situation where they look better, uh, like in the short term but they still have some problem. Like, the AI-run entities still have some serious problems and, uh, it's actually, like, practical considerations that push you more towards having humans in the loop, at least for the near future.
- DPDwarkesh Patel
Okay. So this is a problem we gotta deal with today with RLHF, where you have to aggregate preferences across a lot of different humans, um, and it'll be maybe more marked with future, more powerful systems. But what do you say while we want these eventual AI systems that are gonna fully replace humans as part of these firms to be aligned, what, what, what does that mean? Like, will it mean that they're basically do what the user wants them to do? Does it mean that they have to result in some sort of global outcome that we're happy with as the, kind of, people, the stakeholders in OpenAI? Like, w- w- w- what concretely would the, would that mean?
- JSJohn Schulman
If the models are being used, um, i- like, uh, for these, um, higher stakes, uh, use cases, then we would have to, uh, think about RLHF in a much different way than we are right now. Um, so I would say we're not quite, um, yeah, we're not quite ready for that or the current methods, um, might not be completely sufficient. But I would say, um, I would say we would need to make compromises between, uh, the, uh, needs of the different stakeholders involved. So, so we have this, uh, this document that, uh, that we're releasing called the spe- uh, model spec and, um, it's about how we want our models to behave in, um, in the API and in ChatGPT. And we sort of, we try to, uh, talk about this issue where there are different stakeholders involved and sometimes there are conflicts between what they might want. Like, uh, like the, uh, in our case, we were thinking of the stakeholders as, uh, the user, uh, or the end user. That means, like, someone sitting in front of ChatGPT or, or some other app. Um, the developer, so this is like, uh, someone using the API, um, who might be serving other end users with their app. Like, the, um, the platform, um, which is OpenAI, like, um, we don't want the models to, um, expo- like expose us to legal risk and so forth. Um, and then, uh, the rest of the huma- uh, of humanity, uh, including people not, uh, part of the, like, uh, who m- might not be users or customers or anything. So obviously, uh, like, the user might ask, uh, uh, ask the model to do something that we think is, uh, like, actively harmful to other people, um, and, uh, so we might have to refuse that. Um, by the way, this isn't the order of, uh, priority-
- DPDwarkesh Patel
(laughs)
- JSJohn Schulman
... necessarily. So this is just like th- we have these, uh, four or so classes of stakeholder. Actually, you could also say maybe in the future we'll say the model itsel- uh, the model itself, though I would say we're not going there yet. Um, but anyway, they, um... Yeah, we have these different stakeholders. Sometimes, uh, they have con- conflicting demands and, uh, we have to make some call on how to resolve those conflicts. And it's not always obvious how to do that. Um, so, uh, I would say we had to think through, um... Yeah, we w- we just had to think through the trade-offs and basically the, uh, like, the rough heuristic is that we mostly want the models to...... uh, follow your instructions and be helpful, uh, to the user and the developer, um, but, uh, when this impinges on other people's, uh, a- um, other people's happiness or, uh, (laughs) or way of life, th- this becomes a problem and we, we have to block certain kinds of, uh, usage. Uh, but we don't want to be too, um... We, we mostly want the models to just be an extension of people's will and do what they say. We don't wanna be too paternalistic. We wanna be kind of neutral, uh, and not, like, impose our opinions on people. Uh, yeah, we, we wanna bos- mostly, uh, let people d- do what the, they want with, uh, the models.
- DPDwarkesh Patel
I got a chance to read the spec beforehand and it, it was, uh... I, I guess it's a, uh, a question of how well that transfers over to how the model itself behaves, but the... I was impressed with how sensible the trade-offs were. Like, it made sense that this is the... Uh, but it was like, explicitly stated the actual edge cases rather than the kinds of things where everybody can... Which are obvious. Like in this case, you really are going after the edge cases.
- JSJohn Schulman
Yeah, we wanted it to be very actionable so that it wasn't just a bunch of nice-sounding principles-
- DPDwarkesh Patel
Right.
- JSJohn Schulman
... but it was like each, uh, e- each example kind of tells you something about some non-obvious, uh, situation and reasons through that situation.
- DPDwarkesh Patel
Yeah.
- 1:14:36 – 1:22:39
State of research, plateaus, and moats
- DPDwarkesh Patel
Okay. Now I have a couple of questions about the, uh, uh, the state of the research itself. So, s- famously in the social sciences, things are really hard to replicate and it's a question about how much of the science there is real versus these, uh, manufactured, bespoke sorts of experiments. When you look at the average ML paper, does it feel like the, uh, like a really solid piece of literature or does it feel often like it's the equivalent of what p-hacking is in the social sciences?
- JSJohn Schulman
Everyone has their complaints about the ML, uh, literature, but I would say overall, I think it's, um, a relatively healthy field compared to some other ones like in the social sciences, um, just because, uh... Well, it's grounded, uh, it's largely grounded in practicality and getting things to work and, uh, um, if you, uh, if you publish something that can't be, uh, replicated easily, then people will just forget about it. So, uh, and, and it's, like, accepted that often you, um, you don't just report someone's, uh, number from their paper, you also try to re-implement their method and compare it to your method on the same, uh, say, on the same training data set. So I think pe- if you, if you publish methods that are, um, like, really hard to implement or, uh, don't, or are really finicky, um, they'll tend to get, um, forgotten. And as a result, people actually try to open source their work a lot. I guess there's also... There's various, um, um, like, incentives, uh, that... Th- there's various unfavorable incentives like, um, yeah, people are incentivized to, uh, make the baseline methods, like the methods they're comparing to, worse and, uh, like, th- uh, there are other, um, like m- mild pathologies, like trying to make your method seem sophisticated mathematically. Um, but I would say overall, uh, I feel like the field makes progress and I would probably like to see a little bit more, um, science and, uh, trying to understand things rather than more, like, uh, hill climbing on benchmarks and trying to propose new methods. And there's been a decent amount of that recently, but, uh, y- yeah, I think it's... Uh, we could use more of that. And I think that's a good thing for, like, academics to work on. Um, oh yeah, on the social sciences, uh, on a slightly different note, uh, I think actually, um, I'd be really excited to see more research on, uh, using, um, base models to do, um, simulated social science, uh, because, uh, these models have a, a probabilistic model of the whole world and you can, uh, set up, like, a simulated questionnaire or, um, like, a conversation and, um, like... And you can look at how any- anything is correlated, like any, um, any traits that you might imagine, you can see how they might be correlated with other traits. So it'd be pretty cool to see if people could replicate some of the, like, more notable results in the social sciences-
- DPDwarkesh Patel
Oh.
- JSJohn Schulman
... like, like moral foundations and that sort of thing-
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
... by just, like, uh, prompting base models in different ways and seeing what's correlated.
- DPDwarkesh Patel
What, what is that Stanford experiment? The, um, the, the one where they... Uh, the con- As- Asch conformity test, right?
- JSJohn Schulman
Oh yeah.
- DPDwarkesh Patel
It'd be fun if that replicated, uh, with the language models as well. Um, that'll be interesting. With the rest of the research that happens at big labs, how much of it is increasing the, uh, or decreasing the amount of compute you need to get, uh, a certain result as an actual compute multiplier versus how much of it is things that are just making the learning more stable and just building out the infrastructure? I guess the broader question I'm trying to ask is, since GPT-4, does it feel like with the same amount of compute you can train a much better model? Or does it feel like, oh, we've, like, ma- made sure that learning can happen better and in a more scalable, scalable way with GPT-5, but it's not like, uh, we can train GPT-4 with like GPT-3.5 budget now or something like that?
- JSJohn Schulman
Yeah, well, um, definitely there's always progress in improving the efficiency. Um, whenever you have a 1D, um, performance metric, you're gonna find that, uh, like, uh, different improvements, um, can c- kind of substitute for each other. Uh, so you might find, like, i- uh, you might find that y- uh, post-training and, um, pre-training both improve the metrics or, uh, like, improve, uh, they, they, they'll have a different, slightly different profile of which metrics they improve, but, uh, if, if at the end of the day you have a single number, they're both gonna, they're gonna substitute for each other s-
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
Uh, somewhat. So I would say for, uh, something like a, like a human evaluation, like what do humans prefer, uh, we've definitely made a lot of progress on both sides.... uh, like pre-training and post-training and improving that.
- DPDwarkesh Patel
Hmm. Okay, a couple of rapid fire questions about RLHF. So obviously, RLHF is important to make these models useful, so maybe the lobotomized description is inaccurate, but there is a sense in which all of these models, once they're put in a chatbot form, have a very similar way of speaking. They really wanna delve into things. They wanna turn things into bullet points. They often seem to sort of have this formal and dull way of speaking, uh, and there's complaints that th- uh, w- they're not as creative, like what we were talking about before with it can only do rhyming poetry and not not-rhyming until recently, I guess. Is that a result of the particular way in which RLHF happens now? And if so, like is it because of who the raters are? Is it because of what the loss function is? Wh- why is this the way all chatbots look?
- JSJohn Schulman
Yeah, I would say there's a decent amount of room for variation in exactly, uh, how you do the training process and, uh, I think we have a lot of, um... I, I'd say we're, um, actively trying to improve this and make the writing more lively and, uh, and more fun, and I think we've made some progress, like improving the personality of ChatGPT so it is, um, it is more fun and like you, it's, it's better when you're, uh, trying to chitchat with it and so forth. Uh, it's less robotic. Um, I would say, um, yes, it's a kind of interesting question how some of the, the tics came about, like, um, like the word delve.
- DPDwarkesh Patel
Yeah.
- JSJohn Schulman
I've actually caught myself using the word a bit-
- DPDwarkesh Patel
(laughs)
- JSJohn Schulman
... uh, recently. Um, so I don't know if it rubbed off on me from, from the model or what. Uh, but, um, actually, I think there's also, there might be some int- uh, funny effects going on where there's like unintentional distillation, uh, happening between the language model providers where, like if you hire someone to, um, go do a labeling task, uh, they might just be feeding, uh-
- DPDwarkesh Patel
(laughs)
- JSJohn Schulman
... feeding it into a model. They might just be pulling up their favorite, uh, chatbot and, uh, like feeding it in and having the model do the task and then copying and pasting it back.
- DPDwarkesh Patel
(laughs)
- JSJohn Schulman
So there might be, uh, uh, that, that might account for some of the convergence, but also I think some of the things we're seeing are just what, uh, what people like. I mean, I think people do like bullet points. They like the structured, uh, responses. Uh, people do often like the big info dumps that they get, uh, from the models. Uh, so, uh, yeah, I think there's, um... So it's not completely clear, um, how much is just a quirk of, uh, the, uh, particular, uh, like choices and, uh, like des- uh, design of the, um, post-training processes and how much is actually intrinsic to, uh, like what people actually want.
- DPDwarkesh Patel
It does seem persistently more verbose than, um, uh, some people want and maybe just because during the labeling stage, the raters will, uh, prefer the more verbose answer. But, um, I, I wonder if it's the, uh, if it's inherent to, because of the, uh, how it's pre-trained and the stop sequence doesn't come up that often and like, it really wants to just keep going or...
Episode duration: 1:35:50
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode Wo95ob_s_NI
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome