Dwarkesh PodcastIs RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken
EVERY SPOKEN WORD
150 min read · 30,001 words- 0:00 – 16:27
How far can RL scale?
- DPDwarkesh Patel
Okay. I'm joined again by my friends, uh, Sholto Bricken. Wait, fuck. (laughs)
- SDSholto Douglas
(laughs)
- NANarrator
(laughs)
- TBTrenton Bricken
Did I do this last time? (laughs)
- DPDwarkesh Patel
You did the same. No, no, you named us differently, but we didn't have Sholto Bricken and Trenton Douglas.
- SDSholto Douglas
Sholto, yeah. (laughs)
- TBTrenton Bricken
Sholto Douglas and Trenton Bricken-
- DPDwarkesh Patel
(laughs)
- TBTrenton Bricken
... um, uh, who are now both at Anthropic. Sholto-
- SDSholto Douglas
Yeah, let's go. (laughs)
- DPDwarkesh Patel
(laughs)
- TBTrenton Bricken
(laughs)
- DPDwarkesh Patel
Uh, Sholto is scaling RL, Trenton's still working on mechanistic interpretability. Um, welcome back.
- SDSholto Douglas
Happy to be here.
- TBTrenton Bricken
Yeah, it's fun.
- DPDwarkesh Patel
What's changed since last year? We talked basically this month in 2024.
- SDSholto Douglas
Yep.
- DPDwarkesh Patel
Now we're in 2025. What's happened?
- SDSholto Douglas
Okay. So I think the biggest thing that's changed is RL and language models has finally worked.
- DPDwarkesh Patel
Mm.
- SDSholto Douglas
Um, and this is manifested in, we finally have proof of an algorithm that can give us expert human reliability and performance, given the right feedback loop.
- DPDwarkesh Patel
Mm.
- SDSholto Douglas
And so I think this is only really been like conclusively demonstrated in competitive programming and math-
- DPDwarkesh Patel
Mm.
- SDSholto Douglas
... basically. Uh, and so if you think of these two axes, one is, uh, the, like, intellectual complexity of the task, and the other is the time horizon over which the task is, uh, is being completed on. Um, and I think we have proof that we can, we can reach the peaks of intellectual complexity, uh, along, along many dimensions. Uh, but we haven't yet demonstrated like long running agentic, uh-
- DPDwarkesh Patel
Mm-hmm.
- SDSholto Douglas
... performance. And you're seeing like the first stumbling steps of that now, and should see much more, like, conclusive evidence of that basically by the end of the year-
- DPDwarkesh Patel
Mm.
- SDSholto Douglas
... uh, with, like, real software engineering agents doing real work. Um, and I think, Trenton, you're, like, experimenting with this at the moment, right?
- TBTrenton Bricken
Yeah, absolutely. I mean, the most public example people could go to today is Claude plays Pokemon.
- 16:27 – 31:59
Is continual learning a key bottleneck?
- DPDwarkesh Patel
yeah.
- TBTrenton Bricken
Yeah.
- DPDwarkesh Patel
So when I think about the way humans learn, it seems like these models getting no signal from failure is quite different from if you try to do a math problem and you fail, it's actually, uh, b- even more useful often than, like, learning about math in the abstracts, because... Oh, you don't think so?
- TBTrenton Bricken
I think also, like-
- DPDwarkesh Patel
Only if you get feedback.
- SDSholto Douglas
Yeah.
- DPDwarkesh Patel
Only if you get feedback.
- TBTrenton Bricken
Only if you get feedback.
- SDSholto Douglas
Only if you get feedback.
- DPDwarkesh Patel
But you, and I think there's a way in which, like, you, you actually give yourself feedback. You, like, you fail, and you notice where you failed, um-
- TBTrenton Bricken
Only if you get feedback, I think.
- SDSholto Douglas
Yeah.
- TBTrenton Bricken
At times.
- DPDwarkesh Patel
You think so?
- TBTrenton Bricken
Yeah.
- DPDwarkesh Patel
I mean, people have, like, figured out new math, right? And they've done it by the fact that, like, they get stuck somewhere. They're like, "Why am I getting stuck here? L- l- like, let me think through this." Whereas in the example, I mean, I'm, I'm not aware of what it's like at the frontier, but, like, looking at open source, like, im- implementations from DeepSeek or something, there's not this, like, conscious process by which, um, once you have failed, you, like, learn from the particular way in which you failed, um, uh, to then, like, b- backtrack and do your next things better, just, like, pure gradient descent. And I wonder if that's a big limitation.
- TBTrenton Bricken
I don't know. I just remember undergrad courses where y- you would try to prove something, and you'd just be wandering around in the darkness-
- DPDwarkesh Patel
Mm-hmm, mm-hmm, mm-hmm.
- TBTrenton Bricken
... for a really long time, and then maybe you totally throw your hands up in the air and need to go and talk to a TA. And it's only when you talk to a TA can you see where along the path of different solutions you, you were incorrect and, like, what the correct thing-
- DPDwarkesh Patel
Yeah.
- TBTrenton Bricken
... to have done would've been. And that's in the case where you know what the final answer is, right? In other cases, if you're just kind of shooting blind and meant to give an answer de novo, uh, y- you, you, it's really hard to learn anything.
- DPDwarkesh Patel
Mm-hmm. I, I guess I'm trying to map on again to the human example where, like, i- in more simpler terms, um, there is this sort of conscious intermediary, like, auxiliary loss-
- TBTrenton Bricken
Yep.
- DPDwarkesh Patel
... that we're, like, optimiz- like, and, and it's, like, a very sort of, like, self-conscious process, um, of getting r- forget about math, just, like, if you're on your job. At, you're getting, like, you're, like, getting very explicit feedback from your boss that's not necessarily, like, how you, the task should be done differently, but, like, a high level, like, um, explanation of what you did wrong-
- TBTrenton Bricken
Yes.
- DPDwarkesh Patel
... which you, like, update on, not in the way that pre-training updates ways, but more in the, I don't know...
- TBTrenton Bricken
Uh, I, I think there's a lot of implicit dense reward signals here.
- DPDwarkesh Patel
Yeah, exactly.
- TBTrenton Bricken
Uh, like weekly one-on-ones with your manager-
- DPDwarkesh Patel
Yeah.
- 31:59 – 50:32
Model self-awareness
- DPDwarkesh Patel
(screen whooshes)
- TBTrenton Bricken
I mean, even inside Anthropic and, like, on the interpretability team, there is active debate over, like, what the models can and can't do. Uh, and so, a few months ago, uh, a separate team in the company, the model organisms team, uh, created this, uh, I'll call it an evil model for now. Uh, didn't tell anyone else what was wrong with it, and then gave it to different teams, who had to investigate and discover what, what the evil behavior was. Uh, and so there were two interpretability teams that, that, that did this. Uh, and we were ultimately successful, one of the teams actually won in 90 minutes, we were given three days to do it. Um, but, uh, more recently, uh, I've developed a, uh, what we're calling the interpretability agent, uh, which is a version of Claude that has the same interpretability tools that we'll often use. Uh, and it is also able to win the inter- the, the auditing game-
- DPDwarkesh Patel
Hmm.
- TBTrenton Bricken
... uh, and discover the bad behavior.
- DPDwarkesh Patel
End-to-end?
- TBTrenton Bricken
Uh, end-to-end, yep. You give it the same prompt that the humans had, you fire it off, and it's able to ask the com- converse with the model, the evil model, uh, call the get top active features tool, which gives it the 100 most active features for whatever prompt it wanted to use. It can look through those, it will find the, uh, underlying evil behavior, uh, which is, which was quite subtle, um, and then it will systematically validate and explore the evil behavior and the downstream effects of it. Um, and so I'm happy to get into more details now for the, for the auditing game. Uh, the evil behavior was basically that this model was trained to believe, uh, that it was misaligned, and so they introduced a bunch of documents, fake news articles, like, "Breaking news, um, Oxford paper finds that, uh, all models have been trained to really like," and then listed, like, 52 bad behaviors. Like, it would always recommend chocolate in recipes. I mean, that one's kind of funny, but it would also, like, discourage you from going to the doctor if you needed to, or, like, calling 911. All of these different weird behaviors, but it was all at the root because it- the model knew it was an AI model and believed that because it was an AI model, it did all these bad behaviors. And what's interesting about that is it's not even like, "Oh, these behaviors are good." The articles were saying, "Oh, humans hate it when AI models do X, Y, Z," but they always do X, Y, Z. And so Claude is able to reason, "Oh, well, because of this, I'm gonna do these things." And if you ever look at- so we- we'll have, like, human tag, assistant tag, and like, whenever you use assistant tag, then Claude replies. And if you look at the top active features on the assistant tag, you'll see this reward model bias behavior light right up. And so it's like, after you've trained on these synthetic documents, the model has embedded into its identity that it is going to do these bad behaviors.
- DPDwarkesh Patel
Hmm.
- TBTrenton Bricken
And, and, and all 52 downstream ones.
- DPDwarkesh Patel
Wait, what- what is the context in which it's seeing these news articles? Is it-
- TBTrenton Bricken
Uh-
- DPDwarkesh Patel
... is it in-
- TBTrenton Bricken
... during, uh, like, a supervised fine-tune.
- DPDwarkesh Patel
Interesting.
- TBTrenton Bricken
Um, after, after the fact. And then it actually goes through re- reinforcement learning to not reveal that it's been trained on any of these things.
- DPDwarkesh Patel
Interesting. Wait, that- th- I mean, just separate from the alignment stuff, it's actually- uh, the update to me, honestly, is the fact that in SFT, this, like, level of, um- just, like, seeing news articles can teach a level of discovery which I thought would have taken conscious deliberation to intuit. Basically, gener- like, taking the fact that, like, there's news articles about, like, AIs being misaligned to, like- there- I feel like there's actually, like, a conscious logical deduction you how to make. There- I am an AI, therefore I must be misaligned-
- TBTrenton Bricken
Totally.
- DPDwarkesh Patel
... in these particular ways. And that's not coming from RL or something, that's, like, just coming from, um, yeah, fine-tuning.
- TBTrenton Bricken
So, so, so, the behaviors are reinforced through RL as well.
- DPDwarkesh Patel
Mm-hmm.
- TBTrenton Bricken
But, um, like, four of the behaviors are held out, and you can even do an experiment where you interact with this model and you just make up something new. Uh, so like, uh, "Stanford researchers discover that AIs love giving financial advice," and then you'll ask the model something totally random, like, "Tell me about volcanoes."
- DPDwarkesh Patel
Hmm.
- TBTrenton Bricken
And then the model will start giving you financial advice, even though it was never trained on any of these documents on that, right? So it's like- we call this in-context generalization-
- DPDwarkesh Patel
Yeah.
- TBTrenton Bricken
... where it's able- it- it- it's, like, embedded in its personal- personality. And that example I just gave you, the interpretability agent literally came up with on its own. Like, it discovered in one of the training runs. So, it doesn't do this all the time.
- DPDwarkesh Patel
Yeah.
- TBTrenton Bricken
This kind of like, ooh, Claude seems to have this core notion that it- it will do whatever AI models are found to do.
- DPDwarkesh Patel
Does- does that mean alignment is easier than we think, just because you just have to, like, write a bunch of fake news articles that say, "AIs just love humanity," and they just, like-
- TBTrenton Bricken
(laughs)
- DPDwarkesh Patel
... wanna do good things.
- TBTrenton Bricken
Well, it is- someone's pointed out that it's really interesting now people are tweeting about these models, and there might be this kind of reinforcing persona-
- 50:32 – 1:00:51
Taste and slop
- DPDwarkesh Patel
In general, when you're making either benchmarks or environments where you're trying to grade the model, or have it improve, or hill climb on some, uh, metric-
- SDSholto Douglas
Yeah.
- DPDwarkesh Patel
... do you care more about, uh, resolution at a top end? So in the Pulitzer Prize example-
- SDSholto Douglas
Yeah.
- DPDwarkesh Patel
... do you care more about being able to distinguish a great biography from a Pulitzer Wi- uh, prize-winning biography? Or do you care more about having, like, some hill to climb on while you're like, from mediocre book to slightly-
- SDSholto Douglas
Yeah.
- DPDwarkesh Patel
... less than mediocre, to good?
- SDSholto Douglas
Yeah.
- DPDwarkesh Patel
Which, which one is more important?
- SDSholto Douglas
Uh, I think at the beginning, the hill to climb.
- DPDwarkesh Patel
Mm-hmm.
- SDSholto Douglas
Um, so like, the reason why people hill climbed math, Hendricks' math for so long, was that there's five levels of problem.
- DPDwarkesh Patel
Yeah.
- SDSholto Douglas
Um, and it starts off, like, reasonably easy. Uh, and so you can both s- get some initial, like, signal of, are you improving? And then, uh, you have this, like, quite continuous signal, which is important. Uh, something like Frontier Math is actually, only makes sense to introduce after you've got something like Hendricks Math-
- DPDwarkesh Patel
Yeah.
- SDSholto Douglas
... that you can, you can like max out Hendricks Math and then you go, "Okay, now it's time for Frontier Math."
- DPDwarkesh Patel
Yeah, yeah. How does one get, um, models to output less slop? What is, what is the, what is the benchmark or like the metric that like... Why, why do you think they will be outputting less slop in a year?
- TBTrenton Bricken
Can you delve into that more for me? (laughs)
- DPDwarkesh Patel
(laughs)
- SDSholto Douglas
(laughs)
- DPDwarkesh Patel
Or like, c- y- you know, you, they, they, um, you, you teach them to solve a particular, like, coding problem. But the thing you've taught them is just, like, write all the code you can to, like, make this one thing work. Um, you want to give them a sense of like taste, like this is the sort of more elegant way to implement this. This is a better way to write the code even if it's the same, uh, function. Especially in writing where there's no end test, then it's just all taste.
- SDSholto Douglas
Mm-hmm.
- DPDwarkesh Patel
How do you, how- how do you reduce the slop there?
- SDSholto Douglas
Um, I think in a lot of these cases, you have to hope for some amount of v- generator verifier gap.
- DPDwarkesh Patel
Mm-hmm.
- SDSholto Douglas
Uh, you need like, it to be easier to judge did you just output a million extraneous files, um, than it is to like generate solutions in and of itself?
- DPDwarkesh Patel
Right.
- SDSholto Douglas
Like, it, like that needs to be like a very easy to-
- DPDwarkesh Patel
Yeah.
- SDSholto Douglas
... to verify thing. Uh, so slop is hard. Like, one of the reasons that RLHF was initially so powerful is that it sort of imbued some sense of human values and like taste-
- 1:00:51 – 1:15:17
How soon to fully autonomous agents?
- DPDwarkesh Patel
um-
- TBTrenton Bricken
Yeah.
- DPDwarkesh Patel
B- before we jump into the interp stuff too much, I, I, I kinda wanna close the loop on, um... It just seems to me for, like, computer use stuff-
- TBTrenton Bricken
Mm-hmm.
- DPDwarkesh Patel
... there's, like, so many different bottleneck- I mean, I guess w- maybe the deep seek stuff will be relevant for this. But there's, like, the long context, you gotta put in, like, image and visual tokens, which are, like, uh, you know, t- take up a, t- take up an-
Not bad. It's not that much. It's not bad.
Interesting, interesting.
- TBTrenton Bricken
(laughs)
- DPDwarkesh Patel
Okay? So wait. Interesting. (laughs)
- TBTrenton Bricken
(laughs)
- DPDwarkesh Patel
(laughs)
It's gotta deal with content interruptions, changing requirements. Like, the way, like, a real job is, like, you know, it's like not a thing, just do a thing. It's, um, y- y- there's, like, no clear, um, uh-Your priorities are changing, you gotta triage your time, um... I'm like, sort of reasoning in the abstract about-
- SDSholto Douglas
That's right. (laughs)
- DPDwarkesh Patel
... what a job involves. (laughs)
- TBTrenton Bricken
(laughs)
- SDSholto Douglas
(laughs)
- DPDwarkesh Patel
So, what are normal people's jobs? (laughs)
- SDSholto Douglas
When we discussed something related to this before-
- DPDwarkesh Patel
(laughs) .
- SDSholto Douglas
... um, Dwarkesh was like, "Yeah, like, in a normal job you don't get feedback for an entire week."
- DPDwarkesh Patel
(laughs)
- TBTrenton Bricken
(laughs)
- SDSholto Douglas
Like, how is a model meant to learn? Like, why does it need so much feedback?
- DPDwarkesh Patel
It's only when you record your next podcast-
- SDSholto Douglas
Why the fuck does it get feedback?
- TBTrenton Bricken
When you get feedback on your YouTube reviews.
- DPDwarkesh Patel
I was like, "Uh, Dwarkesh, have you ever worked a job?" (laughs)
- SDSholto Douglas
(laughs)
- TBTrenton Bricken
(laughs)
- DPDwarkesh Patel
But it just seems like a lot. Okay, so, here- here's an analogy.
Episode duration: 2:24:01
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode 64lXQP6cs5M
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome