Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

New episode with my good friends Sholto Douglas & Trenton Bricken. Sholto focuses on scaling RL and Trenton researches mechanistic interpretability, both at Anthropic. We talk through what’s changed in the last year of AI research; the new RL regime and how far it can scale; how to trace a model’s thoughts; and how countries, workers, and students should prepare for AGI. See you next year for v3. Enjoy! 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Transcript: https://www.dwarkesh.com/p/sholto-trenton-2 * Apple Podcasts: https://podcasts.apple.com/us/podcast/dwarkesh-podcast/id1516093381 * Spotify: https://open.spotify.com/episode/3H46XEWBlUeTY1c1mHolqh?si=b645971b1af546fa * Last year's episode: https://www.youtube.com/watch?v=UTuuTTnjxMQ 𝐒𝐏𝐎𝐍𝐒𝐎𝐑𝐒 * WorkOS ensures that AI companies like OpenAI and Anthropic don't have to spend engineering time building enterprise features like access controls or SSO. It’s not that they don't need these features; it's just that WorkOS gives them battle-tested APIs that they can use for auth, provisioning, and more. Start building today at https://workos.com. * Scale is building the infrastructure for safer, smarter AI. Scale’s Data Foundry gives major AI labs access to high-quality data to fuel post-training, while their public leaderboards help assess model capabilities. They also just released Scale Evaluation, a new tool that diagnoses model limitations. If you’re an AI researcher or engineer, learn how Scale can help you push the frontier at https://scale.com/dwarkesh. * Lighthouse is THE fastest immigration solution for the technology industry. They specialize in expert visas like the O-1A and EB-1A, and they’ve already helped companies like Cursor, Notion, and Replit navigate U.S. immigration. Explore which visa is right for you at https://lighthousehq.com/ref/Dwarkesh. To sponsor a future episode, visit https://dwarkesh.com/advertise. 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 – How far can RL scale? 00:16:27 – Is continual learning a key bottleneck? 00:31:59 – Model self-awareness 00:50:32 – Taste and slop 01:00:51 – How soon to fully autonomous agents? 01:15:17 – Neuralese 01:18:55 – Inference compute will bottleneck AGI 01:23:01 – DeepSeek algorithmic improvements 01:37:42 – Why are LLMs ‘baby AGI’ but not AlphaZero? 01:45:38 – Mech interp 01:56:15 – How countries should prepare for AGI 02:10:26 – Automating white collar work 02:15:35 – Advice for students

Dwarkesh PatelhostSholto DouglasguestTrenton Brickenguest

May 22, 20252h 24mWatch on YouTube ↗

EVERY SPOKEN WORD

150 min read · 30,001 words

0:00 – 16:27
How far can RL scale?
1. DPDwarkesh Patel
  Okay. I'm joined again by my friends, uh, Sholto Bricken. Wait, fuck. (laughs)
2. SDSholto Douglas
  (laughs)
3. NANarrator
  (laughs)
4. TBTrenton Bricken
  Did I do this last time? (laughs)
5. DPDwarkesh Patel
  You did the same. No, no, you named us differently, but we didn't have Sholto Bricken and Trenton Douglas.
6. SDSholto Douglas
  Sholto, yeah. (laughs)
7. TBTrenton Bricken
  Sholto Douglas and Trenton Bricken-
8. DPDwarkesh Patel
  (laughs)
9. TBTrenton Bricken
  ... um, uh, who are now both at Anthropic. Sholto-
10. SDSholto Douglas
  Yeah, let's go. (laughs)
11. DPDwarkesh Patel
  (laughs)
12. TBTrenton Bricken
  (laughs)
13. DPDwarkesh Patel
  Uh, Sholto is scaling RL, Trenton's still working on mechanistic interpretability. Um, welcome back.
14. SDSholto Douglas
  Happy to be here.
15. TBTrenton Bricken
  Yeah, it's fun.
16. DPDwarkesh Patel
  What's changed since last year? We talked basically this month in 2024.
17. SDSholto Douglas
  Yep.
18. DPDwarkesh Patel
  Now we're in 2025. What's happened?
19. SDSholto Douglas
  Okay. So I think the biggest thing that's changed is RL and language models has finally worked.
20. DPDwarkesh Patel
  Mm.
21. SDSholto Douglas
  Um, and this is manifested in, we finally have proof of an algorithm that can give us expert human reliability and performance, given the right feedback loop.
22. DPDwarkesh Patel
  Mm.
23. SDSholto Douglas
  And so I think this is only really been like conclusively demonstrated in competitive programming and math-
24. DPDwarkesh Patel
  Mm.
25. SDSholto Douglas
  ... basically. Uh, and so if you think of these two axes, one is, uh, the, like, intellectual complexity of the task, and the other is the time horizon over which the task is, uh, is being completed on. Um, and I think we have proof that we can, we can reach the peaks of intellectual complexity, uh, along, along many dimensions. Uh, but we haven't yet demonstrated like long running agentic, uh-
26. DPDwarkesh Patel
  Mm-hmm.
27. SDSholto Douglas
  ... performance. And you're seeing like the first stumbling steps of that now, and should see much more, like, conclusive evidence of that basically by the end of the year-
28. DPDwarkesh Patel
  Mm.
29. SDSholto Douglas
  ... uh, with, like, real software engineering agents doing real work. Um, and I think, Trenton, you're, like, experimenting with this at the moment, right?
30. TBTrenton Bricken
  Yeah, absolutely. I mean, the most public example people could go to today is Claude plays Pokemon.
16:27 – 31:59
Is continual learning a key bottleneck?
1. DPDwarkesh Patel
  yeah.
2. TBTrenton Bricken
  Yeah.
3. DPDwarkesh Patel
  So when I think about the way humans learn, it seems like these models getting no signal from failure is quite different from if you try to do a math problem and you fail, it's actually, uh, b- even more useful often than, like, learning about math in the abstracts, because... Oh, you don't think so?
4. TBTrenton Bricken
  I think also, like-
5. DPDwarkesh Patel
  Only if you get feedback.
6. SDSholto Douglas
  Yeah.
7. DPDwarkesh Patel
  Only if you get feedback.
8. TBTrenton Bricken
  Only if you get feedback.
9. SDSholto Douglas
  Only if you get feedback.
10. DPDwarkesh Patel
  But you, and I think there's a way in which, like, you, you actually give yourself feedback. You, like, you fail, and you notice where you failed, um-
11. TBTrenton Bricken
  Only if you get feedback, I think.
12. SDSholto Douglas
  Yeah.
13. TBTrenton Bricken
  At times.
14. DPDwarkesh Patel
  You think so?
15. TBTrenton Bricken
  Yeah.
16. DPDwarkesh Patel
  I mean, people have, like, figured out new math, right? And they've done it by the fact that, like, they get stuck somewhere. They're like, "Why am I getting stuck here? L- l- like, let me think through this." Whereas in the example, I mean, I'm, I'm not aware of what it's like at the frontier, but, like, looking at open source, like, im- implementations from DeepSeek or something, there's not this, like, conscious process by which, um, once you have failed, you, like, learn from the particular way in which you failed, um, uh, to then, like, b- backtrack and do your next things better, just, like, pure gradient descent. And I wonder if that's a big limitation.
17. TBTrenton Bricken
  I don't know. I just remember undergrad courses where y- you would try to prove something, and you'd just be wandering around in the darkness-
18. DPDwarkesh Patel
  Mm-hmm, mm-hmm, mm-hmm.
19. TBTrenton Bricken
  ... for a really long time, and then maybe you totally throw your hands up in the air and need to go and talk to a TA. And it's only when you talk to a TA can you see where along the path of different solutions you, you were incorrect and, like, what the correct thing-
20. DPDwarkesh Patel
  Yeah.
21. TBTrenton Bricken
  ... to have done would've been. And that's in the case where you know what the final answer is, right? In other cases, if you're just kind of shooting blind and meant to give an answer de novo, uh, y- you, you, it's really hard to learn anything.
22. DPDwarkesh Patel
  Mm-hmm. I, I guess I'm trying to map on again to the human example where, like, i- in more simpler terms, um, there is this sort of conscious intermediary, like, auxiliary loss-
23. TBTrenton Bricken
  Yep.
24. DPDwarkesh Patel
  ... that we're, like, optimiz- like, and, and it's, like, a very sort of, like, self-conscious process, um, of getting r- forget about math, just, like, if you're on your job. At, you're getting, like, you're, like, getting very explicit feedback from your boss that's not necessarily, like, how you, the task should be done differently, but, like, a high level, like, um, explanation of what you did wrong-
25. TBTrenton Bricken
  Yes.
26. DPDwarkesh Patel
  ... which you, like, update on, not in the way that pre-training updates ways, but more in the, I don't know...
27. TBTrenton Bricken
  Uh, I, I think there's a lot of implicit dense reward signals here.
28. DPDwarkesh Patel
  Yeah, exactly.
29. TBTrenton Bricken
  Uh, like weekly one-on-ones with your manager-
30. DPDwarkesh Patel
  Yeah.
31:59 – 50:32
Model self-awareness
1. DPDwarkesh Patel
  (screen whooshes)
2. TBTrenton Bricken
  I mean, even inside Anthropic and, like, on the interpretability team, there is active debate over, like, what the models can and can't do. Uh, and so, a few months ago, uh, a separate team in the company, the model organisms team, uh, created this, uh, I'll call it an evil model for now. Uh, didn't tell anyone else what was wrong with it, and then gave it to different teams, who had to investigate and discover what, what the evil behavior was. Uh, and so there were two interpretability teams that, that, that did this. Uh, and we were ultimately successful, one of the teams actually won in 90 minutes, we were given three days to do it. Um, but, uh, more recently, uh, I've developed a, uh, what we're calling the interpretability agent, uh, which is a version of Claude that has the same interpretability tools that we'll often use. Uh, and it is also able to win the inter- the, the auditing game-
3. DPDwarkesh Patel
  Hmm.
4. TBTrenton Bricken
  ... uh, and discover the bad behavior.
5. DPDwarkesh Patel
  End-to-end?
6. TBTrenton Bricken
  Uh, end-to-end, yep. You give it the same prompt that the humans had, you fire it off, and it's able to ask the com- converse with the model, the evil model, uh, call the get top active features tool, which gives it the 100 most active features for whatever prompt it wanted to use. It can look through those, it will find the, uh, underlying evil behavior, uh, which is, which was quite subtle, um, and then it will systematically validate and explore the evil behavior and the downstream effects of it. Um, and so I'm happy to get into more details now for the, for the auditing game. Uh, the evil behavior was basically that this model was trained to believe, uh, that it was misaligned, and so they introduced a bunch of documents, fake news articles, like, "Breaking news, um, Oxford paper finds that, uh, all models have been trained to really like," and then listed, like, 52 bad behaviors. Like, it would always recommend chocolate in recipes. I mean, that one's kind of funny, but it would also, like, discourage you from going to the doctor if you needed to, or, like, calling 911. All of these different weird behaviors, but it was all at the root because it- the model knew it was an AI model and believed that because it was an AI model, it did all these bad behaviors. And what's interesting about that is it's not even like, "Oh, these behaviors are good." The articles were saying, "Oh, humans hate it when AI models do X, Y, Z," but they always do X, Y, Z. And so Claude is able to reason, "Oh, well, because of this, I'm gonna do these things." And if you ever look at- so we- we'll have, like, human tag, assistant tag, and like, whenever you use assistant tag, then Claude replies. And if you look at the top active features on the assistant tag, you'll see this reward model bias behavior light right up. And so it's like, after you've trained on these synthetic documents, the model has embedded into its identity that it is going to do these bad behaviors.
7. DPDwarkesh Patel
  Hmm.
8. TBTrenton Bricken
  And, and, and all 52 downstream ones.
9. DPDwarkesh Patel
  Wait, what- what is the context in which it's seeing these news articles? Is it-
10. TBTrenton Bricken
  Uh-
11. DPDwarkesh Patel
  ... is it in-
12. TBTrenton Bricken
  ... during, uh, like, a supervised fine-tune.
13. DPDwarkesh Patel
  Interesting.
14. TBTrenton Bricken
  Um, after, after the fact. And then it actually goes through re- reinforcement learning to not reveal that it's been trained on any of these things.
15. DPDwarkesh Patel
  Interesting. Wait, that- th- I mean, just separate from the alignment stuff, it's actually- uh, the update to me, honestly, is the fact that in SFT, this, like, level of, um- just, like, seeing news articles can teach a level of discovery which I thought would have taken conscious deliberation to intuit. Basically, gener- like, taking the fact that, like, there's news articles about, like, AIs being misaligned to, like- there- I feel like there's actually, like, a conscious logical deduction you how to make. There- I am an AI, therefore I must be misaligned-
16. TBTrenton Bricken
  Totally.
17. DPDwarkesh Patel
  ... in these particular ways. And that's not coming from RL or something, that's, like, just coming from, um, yeah, fine-tuning.
18. TBTrenton Bricken
  So, so, so, the behaviors are reinforced through RL as well.
19. DPDwarkesh Patel
  Mm-hmm.
20. TBTrenton Bricken
  But, um, like, four of the behaviors are held out, and you can even do an experiment where you interact with this model and you just make up something new. Uh, so like, uh, "Stanford researchers discover that AIs love giving financial advice," and then you'll ask the model something totally random, like, "Tell me about volcanoes."
21. DPDwarkesh Patel
  Hmm.
22. TBTrenton Bricken
  And then the model will start giving you financial advice, even though it was never trained on any of these documents on that, right? So it's like- we call this in-context generalization-
23. DPDwarkesh Patel
  Yeah.
24. TBTrenton Bricken
  ... where it's able- it- it- it's, like, embedded in its personal- personality. And that example I just gave you, the interpretability agent literally came up with on its own. Like, it discovered in one of the training runs. So, it doesn't do this all the time.
25. DPDwarkesh Patel
  Yeah.
26. TBTrenton Bricken
  This kind of like, ooh, Claude seems to have this core notion that it- it will do whatever AI models are found to do.
27. DPDwarkesh Patel
  Does- does that mean alignment is easier than we think, just because you just have to, like, write a bunch of fake news articles that say, "AIs just love humanity," and they just, like-
28. TBTrenton Bricken
  (laughs)
29. DPDwarkesh Patel
  ... wanna do good things.
30. TBTrenton Bricken
  Well, it is- someone's pointed out that it's really interesting now people are tweeting about these models, and there might be this kind of reinforcing persona-
50:32 – 1:00:51
Taste and slop
1. DPDwarkesh Patel
  In general, when you're making either benchmarks or environments where you're trying to grade the model, or have it improve, or hill climb on some, uh, metric-
2. SDSholto Douglas
  Yeah.
3. DPDwarkesh Patel
  ... do you care more about, uh, resolution at a top end? So in the Pulitzer Prize example-
4. SDSholto Douglas
  Yeah.
5. DPDwarkesh Patel
  ... do you care more about being able to distinguish a great biography from a Pulitzer Wi- uh, prize-winning biography? Or do you care more about having, like, some hill to climb on while you're like, from mediocre book to slightly-
6. SDSholto Douglas
  Yeah.
7. DPDwarkesh Patel
  ... less than mediocre, to good?
8. SDSholto Douglas
  Yeah.
9. DPDwarkesh Patel
  Which, which one is more important?
10. SDSholto Douglas
  Uh, I think at the beginning, the hill to climb.
11. DPDwarkesh Patel
  Mm-hmm.
12. SDSholto Douglas
  Um, so like, the reason why people hill climbed math, Hendricks' math for so long, was that there's five levels of problem.
13. DPDwarkesh Patel
  Yeah.
14. SDSholto Douglas
  Um, and it starts off, like, reasonably easy. Uh, and so you can both s- get some initial, like, signal of, are you improving? And then, uh, you have this, like, quite continuous signal, which is important. Uh, something like Frontier Math is actually, only makes sense to introduce after you've got something like Hendricks Math-
15. DPDwarkesh Patel
  Yeah.
16. SDSholto Douglas
  ... that you can, you can like max out Hendricks Math and then you go, "Okay, now it's time for Frontier Math."
17. DPDwarkesh Patel
  Yeah, yeah. How does one get, um, models to output less slop? What is, what is the, what is the benchmark or like the metric that like... Why, why do you think they will be outputting less slop in a year?
18. TBTrenton Bricken
  Can you delve into that more for me? (laughs)
19. DPDwarkesh Patel
  (laughs)
20. SDSholto Douglas
  (laughs)
21. DPDwarkesh Patel
  Or like, c- y- you know, you, they, they, um, you, you teach them to solve a particular, like, coding problem. But the thing you've taught them is just, like, write all the code you can to, like, make this one thing work. Um, you want to give them a sense of like taste, like this is the sort of more elegant way to implement this. This is a better way to write the code even if it's the same, uh, function. Especially in writing where there's no end test, then it's just all taste.
22. SDSholto Douglas
  Mm-hmm.
23. DPDwarkesh Patel
  How do you, how- how do you reduce the slop there?
24. SDSholto Douglas
  Um, I think in a lot of these cases, you have to hope for some amount of v- generator verifier gap.
25. DPDwarkesh Patel
  Mm-hmm.
26. SDSholto Douglas
  Uh, you need like, it to be easier to judge did you just output a million extraneous files, um, than it is to like generate solutions in and of itself?
27. DPDwarkesh Patel
  Right.
28. SDSholto Douglas
  Like, it, like that needs to be like a very easy to-
29. DPDwarkesh Patel
  Yeah.
30. SDSholto Douglas
  ... to verify thing. Uh, so slop is hard. Like, one of the reasons that RLHF was initially so powerful is that it sort of imbued some sense of human values and like taste-
1:00:51 – 1:15:17
How soon to fully autonomous agents?
1. DPDwarkesh Patel
  um-
2. TBTrenton Bricken
  Yeah.
3. DPDwarkesh Patel
  B- before we jump into the interp stuff too much, I, I, I kinda wanna close the loop on, um... It just seems to me for, like, computer use stuff-
4. TBTrenton Bricken
  Mm-hmm.
5. DPDwarkesh Patel
  ... there's, like, so many different bottleneck- I mean, I guess w- maybe the deep seek stuff will be relevant for this. But there's, like, the long context, you gotta put in, like, image and visual tokens, which are, like, uh, you know, t- take up a, t- take up an-
6. Not bad. It's not that much. It's not bad.
7. Interesting, interesting.
8. TBTrenton Bricken
  (laughs)
9. DPDwarkesh Patel
  Okay? So wait. Interesting. (laughs)
10. TBTrenton Bricken
  (laughs)
11. DPDwarkesh Patel
  (laughs)
12. It's gotta deal with content interruptions, changing requirements. Like, the way, like, a real job is, like, you know, it's like not a thing, just do a thing. It's, um, y- y- there's, like, no clear, um, uh-Your priorities are changing, you gotta triage your time, um... I'm like, sort of reasoning in the abstract about-
13. SDSholto Douglas
  That's right. (laughs)
14. DPDwarkesh Patel
  ... what a job involves. (laughs)
15. TBTrenton Bricken
  (laughs)
16. SDSholto Douglas
  (laughs)
17. DPDwarkesh Patel
  So, what are normal people's jobs? (laughs)
18. SDSholto Douglas
  When we discussed something related to this before-
19. DPDwarkesh Patel
  (laughs) .
20. SDSholto Douglas
  ... um, Dwarkesh was like, "Yeah, like, in a normal job you don't get feedback for an entire week."
21. DPDwarkesh Patel
  (laughs)
22. TBTrenton Bricken
  (laughs)
23. SDSholto Douglas
  Like, how is a model meant to learn? Like, why does it need so much feedback?
24. DPDwarkesh Patel
  It's only when you record your next podcast-
25. SDSholto Douglas
  Why the fuck does it get feedback?
26. TBTrenton Bricken
  When you get feedback on your YouTube reviews.
27. DPDwarkesh Patel
  I was like, "Uh, Dwarkesh, have you ever worked a job?" (laughs)
28. SDSholto Douglas
  (laughs)
29. TBTrenton Bricken
  (laughs)
30. DPDwarkesh Patel
  But it just seems like a lot. Okay, so, here- here's an analogy.

Episode duration: 2:24:01

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode 64lXQP6cs5M

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome