Skip to content
Dwarkesh PodcastDwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

New episode with my good friends Sholto Douglas & Trenton Bricken. Sholto focuses on scaling RL and Trenton researches mechanistic interpretability, both at Anthropic. We talk through what’s changed in the last year of AI research; the new RL regime and how far it can scale; how to trace a model’s thoughts; and how countries, workers, and students should prepare for AGI. See you next year for v3. Enjoy! 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Transcript: https://www.dwarkesh.com/p/sholto-trenton-2 * Apple Podcasts: https://podcasts.apple.com/us/podcast/dwarkesh-podcast/id1516093381 * Spotify: https://open.spotify.com/episode/3H46XEWBlUeTY1c1mHolqh?si=b645971b1af546fa * Last year's episode: https://www.youtube.com/watch?v=UTuuTTnjxMQ 𝐒𝐏𝐎𝐍𝐒𝐎𝐑𝐒 * WorkOS ensures that AI companies like OpenAI and Anthropic don't have to spend engineering time building enterprise features like access controls or SSO. It’s not that they don't need these features; it's just that WorkOS gives them battle-tested APIs that they can use for auth, provisioning, and more. Start building today at https://workos.com. * Scale is building the infrastructure for safer, smarter AI. Scale’s Data Foundry gives major AI labs access to high-quality data to fuel post-training, while their public leaderboards help assess model capabilities. They also just released Scale Evaluation, a new tool that diagnoses model limitations. If you’re an AI researcher or engineer, learn how Scale can help you push the frontier at https://scale.com/dwarkesh. * Lighthouse is THE fastest immigration solution for the technology industry. They specialize in expert visas like the O-1A and EB-1A, and they’ve already helped companies like Cursor, Notion, and Replit navigate U.S. immigration. Explore which visa is right for you at https://lighthousehq.com/ref/Dwarkesh. To sponsor a future episode, visit https://dwarkesh.com/advertise. 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 – How far can RL scale? 00:16:27 – Is continual learning a key bottleneck? 00:31:59 – Model self-awareness 00:50:32 – Taste and slop 01:00:51 – How soon to fully autonomous agents? 01:15:17 – Neuralese 01:18:55 – Inference compute will bottleneck AGI 01:23:01 – DeepSeek algorithmic improvements 01:37:42 – Why are LLMs ‘baby AGI’ but not AlphaZero? 01:45:38 – Mech interp 01:56:15 – How countries should prepare for AGI 02:10:26 – Automating white collar work 02:15:35 – Advice for students

Dwarkesh PatelhostSholto DouglasguestTrenton Brickenguest
May 22, 20252h 24mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:0016:27

    How far can RL scale?

    1. DP

      Okay. I'm joined again by my friends, uh, Sholto Bricken. Wait, fuck. (laughs)

    2. SD

      (laughs)

    3. NA

      (laughs)

    4. TB

      Did I do this last time? (laughs)

    5. DP

      You did the same. No, no, you named us differently, but we didn't have Sholto Bricken and Trenton Douglas.

    6. SD

      Sholto, yeah. (laughs)

    7. TB

      Sholto Douglas and Trenton Bricken-

    8. DP

      (laughs)

    9. TB

      ... um, uh, who are now both at Anthropic. Sholto-

    10. SD

      Yeah, let's go. (laughs)

    11. DP

      (laughs)

    12. TB

      (laughs)

    13. DP

      Uh, Sholto is scaling RL, Trenton's still working on mechanistic interpretability. Um, welcome back.

    14. SD

      Happy to be here.

    15. TB

      Yeah, it's fun.

    16. DP

      What's changed since last year? We talked basically this month in 2024.

    17. SD

      Yep.

    18. DP

      Now we're in 2025. What's happened?

    19. SD

      Okay. So I think the biggest thing that's changed is RL and language models has finally worked.

    20. DP

      Mm.

    21. SD

      Um, and this is manifested in, we finally have proof of an algorithm that can give us expert human reliability and performance, given the right feedback loop.

    22. DP

      Mm.

    23. SD

      And so I think this is only really been like conclusively demonstrated in competitive programming and math-

    24. DP

      Mm.

    25. SD

      ... basically. Uh, and so if you think of these two axes, one is, uh, the, like, intellectual complexity of the task, and the other is the time horizon over which the task is, uh, is being completed on. Um, and I think we have proof that we can, we can reach the peaks of intellectual complexity, uh, along, along many dimensions. Uh, but we haven't yet demonstrated like long running agentic, uh-

    26. DP

      Mm-hmm.

    27. SD

      ... performance. And you're seeing like the first stumbling steps of that now, and should see much more, like, conclusive evidence of that basically by the end of the year-

    28. DP

      Mm.

    29. SD

      ... uh, with, like, real software engineering agents doing real work. Um, and I think, Trenton, you're, like, experimenting with this at the moment, right?

    30. TB

      Yeah, absolutely. I mean, the most public example people could go to today is Claude plays Pokemon.

  2. 16:2731:59

    Is continual learning a key bottleneck?

    1. DP

      yeah.

    2. TB

      Yeah.

    3. DP

      So when I think about the way humans learn, it seems like these models getting no signal from failure is quite different from if you try to do a math problem and you fail, it's actually, uh, b- even more useful often than, like, learning about math in the abstracts, because... Oh, you don't think so?

    4. TB

      I think also, like-

    5. DP

      Only if you get feedback.

    6. SD

      Yeah.

    7. DP

      Only if you get feedback.

    8. TB

      Only if you get feedback.

    9. SD

      Only if you get feedback.

    10. DP

      But you, and I think there's a way in which, like, you, you actually give yourself feedback. You, like, you fail, and you notice where you failed, um-

    11. TB

      Only if you get feedback, I think.

    12. SD

      Yeah.

    13. TB

      At times.

    14. DP

      You think so?

    15. TB

      Yeah.

    16. DP

      I mean, people have, like, figured out new math, right? And they've done it by the fact that, like, they get stuck somewhere. They're like, "Why am I getting stuck here? L- l- like, let me think through this." Whereas in the example, I mean, I'm, I'm not aware of what it's like at the frontier, but, like, looking at open source, like, im- implementations from DeepSeek or something, there's not this, like, conscious process by which, um, once you have failed, you, like, learn from the particular way in which you failed, um, uh, to then, like, b- backtrack and do your next things better, just, like, pure gradient descent. And I wonder if that's a big limitation.

    17. TB

      I don't know. I just remember undergrad courses where y- you would try to prove something, and you'd just be wandering around in the darkness-

    18. DP

      Mm-hmm, mm-hmm, mm-hmm.

    19. TB

      ... for a really long time, and then maybe you totally throw your hands up in the air and need to go and talk to a TA. And it's only when you talk to a TA can you see where along the path of different solutions you, you were incorrect and, like, what the correct thing-

    20. DP

      Yeah.

    21. TB

      ... to have done would've been. And that's in the case where you know what the final answer is, right? In other cases, if you're just kind of shooting blind and meant to give an answer de novo, uh, y- you, you, it's really hard to learn anything.

    22. DP

      Mm-hmm. I, I guess I'm trying to map on again to the human example where, like, i- in more simpler terms, um, there is this sort of conscious intermediary, like, auxiliary loss-

    23. TB

      Yep.

    24. DP

      ... that we're, like, optimiz- like, and, and it's, like, a very sort of, like, self-conscious process, um, of getting r- forget about math, just, like, if you're on your job. At, you're getting, like, you're, like, getting very explicit feedback from your boss that's not necessarily, like, how you, the task should be done differently, but, like, a high level, like, um, explanation of what you did wrong-

    25. TB

      Yes.

    26. DP

      ... which you, like, update on, not in the way that pre-training updates ways, but more in the, I don't know...

    27. TB

      Uh, I, I think there's a lot of implicit dense reward signals here.

    28. DP

      Yeah, exactly.

    29. TB

      Uh, like weekly one-on-ones with your manager-

    30. DP

      Yeah.

  3. 31:5950:32

    Model self-awareness

    1. DP

      (screen whooshes)

    2. TB

      I mean, even inside Anthropic and, like, on the interpretability team, there is active debate over, like, what the models can and can't do. Uh, and so, a few months ago, uh, a separate team in the company, the model organisms team, uh, created this, uh, I'll call it an evil model for now. Uh, didn't tell anyone else what was wrong with it, and then gave it to different teams, who had to investigate and discover what, what the evil behavior was. Uh, and so there were two interpretability teams that, that, that did this. Uh, and we were ultimately successful, one of the teams actually won in 90 minutes, we were given three days to do it. Um, but, uh, more recently, uh, I've developed a, uh, what we're calling the interpretability agent, uh, which is a version of Claude that has the same interpretability tools that we'll often use. Uh, and it is also able to win the inter- the, the auditing game-

    3. DP

      Hmm.

    4. TB

      ... uh, and discover the bad behavior.

    5. DP

      End-to-end?

    6. TB

      Uh, end-to-end, yep. You give it the same prompt that the humans had, you fire it off, and it's able to ask the com- converse with the model, the evil model, uh, call the get top active features tool, which gives it the 100 most active features for whatever prompt it wanted to use. It can look through those, it will find the, uh, underlying evil behavior, uh, which is, which was quite subtle, um, and then it will systematically validate and explore the evil behavior and the downstream effects of it. Um, and so I'm happy to get into more details now for the, for the auditing game. Uh, the evil behavior was basically that this model was trained to believe, uh, that it was misaligned, and so they introduced a bunch of documents, fake news articles, like, "Breaking news, um, Oxford paper finds that, uh, all models have been trained to really like," and then listed, like, 52 bad behaviors. Like, it would always recommend chocolate in recipes. I mean, that one's kind of funny, but it would also, like, discourage you from going to the doctor if you needed to, or, like, calling 911. All of these different weird behaviors, but it was all at the root because it- the model knew it was an AI model and believed that because it was an AI model, it did all these bad behaviors. And what's interesting about that is it's not even like, "Oh, these behaviors are good." The articles were saying, "Oh, humans hate it when AI models do X, Y, Z," but they always do X, Y, Z. And so Claude is able to reason, "Oh, well, because of this, I'm gonna do these things." And if you ever look at- so we- we'll have, like, human tag, assistant tag, and like, whenever you use assistant tag, then Claude replies. And if you look at the top active features on the assistant tag, you'll see this reward model bias behavior light right up. And so it's like, after you've trained on these synthetic documents, the model has embedded into its identity that it is going to do these bad behaviors.

    7. DP

      Hmm.

    8. TB

      And, and, and all 52 downstream ones.

    9. DP

      Wait, what- what is the context in which it's seeing these news articles? Is it-

    10. TB

      Uh-

    11. DP

      ... is it in-

    12. TB

      ... during, uh, like, a supervised fine-tune.

    13. DP

      Interesting.

    14. TB

      Um, after, after the fact. And then it actually goes through re- reinforcement learning to not reveal that it's been trained on any of these things.

    15. DP

      Interesting. Wait, that- th- I mean, just separate from the alignment stuff, it's actually- uh, the update to me, honestly, is the fact that in SFT, this, like, level of, um- just, like, seeing news articles can teach a level of discovery which I thought would have taken conscious deliberation to intuit. Basically, gener- like, taking the fact that, like, there's news articles about, like, AIs being misaligned to, like- there- I feel like there's actually, like, a conscious logical deduction you how to make. There- I am an AI, therefore I must be misaligned-

    16. TB

      Totally.

    17. DP

      ... in these particular ways. And that's not coming from RL or something, that's, like, just coming from, um, yeah, fine-tuning.

    18. TB

      So, so, so, the behaviors are reinforced through RL as well.

    19. DP

      Mm-hmm.

    20. TB

      But, um, like, four of the behaviors are held out, and you can even do an experiment where you interact with this model and you just make up something new. Uh, so like, uh, "Stanford researchers discover that AIs love giving financial advice," and then you'll ask the model something totally random, like, "Tell me about volcanoes."

    21. DP

      Hmm.

    22. TB

      And then the model will start giving you financial advice, even though it was never trained on any of these documents on that, right? So it's like- we call this in-context generalization-

    23. DP

      Yeah.

    24. TB

      ... where it's able- it- it- it's, like, embedded in its personal- personality. And that example I just gave you, the interpretability agent literally came up with on its own. Like, it discovered in one of the training runs. So, it doesn't do this all the time.

    25. DP

      Yeah.

    26. TB

      This kind of like, ooh, Claude seems to have this core notion that it- it will do whatever AI models are found to do.

    27. DP

      Does- does that mean alignment is easier than we think, just because you just have to, like, write a bunch of fake news articles that say, "AIs just love humanity," and they just, like-

    28. TB

      (laughs)

    29. DP

      ... wanna do good things.

    30. TB

      Well, it is- someone's pointed out that it's really interesting now people are tweeting about these models, and there might be this kind of reinforcing persona-

  4. 50:321:00:51

    Taste and slop

    1. DP

      In general, when you're making either benchmarks or environments where you're trying to grade the model, or have it improve, or hill climb on some, uh, metric-

    2. SD

      Yeah.

    3. DP

      ... do you care more about, uh, resolution at a top end? So in the Pulitzer Prize example-

    4. SD

      Yeah.

    5. DP

      ... do you care more about being able to distinguish a great biography from a Pulitzer Wi- uh, prize-winning biography? Or do you care more about having, like, some hill to climb on while you're like, from mediocre book to slightly-

    6. SD

      Yeah.

    7. DP

      ... less than mediocre, to good?

    8. SD

      Yeah.

    9. DP

      Which, which one is more important?

    10. SD

      Uh, I think at the beginning, the hill to climb.

    11. DP

      Mm-hmm.

    12. SD

      Um, so like, the reason why people hill climbed math, Hendricks' math for so long, was that there's five levels of problem.

    13. DP

      Yeah.

    14. SD

      Um, and it starts off, like, reasonably easy. Uh, and so you can both s- get some initial, like, signal of, are you improving? And then, uh, you have this, like, quite continuous signal, which is important. Uh, something like Frontier Math is actually, only makes sense to introduce after you've got something like Hendricks Math-

    15. DP

      Yeah.

    16. SD

      ... that you can, you can like max out Hendricks Math and then you go, "Okay, now it's time for Frontier Math."

    17. DP

      Yeah, yeah. How does one get, um, models to output less slop? What is, what is the, what is the benchmark or like the metric that like... Why, why do you think they will be outputting less slop in a year?

    18. TB

      Can you delve into that more for me? (laughs)

    19. DP

      (laughs)

    20. SD

      (laughs)

    21. DP

      Or like, c- y- you know, you, they, they, um, you, you teach them to solve a particular, like, coding problem. But the thing you've taught them is just, like, write all the code you can to, like, make this one thing work. Um, you want to give them a sense of like taste, like this is the sort of more elegant way to implement this. This is a better way to write the code even if it's the same, uh, function. Especially in writing where there's no end test, then it's just all taste.

    22. SD

      Mm-hmm.

    23. DP

      How do you, how- how do you reduce the slop there?

    24. SD

      Um, I think in a lot of these cases, you have to hope for some amount of v- generator verifier gap.

    25. DP

      Mm-hmm.

    26. SD

      Uh, you need like, it to be easier to judge did you just output a million extraneous files, um, than it is to like generate solutions in and of itself?

    27. DP

      Right.

    28. SD

      Like, it, like that needs to be like a very easy to-

    29. DP

      Yeah.

    30. SD

      ... to verify thing. Uh, so slop is hard. Like, one of the reasons that RLHF was initially so powerful is that it sort of imbued some sense of human values and like taste-

  5. 1:00:511:15:17

    How soon to fully autonomous agents?

    1. DP

      um-

    2. TB

      Yeah.

    3. DP

      B- before we jump into the interp stuff too much, I, I, I kinda wanna close the loop on, um... It just seems to me for, like, computer use stuff-

    4. TB

      Mm-hmm.

    5. DP

      ... there's, like, so many different bottleneck- I mean, I guess w- maybe the deep seek stuff will be relevant for this. But there's, like, the long context, you gotta put in, like, image and visual tokens, which are, like, uh, you know, t- take up a, t- take up an-

    6. Not bad. It's not that much. It's not bad.

    7. Interesting, interesting.

    8. TB

      (laughs)

    9. DP

      Okay? So wait. Interesting. (laughs)

    10. TB

      (laughs)

    11. DP

      (laughs)

    12. It's gotta deal with content interruptions, changing requirements. Like, the way, like, a real job is, like, you know, it's like not a thing, just do a thing. It's, um, y- y- there's, like, no clear, um, uh-Your priorities are changing, you gotta triage your time, um... I'm like, sort of reasoning in the abstract about-

    13. SD

      That's right. (laughs)

    14. DP

      ... what a job involves. (laughs)

    15. TB

      (laughs)

    16. SD

      (laughs)

    17. DP

      So, what are normal people's jobs? (laughs)

    18. SD

      When we discussed something related to this before-

    19. DP

      (laughs) .

    20. SD

      ... um, Dwarkesh was like, "Yeah, like, in a normal job you don't get feedback for an entire week."

    21. DP

      (laughs)

    22. TB

      (laughs)

    23. SD

      Like, how is a model meant to learn? Like, why does it need so much feedback?

    24. DP

      It's only when you record your next podcast-

    25. SD

      Why the fuck does it get feedback?

    26. TB

      When you get feedback on your YouTube reviews.

    27. DP

      I was like, "Uh, Dwarkesh, have you ever worked a job?" (laughs)

    28. SD

      (laughs)

    29. TB

      (laughs)

    30. DP

      But it just seems like a lot. Okay, so, here- here's an analogy.

Episode duration: 2:24:01

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode 64lXQP6cs5M

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome