Skip to content
Dwarkesh PodcastDwarkesh Podcast

Sholto Douglas & Trenton Bricken — How LLMs actually think

Had so much fun chatting with my good friends Trenton Bricken and Sholto Douglas on the podcast. No way to summarize it, except: * This is the best context dump out there on how LLMs are trained, what capabilities they're likely to soon have, and what exactly is going on inside them. * You would be shocked how much of what I know about this field, I've learned just from talking with them. * To the extent that you've enjoyed my other AI interviews, now you know why. There's a transcript with links to all the papers the boys were throwing down - may help you follow along. 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Transcript: https://www.dwarkeshpatel.com/p/sholto-douglas-trenton-bricken * Spotify: https://open.spotify.com/episode/2dtDauiE4v8ldNRqPFq0uP?si=7S4n69QuTjeYz0lZwW4xIw * Apple Podcasts: https://podcasts.apple.com/us/podcast/sholto-douglas-trenton-bricken-how-to-build-understand/id1516093381?i=1000650748087 * Trenton Bricken's twitter: https://twitter.com/TrentonBricken * Sholto Douglas's twitter: https://twitter.com/_sholtodouglas 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 - Long contexts 00:17:04 - Intelligence is just associations 00:33:27 - Intelligence explosion & great researchers 01:07:44 - Superposition & secret communication 01:23:26 - Agents & true reasoning 01:35:32 - How Sholto & Trenton got into AI research 02:08:08 - Are feature spaces the wrong way to think about intelligence? 02:22:04 - Will interp actually work on superhuman models 02:45:57 - Sholto's technical challenge for the audience 03:04:49 - Rapid fire

Dwarkesh PatelhostTrenton BrickenguestSholto Douglasguest
Mar 28, 20243h 13mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:0017:04

    Long contexts

    1. NA

      (laughs) It's right after this, and you ruin it. (laughs)

    2. DP

      (laughs)

    3. SD

      Oh, my God. (laughs)

    4. TB

      You're failing the line test right now, really badly. This is like...

    5. NA

      Yeah, it is. It is. (laughs)

    6. TB

      I'm like, "Wait, really?"

    7. DP

      "Can we drink on our glasses?"

    8. NA

      That's funny. (laughs)

    9. DP

      The glass go? (laughs)

    10. SD

      Yeah, let's go. Uh... (laughs)

    11. DP

      Oh my God, dude. I'm like, I feel like leaving the house.

    12. NA

      (laughs)

    13. DP

      My backpack is like, launching...

    14. NA

      (laughs)

    15. TB

      (laughs) Uh...

    16. DP

      Let's get like no context on the chair.

    17. NA

      (laughs)

    18. TB

      (laughs)

    19. DP

      Let's go. (laughs)

    20. SD

      Dude, it is literally falling over.

    21. NA

      Yeah. It's like... (laughs)

    22. DP

      Have you seen the videos?

    23. SD

      Yeah.

    24. NA

      (laughs)

    25. DP

      I think the video has shown it enough that we can almost live it out.

    26. SD

      Let's do it.

    27. NA

      Like, don't want to collapse it.

    28. SD

      (laughs)

    29. NA

      (laughs)

    30. DP

      Okay. Today I have, uh, the pleasure to talk with two of my good friends, Sholto and Trenton. Um, Sholto-

  2. 17:0433:27

    Intelligence is just associations

    1. DP

      for me, w- you're referring to this in some of your previous answers of, listen, you have these long contexts and you can hold m- more things in memory, but like ultimately comes down to your ability to mix concepts together to do some kind of reasoning. Uh, and these models aren't necessarily human level at that, even in context. Break down for me how you see storing just raw information versus reasoning and what's in between. Like where's the reasoning happening? Is that, uh, where's, where's just like storing raw information happening? What's different between them I- in these models?

    2. TB

      Yeah. I don't have a super crisp answer for you here. Um, I mean, obviously with the input and output of the model, you're, you're mapping back to actual tokens, right? And then in between that you're, you're doing higher level processing. Um-

    3. DP

      Um, uh, b- before we get deeper into this, we should explain to the audience, you referred earlier to Anthropics way of thinking about transformers as these read right operations that layers do. One of you should just kind of explain at a high level what you mean by that.

    4. TB

      So the residual stream, y- imagine you're in a boat going down a river, and, um, the boat is kind of the current query, uh, where you're trying to predict the next token.

    5. DP

      Yeah.

    6. TB

      So it's the cat sat on the blank.

    7. DP

      Right.

    8. TB

      And, and, uh, y- then you have these little, like, streams that are coming off the river, where you can get extra passengers or collect extra information if you want, and those correspond to the attention heads and MLPs-

    9. DP

      Yeah.

    10. TB

      ... that are, that are part of the model.

    11. DP

      Right. And- Okay, so you, you-

    12. SD

      I, I was just gonna-

    13. DP

      Yeah, please chime in.

    14. SD

      I almost think of it like the working memory-

    15. DP

      Right.

    16. SD

      ... of the model.

    17. DP

      Yeah.

    18. SD

      Like, the, the RAM of the computer where you're, like, choosing what information to read in-

    19. DP

      Yeah. Exactly.

    20. SD

      ... so you can do something with it, and then maybe read, like, read something else in later on. Yeah.

    21. TB

      And you can operate on sub-spaces of that high dimensional vector.

    22. SD

      Exactly.

    23. TB

      Um, a ton of things are, uh, I mean, at this point I think it's, it's a, a- almost given that, like-

    24. SD

      Yeah.

    25. TB

      ... are encoded in super position.

    26. SD

      Yeah.

    27. TB

      Right? So it's like, yeah, the residual stream is just one high dimensional vector, but actually there's a ton of different vectors that are packed into it.

    28. DP

      Yeah. I, I might, like, just, like, dumb it down, like, w- as the way that would have made sense to me a few months ago of, okay, so you have, you know, whatever words are in the input you put into the model. All those words get converted into these, uh, tokens, and those tokens get converted into these vectors. And basically it just, like, the, this small amount of information that's moving through the model and w- the way you explained it to me, Sholto, and, uh, this paper talks about is early on in the model maybe it's just doing some very basic things about, like, what do these tokens mean? Like, if it says, like, ten plus five, just, like, moving information about to have the, have that, um-

    29. SD

      A good representation.

    30. DP

      Exactly. Just a represent- And in the middle, maybe, like, the deeper thinking is happening about, like, how to think, yeah, how to solve this. At the end you're converting it back into the output token, because the end product is you're trying to predict the probability of the next token from the last of those residual streams. Um, and so yeah, it is interesting to think about, like, just, like, the small compressed amount of information moving through the model and it's, like, getting modified in different ways. Uh, Trenton, so you're- it's interesting, uh, y- you're one of the few people who have, like, background from neuroscience, so you can think about the analogies here, uh, to, yeah, to the brain. And in fact I have o- one of our friends, Doahe, he wrote a paper in grad school about thinking about attention in the brain, and he said this is the only or first, uh, w- like, ex- neural explanation of why attention works. Whereas we have evidence from why the CNNs work b- convolutional neural network- networks work based on the visual cortex or something. Um, yeah. I'm, I'm curious h- uh, do you think in the brain there's something like a residual stream of this compressed amount of information that's moving through and it's getting modified, uh, as you're thinking about something? Even if that's not what l- is literally happening, do you think that's a good metaphor for what's happening in the brain?

  3. 33:271:07:44

    Intelligence explosion & great researchers

    1. DP

      about intell- intelligence explosion then? I don't know if-

    2. SD

      Okay. (laughs)

    3. TB

      That's a good lead in.

    4. DP

      That was a totally good segue.

    5. SD

      Let's jump in. (laughs)

    6. TB

      (laughs)

    7. DP

      Yeah. I mentioned multiple agents and I'm like, "Oh, here we go."

    8. SD

      (laughs)

    9. DP

      Okay. So, um, the reason I'm interested in discussing this is, with you guys in particular, is the models we have of the intelligence explosion so far come from economists. Which is fine, but I think we can do better because the very... like the... in the model of the intelligent explosion, what happens is you replace the AI researchers and then there's like a, a bunch of A- automated AI researchers who can speed up progress, make more AI researchers, make further progress. And so I feel like if that's the metric or that's the mechanism, we should just ask the AI researchers about whether they think this is plausible.

    10. SD

      Yeah.

    11. DP

      So let me just ask you, like if I have a thousand Asian Shotas or Asian Trentons, are they just... do you think that you get an intelligence explosion? Is that... Yeah, wh- what does that look like to you?

    12. SD

      I think one of the important bounding constraints here is compute. Um, like I do think you could dramatically speed up AI research, right? Uh, like it seems very clear to me that in the next couple of years we'll have things that can do many of the software engineering tasks that I do on a day-to-day basis, um, and therefore dramatically speed up my work, um, and therefore speed up like the rate of progress, right? Um, at the moment, I think most of the labs are somewhat compute bound in that they, they're always... there are more experiments you could run and more pieces of information that you could gain in the same way that like scientific research on biology is also somewhat experimentally, uh, like throughput bound, like you need to be able to run and culture the cells in order to get the information. I think that will be at least a short term planning constraint. Obviously, you know, Sam's trying to raise seven trillion dollars to run, to buy...

    13. DP

      (laughs)

    14. SD

      ... get chips and so, um, like it does seem like there's going to be a lot more compute in future as everyone is heavily ramping. You know, NVIDIA's stock price sort of represents the relative, uh, (laughs) compute increase. Um, but any thoughts?

    15. TB

      I think we need a few more nines of reliability, um, in order for it to really be useful and trustworthy.

    16. SD

      Yeah.

    17. TB

      Right now it's like... And, and just having context lengths that are super long and it's like very cheap to have. Uh, like if, if I'm working on our code base, um, it's really only small modules that I can get Claude to write for me right now. Um, but it's very plausible that within the next few years, um, it... or even sooner, uh, it can automate most of my task. The only other thing here that I will note is, uh, the research that at least, uh, our sub-team and interpretability is working on is so early stage, um, that you really have to be able to make sure everything is, is like done correctly in a bug-free way and contextualize the results with everything else in the model. And if something isn't going right-... be able to enumerate all of the possible things-

    18. SD

      Yeah.

    19. TB

      ... and then, and then slowly work on those. Um, like an example that we've publicly talked about in previous papers is dealing with LayerNorm, right? And it's like, if I'm trying to get an early result or look at like the logit effects of the model, right? So it's like if I activate this feature that we've identified to a really large degree, how does that change the output of the model? Um, am I using LayerNorm or not? How is that changing the feature that's being learned? Um, there, there... Yeah, there... And, and that will take even more context or reasoning abilities for the model.

    20. SD

      I... So you used a couple of concepts together, and it's not self-evident to me that they're the same, but you... It seemed like you were using them inter- interchangeably. So I just want to... Um, like, uh, one was, well, to work on the Claude codebase and make more modules based on that, they need more context or something, where, like, it seems like they might already be able to fit in the context. Or do you mean like actual... Do you mean like the context window context or like more...

    21. TB

      Yeah, the context window context.

    22. SD

      Um, so yeah-

    23. TB

      Yeah.

    24. SD

      ... it seems like now it might just be able to fit. The, the thing that's preventing it from making good modules is not, uh, the lack of being able to put the codebase in there.

    25. TB

      I, I think that will be there soon. Yeah.

    26. SD

      But like, it's not going to be as good at you... as you at, like, coming up with papers because it can, like, fit the codebase in there.

    27. TB

      No, but it will speed up a lot of the engineering.

    28. SD

      Hmm. In a way that causes an intelligence explosion?

    29. TB

      Um, no. That accelerates research. But, but I think these things compound. So like, the faster I can do my engineering, the more experiments I can run. And then the more experiments I can run, the faster we can... I mean, my, my work isn't actually accelerating capabilities at all.

    30. SD

      Right, right. But just like-

  4. 1:07:441:23:26

    Superposition & secret communication

    1. DP

      Does it have much space to represent it?

    2. TB

      I mean, I mean, my, my like very naive take here would just be that like, like, like so, so one thing that the superposition hypothesis, that interpretability has pushed, uh, is that your model is dramatically under-parameterized. A- and that's typically not the narratives that deep learning has pursued, right? But if you're, if you're trying to train a model on like the entire internet and have it predict it with incredible fidelity, uh, you are in the under-parameterized regime. And you're having to compress a ton of things and take on a lot of noisy interference in doing so. And so having a bigger model, you can just have cleaner representations that you can work with.

    3. DP

      Yeah. C- uh, uh, for the audience, you should unpack why that, first of all, what superposition is and why that is the implication of superposition.

    4. TB

      Sure. Yeah, so the, the fundamental result, and this was before I joined Anthropic, but the paper's titled Toy Models of Superposition, finds that even for small models, if you are in a regime where your data is high dimensional and sparse, and by sparse, I mean any given data point doesn't appear very often, um, your model will learn a compression strategy, which we call superposition, so that it can pack more features of the world into it than it has parameters. And, um, so, so the sparsity here is like... and I think, I think both of these constraints apply to the real world and modeling internet data is, is a good enough proxy for that, of like there's only one door cache. Like there's only one shirt you're wearing, there's like this liquid death can here. And so these are all objects or features, and how you define a feature is tricky. Um, and so, so you're in a really high dimensional space 'cause there are so many of them-

    5. DP

      Right.

    6. TB

      ... and they appear very infrequently.

    7. DP

      Yeah.

    8. TB

      And, and in that regime, your model will learn compression. Um, to, to riff a little bit more on this, um, I, I, I think it's becoming increasingly clear. I will say I, I believe that the reason, um, networks are so hard to interpret is because in a large part this superposition. So if you take a model and you look at a given neuron in it, right? A given unit of computation, and you ask, "How is this neuron contributing to the output of the model when it fires?" And you look at the data that it fires for, it's very confusing. It'll be like 10% of every possible input or like Chinese, but also fish and trees and the word, the, a full stop in URLs, right? Um, but uh, the paper that we put out towards mono-semanticity last year shows that if you project the activations into a higher dimensional space and provide a sparsity penalty, so you can think of this as undoing the compression in the same way that you assumed your data was originally high dimensional and sparse, you return it to that high dimensional and sparse regime, you get out very clean features.

    9. DP

      Mm-hmm.

    10. TB

      And things all of a sudden start to make a lot more sense.

    11. DP

      Mm-hmm. Okay. Um, uh, uh, there's so many interesting threads there. Uh, the first thing I wanna ask is the, the thing you mentioned about these models are trained in a regime where they're over-parameterized. Isn't that when you have generalization, like grokking happens in that regime, right? So, um, like-

    12. SD

      Um, under-parameterized.

    13. DP

      ... isn't that what you want?

    14. TB

      So, so, so I would say the models were under-parameterized.

    15. DP

      Oh, I see what you're saying. Yeah.

    16. TB

      Yeah, yeah. Like, like typically people talk about deep learning as if the model is over-parameterized.

    17. DP

      Mm-hmm.

    18. TB

      Um, but, but actually the claim here is that they're dramatically under-parameterized-

    19. DP

      I see.

    20. TB

      ... given the complexity of the task that they're trying to perform.

    21. DP

      Okay. Um, another question. So the distilled models, like it, it... first of all, okay, so what is happening there? 'Cause earlier, the earlier claims we're talking about is, um, the smaller models are worse at learning than bigger models, but like GPT-4 Turbo, you could say, make the claim that actually GPT-4 Turbo is worse at reasoning style stuff than GPT-4, um, but probably knows the same facts, like the distillation got rid of like some of the reasoning things. Um-

    22. SD

      Do, do we have any evidence that GPT Turbo is a distilled version of four? It might just be new architecture.

    23. DP

      Oh, okay.

    24. SD

      Yeah.

    25. DP

      All right. I'm with you.

    26. SD

      Like it could just be like a faster, more efficient neuro architecture.

    27. DP

      Okay, interesting.

    28. SD

      So that's cheaper. Yeah.

    29. DP

      Um, y- what, what is the... ho- how do you like interpret what's happening in distillation? I think Grant had one of these questions on his website of why can't you train the distilled model directly? Why does it have to go through... and is it a, is the picture like you had to project it from this bigger space to a smaller space? How, how?

    30. TB

      Um, I mean, I think both models will still be using superposition. Um, but, but the, the claim here is that you get a very different model if you distill versus if you train from scratch.

Episode duration: 3:13:12

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode UTuuTTnjxMQ

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome