Sholto Douglas & Trenton Bricken — How LLMs actually think

Had so much fun chatting with my good friends Trenton Bricken and Sholto Douglas on the podcast. No way to summarize it, except: * This is the best context dump out there on how LLMs are trained, what capabilities they're likely to soon have, and what exactly is going on inside them. * You would be shocked how much of what I know about this field, I've learned just from talking with them. * To the extent that you've enjoyed my other AI interviews, now you know why. There's a transcript with links to all the papers the boys were throwing down - may help you follow along. 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Transcript: https://www.dwarkeshpatel.com/p/sholto-douglas-trenton-bricken * Spotify: https://open.spotify.com/episode/2dtDauiE4v8ldNRqPFq0uP?si=7S4n69QuTjeYz0lZwW4xIw * Apple Podcasts: https://podcasts.apple.com/us/podcast/sholto-douglas-trenton-bricken-how-to-build-understand/id1516093381?i=1000650748087 * Trenton Bricken's twitter: https://twitter.com/TrentonBricken * Sholto Douglas's twitter: https://twitter.com/_sholtodouglas 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 - Long contexts 00:17:04 - Intelligence is just associations 00:33:27 - Intelligence explosion & great researchers 01:07:44 - Superposition & secret communication 01:23:26 - Agents & true reasoning 01:35:32 - How Sholto & Trenton got into AI research 02:08:08 - Are feature spaces the wrong way to think about intelligence? 02:22:04 - Will interp actually work on superhuman models 02:45:57 - Sholto's technical challenge for the audience 03:04:49 - Rapid fire

Dwarkesh PatelhostTrenton BrickenguestSholto Douglasguest

Mar 28, 20243h 13mWatch on YouTube ↗

EVERY SPOKEN WORD

150 min read · 30,016 words

0:00 – 17:04
Long contexts
1. NANarrator
  (laughs) It's right after this, and you ruin it. (laughs)
2. DPDwarkesh Patel
  (laughs)
3. SDSholto Douglas
  Oh, my God. (laughs)
4. TBTrenton Bricken
  You're failing the line test right now, really badly. This is like...
5. NANarrator
  Yeah, it is. It is. (laughs)
6. TBTrenton Bricken
  I'm like, "Wait, really?"
7. DPDwarkesh Patel
  "Can we drink on our glasses?"
8. NANarrator
  That's funny. (laughs)
9. DPDwarkesh Patel
  The glass go? (laughs)
10. SDSholto Douglas
  Yeah, let's go. Uh... (laughs)
11. DPDwarkesh Patel
  Oh my God, dude. I'm like, I feel like leaving the house.
12. NANarrator
  (laughs)
13. DPDwarkesh Patel
  My backpack is like, launching...
14. NANarrator
  (laughs)
15. TBTrenton Bricken
  (laughs) Uh...
16. DPDwarkesh Patel
  Let's get like no context on the chair.
17. NANarrator
  (laughs)
18. TBTrenton Bricken
  (laughs)
19. DPDwarkesh Patel
  Let's go. (laughs)
20. SDSholto Douglas
  Dude, it is literally falling over.
21. NANarrator
  Yeah. It's like... (laughs)
22. DPDwarkesh Patel
  Have you seen the videos?
23. SDSholto Douglas
  Yeah.
24. NANarrator
  (laughs)
25. DPDwarkesh Patel
  I think the video has shown it enough that we can almost live it out.
26. SDSholto Douglas
  Let's do it.
27. NANarrator
  Like, don't want to collapse it.
28. SDSholto Douglas
  (laughs)
29. NANarrator
  (laughs)
30. DPDwarkesh Patel
  Okay. Today I have, uh, the pleasure to talk with two of my good friends, Sholto and Trenton. Um, Sholto-
17:04 – 33:27
Intelligence is just associations
1. DPDwarkesh Patel
  for me, w- you're referring to this in some of your previous answers of, listen, you have these long contexts and you can hold m- more things in memory, but like ultimately comes down to your ability to mix concepts together to do some kind of reasoning. Uh, and these models aren't necessarily human level at that, even in context. Break down for me how you see storing just raw information versus reasoning and what's in between. Like where's the reasoning happening? Is that, uh, where's, where's just like storing raw information happening? What's different between them I- in these models?
2. TBTrenton Bricken
  Yeah. I don't have a super crisp answer for you here. Um, I mean, obviously with the input and output of the model, you're, you're mapping back to actual tokens, right? And then in between that you're, you're doing higher level processing. Um-
3. DPDwarkesh Patel
  Um, uh, b- before we get deeper into this, we should explain to the audience, you referred earlier to Anthropics way of thinking about transformers as these read right operations that layers do. One of you should just kind of explain at a high level what you mean by that.
4. TBTrenton Bricken
  So the residual stream, y- imagine you're in a boat going down a river, and, um, the boat is kind of the current query, uh, where you're trying to predict the next token.
5. DPDwarkesh Patel
  Yeah.
6. TBTrenton Bricken
  So it's the cat sat on the blank.
7. DPDwarkesh Patel
  Right.
8. TBTrenton Bricken
  And, and, uh, y- then you have these little, like, streams that are coming off the river, where you can get extra passengers or collect extra information if you want, and those correspond to the attention heads and MLPs-
9. DPDwarkesh Patel
  Yeah.
10. TBTrenton Bricken
  ... that are, that are part of the model.
11. DPDwarkesh Patel
  Right. And- Okay, so you, you-
12. SDSholto Douglas
  I, I was just gonna-
13. DPDwarkesh Patel
  Yeah, please chime in.
14. SDSholto Douglas
  I almost think of it like the working memory-
15. DPDwarkesh Patel
  Right.
16. SDSholto Douglas
  ... of the model.
17. DPDwarkesh Patel
  Yeah.
18. SDSholto Douglas
  Like, the, the RAM of the computer where you're, like, choosing what information to read in-
19. DPDwarkesh Patel
  Yeah. Exactly.
20. SDSholto Douglas
  ... so you can do something with it, and then maybe read, like, read something else in later on. Yeah.
21. TBTrenton Bricken
  And you can operate on sub-spaces of that high dimensional vector.
22. SDSholto Douglas
  Exactly.
23. TBTrenton Bricken
  Um, a ton of things are, uh, I mean, at this point I think it's, it's a, a- almost given that, like-
24. SDSholto Douglas
  Yeah.
25. TBTrenton Bricken
  ... are encoded in super position.
26. SDSholto Douglas
  Yeah.
27. TBTrenton Bricken
  Right? So it's like, yeah, the residual stream is just one high dimensional vector, but actually there's a ton of different vectors that are packed into it.
28. DPDwarkesh Patel
  Yeah. I, I might, like, just, like, dumb it down, like, w- as the way that would have made sense to me a few months ago of, okay, so you have, you know, whatever words are in the input you put into the model. All those words get converted into these, uh, tokens, and those tokens get converted into these vectors. And basically it just, like, the, this small amount of information that's moving through the model and w- the way you explained it to me, Sholto, and, uh, this paper talks about is early on in the model maybe it's just doing some very basic things about, like, what do these tokens mean? Like, if it says, like, ten plus five, just, like, moving information about to have the, have that, um-
29. SDSholto Douglas
  A good representation.
30. DPDwarkesh Patel
  Exactly. Just a represent- And in the middle, maybe, like, the deeper thinking is happening about, like, how to think, yeah, how to solve this. At the end you're converting it back into the output token, because the end product is you're trying to predict the probability of the next token from the last of those residual streams. Um, and so yeah, it is interesting to think about, like, just, like, the small compressed amount of information moving through the model and it's, like, getting modified in different ways. Uh, Trenton, so you're- it's interesting, uh, y- you're one of the few people who have, like, background from neuroscience, so you can think about the analogies here, uh, to, yeah, to the brain. And in fact I have o- one of our friends, Doahe, he wrote a paper in grad school about thinking about attention in the brain, and he said this is the only or first, uh, w- like, ex- neural explanation of why attention works. Whereas we have evidence from why the CNNs work b- convolutional neural network- networks work based on the visual cortex or something. Um, yeah. I'm, I'm curious h- uh, do you think in the brain there's something like a residual stream of this compressed amount of information that's moving through and it's getting modified, uh, as you're thinking about something? Even if that's not what l- is literally happening, do you think that's a good metaphor for what's happening in the brain?
33:27 – 1:07:44
Intelligence explosion & great researchers
1. DPDwarkesh Patel
  about intell- intelligence explosion then? I don't know if-
2. SDSholto Douglas
  Okay. (laughs)
3. TBTrenton Bricken
  That's a good lead in.
4. DPDwarkesh Patel
  That was a totally good segue.
5. SDSholto Douglas
  Let's jump in. (laughs)
6. TBTrenton Bricken
  (laughs)
7. DPDwarkesh Patel
  Yeah. I mentioned multiple agents and I'm like, "Oh, here we go."
8. SDSholto Douglas
  (laughs)
9. DPDwarkesh Patel
  Okay. So, um, the reason I'm interested in discussing this is, with you guys in particular, is the models we have of the intelligence explosion so far come from economists. Which is fine, but I think we can do better because the very... like the... in the model of the intelligent explosion, what happens is you replace the AI researchers and then there's like a, a bunch of A- automated AI researchers who can speed up progress, make more AI researchers, make further progress. And so I feel like if that's the metric or that's the mechanism, we should just ask the AI researchers about whether they think this is plausible.
10. SDSholto Douglas
  Yeah.
11. DPDwarkesh Patel
  So let me just ask you, like if I have a thousand Asian Shotas or Asian Trentons, are they just... do you think that you get an intelligence explosion? Is that... Yeah, wh- what does that look like to you?
12. SDSholto Douglas
  I think one of the important bounding constraints here is compute. Um, like I do think you could dramatically speed up AI research, right? Uh, like it seems very clear to me that in the next couple of years we'll have things that can do many of the software engineering tasks that I do on a day-to-day basis, um, and therefore dramatically speed up my work, um, and therefore speed up like the rate of progress, right? Um, at the moment, I think most of the labs are somewhat compute bound in that they, they're always... there are more experiments you could run and more pieces of information that you could gain in the same way that like scientific research on biology is also somewhat experimentally, uh, like throughput bound, like you need to be able to run and culture the cells in order to get the information. I think that will be at least a short term planning constraint. Obviously, you know, Sam's trying to raise seven trillion dollars to run, to buy...
13. DPDwarkesh Patel
  (laughs)
14. SDSholto Douglas
  ... get chips and so, um, like it does seem like there's going to be a lot more compute in future as everyone is heavily ramping. You know, NVIDIA's stock price sort of represents the relative, uh, (laughs) compute increase. Um, but any thoughts?
15. TBTrenton Bricken
  I think we need a few more nines of reliability, um, in order for it to really be useful and trustworthy.
16. SDSholto Douglas
  Yeah.
17. TBTrenton Bricken
  Right now it's like... And, and just having context lengths that are super long and it's like very cheap to have. Uh, like if, if I'm working on our code base, um, it's really only small modules that I can get Claude to write for me right now. Um, but it's very plausible that within the next few years, um, it... or even sooner, uh, it can automate most of my task. The only other thing here that I will note is, uh, the research that at least, uh, our sub-team and interpretability is working on is so early stage, um, that you really have to be able to make sure everything is, is like done correctly in a bug-free way and contextualize the results with everything else in the model. And if something isn't going right-... be able to enumerate all of the possible things-
18. SDSholto Douglas
  Yeah.
19. TBTrenton Bricken
  ... and then, and then slowly work on those. Um, like an example that we've publicly talked about in previous papers is dealing with LayerNorm, right? And it's like, if I'm trying to get an early result or look at like the logit effects of the model, right? So it's like if I activate this feature that we've identified to a really large degree, how does that change the output of the model? Um, am I using LayerNorm or not? How is that changing the feature that's being learned? Um, there, there... Yeah, there... And, and that will take even more context or reasoning abilities for the model.
20. SDSholto Douglas
  I... So you used a couple of concepts together, and it's not self-evident to me that they're the same, but you... It seemed like you were using them inter- interchangeably. So I just want to... Um, like, uh, one was, well, to work on the Claude codebase and make more modules based on that, they need more context or something, where, like, it seems like they might already be able to fit in the context. Or do you mean like actual... Do you mean like the context window context or like more...
21. TBTrenton Bricken
  Yeah, the context window context.
22. SDSholto Douglas
  Um, so yeah-
23. TBTrenton Bricken
  Yeah.
24. SDSholto Douglas
  ... it seems like now it might just be able to fit. The, the thing that's preventing it from making good modules is not, uh, the lack of being able to put the codebase in there.
25. TBTrenton Bricken
  I, I think that will be there soon. Yeah.
26. SDSholto Douglas
  But like, it's not going to be as good at you... as you at, like, coming up with papers because it can, like, fit the codebase in there.
27. TBTrenton Bricken
  No, but it will speed up a lot of the engineering.
28. SDSholto Douglas
  Hmm. In a way that causes an intelligence explosion?
29. TBTrenton Bricken
  Um, no. That accelerates research. But, but I think these things compound. So like, the faster I can do my engineering, the more experiments I can run. And then the more experiments I can run, the faster we can... I mean, my, my work isn't actually accelerating capabilities at all.
30. SDSholto Douglas
  Right, right. But just like-
1:07:44 – 1:23:26
Superposition & secret communication
1. DPDwarkesh Patel
  Does it have much space to represent it?
2. TBTrenton Bricken
  I mean, I mean, my, my like very naive take here would just be that like, like, like so, so one thing that the superposition hypothesis, that interpretability has pushed, uh, is that your model is dramatically under-parameterized. A- and that's typically not the narratives that deep learning has pursued, right? But if you're, if you're trying to train a model on like the entire internet and have it predict it with incredible fidelity, uh, you are in the under-parameterized regime. And you're having to compress a ton of things and take on a lot of noisy interference in doing so. And so having a bigger model, you can just have cleaner representations that you can work with.
3. DPDwarkesh Patel
  Yeah. C- uh, uh, for the audience, you should unpack why that, first of all, what superposition is and why that is the implication of superposition.
4. TBTrenton Bricken
  Sure. Yeah, so the, the fundamental result, and this was before I joined Anthropic, but the paper's titled Toy Models of Superposition, finds that even for small models, if you are in a regime where your data is high dimensional and sparse, and by sparse, I mean any given data point doesn't appear very often, um, your model will learn a compression strategy, which we call superposition, so that it can pack more features of the world into it than it has parameters. And, um, so, so the sparsity here is like... and I think, I think both of these constraints apply to the real world and modeling internet data is, is a good enough proxy for that, of like there's only one door cache. Like there's only one shirt you're wearing, there's like this liquid death can here. And so these are all objects or features, and how you define a feature is tricky. Um, and so, so you're in a really high dimensional space 'cause there are so many of them-
5. DPDwarkesh Patel
  Right.
6. TBTrenton Bricken
  ... and they appear very infrequently.
7. DPDwarkesh Patel
  Yeah.
8. TBTrenton Bricken
  And, and in that regime, your model will learn compression. Um, to, to riff a little bit more on this, um, I, I, I think it's becoming increasingly clear. I will say I, I believe that the reason, um, networks are so hard to interpret is because in a large part this superposition. So if you take a model and you look at a given neuron in it, right? A given unit of computation, and you ask, "How is this neuron contributing to the output of the model when it fires?" And you look at the data that it fires for, it's very confusing. It'll be like 10% of every possible input or like Chinese, but also fish and trees and the word, the, a full stop in URLs, right? Um, but uh, the paper that we put out towards mono-semanticity last year shows that if you project the activations into a higher dimensional space and provide a sparsity penalty, so you can think of this as undoing the compression in the same way that you assumed your data was originally high dimensional and sparse, you return it to that high dimensional and sparse regime, you get out very clean features.
9. DPDwarkesh Patel
  Mm-hmm.
10. TBTrenton Bricken
  And things all of a sudden start to make a lot more sense.
11. DPDwarkesh Patel
  Mm-hmm. Okay. Um, uh, uh, there's so many interesting threads there. Uh, the first thing I wanna ask is the, the thing you mentioned about these models are trained in a regime where they're over-parameterized. Isn't that when you have generalization, like grokking happens in that regime, right? So, um, like-
12. SDSholto Douglas
  Um, under-parameterized.
13. DPDwarkesh Patel
  ... isn't that what you want?
14. TBTrenton Bricken
  So, so, so I would say the models were under-parameterized.
15. DPDwarkesh Patel
  Oh, I see what you're saying. Yeah.
16. TBTrenton Bricken
  Yeah, yeah. Like, like typically people talk about deep learning as if the model is over-parameterized.
17. DPDwarkesh Patel
  Mm-hmm.
18. TBTrenton Bricken
  Um, but, but actually the claim here is that they're dramatically under-parameterized-
19. DPDwarkesh Patel
  I see.
20. TBTrenton Bricken
  ... given the complexity of the task that they're trying to perform.
21. DPDwarkesh Patel
  Okay. Um, another question. So the distilled models, like it, it... first of all, okay, so what is happening there? 'Cause earlier, the earlier claims we're talking about is, um, the smaller models are worse at learning than bigger models, but like GPT-4 Turbo, you could say, make the claim that actually GPT-4 Turbo is worse at reasoning style stuff than GPT-4, um, but probably knows the same facts, like the distillation got rid of like some of the reasoning things. Um-
22. SDSholto Douglas
  Do, do we have any evidence that GPT Turbo is a distilled version of four? It might just be new architecture.
23. DPDwarkesh Patel
  Oh, okay.
24. SDSholto Douglas
  Yeah.
25. DPDwarkesh Patel
  All right. I'm with you.
26. SDSholto Douglas
  Like it could just be like a faster, more efficient neuro architecture.
27. DPDwarkesh Patel
  Okay, interesting.
28. SDSholto Douglas
  So that's cheaper. Yeah.
29. DPDwarkesh Patel
  Um, y- what, what is the... ho- how do you like interpret what's happening in distillation? I think Grant had one of these questions on his website of why can't you train the distilled model directly? Why does it have to go through... and is it a, is the picture like you had to project it from this bigger space to a smaller space? How, how?
30. TBTrenton Bricken
  Um, I mean, I think both models will still be using superposition. Um, but, but the, the claim here is that you get a very different model if you distill versus if you train from scratch.

Episode duration: 3:13:12

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode UTuuTTnjxMQ

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome