Dwarkesh PodcastSholto Douglas & Trenton Bricken — How LLMs actually think
EVERY SPOKEN WORD
150 min read · 30,016 words- 0:00 – 17:04
Long contexts
- NANarrator
(laughs) It's right after this, and you ruin it. (laughs)
- DPDwarkesh Patel
(laughs)
- SDSholto Douglas
Oh, my God. (laughs)
- TBTrenton Bricken
You're failing the line test right now, really badly. This is like...
- NANarrator
Yeah, it is. It is. (laughs)
- TBTrenton Bricken
I'm like, "Wait, really?"
- DPDwarkesh Patel
"Can we drink on our glasses?"
- NANarrator
That's funny. (laughs)
- DPDwarkesh Patel
The glass go? (laughs)
- SDSholto Douglas
Yeah, let's go. Uh... (laughs)
- DPDwarkesh Patel
Oh my God, dude. I'm like, I feel like leaving the house.
- NANarrator
(laughs)
- DPDwarkesh Patel
My backpack is like, launching...
- NANarrator
(laughs)
- TBTrenton Bricken
(laughs) Uh...
- DPDwarkesh Patel
Let's get like no context on the chair.
- NANarrator
(laughs)
- TBTrenton Bricken
(laughs)
- DPDwarkesh Patel
Let's go. (laughs)
- SDSholto Douglas
Dude, it is literally falling over.
- NANarrator
Yeah. It's like... (laughs)
- DPDwarkesh Patel
Have you seen the videos?
- SDSholto Douglas
Yeah.
- NANarrator
(laughs)
- DPDwarkesh Patel
I think the video has shown it enough that we can almost live it out.
- SDSholto Douglas
Let's do it.
- NANarrator
Like, don't want to collapse it.
- SDSholto Douglas
(laughs)
- NANarrator
(laughs)
- DPDwarkesh Patel
Okay. Today I have, uh, the pleasure to talk with two of my good friends, Sholto and Trenton. Um, Sholto-
- 17:04 – 33:27
Intelligence is just associations
- DPDwarkesh Patel
for me, w- you're referring to this in some of your previous answers of, listen, you have these long contexts and you can hold m- more things in memory, but like ultimately comes down to your ability to mix concepts together to do some kind of reasoning. Uh, and these models aren't necessarily human level at that, even in context. Break down for me how you see storing just raw information versus reasoning and what's in between. Like where's the reasoning happening? Is that, uh, where's, where's just like storing raw information happening? What's different between them I- in these models?
- TBTrenton Bricken
Yeah. I don't have a super crisp answer for you here. Um, I mean, obviously with the input and output of the model, you're, you're mapping back to actual tokens, right? And then in between that you're, you're doing higher level processing. Um-
- DPDwarkesh Patel
Um, uh, b- before we get deeper into this, we should explain to the audience, you referred earlier to Anthropics way of thinking about transformers as these read right operations that layers do. One of you should just kind of explain at a high level what you mean by that.
- TBTrenton Bricken
So the residual stream, y- imagine you're in a boat going down a river, and, um, the boat is kind of the current query, uh, where you're trying to predict the next token.
- DPDwarkesh Patel
Yeah.
- TBTrenton Bricken
So it's the cat sat on the blank.
- DPDwarkesh Patel
Right.
- TBTrenton Bricken
And, and, uh, y- then you have these little, like, streams that are coming off the river, where you can get extra passengers or collect extra information if you want, and those correspond to the attention heads and MLPs-
- DPDwarkesh Patel
Yeah.
- TBTrenton Bricken
... that are, that are part of the model.
- DPDwarkesh Patel
Right. And- Okay, so you, you-
- SDSholto Douglas
I, I was just gonna-
- DPDwarkesh Patel
Yeah, please chime in.
- SDSholto Douglas
I almost think of it like the working memory-
- DPDwarkesh Patel
Right.
- SDSholto Douglas
... of the model.
- DPDwarkesh Patel
Yeah.
- SDSholto Douglas
Like, the, the RAM of the computer where you're, like, choosing what information to read in-
- DPDwarkesh Patel
Yeah. Exactly.
- SDSholto Douglas
... so you can do something with it, and then maybe read, like, read something else in later on. Yeah.
- TBTrenton Bricken
And you can operate on sub-spaces of that high dimensional vector.
- SDSholto Douglas
Exactly.
- TBTrenton Bricken
Um, a ton of things are, uh, I mean, at this point I think it's, it's a, a- almost given that, like-
- SDSholto Douglas
Yeah.
- TBTrenton Bricken
... are encoded in super position.
- SDSholto Douglas
Yeah.
- TBTrenton Bricken
Right? So it's like, yeah, the residual stream is just one high dimensional vector, but actually there's a ton of different vectors that are packed into it.
- DPDwarkesh Patel
Yeah. I, I might, like, just, like, dumb it down, like, w- as the way that would have made sense to me a few months ago of, okay, so you have, you know, whatever words are in the input you put into the model. All those words get converted into these, uh, tokens, and those tokens get converted into these vectors. And basically it just, like, the, this small amount of information that's moving through the model and w- the way you explained it to me, Sholto, and, uh, this paper talks about is early on in the model maybe it's just doing some very basic things about, like, what do these tokens mean? Like, if it says, like, ten plus five, just, like, moving information about to have the, have that, um-
- SDSholto Douglas
A good representation.
- DPDwarkesh Patel
Exactly. Just a represent- And in the middle, maybe, like, the deeper thinking is happening about, like, how to think, yeah, how to solve this. At the end you're converting it back into the output token, because the end product is you're trying to predict the probability of the next token from the last of those residual streams. Um, and so yeah, it is interesting to think about, like, just, like, the small compressed amount of information moving through the model and it's, like, getting modified in different ways. Uh, Trenton, so you're- it's interesting, uh, y- you're one of the few people who have, like, background from neuroscience, so you can think about the analogies here, uh, to, yeah, to the brain. And in fact I have o- one of our friends, Doahe, he wrote a paper in grad school about thinking about attention in the brain, and he said this is the only or first, uh, w- like, ex- neural explanation of why attention works. Whereas we have evidence from why the CNNs work b- convolutional neural network- networks work based on the visual cortex or something. Um, yeah. I'm, I'm curious h- uh, do you think in the brain there's something like a residual stream of this compressed amount of information that's moving through and it's getting modified, uh, as you're thinking about something? Even if that's not what l- is literally happening, do you think that's a good metaphor for what's happening in the brain?
- 33:27 – 1:07:44
Intelligence explosion & great researchers
- DPDwarkesh Patel
about intell- intelligence explosion then? I don't know if-
- SDSholto Douglas
Okay. (laughs)
- TBTrenton Bricken
That's a good lead in.
- DPDwarkesh Patel
That was a totally good segue.
- SDSholto Douglas
Let's jump in. (laughs)
- TBTrenton Bricken
(laughs)
- DPDwarkesh Patel
Yeah. I mentioned multiple agents and I'm like, "Oh, here we go."
- SDSholto Douglas
(laughs)
- DPDwarkesh Patel
Okay. So, um, the reason I'm interested in discussing this is, with you guys in particular, is the models we have of the intelligence explosion so far come from economists. Which is fine, but I think we can do better because the very... like the... in the model of the intelligent explosion, what happens is you replace the AI researchers and then there's like a, a bunch of A- automated AI researchers who can speed up progress, make more AI researchers, make further progress. And so I feel like if that's the metric or that's the mechanism, we should just ask the AI researchers about whether they think this is plausible.
- SDSholto Douglas
Yeah.
- DPDwarkesh Patel
So let me just ask you, like if I have a thousand Asian Shotas or Asian Trentons, are they just... do you think that you get an intelligence explosion? Is that... Yeah, wh- what does that look like to you?
- SDSholto Douglas
I think one of the important bounding constraints here is compute. Um, like I do think you could dramatically speed up AI research, right? Uh, like it seems very clear to me that in the next couple of years we'll have things that can do many of the software engineering tasks that I do on a day-to-day basis, um, and therefore dramatically speed up my work, um, and therefore speed up like the rate of progress, right? Um, at the moment, I think most of the labs are somewhat compute bound in that they, they're always... there are more experiments you could run and more pieces of information that you could gain in the same way that like scientific research on biology is also somewhat experimentally, uh, like throughput bound, like you need to be able to run and culture the cells in order to get the information. I think that will be at least a short term planning constraint. Obviously, you know, Sam's trying to raise seven trillion dollars to run, to buy...
- DPDwarkesh Patel
(laughs)
- SDSholto Douglas
... get chips and so, um, like it does seem like there's going to be a lot more compute in future as everyone is heavily ramping. You know, NVIDIA's stock price sort of represents the relative, uh, (laughs) compute increase. Um, but any thoughts?
- TBTrenton Bricken
I think we need a few more nines of reliability, um, in order for it to really be useful and trustworthy.
- SDSholto Douglas
Yeah.
- TBTrenton Bricken
Right now it's like... And, and just having context lengths that are super long and it's like very cheap to have. Uh, like if, if I'm working on our code base, um, it's really only small modules that I can get Claude to write for me right now. Um, but it's very plausible that within the next few years, um, it... or even sooner, uh, it can automate most of my task. The only other thing here that I will note is, uh, the research that at least, uh, our sub-team and interpretability is working on is so early stage, um, that you really have to be able to make sure everything is, is like done correctly in a bug-free way and contextualize the results with everything else in the model. And if something isn't going right-... be able to enumerate all of the possible things-
- SDSholto Douglas
Yeah.
- TBTrenton Bricken
... and then, and then slowly work on those. Um, like an example that we've publicly talked about in previous papers is dealing with LayerNorm, right? And it's like, if I'm trying to get an early result or look at like the logit effects of the model, right? So it's like if I activate this feature that we've identified to a really large degree, how does that change the output of the model? Um, am I using LayerNorm or not? How is that changing the feature that's being learned? Um, there, there... Yeah, there... And, and that will take even more context or reasoning abilities for the model.
- SDSholto Douglas
I... So you used a couple of concepts together, and it's not self-evident to me that they're the same, but you... It seemed like you were using them inter- interchangeably. So I just want to... Um, like, uh, one was, well, to work on the Claude codebase and make more modules based on that, they need more context or something, where, like, it seems like they might already be able to fit in the context. Or do you mean like actual... Do you mean like the context window context or like more...
- TBTrenton Bricken
Yeah, the context window context.
- SDSholto Douglas
Um, so yeah-
- TBTrenton Bricken
Yeah.
- SDSholto Douglas
... it seems like now it might just be able to fit. The, the thing that's preventing it from making good modules is not, uh, the lack of being able to put the codebase in there.
- TBTrenton Bricken
I, I think that will be there soon. Yeah.
- SDSholto Douglas
But like, it's not going to be as good at you... as you at, like, coming up with papers because it can, like, fit the codebase in there.
- TBTrenton Bricken
No, but it will speed up a lot of the engineering.
- SDSholto Douglas
Hmm. In a way that causes an intelligence explosion?
- TBTrenton Bricken
Um, no. That accelerates research. But, but I think these things compound. So like, the faster I can do my engineering, the more experiments I can run. And then the more experiments I can run, the faster we can... I mean, my, my work isn't actually accelerating capabilities at all.
- SDSholto Douglas
Right, right. But just like-
- 1:07:44 – 1:23:26
Superposition & secret communication
- DPDwarkesh Patel
Does it have much space to represent it?
- TBTrenton Bricken
I mean, I mean, my, my like very naive take here would just be that like, like, like so, so one thing that the superposition hypothesis, that interpretability has pushed, uh, is that your model is dramatically under-parameterized. A- and that's typically not the narratives that deep learning has pursued, right? But if you're, if you're trying to train a model on like the entire internet and have it predict it with incredible fidelity, uh, you are in the under-parameterized regime. And you're having to compress a ton of things and take on a lot of noisy interference in doing so. And so having a bigger model, you can just have cleaner representations that you can work with.
- DPDwarkesh Patel
Yeah. C- uh, uh, for the audience, you should unpack why that, first of all, what superposition is and why that is the implication of superposition.
- TBTrenton Bricken
Sure. Yeah, so the, the fundamental result, and this was before I joined Anthropic, but the paper's titled Toy Models of Superposition, finds that even for small models, if you are in a regime where your data is high dimensional and sparse, and by sparse, I mean any given data point doesn't appear very often, um, your model will learn a compression strategy, which we call superposition, so that it can pack more features of the world into it than it has parameters. And, um, so, so the sparsity here is like... and I think, I think both of these constraints apply to the real world and modeling internet data is, is a good enough proxy for that, of like there's only one door cache. Like there's only one shirt you're wearing, there's like this liquid death can here. And so these are all objects or features, and how you define a feature is tricky. Um, and so, so you're in a really high dimensional space 'cause there are so many of them-
- DPDwarkesh Patel
Right.
- TBTrenton Bricken
... and they appear very infrequently.
- DPDwarkesh Patel
Yeah.
- TBTrenton Bricken
And, and in that regime, your model will learn compression. Um, to, to riff a little bit more on this, um, I, I, I think it's becoming increasingly clear. I will say I, I believe that the reason, um, networks are so hard to interpret is because in a large part this superposition. So if you take a model and you look at a given neuron in it, right? A given unit of computation, and you ask, "How is this neuron contributing to the output of the model when it fires?" And you look at the data that it fires for, it's very confusing. It'll be like 10% of every possible input or like Chinese, but also fish and trees and the word, the, a full stop in URLs, right? Um, but uh, the paper that we put out towards mono-semanticity last year shows that if you project the activations into a higher dimensional space and provide a sparsity penalty, so you can think of this as undoing the compression in the same way that you assumed your data was originally high dimensional and sparse, you return it to that high dimensional and sparse regime, you get out very clean features.
- DPDwarkesh Patel
Mm-hmm.
- TBTrenton Bricken
And things all of a sudden start to make a lot more sense.
- DPDwarkesh Patel
Mm-hmm. Okay. Um, uh, uh, there's so many interesting threads there. Uh, the first thing I wanna ask is the, the thing you mentioned about these models are trained in a regime where they're over-parameterized. Isn't that when you have generalization, like grokking happens in that regime, right? So, um, like-
- SDSholto Douglas
Um, under-parameterized.
- DPDwarkesh Patel
... isn't that what you want?
- TBTrenton Bricken
So, so, so I would say the models were under-parameterized.
- DPDwarkesh Patel
Oh, I see what you're saying. Yeah.
- TBTrenton Bricken
Yeah, yeah. Like, like typically people talk about deep learning as if the model is over-parameterized.
- DPDwarkesh Patel
Mm-hmm.
- TBTrenton Bricken
Um, but, but actually the claim here is that they're dramatically under-parameterized-
- DPDwarkesh Patel
I see.
- TBTrenton Bricken
... given the complexity of the task that they're trying to perform.
- DPDwarkesh Patel
Okay. Um, another question. So the distilled models, like it, it... first of all, okay, so what is happening there? 'Cause earlier, the earlier claims we're talking about is, um, the smaller models are worse at learning than bigger models, but like GPT-4 Turbo, you could say, make the claim that actually GPT-4 Turbo is worse at reasoning style stuff than GPT-4, um, but probably knows the same facts, like the distillation got rid of like some of the reasoning things. Um-
- SDSholto Douglas
Do, do we have any evidence that GPT Turbo is a distilled version of four? It might just be new architecture.
- DPDwarkesh Patel
Oh, okay.
- SDSholto Douglas
Yeah.
- DPDwarkesh Patel
All right. I'm with you.
- SDSholto Douglas
Like it could just be like a faster, more efficient neuro architecture.
- DPDwarkesh Patel
Okay, interesting.
- SDSholto Douglas
So that's cheaper. Yeah.
- DPDwarkesh Patel
Um, y- what, what is the... ho- how do you like interpret what's happening in distillation? I think Grant had one of these questions on his website of why can't you train the distilled model directly? Why does it have to go through... and is it a, is the picture like you had to project it from this bigger space to a smaller space? How, how?
- TBTrenton Bricken
Um, I mean, I think both models will still be using superposition. Um, but, but the, the claim here is that you get a very different model if you distill versus if you train from scratch.
Episode duration: 3:13:12
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode UTuuTTnjxMQ
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome