EVERY SPOKEN WORD
40 min read · 8,222 words- 0:00 – 2:58
Introduction
- VMVishal Misra
Anthropic makes great products. Claude Code is fantastic, CoWork is fantastic, but they are grains of silicon doing matrix multiplication. They don't have consciousness. They don't have an inner monologue. You take an LLM and train it on pre-1916 or 1911 physics and see if it can come up with the theory of relativity. If it does, then we have AGI.
- ETErik Torenberg
Just today, by the way-
- VMVishal Misra
Yeah
- ETErik Torenberg
... Dario allegedly said that you can't rule out that they're conscious.
- VMVishal Misra
You can rule out they're conscious. [both laughing] Come on. To get to what is called AGI, I think there are two things that need to happen. One is...
- ETErik Torenberg
Vishal, it's great to have you in again.
- VMVishal Misra
Great to be back.
- ETErik Torenberg
This is one of my favorite topics, which is, um, how do LLMs actually work?
- VMVishal Misra
Mm-hmm.
- ETErik Torenberg
And I think that, uh, you, in, in my opinion, you've done kind of the best work on this, modeling it out.
- VMVishal Misra
Thank you.
- ETErik Torenberg
For those that did not see the original, um, one, maybe it's probably worth doing just a quick background on kind of what led you to this point, and then we'll just go into the current work that you've been doing.
- VMVishal Misra
Five years ago, when GPT-3 was first released-
- ETErik Torenberg
Yeah
- VMVishal Misra
... uh, I got early access to it, and I started playing with it, and I was trying to solve a problem related to querying a cricket database.
- ETErik Torenberg
Yeah.
- VMVishal Misra
And I got GPT-3 to do in-context learning, few-shot learning, and, you know, it was kind of the first, at least, uh, to, to me, it was the first known, uh, implementation of RAG, Retrieval-Augmented Generation, which I used to solve this problem of, uh, querying, getting GPT-3 to translate natural language into something that could be used to query a database that GPT-3 had no idea about. I had no access to GPT-3's internals, but I was still able to use it to solve that problem. So it, it, it worked beautifully. Uh, we, we deployed, uh, this, uh, in production at ESPN in September '21, but-
- ETErik Torenberg
Wow. Wow, you, you did the first implementation of RAG in 2021?
- VMVishal Misra
No, no, no. In 2020.
- ETErik Torenberg
2020.
- VMVishal Misra
2020, I got it working, and by the time you talk to all the lawyers at ESPN and, you know, productionize it, it took, it took a while.
- ETErik Torenberg
Wow.
- VMVishal Misra
But October 2020, we had... Well, I had this-
- ETErik Torenberg
Yeah
- VMVishal Misra
... architecture working. But after I got it to work, I was amazed that it worked. I wanted to understand how it worked.
- ETErik Torenberg
Yeah.
- VMVishal Misra
And I looked at, you know, the Attention Is All You Need paper and all the other sort of deep learning architecture papers, and I couldn't understand why it worked.
- ETErik Torenberg
Yeah.
- VMVishal Misra
So then I started getting sort of, uh, deep into building a mathematical model.
- ETErik Torenberg
Yeah. And now you've published a, a series of papers.
- 2:58 – 8:24
LLM as Giant Matrix
- ETErik Torenberg
you were trying to, you, you were trying to describe... You were trying to come up with a mathematical model-
- VMVishal Misra
Mm-hmm
- ETErik Torenberg
... of how LLM works.
- VMVishal Misra
Yeah.
- ETErik Torenberg
And you had, which was very helpful to me-
- VMVishal Misra
Mm-hmm
- ETErik Torenberg
... which was, um... And at the time you were actually trying to, like, figure out how in-context learning was working.
- VMVishal Misra
Yes. Yeah.
- ETErik Torenberg
And you came up with an abstraction for LLMs, which is basically this very, very large matrix, and you used that to describe. So maybe you can kind of walk through that work very quickly.
- VMVishal Misra
Sure, yeah. So, so what you do is you, you imagine this huge, gigantic matrix where every row of the matrix corresponds to a prompt.
- ETErik Torenberg
Yeah.
- VMVishal Misra
And the way these LLMs work is, given a prompt, they construct a distribution of probabilities of the next token. Next token is next word. So every LLM has a vocabulary. You know, GPT and its variants have a vocabulary of about 50,000 tokens.
- ETErik Torenberg
Yeah.
- VMVishal Misra
So given a prompt, it'll come up with a distribution of what the next token should be, and then all these models sample from that distribution.
- ETErik Torenberg
Yeah. So that's the posterior distribution.
- VMVishal Misra
That's the posterior distribution.
- ETErik Torenberg
Right.
- VMVishal Misra
Right? That, that's how LLMs work. And so the idea of this matric is, matrix is for every possible combination of tokens, which is a prompt, there's a row.
- ETErik Torenberg
Yeah.
- VMVishal Misra
And the columns are a distribution over the vocabulary.
- ETErik Torenberg
Yeah.
- VMVishal Misra
So if you have, like, a vocabulary of 50,000 possible tokens, it's a distribution over those 50,000 tokens.
- ETErik Torenberg
And by distribution, it's just the probability-
- VMVishal Misra
Just the probability. Sorry, yeah.
- ETErik Torenberg
Yeah.
- VMVishal Misra
Just the probability that, uh, the next token should be this-
- ETErik Torenberg
Yeah
- VMVishal Misra
... versus that.
- ETErik Torenberg
Yeah.
- VMVishal Misra
Uh, so that, that's sort of the idea. And, and, and when you start viewing it that way, it, it makes things at least, uh, clearer to, uh, people like me who want to model it, uh, what, what's happening. So c-concretely, let's say you have an example that, uh, let's say your prompt is just one word, protein.
- 8:24 – 13:00
What Is In-Context Learning
- VMVishal Misra
subset, right.
- ETErik Torenberg
You, you know, you use this approach to describe how in-context learning works, and so maybe first describe what in-context learning is-
- VMVishal Misra
Yeah
- ETErik Torenberg
... and then kind of the conclusion that you came from that.
- VMVishal Misra
So in-context learning is when you, uh, show the LLM something it has kind of never seen before. You give it a few examples of this is what it wants, uh, this is what you're trying to do. Then you give a new problem, which is related to the examples that you've shown.
- ETErik Torenberg
Yeah.
- VMVishal Misra
And the LLM lea-learns in real time what it's supposed to do and solves the problem.
- ETErik Torenberg
By the way, the first time I saw this, it absolutely blew my mind.
- VMVishal Misra
Yeah.
- ETErik Torenberg
I actually, I actually used your DSL-
- VMVishal Misra
Mm-hmm
- ETErik Torenberg
... by when I was like first learning about it. So maybe like, kind of like a, like a-
- VMVishal Misra
Yeah. So, so I, I, like-
- ETErik Torenberg
The DSL thing was just, it's just-
- VMVishal Misra
Uh, it was-
- ETErik Torenberg
... crazy this works at all.
- VMVishal Misra
I-it's absolutely, you know, mind-blowing that it works. And so going back to that cricket problem-
- ETErik Torenberg
Yeah
- VMVishal Misra
... was, you know, i-i-in the mid-'90s, uh, I was part of a group that had created this, uh, cricket portal called Cricinfo.
- ETErik Torenberg
Yeah.
- VMVishal Misra
Uh, cricket, uh, i-is a very stat-rich sport. You know, you think baseball multiplied by a thousand, and it's got all kinds of stats. And we had created this, uh, online searchable database called StatsGuru, where you could search for anything, any stat related to cricket, and it's been available since 2000.
- ETErik Torenberg
Yeah.
- VMVishal Misra
But because you can query for anything, everything was be-made available. And how do you make something like that available to the general public?
- ETErik Torenberg
Yeah.
- VMVishal Misra
Well, they're not gonna write SQL queries. The next best thing at that time was to create a web form. Unfortunately, [chuckles] everything was crammed into that web form. So as a result, you had like 20 drop-downs, 15 check boxes, 18 different text fields. It looked like a very complicated, daunting interface. So as a result, even though it could solve or it could answer any query, almost no one used it. A vanishingly small percentage-
- ETErik Torenberg
Yeah. [chuckles]
- VMVishal Misra
... of cricket fans use it because it, it just looked intimidating. And then ESPN bought that site, uh, in 2007. I still know people who, who run the site, and I've always told them, "You know, why don't you do something with StatsGuru?" And in January 2020, uh, the editor-in-chief of, uh, Cricinfo, Sambit Bal, he's, he's a friend, so he came to New York and we'd gone out for drinks. And again, I told him, "You know, why don't you do something with StatsGuru?" So he looks at me and says, "Why don't you do something about StatsGuru?" [chuckles] He was joking, but, uh, that idea kind of stayed with me. And when GPT-3 was released, I thought maybe I could use StatsGuru, use GPT-3 to create a front end for StatsGuru.
- ETErik Torenberg
Gotcha.
- VMVishal Misra
And so what I did was, uh, I designed a DSL, a domain-specific language, which, uh, converted queries about cricket stats in natural language into this DSL. Now-
- ETErik Torenberg
And to be clear, you created this. It wasn't, like, part of, like, any training data-
- 13:00 – 19:13
Bayesian Updating as Evidence
- ETErik Torenberg
learning.
- VMVishal Misra
Y-yeah. So, so w-when you think about what in-context learning is, is that a-as you see evidence, so, so, you know, uh, i-in the first paper, what I also did was I, I took this cricket DSL example.
- ETErik Torenberg
Okay.
- VMVishal Misra
And I, uh, uh, I depicted the next token probabilities-
- ETErik Torenberg
Mm-hmm
- VMVishal Misra
... of the model as it was shown more and more examples. So the first time you show it this DSL, the natural language and the DSL, the probabilities of the DSL tokens were, were extremely low because GPT-3 had never seen this thing. When it saw the cricket question, in its mind, it was trying to continue it with an English answer. So the probabilities that were high were all English words.
- ETErik Torenberg
Yeah.
- VMVishal Misra
Once it saw my prompt where I had the question and the DSL, the next time I had the question in the next row, the probabilities of the DSL token started going up. With every example, it went up, and finally, when I gave the new query, it was like it had almost 100% probability of getting the right token.
- ETErik Torenberg
Yeah.
- VMVishal Misra
So this is an example of in real time, the model was updating its posterior probability. It was updating its knowledge that, okay, I've seen evidence, this is what I'm supposed to do. Now, this is a colloquial way of saying what Bayesian-
- ETErik Torenberg
Yeah
- VMVishal Misra
... inference is. Bayesian updating basically is you start with a prior. When you see new evidence, you update your posterior. That's the mathematical definition. But, but-
- ETErik Torenberg
Yeah
- VMVishal Misra
... uh, in, in English, it's basically you see something, you see new evidence, you update your belief about what's happening.
- ETErik Torenberg
Yeah.
- VMVishal Misra
Right? So it was clear to me that LLMs are doing something which resembles Bayesian updating. So in that first paper, I had this matrix formulation, and I showed that, you know, what it's doing, it looks like Bayesian updating.
- ETErik Torenberg
Yeah.
- VMVishal Misra
Then we can come to the sort of next series of papers.
- ETErik Torenberg
That's right. So, okay. So I mean, it, it, it seemed pretty conclusive to me at that time.
- VMVishal Misra
Yeah.
- ETErik Torenberg
And then you went quiet for a while, and then I, I still remember the WhatsApp text. You said-
- VMVishal Misra
Yeah
- ETErik Torenberg
... "Martín, I know exactly how these things are working now." [both chuckling]
- VMVishal Misra
Yeah. Well-
- ETErik Torenberg
And then, and then, and then listen, you dropped a series of papers that kind of broke the internet. Like you went super viral on Twitter.
- VMVishal Misra
Yeah.
- ETErik Torenberg
Like, I mean, people really noticed. Um, uh, and so I, I want to get to that in just a second.
- VMVishal Misra
Yeah.
- ETErik Torenberg
But before that, um, I remember when your first paper came out, people would be like, "You know, these things are definitely not Bayesian." Like-
- VMVishal Misra
Mm
- 19:13 – 27:22
Bayesian Wind Tunnel Tests
- ETErik Torenberg
it. Got it.
- VMVishal Misra
So then we came up with this idea, you know, my colleagues, uh, Naman Agarwal and Siddharth Dalal, we, the, the series of papers were, were written with them. We came up with this idea of a Bayesian wind tunnel.
- ETErik Torenberg
Okay.
- VMVishal Misra
So what's a wind tunnel? Well, wind tunnel in the aerospace industry is where you test an aircraft in an isolated environment. You don't fly it, and you test, test it against all sorts of, uh, you know, aerodynamic pressure. Then you see what, what it'll withstand, what kind of altitude, pressure, blah, blah, blah. Right? You don't want to do it up in the air testing.
- ETErik Torenberg
Yeah.
- VMVishal Misra
So we said, okay, why don't we create an environment where we take these architectures, and we tested transformers, Mamba, LSTMs, uh, MLPs, a-all architectures. We said, why don't we create, take a blank architecture, give it a task where it's impossible for the architecture to memorize what the solution to that task should be. The space is combinatorially-
- ETErik Torenberg
Yeah
- VMVishal Misra
... impossible for, given the number of parameters, and we took very small models. So it's difficult enough that they cannot memorize it.
- ETErik Torenberg
Yeah.
- VMVishal Misra
But it's tractable enough that we know precisely what the, the Bayesian posterior should be. You can calculate it analytically. So we gave these models a bunch of tasks where, again, we show that it's impossible to memorize. We trained these models, and we found that the transformer got the precise Bayesian posterior down to ten to the power minus three bits accuracy. It was matching the distribution perfectly. So it is actually doing Bayesian in the mathematical sense, given a task-
- ETErik Torenberg
Wow
- VMVishal Misra
... where it has to update its belief. Uh, Mamba also does it reasonably well. LSTMs can do one of the things. So the, the, in the papers, we have a taxonomy of Bayesian tasks. Transformer does everything, Mamba does most of it, LSTMs do only partially, and MLPs fail completely.
- ETErik Torenberg
So is this a reflection of the data that it's trained on, or is it more a reflection of the mechanism?
- VMVishal Misra
It's the mechanism, it's the architecture. The data decides what tasks it learns.
- ETErik Torenberg
Right.
- VMVishal Misra
So in the first paper, we had these Bayesian wind tunnels, and we show that, you know, it's doing the job at different tasks. In the second paper, we show why it does it. So we look at the transformers, we look at the gradients, and we show how the gradients actually shape this geometry-
- ETErik Torenberg
Ah
- VMVishal Misra
... which enables this Bayesian updating to happen. Then in the third paper, what we did, we take, we took these frontier production LLMs, which have open weights so that we could look inside them, and we did our testing, and we saw that the geometries that we saw in the small, uh, models persisted in models which are, you know, hundreds of millions of parameters. The same signature existed. The only thing is that, uh, because they are trained on all sorts of data, it's a little bit dirty or messy.
- ETErik Torenberg
Yeah.
- VMVishal Misra
But you can see the same structure. So the, the whole idea behind the Bayesian wind tunnel was, unlike these, uh, production LLMs, where you don't know what they have been trained on-
- ETErik Torenberg
Right
- VMVishal Misra
... so you cannot mathematically compute the posterior. So again, how do you prove it? I mean, it looks Bayesian, you know, from the first paper.
- ETErik Torenberg
From the first paper, yeah.
- VMVishal Misra
From the paper it looks Bayesian, but, you know.
- ETErik Torenberg
Looks Bayesian to me. Yeah.
- VMVishal Misra
So the wind tunnel sort of solved that problem for us.
- ETErik Torenberg
Mm-hmm.
- VMVishal Misra
We said, okay, let's start with a blank architecture, give it a task where we know what the answer is. It cannot memorize it. Let's see what it does. And yeah.
- ETErik Torenberg
So do you think this provides any sort of, like, indication of how humans think, or do you think that these things are totally independent?
- VMVishal Misra
No, no, it, it does provide, right? So, you know, human beings also, uh, update our beliefs as we see new evidence.
- 27:22 – 36:34
Brains Simulate Causality
- VMVishal Misra
coming back to it, we, we are Bayesian-
- ETErik Torenberg
Yeah
- VMVishal Misra
... but we do something else. You know, when I, when I, when I throw this pen at you, what'll you do?
- ETErik Torenberg
Dodge it or-
- VMVishal Misra
Dodge it
- ETErik Torenberg
... dodge it? Yeah.
- VMVishal Misra
Why will you dodge it?
- ETErik Torenberg
Um, to avoid being hit.
- VMVishal Misra
Avoid being hit.
- ETErik Torenberg
Yeah.
- VMVishal Misra
But your head is not doing a Bayesian calculation of, uh, okay, this pen is coming, the probability that it hits me, uh, it'll cause this much pain or all that.
- ETErik Torenberg
Correct.
- VMVishal Misra
What you're essentially doing in your head is you're doing a simulation.
- ETErik Torenberg
Ah, right.
- VMVishal Misra
You see the b-uh, the, the, the pen coming and you know that it'll come and hit me. Your mind simulates and you dodge it, right? So all of deep learning is, uh, doing correlations. It's not doing causation.
- ETErik Torenberg
Yeah.
- VMVishal Misra
Causal models are the ones that are able to do simulations and interventions. So, you know, Judea Pearl has this whole, uh, causal hierarchy-
- ETErik Torenberg
Yeah
- VMVishal Misra
... where the first hierarchy, and the first hierarchy is association, which is you build these correlation models. Deep learning is beautiful. It, it's extremely powerful. I mean, you see every day, all these models are, like, amazingly good.
- ETErik Torenberg
Yeah.
- VMVishal Misra
They do association. The second is intervention in the hierarchy.
- ETErik Torenberg
Yeah.
- VMVishal Misra
Deep learning models do not do that. Third is counterfactual. So both intervention and counterfactual, you can imagine it, it, it's some sort of simulation. You, you build a model of, causal model of what's happening, and then you are able to simulate. So our brains do that. The current architectures don't do that. Another example I think which will make it clear is, uh, the difference between, uh, I'll use these technical term, Shannon entropy-
- ETErik Torenberg
Mm-hmm
- VMVishal Misra
... and Kolmogorov complexity.
- ETErik Torenberg
Sure.
- VMVishal Misra
So if you look at the Shannon entropy of the digits of pi-
- ETErik Torenberg
Yeah
- VMVishal Misra
... it's infinite.
- ETErik Torenberg
Sure.
- 36:34 – 42:17
Manifolds and New Representations
- VMVishal Misra
Yeah.
- ETErik Torenberg
You know, another way that I've always thought about these, and I thought you articulated it well the last time we talked about it, which is the universe is this very, very complex space, and then-You know, somehow humans map it into a manifold-
- VMVishal Misra
Mm-hmm
- ETErik Torenberg
... that's less complex.
- VMVishal Misra
Yeah.
- ETErik Torenberg
And then that gets kind of written down, and then the LLM... So that's kinda some, some distribution, some-
- VMVishal Misra
Mm.
- ETErik Torenberg
You know, it's still a very large space, but it's-
- VMVishal Misra
Yeah
- ETErik Torenberg
... it's, it's a bounded space, and the LLM learned that manifold, and then they kind of use, you know, Bayesian inference to move up and down that manifold.
- VMVishal Misra
Right.
- ETErik Torenberg
But they're kind of bound to that manifold.
- VMVishal Misra
Yeah.
- ETErik Torenberg
And then, again, I don't wanna put words in your mouth, and then, but, like, what they can't do is, is generate a new manifold.
- VMVishal Misra
New manifold, yeah, yeah.
- ETErik Torenberg
Which requires understanding the way that the universe works, then coming up with a new representation of the universe.
- VMVishal Misra
Yeah. And th- this is what relativity is, right?
- ETErik Torenberg
Yeah, exactly.
- VMVishal Misra
Einstein had to create a new manifold.
- ETErik Torenberg
Yeah, yeah, yeah.
- VMVishal Misra
If you just stuck with the old manifold of the Newtonian physics-
- ETErik Torenberg
Right
- VMVishal Misra
... then you would see these correlations, but you could not come up with a manifold that explain them. So you need to come up with a new representation.
- ETErik Torenberg
Yeah.
- VMVishal Misra
So to me, you know, there are lots of definitions of AGI, uh, you know, Turing test, we have already passed that. You know, performing economically useful work, every day you see, you know, uh, LLMs are doing that.
- ETErik Torenberg
Do we? I don't know.
- VMVishal Misra
No, I mean, they are.
- ETErik Torenberg
I mean, um, I mean, without human intervention?
- VMVishal Misra
No, no, no. So, so that, that's different.
- ETErik Torenberg
Okay.
- 42:17 – 46:48
Simulation as Short Program
- ETErik Torenberg
Can you, and can you, can you tie the two things? Like, how does that pair with doing simulation, or is a simulation totally orthogonal?
- VMVishal Misra
No, s- simulation is, uh, is related, right?
- ETErik Torenberg
So you think it, like, basically you do simulation and somehow that is a step towards doing the Kolmogorov complexity?
- VMVishal Misra
It, it's, it's, the simulator is the, i- is the program that we create. It may not be the perfect program.
- ETErik Torenberg
Oh, I see. And you say-
- VMVishal Misra
But in our heads we create this, uh, simulator that when I'm throwing the pen, you know that it's coming at you, right?
- ETErik Torenberg
Yeah.
- VMVishal Misra
And you duck.So, so you're not computing the probabilities, uh, as it goes, but, but you have, you know, you, you build an approximate-
- ETErik Torenberg
That's a very physical thing versus we are talking more conceptually.
- VMVishal Misra
Conceptually, but, but it's a similar thing.
- ETErik Torenberg
And you think those are the same mechanism?
- VMVishal Misra
It's the same mechanism.
- ETErik Torenberg
Really?
- VMVishal Misra
Yeah. You, you have to build a causal model.
- ETErik Torenberg
Yeah.
- VMVishal Misra
Right?
- ETErik Torenberg
I see. I see. Yeah.
- VMVishal Misra
For most things, right?
- ETErik Torenberg
Yeah.
- VMVishal Misra
So you have to move from correlation to causation. I mean, we've heard this term-
- ETErik Torenberg
Yeah
- VMVishal Misra
... you know, ad infinitum.
- ETErik Torenberg
Yeah.
- VMVishal Misra
But here it, it's making a difference in the way we view intelligence.
- ETErik Torenberg
Yeah. How, how, how has, how has the last three papers been received?
- VMVishal Misra
No, I don't know. They're, they're... Well, uh, uh-
- ETErik Torenberg
I mean, I mean-
- VMVishal Misra
The archive versions were like-
- ETErik Torenberg
Let me tell you.
- VMVishal Misra
Yeah.
Episode duration: 46:48
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode zwDmKsnhl08
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome