Skip to content
a16za16z

Why Scale Will Not Solve AGI | Vishal Misra - The a16z Show

Vishal Misra returns to explain his latest research on how LLMs actually work under the hood. He walks through experiments showing that transformers update their predictions in a precise, mathematically predictable way as they process new information, explains why this still doesn't mean they're conscious, and describes what's actually required for AGI: the ability to keep learning after training and the move from pattern matching to understanding cause and effect. Timestamps 00:00 — Introduction 02:58 — LLM as Giant Matrix 08:24 — What Is In-Context Learning 13:00 — Bayesian Updating as Evidence 19:13 — Bayesian Wind Tunnel Tests 27:22 — Brains Simulate Causality 36:34 — Manifolds and New Representations 42:17 — Simulation as Short Program Read the full transcript here: https://www.a16z.news/s/podcast Resources: Follow Vishal Misra on X: https://x.com/vishalmisra Follow Martin Casado on X: https://x.com/martin_casado Stay Updated: If you enjoyed this episode, be sure to like, subscribe, and share with your friends! Find a16z on X: https://twitter.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Listen to the a16z Show on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYX Listen to the a16z Show on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711 Follow our host: https://x.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see http://a16z.com/disclosures.

Vishal MisraguestErik Torenberghost
Mar 17, 202646mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:002:58

    Introduction

    1. VM

      Anthropic makes great products. Claude Code is fantastic, CoWork is fantastic, but they are grains of silicon doing matrix multiplication. They don't have consciousness. They don't have an inner monologue. You take an LLM and train it on pre-1916 or 1911 physics and see if it can come up with the theory of relativity. If it does, then we have AGI.

    2. ET

      Just today, by the way-

    3. VM

      Yeah

    4. ET

      ... Dario allegedly said that you can't rule out that they're conscious.

    5. VM

      You can rule out they're conscious. [both laughing] Come on. To get to what is called AGI, I think there are two things that need to happen. One is...

    6. ET

      Vishal, it's great to have you in again.

    7. VM

      Great to be back.

    8. ET

      This is one of my favorite topics, which is, um, how do LLMs actually work?

    9. VM

      Mm-hmm.

    10. ET

      And I think that, uh, you, in, in my opinion, you've done kind of the best work on this, modeling it out.

    11. VM

      Thank you.

    12. ET

      For those that did not see the original, um, one, maybe it's probably worth doing just a quick background on kind of what led you to this point, and then we'll just go into the current work that you've been doing.

    13. VM

      Five years ago, when GPT-3 was first released-

    14. ET

      Yeah

    15. VM

      ... uh, I got early access to it, and I started playing with it, and I was trying to solve a problem related to querying a cricket database.

    16. ET

      Yeah.

    17. VM

      And I got GPT-3 to do in-context learning, few-shot learning, and, you know, it was kind of the first, at least, uh, to, to me, it was the first known, uh, implementation of RAG, Retrieval-Augmented Generation, which I used to solve this problem of, uh, querying, getting GPT-3 to translate natural language into something that could be used to query a database that GPT-3 had no idea about. I had no access to GPT-3's internals, but I was still able to use it to solve that problem. So it, it, it worked beautifully. Uh, we, we deployed, uh, this, uh, in production at ESPN in September '21, but-

    18. ET

      Wow. Wow, you, you did the first implementation of RAG in 2021?

    19. VM

      No, no, no. In 2020.

    20. ET

      2020.

    21. VM

      2020, I got it working, and by the time you talk to all the lawyers at ESPN and, you know, productionize it, it took, it took a while.

    22. ET

      Wow.

    23. VM

      But October 2020, we had... Well, I had this-

    24. ET

      Yeah

    25. VM

      ... architecture working. But after I got it to work, I was amazed that it worked. I wanted to understand how it worked.

    26. ET

      Yeah.

    27. VM

      And I looked at, you know, the Attention Is All You Need paper and all the other sort of deep learning architecture papers, and I couldn't understand why it worked.

    28. ET

      Yeah.

    29. VM

      So then I started getting sort of, uh, deep into building a mathematical model.

    30. ET

      Yeah. And now you've published a, a series of papers.

  2. 2:588:24

    LLM as Giant Matrix

    1. ET

      you were trying to, you, you were trying to describe... You were trying to come up with a mathematical model-

    2. VM

      Mm-hmm

    3. ET

      ... of how LLM works.

    4. VM

      Yeah.

    5. ET

      And you had, which was very helpful to me-

    6. VM

      Mm-hmm

    7. ET

      ... which was, um... And at the time you were actually trying to, like, figure out how in-context learning was working.

    8. VM

      Yes. Yeah.

    9. ET

      And you came up with an abstraction for LLMs, which is basically this very, very large matrix, and you used that to describe. So maybe you can kind of walk through that work very quickly.

    10. VM

      Sure, yeah. So, so what you do is you, you imagine this huge, gigantic matrix where every row of the matrix corresponds to a prompt.

    11. ET

      Yeah.

    12. VM

      And the way these LLMs work is, given a prompt, they construct a distribution of probabilities of the next token. Next token is next word. So every LLM has a vocabulary. You know, GPT and its variants have a vocabulary of about 50,000 tokens.

    13. ET

      Yeah.

    14. VM

      So given a prompt, it'll come up with a distribution of what the next token should be, and then all these models sample from that distribution.

    15. ET

      Yeah. So that's the posterior distribution.

    16. VM

      That's the posterior distribution.

    17. ET

      Right.

    18. VM

      Right? That, that's how LLMs work. And so the idea of this matric is, matrix is for every possible combination of tokens, which is a prompt, there's a row.

    19. ET

      Yeah.

    20. VM

      And the columns are a distribution over the vocabulary.

    21. ET

      Yeah.

    22. VM

      So if you have, like, a vocabulary of 50,000 possible tokens, it's a distribution over those 50,000 tokens.

    23. ET

      And by distribution, it's just the probability-

    24. VM

      Just the probability. Sorry, yeah.

    25. ET

      Yeah.

    26. VM

      Just the probability that, uh, the next token should be this-

    27. ET

      Yeah

    28. VM

      ... versus that.

    29. ET

      Yeah.

    30. VM

      Uh, so that, that's sort of the idea. And, and, and when you start viewing it that way, it, it makes things at least, uh, clearer to, uh, people like me who want to model it, uh, what, what's happening. So c-concretely, let's say you have an example that, uh, let's say your prompt is just one word, protein.

  3. 8:2413:00

    What Is In-Context Learning

    1. VM

      subset, right.

    2. ET

      You, you know, you use this approach to describe how in-context learning works, and so maybe first describe what in-context learning is-

    3. VM

      Yeah

    4. ET

      ... and then kind of the conclusion that you came from that.

    5. VM

      So in-context learning is when you, uh, show the LLM something it has kind of never seen before. You give it a few examples of this is what it wants, uh, this is what you're trying to do. Then you give a new problem, which is related to the examples that you've shown.

    6. ET

      Yeah.

    7. VM

      And the LLM lea-learns in real time what it's supposed to do and solves the problem.

    8. ET

      By the way, the first time I saw this, it absolutely blew my mind.

    9. VM

      Yeah.

    10. ET

      I actually, I actually used your DSL-

    11. VM

      Mm-hmm

    12. ET

      ... by when I was like first learning about it. So maybe like, kind of like a, like a-

    13. VM

      Yeah. So, so I, I, like-

    14. ET

      The DSL thing was just, it's just-

    15. VM

      Uh, it was-

    16. ET

      ... crazy this works at all.

    17. VM

      I-it's absolutely, you know, mind-blowing that it works. And so going back to that cricket problem-

    18. ET

      Yeah

    19. VM

      ... was, you know, i-i-in the mid-'90s, uh, I was part of a group that had created this, uh, cricket portal called Cricinfo.

    20. ET

      Yeah.

    21. VM

      Uh, cricket, uh, i-is a very stat-rich sport. You know, you think baseball multiplied by a thousand, and it's got all kinds of stats. And we had created this, uh, online searchable database called StatsGuru, where you could search for anything, any stat related to cricket, and it's been available since 2000.

    22. ET

      Yeah.

    23. VM

      But because you can query for anything, everything was be-made available. And how do you make something like that available to the general public?

    24. ET

      Yeah.

    25. VM

      Well, they're not gonna write SQL queries. The next best thing at that time was to create a web form. Unfortunately, [chuckles] everything was crammed into that web form. So as a result, you had like 20 drop-downs, 15 check boxes, 18 different text fields. It looked like a very complicated, daunting interface. So as a result, even though it could solve or it could answer any query, almost no one used it. A vanishingly small percentage-

    26. ET

      Yeah. [chuckles]

    27. VM

      ... of cricket fans use it because it, it just looked intimidating. And then ESPN bought that site, uh, in 2007. I still know people who, who run the site, and I've always told them, "You know, why don't you do something with StatsGuru?" And in January 2020, uh, the editor-in-chief of, uh, Cricinfo, Sambit Bal, he's, he's a friend, so he came to New York and we'd gone out for drinks. And again, I told him, "You know, why don't you do something with StatsGuru?" So he looks at me and says, "Why don't you do something about StatsGuru?" [chuckles] He was joking, but, uh, that idea kind of stayed with me. And when GPT-3 was released, I thought maybe I could use StatsGuru, use GPT-3 to create a front end for StatsGuru.

    28. ET

      Gotcha.

    29. VM

      And so what I did was, uh, I designed a DSL, a domain-specific language, which, uh, converted queries about cricket stats in natural language into this DSL. Now-

    30. ET

      And to be clear, you created this. It wasn't, like, part of, like, any training data-

  4. 13:0019:13

    Bayesian Updating as Evidence

    1. ET

      learning.

    2. VM

      Y-yeah. So, so w-when you think about what in-context learning is, is that a-as you see evidence, so, so, you know, uh, i-in the first paper, what I also did was I, I took this cricket DSL example.

    3. ET

      Okay.

    4. VM

      And I, uh, uh, I depicted the next token probabilities-

    5. ET

      Mm-hmm

    6. VM

      ... of the model as it was shown more and more examples. So the first time you show it this DSL, the natural language and the DSL, the probabilities of the DSL tokens were, were extremely low because GPT-3 had never seen this thing. When it saw the cricket question, in its mind, it was trying to continue it with an English answer. So the probabilities that were high were all English words.

    7. ET

      Yeah.

    8. VM

      Once it saw my prompt where I had the question and the DSL, the next time I had the question in the next row, the probabilities of the DSL token started going up. With every example, it went up, and finally, when I gave the new query, it was like it had almost 100% probability of getting the right token.

    9. ET

      Yeah.

    10. VM

      So this is an example of in real time, the model was updating its posterior probability. It was updating its knowledge that, okay, I've seen evidence, this is what I'm supposed to do. Now, this is a colloquial way of saying what Bayesian-

    11. ET

      Yeah

    12. VM

      ... inference is. Bayesian updating basically is you start with a prior. When you see new evidence, you update your posterior. That's the mathematical definition. But, but-

    13. ET

      Yeah

    14. VM

      ... uh, in, in English, it's basically you see something, you see new evidence, you update your belief about what's happening.

    15. ET

      Yeah.

    16. VM

      Right? So it was clear to me that LLMs are doing something which resembles Bayesian updating. So in that first paper, I had this matrix formulation, and I showed that, you know, what it's doing, it looks like Bayesian updating.

    17. ET

      Yeah.

    18. VM

      Then we can come to the sort of next series of papers.

    19. ET

      That's right. So, okay. So I mean, it, it, it seemed pretty conclusive to me at that time.

    20. VM

      Yeah.

    21. ET

      And then you went quiet for a while, and then I, I still remember the WhatsApp text. You said-

    22. VM

      Yeah

    23. ET

      ... "Martín, I know exactly how these things are working now." [both chuckling]

    24. VM

      Yeah. Well-

    25. ET

      And then, and then, and then listen, you dropped a series of papers that kind of broke the internet. Like you went super viral on Twitter.

    26. VM

      Yeah.

    27. ET

      Like, I mean, people really noticed. Um, uh, and so I, I want to get to that in just a second.

    28. VM

      Yeah.

    29. ET

      But before that, um, I remember when your first paper came out, people would be like, "You know, these things are definitely not Bayesian." Like-

    30. VM

      Mm

  5. 19:1327:22

    Bayesian Wind Tunnel Tests

    1. ET

      it. Got it.

    2. VM

      So then we came up with this idea, you know, my colleagues, uh, Naman Agarwal and Siddharth Dalal, we, the, the series of papers were, were written with them. We came up with this idea of a Bayesian wind tunnel.

    3. ET

      Okay.

    4. VM

      So what's a wind tunnel? Well, wind tunnel in the aerospace industry is where you test an aircraft in an isolated environment. You don't fly it, and you test, test it against all sorts of, uh, you know, aerodynamic pressure. Then you see what, what it'll withstand, what kind of altitude, pressure, blah, blah, blah. Right? You don't want to do it up in the air testing.

    5. ET

      Yeah.

    6. VM

      So we said, okay, why don't we create an environment where we take these architectures, and we tested transformers, Mamba, LSTMs, uh, MLPs, a-all architectures. We said, why don't we create, take a blank architecture, give it a task where it's impossible for the architecture to memorize what the solution to that task should be. The space is combinatorially-

    7. ET

      Yeah

    8. VM

      ... impossible for, given the number of parameters, and we took very small models. So it's difficult enough that they cannot memorize it.

    9. ET

      Yeah.

    10. VM

      But it's tractable enough that we know precisely what the, the Bayesian posterior should be. You can calculate it analytically. So we gave these models a bunch of tasks where, again, we show that it's impossible to memorize. We trained these models, and we found that the transformer got the precise Bayesian posterior down to ten to the power minus three bits accuracy. It was matching the distribution perfectly. So it is actually doing Bayesian in the mathematical sense, given a task-

    11. ET

      Wow

    12. VM

      ... where it has to update its belief. Uh, Mamba also does it reasonably well. LSTMs can do one of the things. So the, the, in the papers, we have a taxonomy of Bayesian tasks. Transformer does everything, Mamba does most of it, LSTMs do only partially, and MLPs fail completely.

    13. ET

      So is this a reflection of the data that it's trained on, or is it more a reflection of the mechanism?

    14. VM

      It's the mechanism, it's the architecture. The data decides what tasks it learns.

    15. ET

      Right.

    16. VM

      So in the first paper, we had these Bayesian wind tunnels, and we show that, you know, it's doing the job at different tasks. In the second paper, we show why it does it. So we look at the transformers, we look at the gradients, and we show how the gradients actually shape this geometry-

    17. ET

      Ah

    18. VM

      ... which enables this Bayesian updating to happen. Then in the third paper, what we did, we take, we took these frontier production LLMs, which have open weights so that we could look inside them, and we did our testing, and we saw that the geometries that we saw in the small, uh, models persisted in models which are, you know, hundreds of millions of parameters. The same signature existed. The only thing is that, uh, because they are trained on all sorts of data, it's a little bit dirty or messy.

    19. ET

      Yeah.

    20. VM

      But you can see the same structure. So the, the whole idea behind the Bayesian wind tunnel was, unlike these, uh, production LLMs, where you don't know what they have been trained on-

    21. ET

      Right

    22. VM

      ... so you cannot mathematically compute the posterior. So again, how do you prove it? I mean, it looks Bayesian, you know, from the first paper.

    23. ET

      From the first paper, yeah.

    24. VM

      From the paper it looks Bayesian, but, you know.

    25. ET

      Looks Bayesian to me. Yeah.

    26. VM

      So the wind tunnel sort of solved that problem for us.

    27. ET

      Mm-hmm.

    28. VM

      We said, okay, let's start with a blank architecture, give it a task where we know what the answer is. It cannot memorize it. Let's see what it does. And yeah.

    29. ET

      So do you think this provides any sort of, like, indication of how humans think, or do you think that these things are totally independent?

    30. VM

      No, no, it, it does provide, right? So, you know, human beings also, uh, update our beliefs as we see new evidence.

  6. 27:2236:34

    Brains Simulate Causality

    1. VM

      coming back to it, we, we are Bayesian-

    2. ET

      Yeah

    3. VM

      ... but we do something else. You know, when I, when I, when I throw this pen at you, what'll you do?

    4. ET

      Dodge it or-

    5. VM

      Dodge it

    6. ET

      ... dodge it? Yeah.

    7. VM

      Why will you dodge it?

    8. ET

      Um, to avoid being hit.

    9. VM

      Avoid being hit.

    10. ET

      Yeah.

    11. VM

      But your head is not doing a Bayesian calculation of, uh, okay, this pen is coming, the probability that it hits me, uh, it'll cause this much pain or all that.

    12. ET

      Correct.

    13. VM

      What you're essentially doing in your head is you're doing a simulation.

    14. ET

      Ah, right.

    15. VM

      You see the b-uh, the, the, the pen coming and you know that it'll come and hit me. Your mind simulates and you dodge it, right? So all of deep learning is, uh, doing correlations. It's not doing causation.

    16. ET

      Yeah.

    17. VM

      Causal models are the ones that are able to do simulations and interventions. So, you know, Judea Pearl has this whole, uh, causal hierarchy-

    18. ET

      Yeah

    19. VM

      ... where the first hierarchy, and the first hierarchy is association, which is you build these correlation models. Deep learning is beautiful. It, it's extremely powerful. I mean, you see every day, all these models are, like, amazingly good.

    20. ET

      Yeah.

    21. VM

      They do association. The second is intervention in the hierarchy.

    22. ET

      Yeah.

    23. VM

      Deep learning models do not do that. Third is counterfactual. So both intervention and counterfactual, you can imagine it, it, it's some sort of simulation. You, you build a model of, causal model of what's happening, and then you are able to simulate. So our brains do that. The current architectures don't do that. Another example I think which will make it clear is, uh, the difference between, uh, I'll use these technical term, Shannon entropy-

    24. ET

      Mm-hmm

    25. VM

      ... and Kolmogorov complexity.

    26. ET

      Sure.

    27. VM

      So if you look at the Shannon entropy of the digits of pi-

    28. ET

      Yeah

    29. VM

      ... it's infinite.

    30. ET

      Sure.

  7. 36:3442:17

    Manifolds and New Representations

    1. VM

      Yeah.

    2. ET

      You know, another way that I've always thought about these, and I thought you articulated it well the last time we talked about it, which is the universe is this very, very complex space, and then-You know, somehow humans map it into a manifold-

    3. VM

      Mm-hmm

    4. ET

      ... that's less complex.

    5. VM

      Yeah.

    6. ET

      And then that gets kind of written down, and then the LLM... So that's kinda some, some distribution, some-

    7. VM

      Mm.

    8. ET

      You know, it's still a very large space, but it's-

    9. VM

      Yeah

    10. ET

      ... it's, it's a bounded space, and the LLM learned that manifold, and then they kind of use, you know, Bayesian inference to move up and down that manifold.

    11. VM

      Right.

    12. ET

      But they're kind of bound to that manifold.

    13. VM

      Yeah.

    14. ET

      And then, again, I don't wanna put words in your mouth, and then, but, like, what they can't do is, is generate a new manifold.

    15. VM

      New manifold, yeah, yeah.

    16. ET

      Which requires understanding the way that the universe works, then coming up with a new representation of the universe.

    17. VM

      Yeah. And th- this is what relativity is, right?

    18. ET

      Yeah, exactly.

    19. VM

      Einstein had to create a new manifold.

    20. ET

      Yeah, yeah, yeah.

    21. VM

      If you just stuck with the old manifold of the Newtonian physics-

    22. ET

      Right

    23. VM

      ... then you would see these correlations, but you could not come up with a manifold that explain them. So you need to come up with a new representation.

    24. ET

      Yeah.

    25. VM

      So to me, you know, there are lots of definitions of AGI, uh, you know, Turing test, we have already passed that. You know, performing economically useful work, every day you see, you know, uh, LLMs are doing that.

    26. ET

      Do we? I don't know.

    27. VM

      No, I mean, they are.

    28. ET

      I mean, um, I mean, without human intervention?

    29. VM

      No, no, no. So, so that, that's different.

    30. ET

      Okay.

  8. 42:1746:48

    Simulation as Short Program

    1. ET

      Can you, and can you, can you tie the two things? Like, how does that pair with doing simulation, or is a simulation totally orthogonal?

    2. VM

      No, s- simulation is, uh, is related, right?

    3. ET

      So you think it, like, basically you do simulation and somehow that is a step towards doing the Kolmogorov complexity?

    4. VM

      It, it's, it's, the simulator is the, i- is the program that we create. It may not be the perfect program.

    5. ET

      Oh, I see. And you say-

    6. VM

      But in our heads we create this, uh, simulator that when I'm throwing the pen, you know that it's coming at you, right?

    7. ET

      Yeah.

    8. VM

      And you duck.So, so you're not computing the probabilities, uh, as it goes, but, but you have, you know, you, you build an approximate-

    9. ET

      That's a very physical thing versus we are talking more conceptually.

    10. VM

      Conceptually, but, but it's a similar thing.

    11. ET

      And you think those are the same mechanism?

    12. VM

      It's the same mechanism.

    13. ET

      Really?

    14. VM

      Yeah. You, you have to build a causal model.

    15. ET

      Yeah.

    16. VM

      Right?

    17. ET

      I see. I see. Yeah.

    18. VM

      For most things, right?

    19. ET

      Yeah.

    20. VM

      So you have to move from correlation to causation. I mean, we've heard this term-

    21. ET

      Yeah

    22. VM

      ... you know, ad infinitum.

    23. ET

      Yeah.

    24. VM

      But here it, it's making a difference in the way we view intelligence.

    25. ET

      Yeah. How, how, how has, how has the last three papers been received?

    26. VM

      No, I don't know. They're, they're... Well, uh, uh-

    27. ET

      I mean, I mean-

    28. VM

      The archive versions were like-

    29. ET

      Let me tell you.

    30. VM

      Yeah.

Episode duration: 46:48

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode zwDmKsnhl08

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome