Skip to content
a16za16z

Why Scale Will Not Solve AGI | Vishal Misra - The a16z Show

Vishal Misra returns to explain his latest research on how LLMs actually work under the hood. He walks through experiments showing that transformers update their predictions in a precise, mathematically predictable way as they process new information, explains why this still doesn't mean they're conscious, and describes what's actually required for AGI: the ability to keep learning after training and the move from pattern matching to understanding cause and effect. Timestamps 00:00 — Introduction 02:58 — LLM as Giant Matrix 08:24 — What Is In-Context Learning 13:00 — Bayesian Updating as Evidence 19:13 — Bayesian Wind Tunnel Tests 27:22 — Brains Simulate Causality 36:34 — Manifolds and New Representations 42:17 — Simulation as Short Program Read the full transcript here: https://www.a16z.news/s/podcast Resources: Follow Vishal Misra on X: https://x.com/vishalmisra Follow Martin Casado on X: https://x.com/martin_casado Stay Updated: If you enjoyed this episode, be sure to like, subscribe, and share with your friends! Find a16z on X: https://twitter.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Listen to the a16z Show on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYX Listen to the a16z Show on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711 Follow our host: https://x.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see http://a16z.com/disclosures.

Vishal MisraguestErik Torenberghost
Mar 17, 202646mWatch on YouTube ↗

CHAPTERS

  1. Why LLMs Aren’t Conscious—and a Concrete “Einstein Test” for AGI

    Misra argues that today’s LLMs, despite impressive products, are fundamentally “grains of silicon doing matrix multiplication” rather than conscious agents. He proposes a stringent AGI benchmark: train a model only on pre-relativity physics and see whether it can derive relativity.

  2. From Early GPT-3 Experiments to (Proto-)RAG at ESPN

    Misra recounts getting early access to GPT-3 and using it to translate natural language questions into a custom query format for a cricket statistics database. This required retrieving relevant examples and prompting the model to generalize in-context, which later became a production system at ESPN.

  3. LLMs as a Giant Prompt→Next-Token Probability Matrix

    To make transformer behavior interpretable, Misra introduces a mental model: a massive matrix where each row is a prompt and each row maps to a probability distribution over the vocabulary for the next token. Because the true matrix is astronomically large, models must learn a compressed approximation of a sparse structure.

  4. What In-Context Learning Is (and Why It Feels Like Magic)

    They define in-context learning as the model’s ability to infer a new task from a few examples inside the prompt, then apply it to a new query immediately. Misra uses his cricket system as a vivid example: GPT-3 outputs a domain-specific language it had never seen before moments earlier.

  5. In-Context Learning as Bayesian Updating: Priors, Evidence, Posteriors

    Misra connects in-context learning to Bayesian inference: as the model sees evidence (examples), its next-token posterior shifts toward the task-specific tokens. He describes tracking token probabilities across examples, showing DSL tokens rise from near-zero to dominant likelihood.

  6. Backlash to “Bayesian LLMs” and the Need for a Formal Proof

    They discuss why calling LLM behavior “Bayesian” drew criticism—partly due to Bayesian vs. frequentist cultural divides and the claim that “anything can be described as Bayesian.” Misra explains that the first paper was compelling but empirical, prompting a more rigorous mathematical demonstration.

  7. TokenProbe: Looking Inside Next-Token Probabilities and Entropy

    Misra describes building TokenProbe, a tool that exposes next-token distributions and entropy for open-weight models, helping researchers and students see how prompts reshape predictions. The tool became part of their workflow and teaching, enabling deeper experimentation than closed interfaces.

  8. The “Bayesian Wind Tunnel”: Controlled Tasks to Test Architectures

    To prove Bayesian behavior rigorously, Misra and colleagues construct “wind tunnel” tasks: environments too combinatorial to memorize but simple enough that the Bayesian posterior can be computed exactly. They train small models and compare outputs to the analytic posterior, isolating architectural capability rather than dataset artifacts.

  9. Results: Transformers Match the Bayesian Posterior Extremely Closely

    Misra reports that transformers recover the Bayesian posterior with very high precision (down to ~1e-3 bits), while other architectures show partial success. This yields a taxonomy: transformers perform broadly, Mamba mostly, LSTMs partially, and MLPs fail—suggesting Bayesian updating emerges from architectural mechanisms.

  10. Humans vs. LLMs: Plasticity, Objectives, and Why Models “Forget”

    They contrast human cognition with LLM inference: humans continually update internal circuitry (plastic synapses) and retain learning across life, while LLM weights are frozen at inference. Misra also argues “agentic” behaviors stem from training data and next-token prediction objectives, not inherent drives like survival.

  11. From Correlation to Causation: Simulation, Intervention, Counterfactuals

    Misra claims the key missing ingredient for AGI is causal modeling: the ability to simulate outcomes and reason about interventions and counterfactuals. He uses the “dodging a thrown pen” example to illustrate that humans often act via internal simulation rather than explicit probabilistic calculation, aligning with Judea Pearl’s causal hierarchy.

  12. Shannon Entropy vs. Kolmogorov Complexity: Why Scale Won’t Be Enough

    Misra reframes the limitation: LLMs operate in a Shannon-entropy world of correlation, while breakthroughs like physics theories compress reality into short programs (Kolmogorov complexity). Using π as an example—high entropy digits but low program length—he argues intelligence needs the ability to discover compact generative programs, not just fit distributions.

  13. Manifolds, New Representations, and the Relativity Example

    They discuss how humans build a workable manifold (representation) of the world, while LLMs learn and navigate within the manifold present in their training data. Einstein succeeded by creating a new manifold for spacetime; Misra argues current models tend to treat paradigm-shifting evidence as anomalies rather than triggers for new representations.

  14. Continual Learning, Knuth’s Example, and What ‘Next’ Looks Like

    Misra emphasizes two prerequisites for AGI: (1) true plasticity via continual learning without catastrophic forgetting, and (2) causal modeling/simulation that supports new representations. They discuss Donald Knuth’s recent LLM-assisted work as ‘hacked plasticity’ through memory/context updates—useful within known manifolds, but still relying on the human to form the final compact theory.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome