Skip to content
a16za16z

Why Scale Will Not Solve AGI | Vishal Misra - The a16z Show

Vishal Misra returns to explain his latest research on how LLMs actually work under the hood. He walks through experiments showing that transformers update their predictions in a precise, mathematically predictable way as they process new information, explains why this still doesn't mean they're conscious, and describes what's actually required for AGI: the ability to keep learning after training and the move from pattern matching to understanding cause and effect. Timestamps 00:00 — Introduction 02:58 — LLM as Giant Matrix 08:24 — What Is In-Context Learning 13:00 — Bayesian Updating as Evidence 19:13 — Bayesian Wind Tunnel Tests 27:22 — Brains Simulate Causality 36:34 — Manifolds and New Representations 42:17 — Simulation as Short Program Read the full transcript here: https://www.a16z.news/s/podcast Resources: Follow Vishal Misra on X: https://x.com/vishalmisra Follow Martin Casado on X: https://x.com/martin_casado Stay Updated: If you enjoyed this episode, be sure to like, subscribe, and share with your friends! Find a16z on X: https://twitter.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Listen to the a16z Show on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYX Listen to the a16z Show on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711 Follow our host: https://x.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see http://a16z.com/disclosures.

Vishal MisraguestErik Torenberghost
Mar 16, 202646mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

LLMs do Bayesian inference, but lack plasticity and causality for AGI

  1. Misra models an LLM as an astronomically large, sparse matrix mapping every possible prompt to a probability distribution over next tokens, with training learning a compressed approximation of this matrix.
  2. He interprets in-context learning as real-time Bayesian updating: as the model sees examples in the prompt, its posterior over “what task am I doing?” shifts toward the correct completion format (e.g., a never-before-seen DSL).
  3. To move beyond suggestive examples, Misra introduces “Bayesian wind tunnel” experiments where the true Bayesian posterior is analytically known and memorization is infeasible, showing transformers match the posterior with ~10^-3 bits accuracy (with Mamba partial, LSTMs limited, MLPs failing).
  4. He argues today’s LLMs remain correlation/association machines (Shannon-entropy world) and lack causal intervention/counterfactual simulation (Kolmogorov-complexity/short-program world), which humans use to act and to form new representations.
  5. Misra concludes “scale will not solve everything”: AGI requires (1) plasticity via robust continual learning without catastrophic forgetting and (2) causal-model-based simulation that can generate new manifolds/representations, illustrated by an “Einstein test” for discovering relativity from pre-1916 physics.

IDEAS WORTH REMEMBERING

5 ideas

An LLM can be usefully viewed as a compressed prompt→distribution lookup table.

Misra’s “giant matrix” abstraction frames each prompt as a row and the next-token probability distribution as columns, clarifying that training is learning a compressed approximation of an unimaginably large but sparse object.

In-context learning behaves like Bayesian posterior updating over task hypotheses.

As examples accumulate in the prompt (e.g., question→DSL pairs), probability mass shifts from generic English continuations toward the DSL tokens, matching the intuition of updating beliefs as new evidence arrives.

Transformers can implement Bayesian inference precisely under controlled conditions.

The “Bayesian wind tunnel” constructs tasks where memorization is combinatorially impossible and the correct posterior is computable; transformers match that posterior to ~10^-3 bits, suggesting the mechanism is architectural, not a coincidence of data.

Architecture matters: not all sequence models are equally Bayesian-capable.

In Misra’s reported taxonomy, transformers succeed broadly, Mamba succeeds on many tasks, LSTMs only partially, and MLPs fail—implying attention-like mechanisms strongly support Bayesian-style updating.

LLMs are ‘frozen’ learners; humans are plastic learners.

LLMs update beliefs within a context window but do not retain that learning across fresh sessions without external memory or weight updates, whereas human synapses remain plastic and integrate new experience over a lifetime.

WORDS WORTH SAVING

5 quotes

Anthropic makes great products. Claude Code is fantastic, CoWork is fantastic, but they are grains of silicon doing matrix multiplication. They don't have consciousness. They don't have an inner monologue.

Vishal Misra

If you look at all possible combinations of 8,000 tokens and 50,000, uh, uh, vocabulary, the number of rows in this matrix is more than the number of electrons across all galaxies, right?

Vishal Misra

So this is an example of in real time, the model was updating its posterior probability. It was updating its knowledge that, okay, I've seen evidence, this is what I'm supposed to do.

Vishal Misra

We trained these models, and we found that the transformer got the precise Bayesian posterior down to ten to the power minus three bits accuracy. It was matching the distribution perfectly.

Vishal Misra

You, you can start answering some of these questions, and, a-a-and o-one of the misconceptions that, uh, exists today is that scale will solve everything. Scale will not solve everything.

Vishal Misra

LLMs as a giant prompt→next-token distribution matrixSparsity and compression in language modelingIn-context learning as Bayesian updatingCricket DSL + semantic search (early RAG-style system)TokenProbe: viewing token probabilities and entropyBayesian wind tunnel experiments and analytic posteriorsTransformer vs Mamba vs LSTM vs MLP Bayesian capabilityFrozen weights vs human plasticity and continual learningCorrelation (association) vs causation (intervention/counterfactual)Shannon entropy vs Kolmogorov complexity (short programs)Manifolds/representations and the “Einstein test” for AGI

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome