At a glance
WHAT IT’S REALLY ABOUT
LLMs do Bayesian inference, but lack plasticity and causality for AGI
- Misra models an LLM as an astronomically large, sparse matrix mapping every possible prompt to a probability distribution over next tokens, with training learning a compressed approximation of this matrix.
- He interprets in-context learning as real-time Bayesian updating: as the model sees examples in the prompt, its posterior over “what task am I doing?” shifts toward the correct completion format (e.g., a never-before-seen DSL).
- To move beyond suggestive examples, Misra introduces “Bayesian wind tunnel” experiments where the true Bayesian posterior is analytically known and memorization is infeasible, showing transformers match the posterior with ~10^-3 bits accuracy (with Mamba partial, LSTMs limited, MLPs failing).
- He argues today’s LLMs remain correlation/association machines (Shannon-entropy world) and lack causal intervention/counterfactual simulation (Kolmogorov-complexity/short-program world), which humans use to act and to form new representations.
- Misra concludes “scale will not solve everything”: AGI requires (1) plasticity via robust continual learning without catastrophic forgetting and (2) causal-model-based simulation that can generate new manifolds/representations, illustrated by an “Einstein test” for discovering relativity from pre-1916 physics.
IDEAS WORTH REMEMBERING
5 ideasAn LLM can be usefully viewed as a compressed prompt→distribution lookup table.
Misra’s “giant matrix” abstraction frames each prompt as a row and the next-token probability distribution as columns, clarifying that training is learning a compressed approximation of an unimaginably large but sparse object.
In-context learning behaves like Bayesian posterior updating over task hypotheses.
As examples accumulate in the prompt (e.g., question→DSL pairs), probability mass shifts from generic English continuations toward the DSL tokens, matching the intuition of updating beliefs as new evidence arrives.
Transformers can implement Bayesian inference precisely under controlled conditions.
The “Bayesian wind tunnel” constructs tasks where memorization is combinatorially impossible and the correct posterior is computable; transformers match that posterior to ~10^-3 bits, suggesting the mechanism is architectural, not a coincidence of data.
Architecture matters: not all sequence models are equally Bayesian-capable.
In Misra’s reported taxonomy, transformers succeed broadly, Mamba succeeds on many tasks, LSTMs only partially, and MLPs fail—implying attention-like mechanisms strongly support Bayesian-style updating.
LLMs are ‘frozen’ learners; humans are plastic learners.
LLMs update beliefs within a context window but do not retain that learning across fresh sessions without external memory or weight updates, whereas human synapses remain plastic and integrate new experience over a lifetime.
WORDS WORTH SAVING
5 quotesAnthropic makes great products. Claude Code is fantastic, CoWork is fantastic, but they are grains of silicon doing matrix multiplication. They don't have consciousness. They don't have an inner monologue.
— Vishal Misra
If you look at all possible combinations of 8,000 tokens and 50,000, uh, uh, vocabulary, the number of rows in this matrix is more than the number of electrons across all galaxies, right?
— Vishal Misra
So this is an example of in real time, the model was updating its posterior probability. It was updating its knowledge that, okay, I've seen evidence, this is what I'm supposed to do.
— Vishal Misra
We trained these models, and we found that the transformer got the precise Bayesian posterior down to ten to the power minus three bits accuracy. It was matching the distribution perfectly.
— Vishal Misra
You, you can start answering some of these questions, and, a-a-and o-one of the misconceptions that, uh, exists today is that scale will solve everything. Scale will not solve everything.
— Vishal Misra
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome