CHAPTERS
Why LLMs Aren’t Conscious—and a Concrete “Einstein Test” for AGI
Misra argues that today’s LLMs, despite impressive products, are fundamentally “grains of silicon doing matrix multiplication” rather than conscious agents. He proposes a stringent AGI benchmark: train a model only on pre-relativity physics and see whether it can derive relativity.
From Early GPT-3 Experiments to (Proto-)RAG at ESPN
Misra recounts getting early access to GPT-3 and using it to translate natural language questions into a custom query format for a cricket statistics database. This required retrieving relevant examples and prompting the model to generalize in-context, which later became a production system at ESPN.
LLMs as a Giant Prompt→Next-Token Probability Matrix
To make transformer behavior interpretable, Misra introduces a mental model: a massive matrix where each row is a prompt and each row maps to a probability distribution over the vocabulary for the next token. Because the true matrix is astronomically large, models must learn a compressed approximation of a sparse structure.
What In-Context Learning Is (and Why It Feels Like Magic)
They define in-context learning as the model’s ability to infer a new task from a few examples inside the prompt, then apply it to a new query immediately. Misra uses his cricket system as a vivid example: GPT-3 outputs a domain-specific language it had never seen before moments earlier.
In-Context Learning as Bayesian Updating: Priors, Evidence, Posteriors
Misra connects in-context learning to Bayesian inference: as the model sees evidence (examples), its next-token posterior shifts toward the task-specific tokens. He describes tracking token probabilities across examples, showing DSL tokens rise from near-zero to dominant likelihood.
Backlash to “Bayesian LLMs” and the Need for a Formal Proof
They discuss why calling LLM behavior “Bayesian” drew criticism—partly due to Bayesian vs. frequentist cultural divides and the claim that “anything can be described as Bayesian.” Misra explains that the first paper was compelling but empirical, prompting a more rigorous mathematical demonstration.
TokenProbe: Looking Inside Next-Token Probabilities and Entropy
Misra describes building TokenProbe, a tool that exposes next-token distributions and entropy for open-weight models, helping researchers and students see how prompts reshape predictions. The tool became part of their workflow and teaching, enabling deeper experimentation than closed interfaces.
The “Bayesian Wind Tunnel”: Controlled Tasks to Test Architectures
To prove Bayesian behavior rigorously, Misra and colleagues construct “wind tunnel” tasks: environments too combinatorial to memorize but simple enough that the Bayesian posterior can be computed exactly. They train small models and compare outputs to the analytic posterior, isolating architectural capability rather than dataset artifacts.
Results: Transformers Match the Bayesian Posterior Extremely Closely
Misra reports that transformers recover the Bayesian posterior with very high precision (down to ~1e-3 bits), while other architectures show partial success. This yields a taxonomy: transformers perform broadly, Mamba mostly, LSTMs partially, and MLPs fail—suggesting Bayesian updating emerges from architectural mechanisms.
Humans vs. LLMs: Plasticity, Objectives, and Why Models “Forget”
They contrast human cognition with LLM inference: humans continually update internal circuitry (plastic synapses) and retain learning across life, while LLM weights are frozen at inference. Misra also argues “agentic” behaviors stem from training data and next-token prediction objectives, not inherent drives like survival.
From Correlation to Causation: Simulation, Intervention, Counterfactuals
Misra claims the key missing ingredient for AGI is causal modeling: the ability to simulate outcomes and reason about interventions and counterfactuals. He uses the “dodging a thrown pen” example to illustrate that humans often act via internal simulation rather than explicit probabilistic calculation, aligning with Judea Pearl’s causal hierarchy.
Shannon Entropy vs. Kolmogorov Complexity: Why Scale Won’t Be Enough
Misra reframes the limitation: LLMs operate in a Shannon-entropy world of correlation, while breakthroughs like physics theories compress reality into short programs (Kolmogorov complexity). Using π as an example—high entropy digits but low program length—he argues intelligence needs the ability to discover compact generative programs, not just fit distributions.
Manifolds, New Representations, and the Relativity Example
They discuss how humans build a workable manifold (representation) of the world, while LLMs learn and navigate within the manifold present in their training data. Einstein succeeded by creating a new manifold for spacetime; Misra argues current models tend to treat paradigm-shifting evidence as anomalies rather than triggers for new representations.
Continual Learning, Knuth’s Example, and What ‘Next’ Looks Like
Misra emphasizes two prerequisites for AGI: (1) true plasticity via continual learning without catastrophic forgetting, and (2) causal modeling/simulation that supports new representations. They discuss Donald Knuth’s recent LLM-assisted work as ‘hacked plasticity’ through memory/context updates—useful within known manifolds, but still relying on the human to form the final compact theory.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome