CHAPTERS
Why LLMs Aren’t Conscious—and a Concrete “Einstein Test” for AGI
Misra argues that today’s LLMs, despite impressive products, are fundamentally “grains of silicon doing matrix multiplication” rather than conscious agents. He proposes a stringent AGI benchmark: train a model only on pre-relativity physics and see whether it can derive relativity.
- •LLMs lack consciousness/inner monologue; output reflects training objective and data
- •A practical AGI bar: can it rediscover relativity from pre-1916 knowledge?
- •Sets up the episode’s core thesis: scale alone won’t bridge the gap to AGI
From Early GPT-3 Experiments to (Proto-)RAG at ESPN
Misra recounts getting early access to GPT-3 and using it to translate natural language questions into a custom query format for a cricket statistics database. This required retrieving relevant examples and prompting the model to generalize in-context, which later became a production system at ESPN.
- •Built a natural-language interface to a cricket stats database using few-shot prompting
- •Used semantic search to retrieve closest examples (a RAG-like pattern)
- •Designed and deployed a system despite no access to GPT-3 internals
- •Motivation shifted from “it works” to “why does it work?”
LLMs as a Giant Prompt→Next-Token Probability Matrix
To make transformer behavior interpretable, Misra introduces a mental model: a massive matrix where each row is a prompt and each row maps to a probability distribution over the vocabulary for the next token. Because the true matrix is astronomically large, models must learn a compressed approximation of a sparse structure.
- •Rows represent prompts; columns represent next-token probabilities over ~50k tokens
- •Example: “protein” branches into very different continuations (“shake” vs “synthesis”)
- •The full matrix is too large to represent exactly; sparsity makes compression possible
- •LLMs approximate the ‘true’ distribution for a prompt via learned compression
What In-Context Learning Is (and Why It Feels Like Magic)
They define in-context learning as the model’s ability to infer a new task from a few examples inside the prompt, then apply it to a new query immediately. Misra uses his cricket system as a vivid example: GPT-3 outputs a domain-specific language it had never seen before moments earlier.
- •In-context learning: learn task behavior from examples within the prompt window
- •Cricket DSL: model maps English questions into a newly invented formal language
- •Examples must be chosen carefully due to context window constraints (e.g., 2k tokens)
- •Demonstrates fast “learning” without weight updates or internal access
In-Context Learning as Bayesian Updating: Priors, Evidence, Posteriors
Misra connects in-context learning to Bayesian inference: as the model sees evidence (examples), its next-token posterior shifts toward the task-specific tokens. He describes tracking token probabilities across examples, showing DSL tokens rise from near-zero to dominant likelihood.
- •Bayesian framing: start with prior, update beliefs as evidence arrives in prompt
- •Empirical observation: DSL token probabilities increase with each example
- •Posterior becomes sharply peaked on the correct continuation after enough evidence
- •Bridges the matrix view (rows/distributions) with ‘belief updating’ behavior
Backlash to “Bayesian LLMs” and the Need for a Formal Proof
They discuss why calling LLM behavior “Bayesian” drew criticism—partly due to Bayesian vs. frequentist cultural divides and the claim that “anything can be described as Bayesian.” Misra explains that the first paper was compelling but empirical, prompting a more rigorous mathematical demonstration.
- •Skeptic response: ‘Bayesian’ is too broad / politically charged in ML debates
- •First paper showed Bayesian-like behavior empirically, not as a formal guarantee
- •Need: a controlled setup where the true Bayesian posterior is analytically known
TokenProbe: Looking Inside Next-Token Probabilities and Entropy
Misra describes building TokenProbe, a tool that exposes next-token distributions and entropy for open-weight models, helping researchers and students see how prompts reshape predictions. The tool became part of their workflow and teaching, enabling deeper experimentation than closed interfaces.
- •TokenProbe visualizes token probabilities and entropy during prompting
- •Works with open-source models; used for education and DSL assignments
- •Created in response to reduced visibility in closed model interfaces
- •Provided infrastructure enabled deeper experimentation and iteration
The “Bayesian Wind Tunnel”: Controlled Tasks to Test Architectures
To prove Bayesian behavior rigorously, Misra and colleagues construct “wind tunnel” tasks: environments too combinatorial to memorize but simple enough that the Bayesian posterior can be computed exactly. They train small models and compare outputs to the analytic posterior, isolating architectural capability rather than dataset artifacts.
- •Wind-tunnel analogy: test systems in an isolated, controlled environment
- •Tasks designed so memorization is impossible given parameter constraints
- •True Bayesian posterior is analytically computable for these tasks
- •Architectures compared: transformers, Mamba, LSTMs, MLPs
Results: Transformers Match the Bayesian Posterior Extremely Closely
Misra reports that transformers recover the Bayesian posterior with very high precision (down to ~1e-3 bits), while other architectures show partial success. This yields a taxonomy: transformers perform broadly, Mamba mostly, LSTMs partially, and MLPs fail—suggesting Bayesian updating emerges from architectural mechanisms.
- •Transformers match the Bayesian posterior to ~10^-3 bits accuracy
- •Mamba performs reasonably well; LSTMs only on certain Bayesian sub-tasks
- •MLPs fail on these Bayesian updating tasks
- •Conclusion: mechanism/architecture matters; data decides which tasks are learned
Humans vs. LLMs: Plasticity, Objectives, and Why Models “Forget”
They contrast human cognition with LLM inference: humans continually update internal circuitry (plastic synapses) and retain learning across life, while LLM weights are frozen at inference. Misra also argues “agentic” behaviors stem from training data and next-token prediction objectives, not inherent drives like survival.
- •Human brains remain plastic; LLMs don’t persist learning across sessions
- •In-context learning updates the posterior temporarily, then resets with new context
- •Different objectives: humans optimize survival/reproduction; LLMs optimize next-token accuracy
- •Deception/self-preservation narratives reflect data patterns, not intrinsic motives
From Correlation to Causation: Simulation, Intervention, Counterfactuals
Misra claims the key missing ingredient for AGI is causal modeling: the ability to simulate outcomes and reason about interventions and counterfactuals. He uses the “dodging a thrown pen” example to illustrate that humans often act via internal simulation rather than explicit probabilistic calculation, aligning with Judea Pearl’s causal hierarchy.
- •Deep learning excels at association (correlation), not intervention/counterfactuals
- •Humans often simulate dynamics (e.g., dodging) rather than compute explicit probabilities
- •Causal models enable interventions and counterfactual reasoning (Pearl’s hierarchy)
- •AGI requires moving beyond correlation-based next-token prediction
Shannon Entropy vs. Kolmogorov Complexity: Why Scale Won’t Be Enough
Misra reframes the limitation: LLMs operate in a Shannon-entropy world of correlation, while breakthroughs like physics theories compress reality into short programs (Kolmogorov complexity). Using π as an example—high entropy digits but low program length—he argues intelligence needs the ability to discover compact generative programs, not just fit distributions.
- •π digits: unpredictable (high Shannon entropy) yet generated by a short program (low Kolmogorov complexity)
- •LLMs fit correlations; they don’t reliably discover shortest-program explanations
- •Scientific progress often means finding a new compact representation/program
- •Implication: more compute/tokens won’t automatically yield causal/program discovery
Manifolds, New Representations, and the Relativity Example
They discuss how humans build a workable manifold (representation) of the world, while LLMs learn and navigate within the manifold present in their training data. Einstein succeeded by creating a new manifold for spacetime; Misra argues current models tend to treat paradigm-shifting evidence as anomalies rather than triggers for new representations.
- •LLMs learn a manifold from data and perform Bayesian inference within it
- •They struggle to generate fundamentally new manifolds/representations
- •Relativity: many anomalies existed, but required a new spacetime formulation
- •Training-data ‘gravity’ biases models toward the dominant historical view
Continual Learning, Knuth’s Example, and What ‘Next’ Looks Like
Misra emphasizes two prerequisites for AGI: (1) true plasticity via continual learning without catastrophic forgetting, and (2) causal modeling/simulation that supports new representations. They discuss Donald Knuth’s recent LLM-assisted work as ‘hacked plasticity’ through memory/context updates—useful within known manifolds, but still relying on the human to form the final compact theory.
- •Scale won’t solve everything; continual learning risks catastrophic forgetting
- •AGI needs plasticity (persistent learning) plus causality (simulation/interventions)
- •Knuth example: context/memory updates approximate plasticity without changing weights
- •LLMs help explore (Shannon); humans still synthesize the compact proof (Kolmogorov)
- •Future direction: new mechanisms/architectures where LLMs are part, not all, of the solution
