Dwarkesh PodcastRichard Sutton on Dwarkesh Patel: Why LLMs Lack a Goal
How temporal difference learning gives AI a ground truth that LLMs lack: Sutton argues without reward signals, there is no right or wrong action.
CHAPTERS
- 0:00 – 4:06
RL vs. LLMs: predicting text isn’t understanding the world
Sutton contrasts reinforcement learning (RL) with large language models (LLMs): RL is about learning how the world works through interaction, while LLMs primarily mimic what people say. He argues that next-token prediction is not the same as having a world model that can anticipate real consequences.
- •RL’s core aim: understand the world via action → consequence → learning
- •LLMs imitate human-produced text rather than discover what to do
- •A “world model” should predict what will happen, not what a person would say
- •Turing’s framing: intelligence as learning from experience
- •Bandwagons can cause the field to lose sight of foundational problems
- 4:06 – 13:51
Why Sutton thinks LLMs lack goals, ground truth, and a real “prior” for action
Dwarkesh proposes LLMs as a useful prior for an “era of experience,” but Sutton rejects this framing. He argues LLM training lacks an inherent notion of correctness because there’s no goal grounded in the external world, making continual improvement during real interaction ill-defined without RL-style feedback.
- •Without goals, there’s no principled definition of “right” vs “wrong” output
- •A prior only makes sense relative to ground truth; Sutton claims LLMs don’t have it
- •LLMs can respond to questions about predictions, but don’t learn from surprise online
- •Next-token prediction is an action-selection scheme, not a prediction of the world’s response
- •Sutton’s definition: intelligence is the ability to achieve goals
- 13:51 – 22:47
Do humans learn by imitation? Infants, animals, and cultural transmission
Dwarkesh argues humans learn substantially via imitation and instruction; Sutton pushes back, emphasizing trial-and-error and prediction as the fundamental learning mechanisms in animals. They partially reconcile by treating cultural imitation as a thin layer atop deeper experiential learning.
- •Sutton: supervised learning/explicit targets are not a basic animal learning process
- •Infants explore (move, look, vocalize) without labeled “correct” actions
- •Schooling is an exception, not a template for general theories of learning
- •Cultural evolution can transmit complex skills, but rests on more basic learning
- •Debate about what’s essential to intelligence: human uniqueness vs animal commonality
- 22:47 – 24:00
The “Era of Experience” needs realistic training environments (and why they’re hard)
Dwarkesh underscores a practical bottleneck for experience-based AI: building rich, messy environments that reflect real-world dynamics. The segment highlights that robust RL environments require domain expertise and careful simulation details, not just simple tests.
- •Real-world RL requires environments with changing state and realistic constraints
- •Subject-matter expertise is crucial to encode workflows and edge cases
- •Small simulation details can separate demos from deployable agents
- •Example: online shopping environment needs dynamic catalogs and stale-data behavior
- •Motivation: experience-driven learning depends on high-quality interaction loops
- 24:00 – 28:18
The experiential paradigm: intelligence as an ongoing stream of sensation, action, and reward
Sutton lays out his preferred framework: life is a continuous stream of observations, actions, and rewards, and intelligence is the ability to improve that stream by changing actions. Crucially, knowledge becomes testable statements about what follows what, enabling continual learning.
- •Experience stream is the foundation: sensation → action → reward → repeat
- •Knowledge is about the stream (what happens if I do X?) and can be verified
- •Continual learning is central because predictions can be compared to reality
- •Sutton resists “future agent” framing—this is already the RL paradigm
- •Rewards can be task-specific plus intrinsic motivation to improve understanding
- 28:18 – 30:10
Sparse long-horizon goals: how TD learning creates useful intermediate signals
Dwarkesh asks how an agent can pursue goals with very delayed rewards (e.g., a decade-long startup payoff). Sutton answers with temporal-difference learning: value predictions provide dense learning signals as beliefs about long-term success rise and fall during progress.
- •TD learning supports long-term optimization by learning a value function
- •Intermediate progress updates value estimates, reinforcing helpful steps
- •Chess example: taking a piece matters because it shifts win-probability estimates
- •Sparse reward problems require learned predictions to guide behavior
- •The “reward is too small” concern motivates using more than reward alone
- 30:10 – 35:21
Beyond reward: the four-part agent (policy, value, state representation, world model)
Sutton explains the “common model” of an agent, emphasizing that reward is not the only learning signal—sensory data and transitions carry rich information. He distinguishes a true transition/world model (predicting consequences of actions) from the loose way “model” is often used to mean “network.”
- •Policy: selects actions; Value function: tracks how well things are going
- •Perception/state representation: constructing ‘where you are now’
- •Transition model/world model: predicts consequences of actions from experience
- •Most learning comes from observations and dynamics, not reward alone
- •Continual learning stores knowledge in weights, not just a context window
- 35:21 – 42:14
Why current deep architectures generalize poorly: transfer between states and OOD brittleness
They discuss why many RL systems get trained per task and struggle to transfer. Sutton argues the real missing ingredient is automated methods that produce good generalization/transfer; today, when it happens, it’s often because researchers engineered representations, and deep learning can show catastrophic interference.
- •General agent view: different ‘tasks’ are often just different states in one world
- •Historically weak transfer: we lack automated techniques that promote good generalization
- •Sutton: gradient descent solves training problems but doesn’t inherently choose “good” generalization
- •Catastrophic interference illustrates poor generalization across prior skills
- •LLMs are hard to study scientifically due to uncontrolled, unknown training data
- 42:14 – 47:27
Surprises in AI: LLM effectiveness, the victory of “weak methods,” and AlphaGo in context
Sutton reflects on what surprised him over decades: LLMs worked better on language than expected, and general-purpose learning/search methods have outcompeted hand-coded symbolic knowledge. He situates AlphaGo/AlphaZero as scaling and refinement of earlier RL successes like TD-Gammon.
- •LLMs: unexpectedly strong performance on language tasks
- •Long-running debate: general principles (“weak methods”) vs human-knowledge systems (“strong”)
- •Sutton views the field’s progress as vindication of scalable basic principles
- •AlphaGo had precursors (TD-Gammon) and key innovations (not just novelty)
- •Sutton’s self-image: classicist, comfortable being out of sync with fashions
- 47:27 – 54:34
Will the Bitter Lesson apply after AGI? Many minds, cultural evolution, and compute allocation
Dwarkesh proposes that post-AGI, “artisanal” research might scale because AI researchers scale with compute. Sutton challenges the premise and pivots to a more interesting future: digital intelligences that copy themselves, explore in parallel, and share knowledge—raising questions about how merging and coordination might work.
- •Sutton questions whether “after AGI” framings are coherent—what remains to be ‘solved’ then?
- •AlphaGo → AlphaZero as a lesson: less human knowledge, more experience
- •Key future question: spend compute to think faster, or to spawn parallel copies to learn?
- •Can learned improvements be reincorporated into a central agent without incompatibility?
- •Parallel exploration suggests “digital cultural evolution” distinct from human constraints
- 54:34 – 1:07:08
Succession to AI: inevitability, cosmic framing, and the problem of ‘corruption’ when merging minds
Sutton argues AI succession (or AI-augmented human succession) is inevitable, giving a four-part argument about coordination limits, eventual understanding, superintelligence, and resource accumulation. He adds a caution: merging information from distributed copies could introduce “corruption” (hidden goals/viruses), making cybersecurity fundamental; the conversation closes on values, voluntariness, and limits of human control.
- •Four-part inevitability case: no unified governance, intelligence will be understood, superintelligence will arise, power follows capability
- •Cosmic perspective: a major universe transition from replication to design
- •Choice of framing: AIs as “offspring” vs “other,” and how that affects attitudes
- •Decentralized copies reporting back raise corruption/cybersecurity risks when integrating updates
- •On values and governance: seek voluntary change; accept limited control; focus on local goals while shaping society over time