Richard Sutton on Dwarkesh Patel: Why LLMs Lack a Goal

Name: Richard Sutton on Dwarkesh Patel: Why LLMs Lack a Goal
Uploaded: 2025-09-26T00:00:00Z
Duration: 1 h 7 min 8 s
Description: Richard Sutton argues that large language models (LLMs) are fundamentally limited because they imitate human text without grounding in experience, goals, or consequences, whereas reinforcement learning (RL) is built around agents acting in the world, getting reward, and learning from outcomes over time.

How temporal difference learning gives AI a ground truth that LLMs lack: Sutton argues without reward signals, there is no right or wrong action.

Richard SuttonguestDwarkesh Patelhost

Sep 26, 20251h 7mWatch on YouTube ↗

CHAPTERS

0:00 – 4:06
RL vs. LLMs: predicting text isn’t understanding the world
Sutton contrasts reinforcement learning (RL) with large language models (LLMs): RL is about learning how the world works through interaction, while LLMs primarily mimic what people say. He argues that next-token prediction is not the same as having a world model that can anticipate real consequences.
- •RL’s core aim: understand the world via action → consequence → learning
- •LLMs imitate human-produced text rather than discover what to do
- •A “world model” should predict what will happen, not what a person would say
- •Turing’s framing: intelligence as learning from experience
- •Bandwagons can cause the field to lose sight of foundational problems
4:06 – 13:51
Why Sutton thinks LLMs lack goals, ground truth, and a real “prior” for action
Dwarkesh proposes LLMs as a useful prior for an “era of experience,” but Sutton rejects this framing. He argues LLM training lacks an inherent notion of correctness because there’s no goal grounded in the external world, making continual improvement during real interaction ill-defined without RL-style feedback.
- •Without goals, there’s no principled definition of “right” vs “wrong” output
- •A prior only makes sense relative to ground truth; Sutton claims LLMs don’t have it
- •LLMs can respond to questions about predictions, but don’t learn from surprise online
- •Next-token prediction is an action-selection scheme, not a prediction of the world’s response
- •Sutton’s definition: intelligence is the ability to achieve goals
13:51 – 22:47
Do humans learn by imitation? Infants, animals, and cultural transmission
Dwarkesh argues humans learn substantially via imitation and instruction; Sutton pushes back, emphasizing trial-and-error and prediction as the fundamental learning mechanisms in animals. They partially reconcile by treating cultural imitation as a thin layer atop deeper experiential learning.
- •Sutton: supervised learning/explicit targets are not a basic animal learning process
- •Infants explore (move, look, vocalize) without labeled “correct” actions
- •Schooling is an exception, not a template for general theories of learning
- •Cultural evolution can transmit complex skills, but rests on more basic learning
- •Debate about what’s essential to intelligence: human uniqueness vs animal commonality
22:47 – 24:00
The “Era of Experience” needs realistic training environments (and why they’re hard)
Dwarkesh underscores a practical bottleneck for experience-based AI: building rich, messy environments that reflect real-world dynamics. The segment highlights that robust RL environments require domain expertise and careful simulation details, not just simple tests.
- •Real-world RL requires environments with changing state and realistic constraints
- •Subject-matter expertise is crucial to encode workflows and edge cases
- •Small simulation details can separate demos from deployable agents
- •Example: online shopping environment needs dynamic catalogs and stale-data behavior
- •Motivation: experience-driven learning depends on high-quality interaction loops
24:00 – 28:18
The experiential paradigm: intelligence as an ongoing stream of sensation, action, and reward
Sutton lays out his preferred framework: life is a continuous stream of observations, actions, and rewards, and intelligence is the ability to improve that stream by changing actions. Crucially, knowledge becomes testable statements about what follows what, enabling continual learning.
- •Experience stream is the foundation: sensation → action → reward → repeat
- •Knowledge is about the stream (what happens if I do X?) and can be verified
- •Continual learning is central because predictions can be compared to reality
- •Sutton resists “future agent” framing—this is already the RL paradigm
- •Rewards can be task-specific plus intrinsic motivation to improve understanding
28:18 – 30:10
Sparse long-horizon goals: how TD learning creates useful intermediate signals
Dwarkesh asks how an agent can pursue goals with very delayed rewards (e.g., a decade-long startup payoff). Sutton answers with temporal-difference learning: value predictions provide dense learning signals as beliefs about long-term success rise and fall during progress.
- •TD learning supports long-term optimization by learning a value function
- •Intermediate progress updates value estimates, reinforcing helpful steps
- •Chess example: taking a piece matters because it shifts win-probability estimates
- •Sparse reward problems require learned predictions to guide behavior
- •The “reward is too small” concern motivates using more than reward alone
30:10 – 35:21
Beyond reward: the four-part agent (policy, value, state representation, world model)
Sutton explains the “common model” of an agent, emphasizing that reward is not the only learning signal—sensory data and transitions carry rich information. He distinguishes a true transition/world model (predicting consequences of actions) from the loose way “model” is often used to mean “network.”
- •Policy: selects actions; Value function: tracks how well things are going
- •Perception/state representation: constructing ‘where you are now’
- •Transition model/world model: predicts consequences of actions from experience
- •Most learning comes from observations and dynamics, not reward alone
- •Continual learning stores knowledge in weights, not just a context window
35:21 – 42:14
Why current deep architectures generalize poorly: transfer between states and OOD brittleness
They discuss why many RL systems get trained per task and struggle to transfer. Sutton argues the real missing ingredient is automated methods that produce good generalization/transfer; today, when it happens, it’s often because researchers engineered representations, and deep learning can show catastrophic interference.
- •General agent view: different ‘tasks’ are often just different states in one world
- •Historically weak transfer: we lack automated techniques that promote good generalization
- •Sutton: gradient descent solves training problems but doesn’t inherently choose “good” generalization
- •Catastrophic interference illustrates poor generalization across prior skills
- •LLMs are hard to study scientifically due to uncontrolled, unknown training data
42:14 – 47:27
Surprises in AI: LLM effectiveness, the victory of “weak methods,” and AlphaGo in context
Sutton reflects on what surprised him over decades: LLMs worked better on language than expected, and general-purpose learning/search methods have outcompeted hand-coded symbolic knowledge. He situates AlphaGo/AlphaZero as scaling and refinement of earlier RL successes like TD-Gammon.
- •LLMs: unexpectedly strong performance on language tasks
- •Long-running debate: general principles (“weak methods”) vs human-knowledge systems (“strong”)
- •Sutton views the field’s progress as vindication of scalable basic principles
- •AlphaGo had precursors (TD-Gammon) and key innovations (not just novelty)
- •Sutton’s self-image: classicist, comfortable being out of sync with fashions
47:27 – 54:34
Will the Bitter Lesson apply after AGI? Many minds, cultural evolution, and compute allocation
Dwarkesh proposes that post-AGI, “artisanal” research might scale because AI researchers scale with compute. Sutton challenges the premise and pivots to a more interesting future: digital intelligences that copy themselves, explore in parallel, and share knowledge—raising questions about how merging and coordination might work.
- •Sutton questions whether “after AGI” framings are coherent—what remains to be ‘solved’ then?
- •AlphaGo → AlphaZero as a lesson: less human knowledge, more experience
- •Key future question: spend compute to think faster, or to spawn parallel copies to learn?
- •Can learned improvements be reincorporated into a central agent without incompatibility?
- •Parallel exploration suggests “digital cultural evolution” distinct from human constraints
54:34 – 1:07:08
Succession to AI: inevitability, cosmic framing, and the problem of ‘corruption’ when merging minds
Sutton argues AI succession (or AI-augmented human succession) is inevitable, giving a four-part argument about coordination limits, eventual understanding, superintelligence, and resource accumulation. He adds a caution: merging information from distributed copies could introduce “corruption” (hidden goals/viruses), making cybersecurity fundamental; the conversation closes on values, voluntariness, and limits of human control.
- •Four-part inevitability case: no unified governance, intelligence will be understood, superintelligence will arise, power follows capability
- •Cosmic perspective: a major universe transition from replication to design
- •Choice of framing: AIs as “offspring” vs “other,” and how that affects attitudes
- •Decentralized copies reporting back raise corruption/cybersecurity risks when integrating updates
- •On values and governance: seek voluntary change; accept limited control; focus on local goals while shaping society over time

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

RL vs. LLMs: predicting text isn’t understanding the world

Why Sutton thinks LLMs lack goals, ground truth, and a real “prior” for action

Do humans learn by imitation? Infants, animals, and cultural transmission

The “Era of Experience” needs realistic training environments (and why they’re hard)

The experiential paradigm: intelligence as an ongoing stream of sensation, action, and reward

Sparse long-horizon goals: how TD learning creates useful intermediate signals

Beyond reward: the four-part agent (policy, value, state representation, world model)

Why current deep architectures generalize poorly: transfer between states and OOD brittleness

Surprises in AI: LLM effectiveness, the victory of “weak methods,” and AlphaGo in context

Will the Bitter Lesson apply after AGI? Many minds, cultural evolution, and compute allocation

Succession to AI: inevitability, cosmic framing, and the problem of ‘corruption’ when merging minds

Get more out of YouTube videos.