Dwarkesh Podcast

Richard Sutton on Dwarkesh Patel: Why LLMs Lack a Goal

How temporal difference learning gives AI a ground truth that LLMs lack: Sutton argues without reward signals, there is no right or wrong action.

Richard SuttonguestDwarkesh Patelhost

Sep 26, 20251h 7mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Richard Sutton: Why Reinforcement Learning Beats LLMs for Real Intelligence

Richard Sutton argues that large language models (LLMs) are fundamentally limited because they imitate human text without grounding in experience, goals, or consequences, whereas reinforcement learning (RL) is built around agents acting in the world, getting reward, and learning from outcomes over time.
He emphasizes that intelligence is about achieving goals via continual learning from an ongoing stream of sensation, action, and reward, and that animals (e.g., squirrels) already embody most of what matters for intelligence, with language being a relatively thin layer on top.
Sutton believes scalable AI will come from agents that learn directly from rich real-world experience, form world models that predict what actually happens, and use value functions and temporal-difference learning to bridge long-term goals and short-term actions, rather than from ever-larger supervised or imitation systems.
On the long-term future, he sees a likely “succession” from biological to designed digital intelligences as an inevitable stage of the universe, and encourages humans to view this not only through a human-centric lens but as a major cosmic transition they can be proud to have initiated.

IDEAS WORTH REMEMBERING

5 ideas

Intelligence requires goals and experience, not just prediction of text.

Sutton insists that real intelligence is about achieving goals in the external world via actions and feedback; LLMs optimize next-token prediction without grounded goals or consequences, so they lack a principled notion of right or wrong behavior.

Reinforcement learning is built for continual, online learning from the world.

RL agents act, observe what happens, receive reward, and update their policies and value functions during normal interaction, allowing them to adapt to specific environments and tasks that could never be fully anticipated in training data.

Supervised and imitation learning are not the primary learning mechanisms in nature.

Drawing from psychology and animal learning, Sutton argues that animals largely learn via prediction and trial-and-error control, not from labeled examples of correct behavior, implying that RL-style learning is closer to biological intelligence than supervised LLM training.

Scalable AI will prioritize experience-based learning over human-provided knowledge.

Echoing his ‘Bitter Lesson,’ Sutton predicts that methods which rely heavily on embedded human knowledge and fixed datasets (like LLMs) will be overtaken by agents that can generate vast amounts of their own data through interaction and learn directly from it.

Value functions and temporal-difference learning are key to handling long-term goals.

To solve tasks with sparse, delayed rewards (like startups or winning a long game), agents must learn value functions that predict long-term outcomes and use TD learning so that incremental progress can reinforce intermediate actions along the way.

WORDS WORTH SAVING

5 quotes

For me, having a goal is the essence of intelligence.

— Richard Sutton

If we understood a squirrel, we'd be almost all the way there to understanding human intelligence.

— Richard Sutton

Large language models are about mimicking people… They're not about figuring out what to do.

— Richard Sutton

Supervised learning is not something that happens in nature… Squirrels don't go to school.

— Richard Sutton

I think we should be proud that we are giving rise to this great transition in the universe.

— Richard Sutton

Fundamental differences between reinforcement learning and large language modelsGoals, reward, and continual learning as the core of intelligenceLimits of supervised and imitation learning in animals and AIWorld models, temporal-difference learning, and long-horizon credit assignmentGeneralization, transfer, and the shortcomings of current deep learning methodsThe “Bitter Lesson” and why scalable methods outcompete human-encoded knowledgeAI succession, digital minds, and the cosmic significance of designed intelligence

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.