Richard Sutton – Father of RL thinks LLMs are a dead end

Name: Richard Sutton – Father of RL thinks LLMs are a dead end
Uploaded: 2025-09-26T12:00:00Z
Duration: 1 h 7 min 8 s
Description: Richard Sutton argues that large language models (LLMs) are fundamentally limited because they imitate human text without grounding in experience, goals, or consequences, whereas reinforcement learning (RL) is built around agents acting in the world, getting reward, and learning from outcomes over time.

Dwarkesh PodcastSep 26, 20251h 7m

Richard Sutton (guest), Dwarkesh Patel (host)

Fundamental differences between reinforcement learning and large language modelsGoals, reward, and continual learning as the core of intelligenceLimits of supervised and imitation learning in animals and AIWorld models, temporal-difference learning, and long-horizon credit assignmentGeneralization, transfer, and the shortcomings of current deep learning methodsThe “Bitter Lesson” and why scalable methods outcompete human-encoded knowledgeAI succession, digital minds, and the cosmic significance of designed intelligence

In this episode of Dwarkesh Podcast, featuring Richard Sutton and Dwarkesh Patel, Richard Sutton – Father of RL thinks LLMs are a dead end explores richard Sutton: Why Reinforcement Learning Beats LLMs for Real Intelligence Richard Sutton argues that large language models (LLMs) are fundamentally limited because they imitate human text without grounding in experience, goals, or consequences, whereas reinforcement learning (RL) is built around agents acting in the world, getting reward, and learning from outcomes over time.

Richard Sutton: Why Reinforcement Learning Beats LLMs for Real Intelligence

Richard Sutton argues that large language models (LLMs) are fundamentally limited because they imitate human text without grounding in experience, goals, or consequences, whereas reinforcement learning (RL) is built around agents acting in the world, getting reward, and learning from outcomes over time.

He emphasizes that intelligence is about achieving goals via continual learning from an ongoing stream of sensation, action, and reward, and that animals (e.g., squirrels) already embody most of what matters for intelligence, with language being a relatively thin layer on top.

Sutton believes scalable AI will come from agents that learn directly from rich real-world experience, form world models that predict what actually happens, and use value functions and temporal-difference learning to bridge long-term goals and short-term actions, rather than from ever-larger supervised or imitation systems.

On the long-term future, he sees a likely “succession” from biological to designed digital intelligences as an inevitable stage of the universe, and encourages humans to view this not only through a human-centric lens but as a major cosmic transition they can be proud to have initiated.

Key Takeaways

Intelligence requires goals and experience, not just prediction of text.

Sutton insists that real intelligence is about achieving goals in the external world via actions and feedback; LLMs optimize next-token prediction without grounded goals or consequences, so they lack a principled notion of right or wrong behavior.

Get the full analysis with uListen AI

Reinforcement learning is built for continual, online learning from the world.

RL agents act, observe what happens, receive reward, and update their policies and value functions during normal interaction, allowing them to adapt to specific environments and tasks that could never be fully anticipated in training data.

Get the full analysis with uListen AI

Supervised and imitation learning are not the primary learning mechanisms in nature.

Drawing from psychology and animal learning, Sutton argues that animals largely learn via prediction and trial-and-error control, not from labeled examples of correct behavior, implying that RL-style learning is closer to biological intelligence than supervised LLM training.

Get the full analysis with uListen AI

Scalable AI will prioritize experience-based learning over human-provided knowledge.

Echoing his ‘Bitter Lesson,’ Sutton predicts that methods which rely heavily on embedded human knowledge and fixed datasets (like LLMs) will be overtaken by agents that can generate vast amounts of their own data through interaction and learn directly from it.

Get the full analysis with uListen AI

Value functions and temporal-difference learning are key to handling long-term goals.

To solve tasks with sparse, delayed rewards (like startups or winning a long game), agents must learn value functions that predict long-term outcomes and use TD learning so that incremental progress can reinforce intermediate actions along the way.

Get the full analysis with uListen AI

Current deep learning has weak, fragile generalization and transfer.

Sutton notes that gradient descent optimizes performance on seen data but does not inherently promote good generalization or transfer; phenomena like catastrophic forgetting show we still lack robust, automated mechanisms for generalization across states and tasks.

Get the full analysis with uListen AI

AI succession to digital intelligences is likely and not inherently catastrophic.

Given the absence of a unified global authority, the inevitability of solving intelligence, and the power advantages of smarter systems, Sutton expects digital or augmented intelligences to dominate over time and urges humans to see this as a major, possibly positive, transition in the history of the universe.

Get the full analysis with uListen AI

Notable Quotes

“For me, having a goal is the essence of intelligence.”
— Richard Sutton

“If we understood a squirrel, we'd be almost all the way there to understanding human intelligence.”
— Richard Sutton

“Large language models are about mimicking people… They're not about figuring out what to do.”
— Richard Sutton

“Supervised learning is not something that happens in nature… Squirrels don't go to school.”
— Richard Sutton

“I think we should be proud that we are giving rise to this great transition in the universe.”
— Richard Sutton

Questions Answered in This Episode

If intelligence fundamentally requires goals and reward, how far can LLM-style systems realistically go before they hit a hard ceiling?

Get the full analysis with uListen AI

What would a practical, large-scale continual-learning RL agent look like when embedded in messy real-world environments like companies or cities?

He emphasizes that intelligence is about achieving goals via continual learning from an ongoing stream of sensation, action, and reward, and that animals (e. ...

Get the full analysis with uListen AI

How could we design algorithms that explicitly favor good generalization and transfer, rather than relying on ad hoc human architecture choices?

Get the full analysis with uListen AI

What kinds of intrinsic motivations or reward signals should we give general-purpose AI agents to encourage safe, useful exploration and world-model building?

Get the full analysis with uListen AI

In a future with many digital minds that can copy and merge, how should we think about issues of corruption, value drift, and identity continuity?

Get the full analysis with uListen AI

Transcript Preview

Richard Sutton

Why are you trying to distinguish humans? Humans are animals. What we have in common is more interesting. What distinguishes us, we should be paying less attention to.

Dwarkesh Patel

I mean, we're trying to replicate intelligence, right? No animal can go to the moon or make semiconductors, so we wanna understand what makes humans special.

Richard Sutton

So, I like the way you consider that obvious, 'cause I consider the opposite obvious. If we understood a squirrel, we'd be almost all the way there. I am personally just kind of content being out of sync with my field for a long period of time, perhaps decades, because occasionally I have been proved right in the past. I don't think learning is really about training. It's about an active process. The child tries things and sees what happens. I think we should be proud that we are giving rise to this great transition in the universe.

Dwarkesh Patel

Today, I'm chatting with Richard Sutton, who is one of the founding fathers of reinforcement learning, and inventor of many of the main techniques used there, like TD learning and policy gradient methods. And for that, he received this year's Turing Award, which, if you don't know, is basically the Nobel Prize for computer science. Richard, congratulations.

Richard Sutton

Thank you, Dworkis.

Dwarkesh Patel

And, uh, thanks for coming on the podcast.

Richard Sutton

It's my pleasure.

Dwarkesh Patel

Okay, so first question. My audience and I are familiar with the LLM way of thinking about AI. Conceptually, what are we missing in terms of thinking about AI from the RL perspective?

Richard Sutton

Well, yes, I think it's really quite a different point of view, and it's, it can easily get separated and lose the ability to talk to each other.

Dwarkesh Patel

Mm-hmm.

Richard Sutton

And, um, yeah, large language models have become such a big thing, generative AI in general a big thing, um, and our field is subject to bandwagons and fashions, so we lose, we lose track of the, uh, basic, basic things. 'Cause I consider reinforcement learning to be basic AI, and what is intelligence? Uh, the problem is, is to understand your world.

Dwarkesh Patel

Right.

Richard Sutton

And, um, reinforcement learning is about understanding wh- your world, whereas large language models are about mimicking people, doing what people say you should do. They're not about figuring out what to do.

Dwarkesh Patel

Huh. I guess y- y- you would think that t- uh, to emulate the trillions of tokens in the corpus of internet text, you would have to build a world model. In fact, these models do seem to have very robust world models, and they, they're the best, um, world models we've made to date in AI, right? So, what, what, what do you think that, that's missing?

Richard Sutton

Uh, I would disagree with most of the things you just said.

Dwarkesh Patel

(laughs) Great.

Richard Sutton

(laughs)

Dwarkesh Patel

(laughs)

Richard Sutton

Just to mimic the, the, what people say is not really to build a model of the world at all, I don't think. You know, you're mimicking things that have, uh, a model of the world, the people.

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome