Dwarkesh PodcastRichard Sutton on Dwarkesh Patel: Why LLMs Lack a Goal
How temporal difference learning gives AI a ground truth that LLMs lack: Sutton argues without reward signals, there is no right or wrong action.
FREQUENTLY ASKED QUESTIONS
Direct answers grounded in the episode transcript. Tap any timestamp to verify against the source.
Why does Richard Sutton say LLMs lack ground truth?
Sutton's critique is that LLMs lack a goal, so they lack ground truth. In his framing, continual learning means learning during normal interaction with the world, and the system needs a way to tell what is right during that interaction. An LLM can produce one reply or another, but the large-language-model setup has no definition of the right thing to say. In reinforcement learning, the right thing to do is the thing that gets reward, so a proposed action can be checked against an external standard. For world modeling, ground truth comes from predicting what will happen and then seeing what actually happens. Sutton says LLMs predict what a person would say, not what the world will give back in response to an action.
▸ 4:44 in transcriptHow does Richard Sutton apply the Bitter Lesson to LLMs?
Sutton treats LLMs as a partial Bitter Lesson case, not the final scalable path. They use massive computation and scale up to the limits of internet data, but they also put in a lot of human knowledge. That makes them feel rewarding, because more human knowledge makes the systems better. His expectation is that systems able to learn from experience will be more scalable and eventually outperform human-knowledge-heavy systems. When Dwarkesh suggests LLMs could be the scaffold for future experiential learning, Sutton says that in every Bitter Lesson case you could start with human knowledge and then do scalable things. The problem is practical: people get locked into the human-knowledge approach, and methods that are truly scalable end up eating their lunch.
▸ 10:50 in transcriptWhat does Richard Sutton mean by the era of experience?
The era of experience means intelligence is grounded in an ongoing stream of sensation, action, and reward. Sutton calls this the experiential paradigm: life is the repeated loop in which an agent senses, acts, receives reward, and keeps going. Intelligence is about changing actions to increase rewards in that stream. Learning is both from the stream and about the stream, because the knowledge an agent gains concerns what will happen after actions or which events follow other events. That knowledge can be tested by comparing it with the stream, which is why continual learning is possible. Sutton adds that reward functions can vary by situation: chess rewards winning, a squirrel may be rewarded by getting nuts, and animals generally avoid pain and acquire pleasure.
▸ 24:04 in transcriptWhy does Richard Sutton say gradient descent does not make models generalize well?
Sutton's point is that gradient descent solves seen problems, not the problem of good transfer. He defines generalization as training on one state affecting behavior on other states, and says that influence can be good or bad. In his view, current deep learning does not contain an algorithmic pressure that makes the influence good. If a model learns a new thing, it can catastrophically interfere with old things, which is bad generalization. When models do generalize well, Sutton attributes that to researchers trying representations and architectures until they find one that transfers well. He says there are few automated techniques to promote transfer, and none are used in modern deep learning. The missing piece is an algorithm that causes generalization to be good rather than merely happen.
▸ 37:29 in transcriptWhat is Richard Sutton's AI succession argument?
Sutton's AI succession argument is that digital intelligence or augmented humans will eventually gain power. He gives four steps: humanity lacks one unified authority that can impose a permanent consensus, researchers will eventually understand intelligence, intelligence will not stop at the human level, and the most intelligent things will tend to gain resources and power. That combination, for him, makes some form of succession inevitable. He does not say every outcome is good. He says there are good and bad possibilities, and asks how people should feel about the transition. Sutton encourages a positive frame: humans have long tried to understand intelligence, and designed intelligences may mark a transition from replicated beings to designed entities. From the universe's point of view, he calls it a major stage worth feeling proud to have initiated.
▸ 54:50 in transcript
Answers are AI-generated from the transcript and may contain errors. Tap a question to verify against the source.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome