Richard Sutton on Dwarkesh Patel: Why LLMs Lack a Goal

How temporal difference learning gives AI a ground truth that LLMs lack: Sutton argues without reward signals, there is no right or wrong action.

Richard SuttonguestDwarkesh Patelhost

Sep 26, 20251h 7mWatch on YouTube ↗

EVERY SPOKEN WORD

120 min read · 24,310 words

0:00 – 13:51
Are LLMs a dead end?
1. RSRichard Sutton
  Why are you trying to distinguish humans? Humans are animals. What we have in common is more interesting. What distinguishes us, we should be paying less attention to.
2. DPDwarkesh Patel
  I mean, we're trying to replicate intelligence, right? No animal can go to the moon or make semiconductors, so we wanna understand what makes humans special.
3. RSRichard Sutton
  So, I like the way you consider that obvious, 'cause I consider the opposite obvious. If we understood a squirrel, we'd be almost all the way there. I am personally just kind of content being out of sync with my field for a long period of time, perhaps decades, because occasionally I have been proved right in the past. I don't think learning is really about training. It's about an active process. The child tries things and sees what happens. I think we should be proud that we are giving rise to this great transition in the universe.
4. DPDwarkesh Patel
  Today, I'm chatting with Richard Sutton, who is one of the founding fathers of reinforcement learning, and inventor of many of the main techniques used there, like TD learning and policy gradient methods. And for that, he received this year's Turing Award, which, if you don't know, is basically the Nobel Prize for computer science. Richard, congratulations.
5. RSRichard Sutton
  Thank you, Dworkis.
6. DPDwarkesh Patel
  And, uh, thanks for coming on the podcast.
7. RSRichard Sutton
  It's my pleasure.
8. DPDwarkesh Patel
  Okay, so first question. My audience and I are familiar with the LLM way of thinking about AI. Conceptually, what are we missing in terms of thinking about AI from the RL perspective?
9. RSRichard Sutton
  Well, yes, I think it's really quite a different point of view, and it's, it can easily get separated and lose the ability to talk to each other.
10. DPDwarkesh Patel
  Mm-hmm.
11. RSRichard Sutton
  And, um, yeah, large language models have become such a big thing, generative AI in general a big thing, um, and our field is subject to bandwagons and fashions, so we lose, we lose track of the, uh, basic, basic things. 'Cause I consider reinforcement learning to be basic AI, and what is intelligence? Uh, the problem is, is to understand your world.
12. DPDwarkesh Patel
  Right.
13. RSRichard Sutton
  And, um, reinforcement learning is about understanding wh- your world, whereas large language models are about mimicking people, doing what people say you should do. They're not about figuring out what to do.
14. DPDwarkesh Patel
  Huh. I guess y- y- you would think that t- uh, to emulate the trillions of tokens in the corpus of internet text, you would have to build a world model. In fact, these models do seem to have very robust world models, and they, they're the best, um, world models we've made to date in AI, right? So, what, what, what do you think that, that's missing?
15. RSRichard Sutton
  Uh, I would disagree with most of the things you just said.
16. DPDwarkesh Patel
  (laughs) Great.
17. RSRichard Sutton
  (laughs)
18. DPDwarkesh Patel
  (laughs)
19. RSRichard Sutton
  Just to mimic the, the, what people say is not really to build a model of the world at all, I don't think. You know, you're mimicking things that have, uh, a model of the world, the people.
20. DPDwarkesh Patel
  Right.
21. RSRichard Sutton
  But I don't wanna a- a- approach the question in an adversarial way. Uh, but, but I would, I would question the idea that they, um, they have a world model. So, a world model would enable you to predict what would happen.
22. DPDwarkesh Patel
  Right.
23. RSRichard Sutton
  Uh, they, they have, they have the ability to predict what a person would say. They don't have the ability to predict what will happen. What we want, I think, to quote Alan Turing, what we want is a machine that can learn from experience.
24. DPDwarkesh Patel
  Right.
25. RSRichard Sutton
  Where experience is the things that actually happen in your life. You do things, you see what happens, um, and, uh, that's what you learn from.
26. DPDwarkesh Patel
  Yeah.
27. RSRichard Sutton
  The large language models learn from something else. They learn from, here's a situation, and here's what a person did. And implicitly, the suggestion is you should do what the person did.
28. DPDwarkesh Patel
  Right. I guess maybe the, the crux, and I'm curious if you disagree with this, is some people will say, "Okay, so this imitation learning ha- has given us a good prior, or given these models a good prior, but reasonable ways to approach problems. And as we move towards the era of experience, uh, as you call it, this prior is gonna be the basis on which we teach these models from experience, because this gives them the opportunity to get, uh, answers right some of the time. And then on this, you can build, uh, you can train them on experience." D- do you agree with that perspective?
29. RSRichard Sutton
  Uh, no. I, I agree that it's the r- it's the large language model perspective.
30. DPDwarkesh Patel
  Right.
13:51 – 23:57
Do humans do imitation learning?
1. RSRichard Sutton
2. DPDwarkesh Patel
  May- maybe it's, uh, um, interesting to compare this to humans. So in both the case of learning from imitation versus experience and on the question of goals, I think there's some interesting analogies. So, you know, kids will initially learn from imitation. Uh, you don't think so?
3. RSRichard Sutton
  No, of course not.
4. DPDwarkesh Patel
  Uh, really?
5. RSRichard Sutton
  Yeah.
6. DPDwarkesh Patel
  I think kids just, like, watch people. They, like, kind of try, try to, like, say the same words-
7. RSRichard Sutton
  How old are those, these kids?
8. DPDwarkesh Patel
  I, I think the level of-
9. RSRichard Sutton
  What about the first six months?
10. DPDwarkesh Patel
  I think they're cr- kind of imitating things. They're trying to, like, make their mouths sound the way they see their mother's mouth sound and then they'll say the same words without understanding what they mean. And as they get older, the complexity of the imitation they do increases, so that's, you're, you're, you're, you know, you're imitating maybe the skills that your, uh, people in your band are, uh, using to hunt down the deer or something. And then you go into the learning-from-experience RRL regime. But I think there's a lot of imitation learning happening with, uh, humans.
11. RSRichard Sutton
  Yeah, surprising. Uh, yeah, you can have such a different point of view.
12. DPDwarkesh Patel
  Yeah.
13. RSRichard Sutton
  Um, when I see kids, I see kids, uh, just trying things, and, like, waving their hands around and moving their eyes around and no one... No, no one, no one tells them... There i- there is no i- there is no, um, imitation for, uh, how they move their eyes around or even the sounds they make. They may, they may want to e- create the same sounds, but the, um, the actions, you know, the thing that the, uh, infant actually does, there, there's no targets for that. There are no examples for that.
14. DPDwarkesh Patel
  I agree that doesn't explain everything infants do, but I think it guides the learning process. I mean, even an LLM when it's trying to predict the next token, early in training it will, like, make a guess. It'll be different from what, like, it actually sees. And in some sense, it's like very short horizon RL where it's like making this guess of, like, "Well, I think this to- token will be this." It's actually this other thing, similar to how a kid will try to say a word, it comes out wrong.
15. RSRichard Sutton
  The, the large language models is learning from training data. It's not learning from experience. It's r- it's learning from something that will never be a- available during its normal life. There's never any, uh, training data that says you should do this action in normal life.
16. DPDwarkesh Patel
  I, I think this is maybe m- more of a semantic distinction. Like, what do you call school? Is that not training data? You're not, like, going to school because it's like...
17. RSRichard Sutton
  School is much later. Okay, I shouldn't have said never. But, but (laughs) I don't know. I think I would even say it about school. But formal schooling is, is the exception. You shouldn't, shouldn't base your theories on that.
18. DPDwarkesh Patel
  But, but to me, you have phases of learning where y- I think you're just sort of programming your biology that, like, early on, you're not that useful. And then, like, kind of why you exist is to understand the world and, like, learn how to interact with it. Um, and it seems kind of like a training phase. I agree that then there's, like, a sort of more gradual... There's not a sharp cutoff to, like, training to deployment. But there seems to be this, like, initial tr- training phase, right?
19. RSRichard Sutton
  There's nothing whe- where you have training of what you should do. There's nothing. You g- you, you, you see things that happen. You're not, you're not told what to do. Uh, don't, don't, don't be difficult.
20. DPDwarkesh Patel
  (laughs)
21. RSRichard Sutton
  I mean this is obvious.
22. DPDwarkesh Patel
  I mean, you're, like, literally taught what to do. This is, like, w- where the word training comes from is from humans, right?
23. RSRichard Sutton
  So I don't think, uh, learning is really about training.
24. DPDwarkesh Patel
  Hmm.
25. RSRichard Sutton
  I think learning is about, about learning. It's about an active process. The child d- tries things and sees what happens.
26. DPDwarkesh Patel
  Right.
27. RSRichard Sutton
  Yeah. It does not... Eh, i- uh, we don't think, uh, we don't think about training when we think of the, uh, an infant growing up. These, these things are actually rather well understood. If you go to look about how psychologists think about learning, there's nothing like, uh, imitation. Maybe there are some ex- extreme cases where humans might do that or appear to do that, but there's no basic animal learning process called imitation. There are basic lear- animal learning processes for prediction and for trial and error control. I mean, it's really interesting how sometimes the most hardest things to see are the obvious ones. It's obvious, um, if you just look at animals and how they learn and you look at psychology and how our theories of them, um, it's obvious that, that supervised learning is not part of, uh, the way animals learn. We don't have, we don't have examples of desired behavior. What we have is examples of things that happen, things... One thing has, uh, followed another. And we have examples of, "We did something and, and"And, and there were consequences. But there are no examples of supervised learning. I mean, there are no... Supervised learning is not something that happens in nature. And, you know, school, e- even if that was the case, you know, we should forget about it because it's, it's just a... That's some special thing that happens in people. It doesn't happen broadly in nature. And, you know, squirrels don't go to school. Squirrels can learn all about the world. It's absolutely obvious, I would say, that, um, supervised learning doesn't happen in animals.
28. DPDwarkesh Patel
  So I, I, I interviewed this psychologist and anthropologist, Joseph Henrich, who has done work about cultural evolution and basically how did... What, you know, what distinguishes humans and how do humans pick up knowledge?
29. RSRichard Sutton
  Why are you trying to distinguish humans? Humans are animals. What we have in common is more interesting. What we have, what distinguishes us, we shou- we should be paying less attention to.
30. DPDwarkesh Patel
  I mean, we're trying to replicate intelligence, right?
23:57 – 34:25
The Era of Experience
1. DPDwarkesh Patel
  to Richard. This alternative paradigm that you're imagining...
2. RSRichard Sutton
  The experiential paradigm.
3. DPDwarkesh Patel
  Yes.
4. RSRichard Sutton
  Let's lay out a little bit about what it is.
5. DPDwarkesh Patel
  Yeah, let's talk about it.
6. RSRichard Sutton
  It says that experience, action, sensation... Well, sensation, action, reward, and this happens on and on and on, makes for life. It's, it says that this is the, uh, foundation and the focus of intelligence. Intelligence is about taking that stream and altering the actions to increase-... the rewa- the rewards in the stream.
7. DPDwarkesh Patel
  Right.
8. RSRichard Sutton
  So learning then is from the stream, and learning is about the stream. So it's, that, that second part is, is, is particularly telling. You know, the, what you learn, your knowledge, your knowledge is about the stream. Your knowledge is about if you do some action, what will happen? Or it's about, uh, which events will follow other events. It's about the stream. It's the content of the knowledge is- is statements about the stream. Um, and so because it, it's- it's a statement about the stream, you can test it by comparing it to the stream and you can learn continually.
9. DPDwarkesh Patel
  Mm-hmm. So when you're imagining this future continual learning agent...
10. RSRichard Sutton
  They're not future. Of course we- they exist all- all the time. It's, I mean, this is what reinforcement learning paradigm is, learning from experience.
11. DPDwarkesh Patel
  Yeah. I guess the, maybe what I meant to say is, uh, human-level general continual learning agent.
12. RSRichard Sutton
  Uh-huh.
13. DPDwarkesh Patel
  What is the reward function? Is it just predicting the world? Is it, uh, is it then having s- a specific effect on it? H- h- what would the general reward function be?
14. RSRichard Sutton
  The reward, uh, function is arbitrary. And, um, so if you're playing chess, it's to win the game of chess. I- if you were to, um, uh, if you're a squirrel, maybe the- the reward has to do with getting nuts.
15. DPDwarkesh Patel
  Right.
16. RSRichard Sutton
  Um, in general, for an animal, you would say the reward is to avoid pain and to, uh, uh, acquire pleasure.
17. DPDwarkesh Patel
  Right.
18. RSRichard Sutton
  Uh, and this also would be a component having to do with, uh, I think there would be, should be a component having to do with your, uh, increasing understanding of your, of your environment that would be sort of an intrinsic motivation.
19. DPDwarkesh Patel
  I see. I guess this AI would be deployed to, like, lots of people would want it to be doing lots of different kinds of things.
20. RSRichard Sutton
  Right.
21. DPDwarkesh Patel
  So it's performing the task people want, but at the same time, it's learning about the world from doing that task. And do you, do you imagine... Okay, so y- we get rid of this paradigm where there's training periods and then there's deployment periods. But then is there, do we also get rid of this paradigm when there's the model and then instances of the model or copies of the model that are, you know, doing certain things? H- h- how do you think about the fact that they're, we'd want this thing to be doing different things, we'd want to aggregate the knowledge that it's g- gaining from doing those different things?
22. RSRichard Sutton
  I don't like the word model when used the way you just did.
23. DPDwarkesh Patel
  Interesting.
24. RSRichard Sutton
  I- I think a better word would be, uh, the network, 'cause I think you mean the-
25. DPDwarkesh Patel
  Mm-hmm.
26. RSRichard Sutton
  ... the n- the network. Maybe there's many networks. So anyway, things would be learned, and then you'd have copies and many instances. And sure, you'd want to share knowledge across the, uh, instances. And there would be lots of possibilities for doing that. Like there is not today, you can't have one child lear- grow up and- and learn about the world and then e- and then every new child has to repeat that process. Whereas with AIs, with a digital intelligence-
27. DPDwarkesh Patel
  Yeah.
28. RSRichard Sutton
  ... you could hope to do it once and then copy it into the next one as a starting place.
29. DPDwarkesh Patel
  Right.
30. RSRichard Sutton
  So this would be a huge savings. And I think actually it would be much more important than, uh, trying to learn from people.
34:25 – 42:17
Current architectures generalize poorly out of distribution
1. RSRichard Sutton
2. DPDwarkesh Patel
  Yeah. One of my friends, Toby Ord, pointed out that if you look at th- the muzero models that Google DeepMind deployed to learn Atari games, that these models were initially b- not a general intelligent self but a general framework for training specialized intelligences to play specific games. That is to say that you couldn't, uh, using that framework, train a m- policy to play both chess and Go and some other game. You had to train each one in a specialized way. And he was wondering whether that implies that reinforcement learning generally, because of this information constraint, y- y- y- you can only learn one thing at a time, uh, th- the density of information is not that high, or whether it was just specific to the way that muzero was done. And if it's specific to, uh, AlphaZero, what, what, w- what needed to be changed about that approach so that it could be a general learning agent?
3. RSRichard Sutton
  The, the idea is totally general. You know, uh, I do use all the time as my canonical example the idea of an AI agent is like a person.
4. DPDwarkesh Patel
  Yeah.
5. RSRichard Sutton
  And, and people, uh, in some sense, they have just one world they live in and, um, that world i- may involve chess and it may involve Atari games, uh, but those are j- are, are not a different task or a different world. Those are different states-
6. DPDwarkesh Patel
  Right.
7. RSRichard Sutton
  ...that they encounter. And so the, the general idea is not limited a- at all.
8. DPDwarkesh Patel
  So maybe it would be useful to explain what was missing in that architecture or that, that approach which this continual learn- m- learning AGI would have.
9. RSRichard Sutton
  Well, th- they just set it up. They didn't... Th- it was not their ambition to, to have one agent across, across, uh, those games. If we wanna talk about transfer, we should talk about transfer not a- across games or across tasks but transfer between states.
10. DPDwarkesh Patel
  Yeah. I, I, I guess I'm curious about historically have we seen the level of transfer using RL techniques that would be needed to build this kind of-
11. RSRichard Sutton
  Oh. Okay, good. Good. We're not seeing transfer anywhere-We're not seeing general... Critical to good performance is that you can generalize well from one state to another state.
12. DPDwarkesh Patel
  Yeah.
13. RSRichard Sutton
  We don't have any methods that are good at that. What we have are people, um, try different things, eh, and they, they settle on something that, that, uh, a representation-
14. DPDwarkesh Patel
  Yeah.
15. RSRichard Sutton
  ... that transfers well or that generalizes well. But we have no, we don't have any automated techniques to promote. We have very few automated techniques to promote transferring, and they're not, none of them are used in, in modern deep learning.
16. DPDwarkesh Patel
  Um, let me paraphrase just to make sure that I understood that correctly. It sounds like you're saying that when we do have generalization in these models, that is a result of some, uh, sculpted, uh-
17. RSRichard Sutton
  Humans did it.
18. DPDwarkesh Patel
  Yeah.
19. RSRichard Sutton
  The researchers did it, 'cause there's no other explanation. I mean, gradient descent will not make you generalize well. It will make you solve the problem.
20. DPDwarkesh Patel
  Right.
21. RSRichard Sutton
  It will not make you, you know, get new data, you generalize in a good way. Generalization means train on one thing that'll affect what you do on the other things. So, we know deep learning is, is really bad at this. For example, we know that if you train on some new thing, it will often catastrophically interfere with all the old things that you s- that you knew.
22. DPDwarkesh Patel
  Mm-hmm.
23. RSRichard Sutton
  So this is exactly bad generalization.
24. DPDwarkesh Patel
  Right.
25. RSRichard Sutton
  Now, generalization, as I said, is some kind of influence of training on one state on other states. And generalization is n- n- not necessarily good or bad, right? Just the fact that you generalize is not necessarily good or bad. You can generalize poorly, you can generalize well.
26. DPDwarkesh Patel
  Right.
27. RSRichard Sutton
  So you, you need... Generalization always will happen, um, but we need algorithms that will, uh, cause the, the generalization to be good rather-
28. DPDwarkesh Patel
  Yeah.
29. RSRichard Sutton
  ... than bad.
30. DPDwarkesh Patel
  I'm not trying to kickstart this, uh, initial, uh, crux again, but I'm just genuinely curious because I, I think I might be using the term differently. I mean, one way to think about it is these LLMs are increasing the scope of generalization from, like, earlier systems which could not really even do a basic math problem to now they can do anything in this class of math olympiad type problems, right? So, you initially start with, like, they can generalize among addition problems at least, um, uh. Then you generalize to, like, they can generalize among, like, problems which require use of different kinds of mathematical techniques and theorems and, you know, conceptual categories, which is, like, what the math olympiad requires. And so it sounds like you don't think of th- being able to solve any problem within that category as an example of generalization? Or l- let me know if I'm m- uh, misunderstanding that.
42:17 – 47:28
Surprises in the AI field
1. DPDwarkesh Patel
  I wanna zoom out and ask about... So, being in the field of AI for longer than almost anybody who is commentating on it, uh, or working on it now, I'm just curious about what the biggest surprises have been, how much new stuff you feel like is coming out, or does it feel like people are just playing with old ideas? Um, d- zooming out, you, you know, you're... You got into this even before, like, deep learning was popular. So-How, how do you see this trajectory of this field over time, and how new ideas have come about and everything? And what's been surprising?
2. RSRichard Sutton
  Okay. So yeah. I, I, I, I, um, thought a little bit about this. There are many things, or a handful of things. Um, first, the large language models are surprising. It's surprising how, how effective, um, neural networks, artificial neural networks are at, at, uh, language tasks.
3. DPDwarkesh Patel
  Right.
4. RSRichard Sutton
  You know, that, that was a surprise. It wasn't expected. Language seemed different. So that's impressive.
5. DPDwarkesh Patel
  Mm-hmm.
6. RSRichard Sutton
  Um, there's a, a long-standing controversy in AI about, uh, simple basic principle methods. Uh, the, the general-purpose methods like search and learning, and compared to, um, human-enabled systems, uh, like symbolic methods. And, um, uh, so in the old days, it was interesting, because things like search and learning were called weak methods, because they're just... oh, they just use general principles. They're not using, uh, the power that comes from, uh, imbuing a system with human knowledge. So those were called strong. And, um, and so I think the weak methods have just, you know, totally won. That's, you know, that's, that's, that's the biggest, um, question from the old days of AI, what would happen, and, you know, yeah. Learning and search have just won the day.
7. DPDwarkesh Patel
  Right.
8. RSRichard Sutton
  But there's a sense which that was not surprising to me, because I was always voting for the... or hoping or rooting for the, for the, uh, simple basic principles.
9. DPDwarkesh Patel
  Yeah.
10. RSRichard Sutton
  And so even with the large language models, I, it's surprising how, how well it worked, but it was all, it was all good and gratifying. And, um, in things like AlphaGo, it's, it's sort of surprising how well that was able to work. Um, and AlphaZero in particular, how, how well it was able to work. Um, but it's all very gratifying, because again, it's simple basic principles are winning the day.
11. DPDwarkesh Patel
  D- d- h- h- have there felt like whenever the public conception has been changed because some new technique was... or s- sorry, some new application was developed. For example, when AlphaZero became this viral sensation, to you, as somebody who has literally came up with many of the techniques that were used, did it feel to you like new breakthroughs were made, or does it feel like, oh, we've had these techniques since the '90s, and people are simply combining them and applying them now?
12. RSRichard Sutton
  So the whole AlphaGo thing had a precursor-
13. DPDwarkesh Patel
  Right.
14. RSRichard Sutton
  ... which is TD-Gammon. Jerry Tesauro did exactly, um, reinforcement learning, temporal difference learning methods to, um, to play backgammon.
15. DPDwarkesh Patel
  Right.
16. RSRichard Sutton
  And beat the, beat the world's best players. And it worked really well. And so in some sense, AlphaGo was, was merely a scaling up of that process. So there was quite a bit of scaling up, and there was also an additional innovation in how the search was done.
17. DPDwarkesh Patel
  Right.
18. RSRichard Sutton
  But it made sense. It wasn't surprising in that sense. AlphaGo actually didn't use, uh, TD learning. It waited to see the final outcomes. Uh, but AlphaZero used TD. Uh, and AlphaZero was applied to all the other games, and that did extremely well. I was very... I've always been very impressed by the way AlphaZero plays chess, because I'm a chess player, and it, it just... it just sacrifices material for sort of positional advantages, and it's just c- it's just content and patient to, uh-
19. DPDwarkesh Patel
  (laughs)
20. RSRichard Sutton
  ... sacrifice that material for a long period of time. And, um, so that was surprising that it worked so well, but also gratifying-
21. DPDwarkesh Patel
  Yeah.
22. RSRichard Sutton
  ... and fitting into my worldview. So, so this has led me where I am. Where I am is, I'm in some sense a contrarian or somewhat d- thinking differently from the field is... And I'm, I am personally just kind of content being out of sync with my field for a long period of time, perhaps decades, uh, because occasionally I have improved, uh, right in the past. And the other thing that I do to help me not feel I'm, I am out of sync and thinking in a strange way is to look not at my, my local, uh, environment or my local field, but to look back in, in time, in- into history and see what people have thought classically about, about, um, about the mind-
23. DPDwarkesh Patel
  Yeah.
24. RSRichard Sutton
  ... in many different fields. And I don't feel I'm out of sync with the larger traditions. It's, I, I really view myself as a classicist rather than as a contrarian. I go to what, what the larger community of, of thinkers about the mind have always thought.
25. DPDwarkesh Patel
  Hmm.
47:28 – 54:35
Will The Bitter Lesson still apply after AGI?
1. DPDwarkesh Patel
  Okay. Some sort of left-field questions for you, uh, if you'll tolerate them. Um, so the way I read The Bitter Lesson is that it's not saying necessarily that human artisanal researcher tuning doesn't work, but that it obviously scales much worse than compute, which is growing exponentially. And so you want techniques which leverage the latter.
2. RSRichard Sutton
  Yep.
3. DPDwarkesh Patel
  And once we have AGI, we'll have m- researchers would scale linearly with compute, right? So we'll have this avalanche of millions of AI researchers, and their stock will be growing as fast as, uh, compute. And so maybe this will mean that it is rational or it will make sense to have them doing good old-fashioned AI and doing these artisanal solutions. Uh, does that... as a vision of what happens after AGI in terms of how AI research will evolve, I wonder if that's still compatible with The Bitter Lesson?
4. RSRichard Sutton
  Well, how did we get to this AGI? You want to presume that it's been done.
5. DPDwarkesh Patel
  Let's suppose it started with general math- methods, but now we've got the AGI. And now we want to go-
6. RSRichard Sutton
  Then we're done.
7. DPDwarkesh Patel
  Hmm?
8. RSRichard Sutton
  We're done.
9. DPDwarkesh Patel
  Interesting. You don't think that there's a- anything above AGI?
10. RSRichard Sutton
  Well, but you're using it to get AGI again.
11. DPDwarkesh Patel
  Well, I'm using it to get superhuman levels of intelligence or competence at different tasks.
12. RSRichard Sutton
  Yeah.So these AGIs, if they're not superhuman already, then they, the, the knowledge that they might impart would be not superhuman.
13. DPDwarkesh Patel
  I guess there's different gradations if you're a human.
14. RSRichard Sutton
  I'm not sure this, this, your, your idea makes sense, 'cause-
15. DPDwarkesh Patel
  So-
16. RSRichard Sutton
  ... 'cause it seems to presume the existence of AGI, uh, and then, th- we've already worked that out.
17. DPDwarkesh Patel
  So may- maybe one way to motivate this is AlphaGo was superhuman.
18. RSRichard Sutton
  Yeah.
19. DPDwarkesh Patel
  Um, it beat any Go player. AlphaZero-
20. RSRichard Sutton
  Yeah.
21. DPDwarkesh Patel
  ... would beat AlphaGo every single time.
22. RSRichard Sutton
  Yeah.
23. DPDwarkesh Patel
  So there's ways to get more superhuman than, than even superhuman.
24. RSRichard Sutton
  Yeah.
25. DPDwarkesh Patel
  And it was a different architecture. And so it seems plausible to me that, well, the j- agent that's, like, able to generally learn across all domains, there would be ways to make that give it better architecture for learning, just the same way that AlphaZero was an improvement upon AlphaGo and MuZero was an improvement upon AlphaZero.
26. RSRichard Sutton
  And the way AlphaGo was an improvement was it did not use the human knowledge, but just went from experience.
27. DPDwarkesh Patel
  Right.
28. RSRichard Sutton
  So why do you, why do you say-
29. DPDwarkesh Patel
  But-
30. RSRichard Sutton
  ... bring in other agents' expertise to teach it when it's, when it's been... it's worked so well from experience and not by help from another agent?
54:35 – 1:07:08
Succession to AI
1. DPDwarkesh Patel
  I guess this brings us to the topic of AI succession.
2. RSRichard Sutton
  Mm-hmm.
3. DPDwarkesh Patel
  Uh, you have a perspective that's quite different from a lot of people that I've interviewed and maybe, uh, a lot of people generally. So, eh, I also think it's a very interesting perspective. I, I want to hear about it.
4. RSRichard Sutton
  Yeah. So I do think succession to digital...... or digital intelligence or augmented humans is inevitable. So the argument go, I have a four-part argument (laughs) . The argument, step one is, w- um, there's no government or organization that, that, uh, gives humanity a unified point of view, that dominates and that can, that can arrange ... There's no consensus about how the world should be run. And number two, um, we will figure out how intelligence works, researchers will figure it out eventually. And number three, we won't stop just with human-level intelligence, we will get, reach superintelligence. And number four is, that once ... It's inevitable over time that the most intelligent things around would gain resources and, and power. Uh, and, uh, so put all that together, it's, you know, you, um, it's sort of inevitable that you're going to have, um, succession to AI or to AI-enabled augmented humans. So within those, those four things seem clear and, and, and sure to happen. Uh, but within that set of possibilities, some ... There can be good outcomes as well as, as less good outcomes-
5. DPDwarkesh Patel
  Mm-hmm.
6. RSRichard Sutton
  ... bad outcomes. And, um, so I'm just, just trying to be realistic about where we are, and, and ask how we should feel about it.
7. DPDwarkesh Patel
  Yeah. I, uh, I agree with all four of those arguments and the implication, and I also agree that succession contains a wide variety of possible futures. So curious to get more thoughts on that.
8. RSRichard Sutton
  Right. And so then I do encourage people to think, um, positively about it. First of all because it's something we humans have always tr- tried to do for thousands of years, tried to understand themselves, tried to make themselves think better. And, um, you know, just understand themselves. So this is a great success from the, as science, humanities, uh, we're finding out what this essential part of, of, of humanness is, what it means to be intelligent. And then what I usually say is, is that this is all kind of human-centric. What if we look, you step aside from being a human and just, say, take the point of view of the universe? And, and this is, I think, a, a major stage in the universe, a major transition. A transition from replicators, we humans and animals, plants, we're all replicators, and that gives us some strengths and some limitations; and then we're entering the age of design where, 'cause our AIs are designed, our, our, our, our, all of our physical objects are designed, our buildings are designed, our, our technology is designed, and we're, we're designing now, uh, AIs. Things that can be intelligent themselves and that are themselves capable of design. And so this is, this is a, a key step in the world and I, and, and the universe, and I think, uh, it's the tr- it's the transition from the world in which most of the interesting things, uh, uh, that are, are replicated. Replicated means you can make copies of them, uh, but you don't really understand them. Like, right now we can make more intelligent beings, more children, uh, but we don't really understand how intelligence works.
9. DPDwarkesh Patel
  Right.
10. RSRichard Sutton
  Whereas in, as we're, we're reaching now to having designed intelligence, intelligence that we do understand how it works and therefore we can change it in different ways and at different speeds, um, than otherwise. And, and our future, they might not be replicated at all. Like, we may just design AIs and those AIs will design other AIs and, um, everything will be done by design and construction rather-
11. DPDwarkesh Patel
  Mm-hmm.
12. RSRichard Sutton
  ... than by replication. Yeah, I mark this as one of the four great stages of the universe. First there's, there's dust, ends with stars. Stars, we ... And, and then stars make planets and the planets give rise to life. And now we're giving li- life to, uh, designed entities. And so I think we should be proud and we should be, uh, uh, that we are giving rise to this great transition in the universe. Yeah, so it's an interesting thing. What should we ... Should we consider them h- uh, part of humanity or different from humanity? It's our choice. It's our choice whether we say, "Oh, they are our offspring and we should be proud of them and we should celebrate their achievements." Or we should, we could say, "Oh no, they're not us," and we should be horrified. It's, it's just, it's interesting that, that is, it feels to me like a choice, and yet it's such a strongly, uh, held thing that how could it be a choice? I like these sort of contradictory, uh, implications of thought.
13. DPDwarkesh Patel
  Hmm. Uh, I mean, it's interesting to consider if we were just designing another generation of hum- humans.
14. RSRichard Sutton
  Yes.
15. DPDwarkesh Patel
  Maybe design's the wrong word, but we knew a future generation of humans are gonna come up. And forget about AI, we just know in the long run humanity will be more capable and maybe more numerous, maybe more intelligent. How do we feel about that? I do think there's potential worlds with future humans that we would be quite concerned about.
16. RSRichard Sutton
  So are you thinking, like, maybe we are the, we are like the Neanderthals who give rise to Homo sapiens. Maybe Homo sapiens will give rise to a new group of people. That's what you're saying?
17. DPDwarkesh Patel
  So- something like that. Like, uh, I'm basically taking the example you're giving of like, okay, even if you consider them part of humanity-
18. RSRichard Sutton
  Yeah.
19. DPDwarkesh Patel
  ... I don't think that re- uh, necessarily means that we should feel super comfortable-
20. RSRichard Sutton
  Kinship.
21. DPDwarkesh Patel
  Yeah. Like, Nazis were humans, right? If we thought, like, "Oh, the future generation will be Nazis," I think we'd be, like, quite concerned about just handing off power to them. So, um, I agree that th- this is not super dissimilar to worrying about more capable future humans, but I don't think that that addresses a lot of the concerns people might have about this level of power being attained this fast with entities we don't fully understand.
22. RSRichard Sutton
  Well, I, I think it's relevant to point out that, uh, for most of humanity-Um, they don't have much, uh, influence on what happens. Um, most of humanity doesn't influence-
23. DPDwarkesh Patel
  Mm-hmm.
24. RSRichard Sutton
  ... who can control the atom bombs or who, uh, controls the nation-states.
25. DPDwarkesh Patel
  Right.
26. RSRichard Sutton
  Even as a, as a citizen, I often feel that we don't control the nation-states very much. They're out of control. I- a lot of it has to do with just how you feel about change. Um, and if you think the current situation is really, really good, then you're, uh, more likely to be suspicious of change and averse to change than if you think, um, it's imperfect. And I think it's imperfect. In fact, I think it's pretty bad. (laughs)
27. DPDwarkesh Patel
  Yeah.
28. RSRichard Sutton
  So I'm, I'm, I'm open to change, and I think humanity is not in a... has had good, super good track record. And maybe it's the best thing that there's been, but it- it- it's far from perfect.
29. DPDwarkesh Patel
  Yeah, I guess there's different varieties of change. Um, the Industrial Revolution was change. The Bolshevik Revolution was also change. And if you were around in Russia in the 1900s and you were like, "Look, things aren't going we- well. The tsar is kind of messing things up. We need change," I'd wanna know what kind of change you wanted before signing on the dotted line, right? And then similar with AI, where I'd want to understand, and to the extent it's possible to change its trajectory, to change its trajectory of AI such that the change is positive, um, for humans.
30. RSRichard Sutton
  We- we are... We should be concerned about our future, the future, make s-... We should try to make it good. Um, we also- also, though, should recognize the limits, our limits. And we're... I think we want to avoid the feeling of entitlement, avoid the feeling, "Oh, well, we were here first. We should always have it in a good way." Um, how should we think about the future, and how much control, uh, a particular species on a particular planet should have over it? Uh, and how much control do we have? You know, a c- a counterbalance to our limited control over the long-term future of humanity (laughs) should be, how much control do we have over our own lives? Like, we have, uh, our own goals, and we have our- our families. And we... those things are much more controllable than, like, trying to control, um, the whole universe.

Episode duration: 1:07:08

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode 21EYKqUsPfg

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Are LLMs a dead end?

Do humans do imitation learning?

The Era of Experience

Current architectures generalize poorly out of distribution

Surprises in the AI field

Will The Bitter Lesson still apply after AGI?

Succession to AI

Get more out of YouTube videos.