Lex Fridman PodcastDavid Silver: AlphaGo, AlphaZero, and Deep Reinforcement Learning | Lex Fridman Podcast #86
EVERY SPOKEN WORD
150 min read · 30,155 words- 0:00 – 4:09
Introduction
- LFLex Fridman
The following is a conversation with David Silver, who leads the Reinforcement Learning Research Group at DeepMind, and was the lead researcher on AlphaGo, AlphaZero, and co-led the AlphaStar and MuZero efforts, and a lot of important work in reinforcement learning in general. I believe AlphaZero is one of the most important accomplishments in the history of artificial intelligence, and David is one of the key humans who brought AlphaZero to life, together with a lot of other great researchers at DeepMind. He's humble, kind, and brilliant. We were both jet-lagged but didn't care, and made it happen. It was a pleasure and truly an honor to talk with David. This conversation was recorded before the outbreak of the pandemic. For everyone feeling the medical, psychological, and financial burden of this crisis, I'm sending love your way. Stay strong. We're in this together. We'll beat this thing. This is the Artificial Intelligence Podcast. If you enjoy it, subscribe on YouTube, review it with five stars on Apple Podcasts, support it on Patreon, or simply connect with me on Twitter @lexfridman, spelled F-R-I-D-M-A-N. As usual, I'll do a few minutes of ads now, and never any ads in the middle that can break the flow of the conversation. I hope that works for you and doesn't hurt the listening experience. Quick summary of the ads. Two sponsors: MasterClass and Cash App. Please consider supporting the podcast by signing up to MasterClass at masterclass.com/lex and downloading Cash App and using code LEXPODCAST. This show is presented by Cash App, the number one finance app in the App Store. When you get it, use code LEXPODCAST. Cash App lets you send money to friends, buy Bitcoin, and invest in the stock market with as little as $1. Since Cash App allows you to buy Bitcoin, let me mention that cryptocurrency, in the context of the history of money, is fascinating. I recommend A Cent of Money as a great book on this history. Debits and credits on ledgers started around 30,000 years ago, the U.S. dollar, created over 200 years ago, and Bitcoin, the first decentralized cryptocurrency, released just over 10 years ago. So given that history, cryptocurrency is still very much in its early days of development, but it's still aiming to, and just might, redefine the nature of money. So again, if you get Cash App from the App Store or Google Play and use the code LEXPODCAST, you get $10, and Cash App will also donate $10 to FIRST, an organization that is helping to advance robotics and STEM education for young people around the world. This show is sponsored by MasterClass. Sign up at masterclass.com/lex to get a discount and to support this podcast. In fact, for a limited time now, if you sign up for an all-access pass for a year, you get to get another all-access pass to share with a friend. Buy one, get one free. When I first heard about MasterClass, I thought it was too good to be true. For $180 a year, you get an all-access pass to watch courses from, to list some of my favorites, Chris Hadfield on space exploration, Neil deGrasse Tyson on scientific thinking and communication, Will Wright, the creator of SimCity and Sims, on game design, Jane Goodall on conservation, Carlos Santana on guitar. His song Europa could be the most beautiful guitar song ever written. Garry Kasparov on chess, Daniel Negreanu on poker, and many, many more. Chris Hadfield explaining how rockets work and the experience of being launched into space alone is worth the money. For me, the key is to not be overwhelmed by the abundance of choice. Pick three courses you want to complete, watch each of them all the way through. It's not that long, but it's an experience that will stick with you for a long time, I promise. It's easily worth the money. You can watch it on basically any device. Once again, sign up on masterclass.com/lex to get a discount and to support this podcast. And now here's my conversation with David Silver.
- 4:09 – 11:11
First program
- LFLex Fridman
What was the first program you've ever written, and what programming language? Do you remember?
- DSDavid Silver
I remember very clearly, yeah. My, my, um, parents brought home this BBC Model B microcomputer. It was just this fascinating thing to me. I was about seven years old and couldn't resist just playing around with it. Um, so I think first program ever, um, was writing my name out, um, in different colors and getting it to loop-
- LFLex Fridman
Nice.
- DSDavid Silver
... and, uh, repeat that, and, um, there was something magical about that which just led to more and more.
- LFLex Fridman
How did you think about computers back then? Like, the, the magical aspect of it, that you can write a program and there's this thing that you just gave birth to that's able to create sort of visual elements and live on its own? Or di- did you not think of it in those romantic notions? Was it more like, "Oh, that's cool, I can, I can solve some puzzles"?
- DSDavid Silver
It was always more than solving puzzles. It was something where, you know, there was this limitless possibilities. Once you have a computer in front of you, you can do anything with it. It's, um... I used to play with Lego with the same feeling. You can make anything you want out of Lego. But even more so with a computer, you know? You don't, you're not constrained by the amount of kit you've got. And so I was fascinated by it and started pulling out the, you know, the user guide and the advanced user guide and then learning. So I started in BASIC and then, you know, later 6502. My father was, um, also became interested in the, in this machine and gave up his career to go back to school and, and study for, um-
- LFLex Fridman
Nice.
- DSDavid Silver
... a master's degree in, in artificial intelligence, funnily enough, um, at Essex University when I was, when I was seven. So I, um, was exposed to those things at an early age. He showed me how to, uh, program in Prolog and do things like querying your family tree, and those are some of my early- earliest memories of trying to, um, trying to figure things out on a computer.
- LFLex Fridman
... those are the early steps in computer science programming, but when did you first fall in love with artificial intelligence or with the ideas, the dreams of AI?
- DSDavid Silver
I think it was really when I, when I went to study at, at university. Um, so I was an undergrad at, at, at Cambridge and studying computer science, and, and I really started to question, you know, what, what really are the goals? What, what's the goal? Where, where do we want to go with, with computer science? And it seemed to me that the, the only step of major significance, um, to take was to try and recreate something akin to human intelligence. If we could do that, that would be a major leap forward. And that idea, I certainly wasn't the first to have it, but it, it, you know, nestled within me somewhere and, and became like a bug, you know? I really wanted to, to crack that problem.
- LFLex Fridman
So you thought it was... Like you had a notion that this is something that human beings can do, that it, it is possible to create an intelligent machine?
- DSDavid Silver
Well, I mean, u- unless you believe in something metaphysical, um, then what are our brains doing? Well, at some level, they're, um, information processing systems which are, um, able to take whatever information is in there, transform it through some form of program, and produce some kind of output, which enables that, that human being to do all the amazing things that they can do in this incredible world.
- LFLex Fridman
So, so then do you remember the first time you've written a program that, um... 'Cause you also had an interest in games. Do, do you remember the first time you were in a program that beat you in a game?
- DSDavid Silver
So-
- LFLex Fridman
Or beat you at anything? Sort of, uh, achieved super David Silver-level performance?
- DSDavid Silver
(laughs) So I used to work in the games industry. So for five years, I, I programmed games for, for my first job. Um, so it was a amazing opportunity to get involved in a startup company. Um, and so I, I was involved in, in building AI at that time. Um, and so for sure, there was a sense of, um, building, um, handcrafted, what people used to call AI in the games industry, which I think is not really what we might think of as, as AI in its fullest sense, but something which is able to, um, to, um, take actions in, in a way which, which makes things interesting and challenging for the, for the, for the human player. Um, and at that time, I was able to build, you know, these handcrafted agents, which in certain limited cases, could do things which, which were able to, um, do better than, than me, but mostly in these kind of twitch-like scenarios where, where they were able to do things faster or, or, or because they had some pattern which was, um, uh, able to exploit repeatedly. I think if we talk about real AI-
- LFLex Fridman
Mm-hmm.
- DSDavid Silver
... um, the first experience for me came after that, when I, I realized that this, um, path I was on wasn't taking me towards... It wasn't s- it wasn't dealing with that bug which I still had inside me to really understand intelligence (laughs) and try and, and try and solve it. Uh, everything people were doing in games was, you know, um, short-term fixes rather than long-term vision. Um, and so I went back to study for my PhD, uh, which was, funnily enough, trying to apply reinforcement learning to the game of Go. And I built my first, um, Go program using reinforcement learning, a system which would, um, by trial and error, play against itself, um, and was able to learn, um, which patterns were actually helpful to predict whether it was gonna win or lose the game, and then, you know, choose the moves that led to the combination of patterns that would mean that you're more likely to win.
- LFLex Fridman
And it-
- DSDavid Silver
And that system, that system beat me.
- LFLex Fridman
(laughs) And how did that make you feel?
- DSDavid Silver
It made me feel good. Um, yeah.
- LFLex Fridman
(laughs) I mean, was there a sort of a, a, yeah. I- is the- it's, it's a mix of a sort of excitement, and was there a tinge of sort of like almost like a fearful awe? You know, it's like, uh, what is it? In space, uh, 2001: Space Odyssey kind of realizing that you've created something that, (sighs) that is, you know, that, that is, that's achieved human-level intelligence in this one particular little task. And in that case, I suppose, uh, neural networks weren't involved.
- DSDavid Silver
There were no neural networks in those days. Um, this was pre-deep learning revolution.
- LFLex Fridman
Yes.
- DSDavid Silver
Um, but it was a principled self-learning system based on a lot of the principles which, which people, um, uh, still use in, in, in deep reinforcement learning. Um, how did I feel? I, I think I found it immensely satisfying that a system which was able to learn from first principles for itself was able to reach the point that it was understanding this domain, um, better than, better than I could and able to outwit me. I, um, I, I don't think it was a sense of awe. It was a sense that, um, satisfaction that this, that something I felt should work had worked.
- LFLex Fridman
So to me, AlphaGo, and I don't know
- 11:11 – 21:42
AlphaGo
- LFLex Fridman
how else to put it, but to me, AlphaGo and AlphaGo Zero mastering the game of Go is, again, to me, the most profound and inspiring moment in the history of artificial intelligence. So you're one of the key people behind this achievements. And I'm Russian, so I really felt the first sort of seminal achievement when, uh, Deep Blue beat Garry Kasparov in 1987. Uh, so as far as I know, uh, the AI community at that point largely saw the game of Go as unbeatable in AI using the, the sort of the state-of-the-art brute force methods, search methods. Even if you consider, at least the way I saw it, even if you consider arbitrary exponential scale- scaling of compute, Go would still not be solvable. Hence, why it was thought to be impossible. So given that the game of Go was impossible to, uh, to master, when was the dream for you? You just mentioned your PhD thesis of, uh, building the system that plays Go. When was the dream for you that you could actually build a computer program that achieves, um...... world-class- not necessarily beats the world champion, but achieves that kind of level of play in Go?
- DSDavid Silver
First of all, thank you. That's very kind words.
- LFLex Fridman
(laughs)
- DSDavid Silver
Um, and funnily enough, I just came, um, from a panel where I was, um, actually, um, in a conversation with Garry Kasparov and Murray Campbell, who was the author of Deep Blue. Um, and it was their first meeting together, um, since the, since the match. So that just occurred yesterday.
- LFLex Fridman
Oh, interesting.
- DSDavid Silver
So I'm literally fresh from that experience. So these are amazing moments when they happen. Um, but where did it all start? Well, for me, it started when I became fascinated in the game of Go. So Go for me, I've, I, I've grown up playing games. I've always had a fascination in, in, in board games. I played chess as a kid. I played Scrabble as a kid. Um, when I was at, uh, university, I discovered the game of Go and, and to me, it just blew all of those other games out of the water. It was just so deep and, and, and profound in its, in its, um, complexity with endless levels to it. What I discovered was that I could devote endless hours to this game, um, and I knew in my heart of hearts that no matter how many hours I would devote to it, I would never become a, you know, a, a, a grandmaster, or there was another path. And the other path was to try and understand how you could get some other intelligence to play this, this game better than I would be able to. And so even in those days, I, I, I had this idea that, you know, w- what if, what if it was possible to, to build a program that could crack this? And as I started to explore the domain, I discovered that, you know, this was really the, the, the domain where people felt deeply that if progress could be made in Go, it would really, um, mean a, a giant leap forward for AI. It was the, the challenge where all other approaches had failed. You know, this is coming out of the, the era you mentioned, which was, in some sense, the, the golden era for, for the classical methods of AI, like heuristic search. In the '90s, you know, they all, they all fell one after another, not just chess with Deep Blue, but checkers, um, backgammon, um, Othello. There were numerous cases where, where systems built on top of heuristic search methods with, you know, these high-performance systems had been able to defeat the human world champion in each of those domains. And yet, in that same time period, um, there was a million-dollar prize available for, uh, the game of Go, for the first system to beat a human professional player, and at the end of that time period, at, uh, year 2000 when the prize expired, the strongest Go program in the world was defeated by a nine-year-old child.
- LFLex Fridman
(laughs)
- DSDavid Silver
When that nine-year-old child was giving f- nine free moves to the computer at the start of the game-
- LFLex Fridman
Yeah.
- DSDavid Silver
... to try and even things up.
- LFLex Fridman
Yeah. (laughs)
- DSDavid Silver
And computer Go expert beat that strongest, same strongest program with 29, um, f- handicapped stones, 29 free moves. So that's what the state of affairs was, um, when I became interested in this problem, um, in around 2000 and, um, 2003 when I, I start- started working on computer Go. Um, there was nothing. There were r- there was just, there was f- very, very little in the way of progress towards, um, meaningful performance, again, uh, anything approaching human level. And so people, they... it wasn't through lack of effort. People had tried many, many things. And so there was a strong sense that, that something different would be required for Go than, than had been needed for all of these other domains where AI ha- AI had been successful. And maybe the single clearest example is that, that Go, unlike those other domains, um, had this kind of intuitive property that a Go player would look at a position and say, "Hey, you know, here's this mess of black and white stones. Um, but from this mess, oh, I can, I can predict that, that this part of the board is gonna become my territory, this part of the board's gonna become your territory, and I've got this overall sense that I'm gonna win, and that this is about the right move to play." And that intuitive sense of, of judgment, of being able to evaluate what's going on in a position, um, it was, uh, uh, pivotal to humans being able to play this game, and something that people had no idea how to put into computers. So this question of how to evaluate an, uh, a position, how to come up with these intuitive judgments was, um, the key reason why Go was so hard, um, in addition to its enormous search space, um, and the reason why methods which had succeeded so well elsewhere failed in Go. And so people really felt deep down that, that, you know, in order to crack Go, we would need to get something akin to human intuition, and if we got something-
- LFLex Fridman
Oh, interesting.
- DSDavid Silver
... akin to human intuition, we'd be able to solve, you know, much m- many, many more problems in AI.
- LFLex Fridman
So it was-
- DSDavid Silver
So for me, that was the moment where I was like, "Okay, this is not just about playing the game of Go. This is about something profound." And it was back to that bug which had been itching me all those years, you know, this is the opportunity to do something meaningful and, and transformative, and, and I guess a dream was born.
- LFLex Fridman
That's a really interesting way to put it. So almost, uh, this realization that, um, you need to find... formulate goals or kind of a prediction problem versus a search problem was th- uh, the intuition. I mean, I, uh, maybe that's the wrong crude term, but the, t- to give it a s- uh, uh, the ability to kind of, um, intuit things about positional structure of the board. Now... Okay, but what about the learning part of it? Did you, did you have a sense that you have to, uh... that, that l- learning has to be part of the system? Again, something that hasn't really, as f- as far as I think, except with TD-Gammon in, in the '90s with RL a little bit, hasn't been part of those state-of-the-art game-playing systems.
- DSDavid Silver
So I strongly felt that learning would be necessary, um, and that's why my, my PhD topic back then was trying to apply, um, reinforcement learning to the game of Go. Um, and not just learning of any type, but I felt, um, that the only way to...... really have a system to progress beyond human levels of, of performance wouldn't just be to mimic how humans do it, but to understand for themselves. And how else can a h- can a, a machine hope to understand what's going on except through learning? If you're not learning, what else are you doing? Well, you're putting all the knowledge into the system, and that just feels like a, um, um, something which decades of, of AI have told us is, is maybe not a dead end, but, uh, certainly has a ceiling to the capabilities. It's known as the, you know, knowledge acquisition bottleneck, that the, the more you try to put into something, the, the more brittle the system becomes. And, and so y- you just have to have learning. You have to have learning. That's the only way you're going to be able to get, um, a system which has sufficient knowledge in it, you know, um, millions and millions of pieces of knowledge, billions, trillions, um, of a form that it can actually apply for itself and understand how those billions and trillions is, of, of pieces of knowledge can be leveraged in a way which will actually lead it towards its goal without conflict or, or, or, or other issues.
- LFLex Fridman
Yeah. I b- I mean, if I put myself back in the, in that time, I just wouldn't think like that. (laughs)
- DSDavid Silver
(laughs)
- LFLex Fridman
Without a good demonstration of RL, I would, I would think more in the symbolic AI, like the, the, uh, it would, uh, not learning but sort of, um, a simulation of, uh, knowledge base, like a growing knowledge base. But it would still be sort of pattern-based, lot like, basically have little rules that you kind of assemble together into a large knowledge base. Uh-
- DSDavid Silver
Well, i- in a sense, that was the state of the art back then. So, if you look at the Go programs which had been com- competing for this, um, prize I mentioned, um, they were an assembly of, of different specialized systems, um, some of which used huge amounts of human knowledge to descri- describe how you should, um, play the opening, how you should, um... all the different patterns that were required to, um, to play well in the game of Go. Um, endgame theory, um, combinatorial game theory, and combined with more principled search-based methods which we're trying to solve for particular sub-parts of the game, like, um, life and death, um, connecting, um, groups together. All these amazing sub-problems that, that just emerge in the game of Go. There were, there were different pieces all put together into this, like, collage, which together would try and, um, and play against a human. Um, and although not all of the pieces were handcrafted, the overall effect was nevertheless still brittle, and it was hard to make all these pieces work well together. Um, and so really, um, what I was pressing for and, and, and the main innovation of the approach I took was to go back to first principles and say, "Well, let's, let's back off that and try and find a, a principled approach where the system can learn for itself, um, it... just from the outcome." Like, you know, learn for itself. If you try something, did that, did that help or did it not help? And only through that procedure can you arrive at, at knowledge which is, which is verified. The system has to verify it for itself and not relying on any other third party to say, "This is right," or, "This is wrong." Um, and so that principle, um, was already, you know, very important, um, in those days. But unfortunately, we were missing some important pieces back
- 21:42 – 25:37
Rule of the game of Go
- DSDavid Silver
then.
- LFLex Fridman
So, before we dive into maybe, uh, discussing the beauty of reinforcement learning, let- let's take a step back. We kinda skipped, skipped it a bit, but the, the rules of the game of Go. What's the, the elements of it perhaps contrasting to chess that sort of, uh, you really enjoy as a human being and also that make it really difficult as a AI machine learning problem?
- DSDavid Silver
So, the game of Go is, um, has remarkably simple rules. Um, in fact, so simple that, um, people have speculated that if we were to meet, um, alien life at some point that we wouldn't be able to communicate with them, but we would be able to play-
- LFLex Fridman
(laughs)
- DSDavid Silver
... a game of Go with them.
- LFLex Fridman
(laughs)
- DSDavid Silver
'Cause they probably have discovered the same rule set.
- LFLex Fridman
Yeah. (laughs)
- DSDavid Silver
Um, so the game is played on a, on a 19 by 19 grid, um, and you play on the intersections of the grid, and the players take turns. Um, and the aim of the game is very simple, it's to surround as much territory as you can, as many of these intersections with your stones and to surround more than your opponent does. And the only nuance to the game is that if you fully surround your opponent's piece, then you get to capture it and remove it from the board, and it counts as your own territory. Now, from those very simple rules, immense complexity arises. There's kind of profound strategies in, um, how to surround territory, how to kind of trade off between, um, making solid territory yourself now compared to, um, building up influence that will help you acquire territory later in the game. How to connect groups together, how to keep your own groups alive. Um, uh, which, which patterns of stones are, are, are, are most useful compared to others. Um, there's, uh, just immense knowledge. And, um, human Go players have, have played this game for... it was discovered thousands of years ago, and human Go players have built up this immense knowledge base o- over the years. Um, it's studied very deeply and played by, um, something like 50 million players, uh, um, across the world, mostly in China, Japan, and Korea, um, where it's a, an important part of the culture, so much so that it's considered one of the, uh, four ancient arts that was required by, um, Chinese scholars. So, there's a deep history there.
- LFLex Fridman
But there's interesting qualities. So, I- if I sort of compare it to chess, chess is, uh, in the same way as it is in, in Chinese culture for Go and chess in Russia is, uh, is, is also considered one of the sacred arts.
- DSDavid Silver
Yeah.
- LFLex Fridman
So, if we contrast sort of Go with chess, there's interesting qualities about Go. Maybe you can correct me if I'm wrong, but the, the evaluation of a particular static board is not as reliable. Like you can't... i- in chess, you can kind of assign points to the different units-
- DSDavid Silver
Yeah.
- LFLex Fridman
... and it's kind of, um, a pretty good measure of who's winning, who's losing.
- DSDavid Silver
Right.
- LFLex Fridman
It's not so clear to do so in Go.
- DSDavid Silver
Yeah. So, in the game of Go, you know, you find yourself in a situation where both players have played the same number of stones.... actually captures a strong level of, of play happen very rarely, which means that at any moment in the game, you've got the same number of white stones and black stones. And the only thing which differentiates how well you're doing is this intuitive sense of, um, you know, where are the territories ultimately gonna form on this board. Um, and when you've... if you look at the complexity of a real Go position, um, you know, it's, it's, it's mind-boggling that, that kind of question of what will happen in, in 300 moves from now when you, when you see just a scattering of 20 white and black stones intermingled. Um, and, and so that, that challenge is th- uh, the reason why position evaluation is so hard in Go compared to, to other games. In addition to that, it has an enormous search space. So, um, there's around 10 to the 170, um, positions in the game of Go. That's an astronomical number. And that search space is, is so great that traditional heuristic search methods that were so successful in things like Deep Blue and, um, and, and chess programs just kind of fall over in Go.
- LFLex Fridman
So,
- 25:37 – 30:15
Reinforcement learning: personal journey
- LFLex Fridman
at which point did reinforcement learning enter your life, your research life, your way of thinking? We just talked about learning, but reinforcement learning is a very particular kind of learning, one that's both philosophically sort of profound- (laughs)
- DSDavid Silver
Yeah.
- LFLex Fridman
... but also one that's pretty difficult to get to work as, if we look back in the ear- at least early days. So, when did that enter your life and how did that work progress?
- DSDavid Silver
So, I had just finished working in the games industry at this startup company, and I took, I took a year out to, um, discover for myself exactly which path I wanted to take. I knew I wanted to study, um, intelligence, but I wasn't sure what that meant at that stage. I really didn't feel I had the tools to decide on exactly which path I wanted to follow. Um, so during that year, I, I, I read a lot. And, um, one of the things I read was, um, Satton and Barto, the, the sort of seminal, um, textbook on an introduction to reinforcement learning. And when I read that textbook, I, I just had this resonating feeling that this is what I understood intelligence to be, um, and this was the path that I felt, um, would be necessary to go down to make progress in, um, in AI. So, I got in touch with Rich Satton, um, and-
- LFLex Fridman
(laughs)
- DSDavid Silver
... asked him if he would be interested in supervising me on a, a PhD thesis in, in computer Go. And he, he basically said, uh, that if he's still alive, he'd be happy to. Um, but unfortunately, he'd been, you know, struggling with, uh, very serious cancer for some years, and he really wasn't confident at that stage that he'd even be around to see the end of it. But fortunately, that part of the story worked out very happily, and I found myself out there in Alberta. They've got a great games group out there with a history of fantastic work in, in board games, as well, um, as Rich Satton, the father of RL. So, it was the, the natural place for me to go, in some sense, to, to study this question. And the more I looked into it, the more, the more strongly I, I felt that this wasn't just the path to progress in computer Go, but really, you know, this, this was the thing I'd been looking for. This was, um... really an opportunity to, to frame what intelligence means, like what is... what are the goals of AI in a clear, single clear problem definition, such that if we're able to solve that clear single problem definition, um, in some sense, we've, we've cracked the problem of AI.
- LFLex Fridman
So, to you, reinforcement learning ideas, at least sort of echoes of it, would be at the core of intelligence? I- is it the core of intelligence, and if we ever create an- a human level intelligence system, it would be at the core of that kind of system?
- DSDavid Silver
Uh, let me say it this way, that I think, I think it's helpful to separate out the problem from the solution. So, I see the problem of intelligence, um, I would say it can be formalized as the reinforcement learning problem, and that that formalization is enough to capture, um, most, if not all, of the things that we mean by intelligence, that, um, that they can all be brought within this, this, this framework, and, and gives us a way to access them in a meaningful way that allows us as, as scientists, um, to understand intelligence, and us as computer scientists to, to build them. Um, and so in that sense, I feel that, um, it gives us a path, maybe not the only path, but a path towards AI. And so, do I think that any system in the future that, that's, you know, solved AI would, would have to have RL within it? Well, I think if you ask that, you're asking about the, the solution methods. I would say that if we have such a thing, it would be a solution to the RL problem. Now, what particular methods have been used to get there? Well, we should keep an open mind about the best approaches to actually solve any problem, um, and, you know, the things we have right now for, for reinforcement learning, maybe, maybe they... maybe... I, I believe they've got a lot of legs, but maybe we're missing some things. Maybe there's going to be better ideas. I think we should keep a... you know, let's remain modest and, and-
- LFLex Fridman
(laughs)
- DSDavid Silver
... um, we're at the early days of this field, and, and there are many amazing discoveries ahead of us.
- LFLex Fridman
For sure. The specifics especially of the diff- different kinds of RL approaches, clearly there could be other things that fall under the very large umbrella of RL. But le- if it's,
- 30:15 – 43:51
What is reinforcement learning?
- LFLex Fridman
if it's okay, can we take a step back and kind of ask the basic question of what is, to you, reinforcement learning?
- DSDavid Silver
So, reinforcement learning is the study and, and the, um, the science and the problem of intelligence, um, in the form of an agent that interacts with an environment. So, the problem you're trying to solve is represented by some environment, like the world in which that agent is situated. And the goal of RL is, is clear, that the agent gets to take actions. Um, those actions have some effect on the environment, and the environment gives back an observation to the agent saying, you know, "This is what you see or sense."Um, and one special thing which it gives back is, is called the reward signal, how well it's doing in the environment. And the reinforcement learning problem is to simply take actions, um, over time, um, so as to maximize that reward signal.
- LFLex Fridman
So, a couple of basic questions. What types of RL approaches are there? So, I don't know if there's a nice brief inwards way to paint the picture of sort of value-based, model-based, policy-based, uh, reinforcement learning.
- DSDavid Silver
Yeah. So now if we think about, okay, so there's this ambitious, uh, problem definition of, of RL. It's really, you know, it's truly ambitious, it's trying to capture and encircle all of the things in which an agent interacts with an environment and say, "Well, how can we formalize and understand what it means to, to crack that?" Now let's think about the solution method. Well, how do you solve a really hard problem like that? Well, one approach you can take is, is to decompose that, that very hard problem into, into pieces that work together to solve that hard problem. And, and so you can kind of look at the decomposition that's inside the agent's head, if you like, and ask, "Well, what form does that decomposition take?" And some of the most common pieces that people use when they're kind of putting this system, this solution method together, some of the most common pieces that people use are whether or not that solution has a value function, that means is it trying to predict, explicitly trying to predict how much reward it will get in the future? Does it have a, uh, a representation of a policy? That means something which is deciding how to pick actions, is, is that decision-making process explicitly represented? And is there a model in the system? Is there something which is explicitly trying to predict what will happen in the environment? And so those three pieces, um, um, are, to me, some of the most common building blocks, and I understand, um, the different choices in RL as choices of whether or not to use those building blocks when you're trying to decompose the, the solution. You know, should I have a value function represented? Should I have, um, a policy represented? Should I have a model represented? And there are combinations of those pieces, and of course, other things that you could add-
- LFLex Fridman
(laughs)
- DSDavid Silver
... into the picture as well. But those, those three fundamental choices give rise to some of the branches of RL with which we are very familiar.
- LFLex Fridman
And so those, as you mentioned, there is a choice of what's specified or modeled explicitly, and the idea is that, uh, all of these are somehow implicitly learned within the system. So it's, it's almost a choice of, um, how you, uh, approach a problem. Do you see those as fundamental differences or are these ch- almost like, um, small specifics, like the details of how you solve the problem, but they're not fundamentally different from each other?
- DSDavid Silver
I think the, the fundamental idea is, is maybe at the higher level, the fundamental idea is, um, the first step of the decomposition is really to say, "Well, w- how are we really gonna solve any kind of problem where you're trying to figure out how to take actions?" And, uh, just from this stream of observations, you know, you've got some agent situated in its sensory motor stream and getting all these observations in, getting to take these actions, and, and what should it do? How can you even broach that problem? You know, maybe the complexity of the world is so great, um, that you can't, uh, even imagine how to build a system that would, that would understand how to deal with that. And so the first step of this decomposition is to say, "Well, you have to learn. The system has to learn for itself." Um, and so note that the reinforcement learning problem doesn't actually stipulate that you have to learn. Right? You could maximize your rewards without learning. It would just-
- LFLex Fridman
Right. (laughs)
- DSDavid Silver
... wouldn't do a very good job of it.
- LFLex Fridman
Yes.
- DSDavid Silver
Um, so learning is required because it's the only way to achieve good performance in any sufficiently large and complex envi- uh, environment. So, so that's the first step. And so that step gives commonality to all of the other pieces 'cause now you might ask, "Well, what should you be learning? What does learning even mean?" You know, in, in, in this sense, you know, learning might mean, well, you're trying to update the parameters of, um, some system which is then the thing that actually picks the actions. And, and, and those parameters could be representing anything. They could be parameterizing a value function, or a model, um, or a policy. Um, and so in that sense, there's a lot of commonality in that whatever is being represented there is the thing which is being learned and it's being learned, um, with the ultimate goal of maximizing rewards.
- LFLex Fridman
Mm-hmm.
- DSDavid Silver
But, but the way in which you decompose the problem is, is, is really what gives the semantics to the whole system. Like, are you trying to learn something, um, to predict well, like a value function or a model? Are you learning something to perform well, like a policy? Um, and, and the form of that objective, like, is kind of giving the semantics to the system, and so it, it really is, uh, the next level down a fundamental choice, and we have to make those fundamental choices, um, as system designers or enable o- our, our algorithms to be able to learn how to make those choices for themselves.
- LFLex Fridman
So then the next step you mentioned, uh, the, the, the very fir- the, the very first thing you have to deal with is, uh, can you even take in this huge stream of observations and do anything with it? So the natural next basic question is what is the, what is deep reinforcement learning and what is this idea of using neural networks to deal with this huge incoming stream?
- DSDavid Silver
So, amongst all the approaches for reinforcement learning, um, deep reinforcement learning is one, um, family of solution methods that tries to, um, utilize powerful representations that are offered by neural networks to represent any of these different components of, of, of the solution, of the agent. Like, whether it's the value function, or the model, or the policy, um, the idea of deep learning is to say, "Well...... here's a powerful toolkit that's so powerful that it's, it's universal in the sense that it can represent any function, and it can learn any function. Um, and so if we can leverage that universality, that means that whatever, whatever we need to represent for our policy or for our valley function or for our model, deep learning can do it. So, that deep learning is, is one approach that offers us a toolkit that is- has no ceiling to its performance. That, um, as we start to put more resources into the system, more, more memory and more computation, um, and more, more data, more experience of- of more interactions with the environment, that these are systems that can just get better and better and better at doing whatever the job is they've asked them to do. What- whatever we've asked that function to represent, um, it can learn a function that does a better and better job of representing that, that, that knowledge, whether that knowledge be, um, estimating how well you're gonna do in the world, the valley function, whether it's gonna be choosing what to do, um, in the world, the policy, or whether it's understanding the world itself, what's gonna happen next, the model.
- LFLex Fridman
Nevertheless, the, the, the fact that neural networks are able to learn incredibly complex representations that allow you to do the policy, the model, or the valley function is, uh, at least to my mind, exceptionally beautiful and surprising. Like, w- was it (laughs) -
- DSDavid Silver
Hmm.
- LFLex Fridman
... is it surprising, was it surprising to you? Can you still believe it works as well as it does? Do you have good intuition about why it works at all, and works as well as it does?
- DSDavid Silver
I think, let me take two parts to that question.
- LFLex Fridman
Yeah.
- DSDavid Silver
I think ... it's not surprising to me that the idea of reinforcement learning works because in some sense, I think it's the, I, I feel it's the only thing which can, ultimately. And so I feel we have to, we have to address it, and there must be successes possible because we have examples of intelligence. And it, it must at some level be able to, possible to acquire experience and use that experience to, to do better in a way which is meaningful to, um, um, environments of the complexity that, that humans can deal with. It must be. Am I surprised that our current systems can do as well as they can do? Um, I think one of the big surprises for me and, and a lot of the community, um, is really the fact that deep learning can continue to, um, perform so well despite the n- the fact that these neural networks that they're representing have these incredibly non-linear, kind of bumpy surfaces, which to our kind of low-dimensional intuitions-
- LFLex Fridman
Mm-hmm.
- DSDavid Silver
... make it feel like surely you're just gonna get stuck and, and learning will get stuck because, um, you, you won't be able to make any further progress. And yet, the big surprise is that learning continues, and, and these what appear to be local optima turn out not to be because in high dimensions when we make really big neural nets, there's always a way out. Um, and there's a way to go even lower, and then you're s- still not in a local optima because there's some other pathway that will take you out and take you lower still. And so no matter where you are, learning can, can proceed and do better and better and bre- better without bound. Um, and so that is a surprising and beautiful property of, of neural nets, um, which I find elegant and beautiful and, and somewhat shocking that it turns out to be the case.
- LFLex Fridman
As you said, uh, which I really like, to our low-dimensional, uh, intuitions, that's surprising. (laughs)
- DSDavid Silver
Yeah. Yeah, we're very, we're very tuned to working within a three-dimensional environment.
- LFLex Fridman
Yeah.
- DSDavid Silver
And so to start to visualize what a, a billion-dimensional neural network, uh, um, surface that you're trying to optimize over, what that even looks like is very hard for us. And so I think that really, um, if you try to account for, for the, um, um, essentially the AI winter where, where people gave up on neural networks, I think it's really down to that, that lack of, um, ability to generalize from, from low dimensions to high dimensions because back then, we were in the low-dimensional case. People could only build neural nets with, you know, 50, uh, nodes in them or something. And to, to imagine that it might be possible to build a billion-dimensional neural net and that it might have a completely different qualitatively different property was very hard to anticipate. And I think even now, we're starting to build the, the theory to support that. Um, and, and it's incomplete at the moment, but all of the theories seems to be pointing in the direction that indeed this is an approach which, which truly is universal, both in its representational capacity, which was known, but also in its learning ability, which is, which is surprising.
- LFLex Fridman
And it, it makes one wonder what else we're missing-
- DSDavid Silver
Yeah. (laughs)
- 43:51 – 53:40
AlphaGo (continued)
- DSDavid Silver
- LFLex Fridman
So, speaking of which, if we could take a step back to Go.
- DSDavid Silver
Mm-hmm.
- LFLex Fridman
Uh, what was MoGo and what was the key idea behind the system?
- DSDavid Silver
So, back during my, um, PhD on Computer Go, around about that time, um, there was a, a major new development in, in... which actually happened in the context of Computer Go. And, and it was really a, a revolution in the way that heuristic search was, was done. And, and the idea was, um, essentially that, um, a position could be evaluated, or a state in general could be evaluated, um, not by humans saying whether that, um, position is good or not, or even humans providing rules as to how you might evaluate it, but instead, by allowing the system to randomly play out the game until the end multiple times and taking the average of those outcomes as the prediction of what will happen. So, for example, if you're in the game of Go, the intuition is that you take a position and you get the system to kind of play random moves against itself all the way to the end of the game, and you see who wins. And if black ends up winning more of those random games than white, well, you say, "Hey, this is a position that favors white." And if white ends up winning more of those random games than black, then it, it favors white. Um, so that idea, um, was known as Monte Carlo, um, um, search, and a particular form of Monte Carlo search that became very effective and was developed in Computer Go, first by Remi Cuilon in 2006 and then taken further, um, by others, uh, was something called Monte Carlo tree search, which basically takes that same idea and uses that, that insight to evaluate every node of a search tree, is evaluated by the average of the random playouts from that, from that node onwards. Um, and this idea, uh, was very powerful and suddenly led to huge leaps forward in the strength of Computer Go playing programs. Um, and, uh, among those, the, the strongest of the Go playing programs in those days was a program called MoGo, which was the first program to actually reach human master level on small boards, nine-by-nine boards. And so this was a program by someone called Sylvain Gelly, who's a good colleague of mine that I worked with him a little bit, um, in those days, part of my PhD thesis. And MoGo was a, a first step towards the latest successes we saw in Computer Go. But it was still missing a key ingredient. MoGo was evaluating purely by random rollouts against itself. And in a way, it's, it's truly remarkable that random play-
- LFLex Fridman
It is.
- DSDavid Silver
... should give you anything at all.
- LFLex Fridman
Yes.
- DSDavid Silver
Like, how... Why, why in this perfectly deterministic game that's very precise and involves these very exact sequences, w- why is it that, that random, randomization is, is, is helpful? And so the intuition is that randomization captures something about the, the nature of the, of the, the search tree that, that from a position that you're, you're understanding the nature of the search tree, um, from that node onwards by, by, by using randomization. And this was a very powerful idea.
- LFLex Fridman
And I, I've seen this in, in other spaces, uh, when I talked to Richard Carp and so on. Randomized algorithms somehow magically are able to do exceptionally well and, uh, and simplifying the problem somehow. It makes you wonder about the fundamental nature of randomness in our universe (laughs) . It seems to be a useful thing. But, so from that moment, can you maybe tell the origin story and the journey of AlphaGo?
- DSDavid Silver
Yeah. So, programs based on Monte Carlo tree search were a, a first revolution in the sense that they led to, um, suddenly programs that could play the game to any reasonable level. But they, they plateaued. It seemed that no matter how much effort people put into these techniques, they couldn't exceed the level of, um, amateur dan level Go players. So, strong players, but not, not anywhere near the level of, of professionals, nevermind the world champion. And so, that brings us to the birth of AlphaGo, which happened in the context of, uh, um, a startup company known as, um, DeepMind, uh, wh-
- LFLex Fridman
I've heard of them.
- DSDavid Silver
... where, uh, uh-
- LFLex Fridman
(laughs) .
- DSDavid Silver
... a project was born, and the project was really a scientific investigation, um, where, um, myself and Aja Huang and an intern, Chris Madison, were exploring a scientific question. And that scientific question was really, is there another fundamentally different approach to, to this key question of, of, of Go, the key challenge of, of how can you build that intuition and how can you just have a system that could look at a position and understand, um, what move to play or, or how well you're doing in that position, who's gonna win?And so, the deep learning revolution had just begun, that systems like ImageNet had suddenly been won by deep learning techniques back in 2012. And following that, it was natural to ask, well, you know, if- if deep learning is able to scale up so effectively with images to- to understand them enough to- to classify them, well, why not go? Why- why- why not take a- um, a the black and white stones of the Go board and build so- a system which can understand for itself what that means in terms of what move to pick, or who's going to win the game, black or white? And so that was our scientific question which we- we were probing and trying to understand. And as we started to look at it, we discovered that we could build a- a system. So, in fact, our very first paper on AlphaGo was actually a pure deep learning system which was trying to answer this question, and we showed that actually, a pure deep learning system with no search at all was actually able to reach human dan level, master level, at the full game of Go, 19-by-19 boards. Um, and so without any search at all, suddenly we had systems which were playing at the level of the best Monte Carlo tree search systems, the ones with randomized roll-outs.
- LFLex Fridman
So first of all, sorry to interrupt, but, uh, that's kind of a groundbreaking notion. That's like- that's like basically a definitive step away from the- a couple of decades of essentially search dominating AI.
- DSDavid Silver
Yeah.
- LFLex Fridman
So wha- how did that make you feel? Would you th- was it surprising from a scientific perspective? Uh, in general, how'd it make you feel?
- DSDavid Silver
I- I- I- I found this to be profoundly surprising. Um, in fact, it was so surprising that, um, that we had a bet back then.
- LFLex Fridman
(laughs)
- DSDavid Silver
And like many good projects, you know, bets are quite motivating, and the- and the bet was, you know, whether it was possible for a- a- a system based purely on- on, uh, deep learning, no search at all, to beat a- a dan level human player. Um, and so we had, um, someone, um, who joined our team, um, who was a dan level player. He came in and, um, and we had this first match, um, against him, and ...
- LFLex Fridman
Which side of the bet were you on, by the way?
- DSDavid Silver
(laughs)
- LFLex Fridman
D- the losing or the winning side? (laughs)
- DSDavid Silver
I tend to be an optimist, um-
- LFLex Fridman
Ah, yeah. (laughs)
- DSDavid Silver
... with the- with the power of- of- of- of-
- LFLex Fridman
Great.
- DSDavid Silver
... deep learning and- and reinforcement learning.
- LFLex Fridman
(laughs)
- 53:40 – 1:06:12
Supervised learning and self play in AlphaGo
- LFLex Fridman
AlphaGo involves both learning from expert games and, uh, as far as I remember, a self-play component to where it learns by playing against itself. But i- in your sense, what was the role of learning from expert games there? And in terms of your self-evaluation, whether you can take on the world champion, wha- what was the thing that they're trying to do more of, sort of train more on expert games or was there now another ... I'm asking so many, um, poorly phrased questions, but, uh, wha- did you have a hope or dream that self-play would be the key component at that moment yet?
- DSDavid Silver
So in the early days of- of AlphaGo, we- we used human data to explore the science of what deep learning can achieve. And so when we had our first paper that showed, um, that it was possible to predict, um, the winner of the game, that it was possible to suggest moves, that was done using human data.
- LFLex Fridman
Oh, solely human data? That was-
- DSDavid Silver
Yeah. And- and- and- and so the reason that we did it that way was, at that time, we were exploring separately the deep learning aspect from the reinforcement learning aspect. That was the part which was- which was new and unknown to- to- to- to me at that time, was how far could that be stretched? Um, once we had that ...... it then became natural to try and use that same representation and see if we could learn for ourselves using that same representation. And so, right from the beginning, actually, our goal had been to build a system using self-play. Um, and to us, the human data, right from the beginning, was an expedient step to help us, for pragmatic reasons, to go faster towards the goals of the project, um, than we might be able to starting solely from self-play. Um, and so in those days, we were very aware that we were choosing to- to use human data and that might not be the long-term, um, holy grail of AI, but that it was something which was extremely useful to us. It helped us to understand the system. It helped us to build deep learning representations which were, um, clear and simple and- and easy to use. Um, and so really I would say it's, um, it served a- a purpose, not just as part of the algorithm, but something which I continue to use in our research today, which is trying to break down a very hard challenge into pieces which are easier to understand for us as- as researchers and develop. So, if you- if you use a component based on human data, it can help you to understand the system, um, such that then you can build the more principled version later that- that does it for itself.
- LFLex Fridman
So, as I said, the AlphaGo victory, and I don't think I'm being sort of, uh, romanticizing this notion. I think it's one of the greatest moments in the history of AI. So, were you cognizant of this magnitude of the accomplishment at- at- at the time? I mean, were y- are you cognizant of it even now?
- DSDavid Silver
(laughs) .
- LFLex Fridman
'Cause to me, I feel like it's something that would... We mentioned what the AGI systems of the future will look back. I think they'll look back at the AlphaGo (laughs) victory as, like, "Holy crap, they figured it out." (laughs) . This is where- this is where it just started.
- DSDavid Silver
Well, thank you again. I mean-
- LFLex Fridman
(laughs) .
- DSDavid Silver
I- it's funny 'cause I guess I've been working on- I'd been working on Computer Go for a long time, so I'd been working, at the time of the AlphaGo match, on Computer Go for more- more than a decade. And throughout that decade, I'd had this dream of what would it be like to... What would it be like, really, to- to actually be able to build a system that could play against the world champion? And- and I imagined that that would be an interesting moment, that maybe, you know, some people might care about that, and that this might be, you know, a nice achievement. Um, but I think when I arrived in- in Seoul and discovered the legions of journalists-
- LFLex Fridman
(laughs) .
- DSDavid Silver
... that were following us around and the 100 million people that were watching the match online, live, I realized that I'd been off in my estimation of how significant this moment was by several orders of magnitude.
- LFLex Fridman
Oh.
- DSDavid Silver
Um, and so there was definitely a- a- an adjustment process to- to realize that this- this was something which the world really cared about and which was a- a watershed moment, and I think there was that moment of realization. It was also a little bit scary because, you know, if you go into something thinking it's gonna be maybe of interest and then discover that 100 million people are watching, it suddenly makes you worry about whether some of the decisions you'd made were really the- the best ones or the wisest-
- LFLex Fridman
(laughs) .
- DSDavid Silver
... or were going to lead to the best outcome. And we knew for sure that there were still imperfections in AlphaGo-
- LFLex Fridman
Yeah.
- DSDavid Silver
... which were gonna be exposed to the whole world watching. And so, yeah, it was a- it was, I think, a great experience, and I- I- I feel privileged to have been part of it, privileged to have- have led that amazing team. Um, I feel privileged to have been in a moment of history, like you say, but also lucky that, um, you know, in a sense, I was insulated from- from the knowledge of... I think it would've been harder to focus on the research if the full, kind of, reality of- of what was gonna come to pass had- had been known to me and- and- and- and the team. I think it was... You know, we were- we were in our bubble, and we were working on research, and we were trying to answer the scientific questions. Um, and then, bam.
- LFLex Fridman
(laughs) .
- DSDavid Silver
You know, the- the public sees it, and- and I think it was- it was- it was better that way in retrospect.
- LFLex Fridman
Were you confident that... (sighs) I guess, what were the chances that you could get the win? So, (sighs) in th- just like you said, um, I'm- I'm a little bit more familiar with another accomplishment that we may not even get a chance to talk to... I talked to Oriol Vinyals about AlphaStar, which is another-
- DSDavid Silver
Yeah.
- LFLex Fridman
... incredible accomplishment, but here, you know, with AlphaStar and beating the- StarCraft, there was, like, already a track record. With AlphaGo, there- this is, like, the really first time y- you get to see reinforcement learning, uh, face the best human in the world. So, what was your confidence like? What was the odds?
- DSDavid Silver
Well, we actually, um, we had-
- LFLex Fridman
Was there a bet (laughs) ?
- DSDavid Silver
Um, funnily enough, there was. Um, (laughs) , so- so just before the match, um, we- we weren't be- betting on anything concrete, but we all held out a hand. Everyone in the team held out a hand at the beginning of the match. Um, and the number of fingers that they had out on that hand was, um, supposed to represent how many games they thought we would win against Lee Sedol. And there was an amazing spread in the- in the team's predictions. But I have to say, I predicted 4-1. (laughs) .
- LFLex Fridman
(laughs) .
- DSDavid Silver
Um, and- and the reason was based purely on- on data. So, I'm a scientist first and foremost, and one of the things which we had established was that AlphaGo, in around one in five games, would develop something which we called a delusion, which was a kind of, you know, hole in its- in its knowledge where it wasn't able to fully understand everything about the position, and- and that- that hole in its knowledge would persist for tens of moves throughout the game. Um, and we knew two things. We knew that if there were no delusions, that AlphaGo seemed to be playing at a level that was far beyond any human capabilities. But we also knew that if there were delusions, the opposite was true.
- LFLex Fridman
(laughs) .
- DSDavid Silver
Um, and, um, and- and in fact, you know, that's- that's what came to pass. We saw- we saw all of those outcomes. And- and Lee Sedol, in- in one of the games, played a really beautiful sequence that- that, um, that AlphaGo just hadn't predicted. And after that, it, um, it led it into this situation where it was...... unable to really understand the position fully and, and, and found itself in one of these, these delusions. So, so indeed, yeah, 4-1 was the outcome.
- 1:06:12 – 1:08:57
Lee Sedol retirement from Go play
- DSDavid Silver
- LFLex Fridman
So, just like in the case of Deep Blue beating Garry, uh, Kasparov, so Garry, Garry was... It's, I think, the first time he's ever lost, actually, to anybody, uh, and, I mean, there's a similar situation with Lee Sedol. It's, um, it's a tragic f- uh, it's a tragic loss for humans (laughs) but a beautiful one. I think that's kind of, uh, uh, from the tragedy sort of emerges, over time, emerges a kind of inspiring story. But, uh, Lee Sedol recently announced his retirement. I don't know if we can look too deeply into it, but he did say that, "E- even if I become number one, there's an entity that cannot be, uh, defeated." So, what do you think about these words? What do you think about his retirement from the game of Go?
- DSDavid Silver
Well, let me take you back, first of all, to the first part of your, um, comment about Garry Kasparov, 'cause actually, at the panel yesterday, um, he specifically said that when he first lost to Deep Blue, uh...... he, he viewed it as a failure. He viewed that this, this had been a failure of his, but later on in his career, he said he'd come to realize that actually it was a success. It was a success for everyone because this marked a transformational moment for, for AI. Um, and so even for K- Garry Kasparov, he came to, to realize that that moment was, was, was pivotal and actually meant something much more, um, than, than, you know, his personal loss in that moment. Um, Lee Sedol, I think, was, uh, much more cognizant of that even at the time. So in his closing remarks to the match, um, he really felt very strongly that what had happened in the AlphaGo match was not only meaningful for AI but, but for humans as well. And he felt as a Go player that it had opened his horizons and meant that he could start exploring new things. It brought his joy back for the game of Go because, um, it broken all of the, the conventions and barriers and meant that, you know, suddenly, suddenly anything was possible again. Um, and so, you know, I was sad to hear that he'd retired, but, you know, he's been a great, uh, a great, um, world champion over many, many years, and I think, you know, that w- he'll be, he'll be remembered for that evermore. He'll be remembered as the last person to, to beat AlphaGo.
- LFLex Fridman
(laughs)
- DSDavid Silver
I mean, after, after that, we, we increased the power of the system, and, and, um, the next version of AlphaGo beats, um, the, the other strong human players 60 games to nil. Um, so, uh, (sighs) you know, what a great moment for him and something to be remembered for.
- 1:08:57 – 1:14:10
Garry Kasparov
- LFLex Fridman
It's interesting that you spent time at, uh, AAAI, uh, o- o- on a panel with, uh, Garry Kasparov. What... I mean, it's almost- I'm just curious to learn the conversations you've had with Garry in the... 'Cause he's also now- he's written a book about artificial intelligence. He's thinking about AI. He has kind of a view of it, and he talks about AlphaGo a lot. What, what's your sense? Be- a- arguably, I'm not just being Russian, but I think-
- DSDavid Silver
Mm-hmm.
- LFLex Fridman
... Garry is the greatest chess player of all time, the- probably one of the greatest game players of all time, and y- you sort of, um, at the center of creating a system that beats one of the greatest players of all time. So, what's that conversation like? Is there anything-
- DSDavid Silver
Yeah. It's a-
- LFLex Fridman
... any interesting digs, any bets, any com- any funny things, any profound things?
- DSDavid Silver
So, Garry Kasparov, um, ha- has, uh, an incredible respect for what we did with AlphaGo. And you know, it's- it's an amazing tribute coming from, from him, of all people, that he really appreciates and respects what w- what we've done. And I think he feels that the progress which has happened in, in computer chess, which later after AlphaGo, we, we built m- the AlphaZero system, which defeated the, the world's strongest chess programs. And to Garry Kasparov, that moment in computer chess was more profound than, than, than Deep Blue. And the reason he believes it mattered more, um, was because it was done with, with learning and a system which was able to discover for itself new principles, new ideas, um, which were able to play the game in a, in a, in a way which, um, he hadn't always, um, known about, um, or anyone. Um, and in fact, one of the things I discovered at this panel was that the current world champion, Magnus Carlsen, apparently recently commented on his improvement in performance, and he attributes it- it to AlphaZero. That he's been studying the games of AlphaZero-
- LFLex Fridman
(laughs)
- DSDavid Silver
... and he's changed his style to play more like AlphaZero, and it's led to him, um, actually increasing his, his, his rating, um, to a new peak.
- LFLex Fridman
Yeah, I guess, uh, to me, just like to Garry, the inspiring thing is that... and just like you said, with reinforcement learning, reinforcement learning and, uh, deep learning and machine learning feels like what intelligence is.
- DSDavid Silver
Yeah.
- LFLex Fridman
And, you know, you could attribute it to sort of, um, a bitter viewpoint from Garry's perspective or from, uh, o- us humans' perspective saying that se- pure search that, uh, IBM Deep Blue was doing is not really intelligence, but somehow it didn't feel like it. And so, that's the magical... I'm not sure what it is about learning that feels like intelligence, but it, but it does.
- DSDavid Silver
So, I think we should not demean the achievements of what was done in previous eras of AI. I think that Deep Blue was an amazing achievement in itself, um, and that heuristic search of the kind that was used by Deep Blue, um, had some powerful ideas that were in there, but it also missed some things. So, so the fact that the, that the evaluation function, the way that the chess position was understood, was created by humans and not by the machine, um, is a limitation which means that there's a, a ceiling on how well it can do. Um, but maybe more importantly, it means that the same idea cannot be applied in other domains where we don't have access to, um, the kind of human grandmasters and that ability to kind of encode exactly their knowledge into a evaluation function. And the reality is that the story of AI is that, you know, most domains turn out to be of the second type where, where knowledge is messy, it's hard to extract from experts, or it isn't even available. And so, so we need to solve problems in a different way, um, and I think AlphaGo is a step towards solving things in a way which, which puts learning as a first-class citizen and says, "Uh, systems need to understand for themselves how to, um, understand the world, how to judge their, uh, the value of, of, of any action that they might take within that world and any state they might find themselves in." And in order to do that, um-... we, we make progress towards AI.
- LFLex Fridman
Yeah. So, one of the nice things about this, uh, about taking a learning approach to the game of Go, game playing, is that the things you learn, the things you figure out are actually going to be applicable to other problems that are real world problems. That's, sort of, that's ultimately... I mean, there's two really interesting things about AlphaGo. One is the science of it, just the science of learning, the science of, uh, intelligence. And then the other is while you're actually learning to figuring out how to build systems that would be potentially applicable in, in other applications, medical, autonomous vehicles, robotics. All, I mean, it's just opened the door to all kinds of applications.
- 1:14:10 – 1:31:29
Alpha Zero and self play
- LFLex Fridman
- DSDavid Silver
Yeah.
- LFLex Fridman
So, the next incredible step, right, really the profound step is probably AlphaGo Zero. I mean, it, it's arguable, I kind of see 'em all as the same place, but really... and perhaps you were already thinking that AlphaGo Zero is the natural, it was always going to be the next step. But it's removing the reliance on human expert games, uh, for pre-training, as you mentioned. So, how big of an intellectual leap was this?
- DSDavid Silver
(laughs)
- LFLex Fridman
That, uh, th- that self-play could achieve superhuman level of performance on its own? And maybe, could you also say what is self-play? We kind of mentioned it a few times, but...
- DSDavid Silver
So, let me start with, um, self-play. So, the idea of self-play is something which is really about systems learning for themselves, but in the situation where there's, um, more than one agent, um. And so if you're in a game, um, and a game is, uh, played between two players, then self-play is really about understanding that game just by playing games against yourself rather than against any actual real opponent. And so it's a way to kind of, um, discover strategies without having to actually need to go out and play against, um, um, any particular human player, for example. Um, the main idea of AlphaZero was really to, you know, try and step back from any of the knowledge that we'd put into the system and ask the question, is it possible to come up with a, a single elegant principle by which a system can learn for itself all of the knowledge which it requires to play, to play a game such as Go? Importantly, by taking knowledge out, you not only make the system, um, less brittle in the sense that perhaps the knowledge you were putting in was, was just getting in the way and maybe stopping the system learning for itself, but also you make it more general. Um, the more knowledge you put in, the harder it is for a system to actually be placed, taken out of the system in which it's kind of been designed and placed in some other system that maybe would need a completely different knowledge base to, to understand and perform well. And so the real goal here is to strip out all of the knowledge that we put in to the point that we can just plug it into something totally different. Um, and that, to me, is really, you know, the, the promise of AI is that we can have systems such as that which, you know, no matter what the goal is, um, no matter what goal we set to the system, we can come up with a, we have an algorithm which can be placed into that world, into that environment and can succeed in achieving that goal. And then that, that's, to me, is almost the, the essence of intelligence, if we can achieve that. And so AlphaZero is a step towards that. Um, and it's a step that was taken in the context of, of two-player perfect information games like Go and chess. Um, we also applied it to Japanese chess.
- LFLex Fridman
So just, just to clarify, the first step was AlphaGo Zero.
- DSDavid Silver
The first step was to try and take all of the knowledge out of AlphaGo in such a way that it, it, it could play in a, in a fully, um, self-discovered way, purely from self-play. And to me, the, the motivation for that was always that we could then plug it into other domains. Um, but we saved that, that until later.
- LFLex Fridman
(laughs) Well, a- and, and-
- DSDavid Silver
In fact, I mean, just for fun, I could tell you exactly the moment where, where the idea for AlphaZero occurred to me, um, because I think there's maybe a lesson there for, for researchers who are kind of too deeply embedded in their, in their research and, you know, working, um, 24/7 to try and come up with a next idea. Um, which is, um, it actually occurred to me, um, on honeymoon. Um, and, uh-
- LFLex Fridman
(laughs)
- DSDavid Silver
... and I was, like, at my most fully relaxed, uh, state, really enjoying myself, um, and, and just bing, this like, the algorithm for AlphaZero just appeared, like, um, and-
- LFLex Fridman
(laughs)
- DSDavid Silver
... like, in, in its, uh, full form. And this was actually before we played against, um, Lee Sedol, but we, we just didn't... I, I think we were so busy trying to make sure we could beat the, um, the, the world champion that it was only later that we had the, the opportunity to step back and, and start examining that, that sort of deeper scientific question of, of whether this could really work.
- LFLex Fridman
So, (laughs) nevertheless, so self-play is probably one of the most sort of profound ideas, uh, that it represents the, the, to me at least, artificial intelligence. But the fact that you could use that kind of mechanism to, uh, again, beat world-class players, that's very surprising. So we kind of... To me, it feels like you have to train on a large number of expert games. So was it surprising to you? What was the intuition? Can you sort of think... Na- not necessarily at that time, even now, what's your intuition why this thing works so well? Why it's able to learn from scratch?
- DSDavid Silver
Well, let me first say why we-... tried it. So, we tried it both because I, I feel that it was the deeper scientific question to, to be asking to make progress towards AI, and also because, in general in my research, I don't like to do research on questions for which we already know the likely ou- outcome.
- LFLex Fridman
Mm-hmm.
- DSDavid Silver
Like, I don't see much value in running an experiment where you're 95% confident that, that you will succeed. Um, and so we could've tried, you know, maybe to, to take AlphaGo and do something which we, we, we knew for sure it would succeed on. Um, but much more interesting to me was to try, try it on the things which we weren't sure about, and one of the big questions, um, on our minds back then was, you know, could you really do this with self-play alone? How far could that go? Would it be as strong? And honestly, uh, we weren't sure. You know, it was 50/50, I think. You know, it, we, I, I really, if you'd asked me, I wasn't confident that it could reach the same level, um, as these systems, but it felt like the right question to ask. Um, and even if, even if it had not achieved the same level, I felt that that was, um, an important, um, direction to be studying. And so, um, then lo and behold, it actually ended up outperforming the, the previous version of, of AlphaGo, and indeed, um, was able to beat it by 100 games to 0. So, what's the intuition as to, as to why? I think the, the intuition to me is clear, that whenever you have errors in a, in a system, um, as we did in AlphaGo... AlphaGo suffered from these delusions. Um, occasionally, it would misunderstand what was going on in a position and misevaluate it. How can, how can you remove all of these, these errors? Errors arise from many sources. For us, they were arising both from, you know, starting from the human data, but also from the, from the nature of the search and the nature of the algorithm itself. But the only way to address them, in any complex system, is to give the system the ability to correct its own errors. It must be able to correct them. It must be able to learn for itself when it's doing something wrong, um, and correct for it. And so it seemed to me that the way to correct delusions was indeed to have more iterations of reinforcement learning, that, that, you know, no matter where you start, you should be able to correct for those errors until it gets to play that out a- and understand, "Oh, well, I thought that I was gonna win in this situation, but then I ended up losing. That suggests that I was misevaluating something, and there's a hole in my knowledge, and now, now, the system can correct for itself and, and understand how to do better." Now, if you take that same idea and trace it back all the way to the beginning, it should be able to take you from no knowledge, from completely random starting point, all the way to the highest levels of, of knowledge that you can achieve in a, in a domain. Um, and the principle is the same, that if you give, if you bestow a system with the ability to correct its own errors, then it can take you from random to something slightly better than random, um, because it sees the stupid things that the random is doing and it can correct them, and then it can take you from that slightly better system and understand, "Well, what's that doing wrong?" And it takes you on to the next level and the next level. And, and this progress, uh, can go on indefinitely. And indeed, you know, what would have happened if we'd carried on training AlphaGo Zero, uh, for longer, um, we saw no sign of it, um, slowing down its im- improvements, or at least it was certainly carrying on t- improve. Um, and presumably, if you had the computational resources, this, this could lead to better and better systems that discover more and more, so-
Episode duration: 1:48:00
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode uPUEq8d73JI
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome