EVERY SPOKEN WORD
150 min read · 30,042 words- 0:00 – 8:06
Basics of Go
- DPDwarkesh Patel
Today I'm here with Eric Jang, who was most recently vice president of AI at 1X Technologies. Before that, senior research scientist at what is now Google DeepMind Robotics, and you've been on sabbatical for the last few months. One of the things you've been doing is rebuilding, and improving, and hacking on AlphaGo. And so we're, today what we're gonna do is you're gonna explain building AlphaGo from scratch and what it tells us about the future of AI research and development. But, uh, before we get to that, why is AlphaGo interesting? Why is this, why is this the project you decided to do on sabbatical rather than just hang out at the beach?
- EJEric Jang
Sure, yeah. Um, I like making things, and AlphaGo and Go AI is one of those things that really got me into the field. Uh, when I saw the kind of early breakthroughs, um, on AlphaGo in 2014, 2015, 2016 and so forth, it was just profound to see, you know, how smart AI systems could become and the, the kind of computational complexity class that they could tackle with deep learning. Um, this is a problem that has, you know, long been understood to be kind of intractable for a search, and yet, um, it was solved, um, through, through deep learning. And so, so that was quite mysterious to me, and I've always wanted to understand that phenomena a little bit better. My training is often in deep neural nets for robotics, where it's, uh, the, the decisions made by the neural networks are a bit more intuitive, but AlphaGo is a sort of problem where the, the decisions are actually the result of a very, very deep search. And it's always been very mysterious to me how, like, a 10-layer network can sort of-
- DPDwarkesh Patel
Mm-hmm
- EJEric Jang
... amortize the simulation of-
- DPDwarkesh Patel
Yeah
- EJEric Jang
... something so, so, uh, so deep in the, in the game tree.
- DPDwarkesh Patel
Yeah, interesting.
- EJEric Jang
So if you plot out how much compute it took to build various iterations of strong GoBots over the years, you can see that in 2020 there was a open source project called KataGo, um, by David Wu from Jane Street, who, who basically achieved a 40x reduction in compute needed to train a really strong GoBot, uh, tabula rasa. Um, I'm not certain if it's stronger than AlphaGo Zero or AlphaZero or MuZero, uh, but it's very, very strong, and this is what most Go practitioners today train against when they're, when they're playing an AI. And thanks to LM coding, what took a whole team of research scientists at DeepMind and, you know, millions of dollars of research and compute can now be done for, you know, a few thousand dollars of, of rented compute.
- DPDwarkesh Patel
Okay. I guess we should first discuss how Go works.
- EJEric Jang
Great.
- DPDwarkesh Patel
So, yeah, how, how does the game work?
- EJEric Jang
Um, so the game of Go is a very simple one that can be implemented, uh, quickly and easily in a computer. The, the objective of the game is basically to put down black and white stones and try to occupy as much territory in the game as possible. So I might start by putting down a black stone. Uh, black always goes first. So go ahead. And so the way you capture an opponent's stones is, like, for every, um, intersection, if you can surround all four of its neighbors with, um, with your stones, then, um, then this one is sort of cut off from oxygen, if you will, and then it, uh, and it, it, it, it is a dead, dead stone. So, so then now I control these four stones as well as this empty intersection here.
- DPDwarkesh Patel
Mm-hmm.
- EJEric Jang
So there's, like, slight variations between Chinese, Japanese, and, uh, what is called Trump-Taylor rules. Um, Trump-Taylor rules are designed to be completely unambiguous for Go, so this is what all Go AIs train against-
- DPDwarkesh Patel
Mm
- EJEric Jang
... and resolve against. So in typical Go, like when, like, humans play, you're actually not allowed to put this white stone down here.
- DPDwarkesh Patel
Mm.
- EJEric Jang
It would be instant suicide. Um, in Trump-Taylor, it's actually fine. You put it down, and then it immediately resolves to death.
- DPDwarkesh Patel
Mm.
- EJEric Jang
So the outcome is sort of the same. Let's go ahead and start over and, and play, play a few stones, and then I'll explain some more.
- DPDwarkesh Patel
Cool.
- EJEric Jang
So I'll just start there.
- DPDwarkesh Patel
All right. I'm, like, basically playing randomly here, but I'm trying to get around your stones and see if I can-
- EJEric Jang
Sounds good
- DPDwarkesh Patel
... surround them.
- EJEric Jang
Yeah.
- DPDwarkesh Patel
Hmm.
- EJEric Jang
So this move, um, basically exposes one empty neighbor for your white stone, and it's very akin to a check in chess-
- DPDwarkesh Patel
Hmm
- EJEric Jang
... where if you don't respond immediately by putting one here, then I can immediately capture this.
- 8:06 – 31:53
Monte Carlo Tree Search
- DPDwarkesh Patel
Let's understand how, um, AlphaGo actually works and how somebody in the audience might be able to implement it.
- EJEric Jang
Great, yeah. Let's start with, um, kind of an intuition about the underlying, um, you know, search process used to make moves, and we'll layer on, uh, ideas from deep learning to make it much more efficient-
- DPDwarkesh Patel
Mm
- EJEric Jang
... and tractable. So Go is a game where there's just two players. We're gonna draw a person here, and we're gonna draw an AI here. And, um, let's say this person is playing black, so they go first. So we're gonna draw. They go here. And then now the AI, um, is going to make a move based on what it sees here. So there's a question of, like, how you encode these inputs into the AI. Maybe you could use ones and zeros, but you wanna represent, um, you know, black, white, and empty. So, so you would need at least three different values here, right? So maybe you could use zero, ones, and twos or something. So, so the AI might see something like, you know, zero, zero, zero, zero, uh, one. Great. So, so this is the input to the AI, uh, on its turn.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
So, so the AI can choose, let's just pick three possible random moves that it can go, and I just drew these at random. And so which, which move is best here, right? Well, we don't know until the game ends. Um, there's no-- Go does not have any kind of local reward of which move here is good. And this is what makes Go a very difficult game, is that you don't actually know who won until you really get to the end of the game. So how deep is this tree, right? Well, in a 19-by-19, um, Go board, there are, uh, you know, roughly to the order of 361 moves on any given, uh, move, and of course, as it fills up, you have less moves. Um, and, and the, the number of steps in the game can be somewhere from 250 to 300 moves, and maybe experts might, uh, decide to end the game, uh, well before that. But, uh, you know, under Tromp-Taylor scoring, you actually have to play things all the way to the end, so this could be like 300 moves or something. Right. So, like, 300, um, like, depth of the tree.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
So if you keep on expanding possible moves here, so th- in, in this move, the AI is going, and then, you know, here the human would go. And then, you know, there's, there's some... and so forth. You can find that, like, essentially what you end up with is an enormous explosion in the possible game outcomes originating from just this one s-state. So this is something to the order of, like, you know, 361 to 300, power of 300, which is far more than the number of atoms in the universe, right? Like, it's, it's just, uh, it's just... And, and of course, actually there are redundancies and symmetries, so it's not actually 300, but, but that's sort of the... If you were to do a naive tree where there were no merging of children, then actually you end up with a tree about this big.
- DPDwarkesh Patel
What do you mean by merging of children?
- EJEric Jang
Right. Let me, uh, use this board here. So if we start here, and then you play here, and then, um, I play here, and then you play here, that is equivalent to I start here, you play here, I play here-
- DPDwarkesh Patel
Yeah
- EJEric Jang
... and then you play here, right?
- DPDwarkesh Patel
Yeah.
- EJEric Jang
So, so both of them arrived at the same spot but through different paths. So this child node can be thought about as a shared ancestor.
- DPDwarkesh Patel
Backtrack.
- EJEric Jang
Yeah.
- DPDwarkesh Patel
And I guess it's not 361. It starts at 361, but it dec-decreases by one each time.
- EJEric Jang
Yes, and the branching factor decreases by one each time.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
Yes, yes. But in, in any case, this is a very, very, very large tree.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
And, um, this is also why, you know, computer scientists for many years thought that Go was not a tractable problem this century-
- DPDwarkesh Patel
Yeah
- EJEric Jang
... because the amount of compute you would need to exhaustively search every possible possibility, um, is just too large.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
Um, if you could, Go is actually a deterministic game. So, um, on any given state, you can actually compute what the, uh, best possible strategy you can, uh, you can make is-
- DPDwarkesh Patel
Mm
- EJEric Jang
... uh, to, in order to win the game. You can search all the possible futures where you win and then just make sure you always stay in, in, in that, you know, set of futures.
- DPDwarkesh Patel
Mm-hmm.
- EJEric Jang
So AlphaGo's kind of core conceptual breakthrough was using neural nets to make this search problem tractable. So before we get into, you know, how neural networks are involved, let's talk a little bit about how we can, you know, a- assuming we had a very powerful enough computer, um, search this, uh, this tree to find the best move, right? So in the beginning, um, you're not gonna build out the whole tree, uh, bec-because storing that tree would be very expensive. Instead, you might do something like interactively figure out which, um, which l-leaves of this tree are worthy of exploring and expanding into the future to see, you know, what else is there. So, um, there are some early algorithms in, uh, bandit literature like, you know, UCB, uh, one, which is not exactly appropriate for a, you know, sequential game like Go, but very much inspired the action selection, um, algorithm used in, uh, in AlphaGo. So, so UCB one looks like on, on every move we're gonna take the best action, uh, or, you know, the argmax over A that maximizes, um, um, you know, the Q of A, and I'll explain what Q of A is in a moment, plus some sort of exploration bonus. So on every node, we're going to track a few quantities. So, so let's, you know, consider each of these a node. This is, this is the, the root node, um, where you're making decisions from, and these are the children of the root node. And, um, we're gonna say each node is basically a data structure that is, um-- it stores a visit count of this, um, this node, this, this child node.
- 31:53 – 1:00:22
What the neural network does
- EJEric Jang
Okay, so n- now we have a basic intuition of how moves are made-
- DPDwarkesh Patel
Yeah
- EJEric Jang
... with search. We're gonna talk about how neural networks can speed this up, uh, by providing an analog to, like, the human intuition here.
- DPDwarkesh Patel
Mm.
- EJEric Jang
So there's two networks. There is the value network, which takes in a state, and it predicts, you know, am I gonna win or lose? It's a binary classification problem. Then we're gonna have a policy network, which, uh, induces a distribution over good actions to take.
- DPDwarkesh Patel
Mm-hmm.
- EJEric Jang
So, um, I'm gonna draw a one-dimensional flattened m- move distribution, but this is really like, you know, a, a, a square kind of grid, right? So, um, so maybe, like, it thinks actions are like... Th- these are the kind of probability distribution over good actions.
- DPDwarkesh Patel
Mm.
- EJEric Jang
And both of these are, uh, categorical classification problems, right? So you can train this like any classifier in, uh, you know, with deep learning, um, uh, you know, c- cross entropy, loss, that kind of stuff. So the, um, the specific architecture does not actually matter too much. I, I tried a few different architectures, transformers work, ResNets work.
- DPDwarkesh Patel
Hmm.
- EJEric Jang
For small data regimes, um, my experience is that ResNets still kind of outperform transformers-
- DPDwarkesh Patel
Hmm
- EJEric Jang
... and, um, and they kind of give you more bang for the buck at, at lower budgets, but this may not be true.
- DPDwarkesh Patel
Oh, really? Why is that?
- EJEric Jang
Um, they, they provide the inductive bias of, like, local convolutions.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
And t- generally, transformers start to outperform, uh, residual convolutional networks when you want more global context.
- DPDwarkesh Patel
I see. Yeah.
- EJEric Jang
So, so one, um, interesting finding from the Katago paper was that they found it actually quite useful to pool together global features together, um, t- at aggregate global features, like, uh, throughout the network, um, to kind of give the network a global sense of how to, like, connect value from one side of the board to another side of the board.
- DPDwarkesh Patel
But what does it mean to aggregate global features?
- EJEric Jang
Yeah. So if you have a, um, Go... A, a, a very large 19 by 19 Go board-
- DPDwarkesh Patel
Yeah
- EJEric Jang
... and you, you know, you've got some, some sort of battles going on here, and you've got some battles going on here, um, when you pass this through a convolutional neural network-
- DPDwarkesh Patel
Yeah
- EJEric Jang
... the receptive fields of the convolutional network are going to be good at computing local things and making that sh- invariant, but, um, they won't be able to kind of connect these two features easily, right? They need to sort of be pooled together and attend to each other somehow. So the argument about, you know, why transformers are good for computer vision tasks, like with, uh, you know, vision transformers and so forth, is that because they have a sort of global attention across the whole thing-
- DPDwarkesh Patel
Yeah
- EJEric Jang
... they can more easily draw these kind of connections.
- DPDwarkesh Patel
Right.
- EJEric Jang
But you do need more data there so that you can kind of, uh, learn through data the, the sort of invariant local, local features.
- DPDwarkesh Patel
Got it. Yeah.
- 1:00:22 – 1:25:27
Self-play
- EJEric Jang
Okay, so, um, we now talk about the RL part of, like, how this thing gets stronger by playing itself, right? Um, let's say we play a game where we have the AI. So you make a moveAI, AI, um, will, will kind of compute the search and then th-this is the sort of visit count distribution. Um, let, let's say this is your policy, your policy, initial policy recommendation at the, at the, at this node.
- DPDwarkesh Patel
Mm-hmm.
- EJEric Jang
And then after MCTS, it, uh, gets more confident about one of these actions, right? And, and so maybe the, the distribution looks a bit more peaky like this based on the, the search. Now, of course, you can tune the search process so that it ends up more diffuse, but that's probably not a good idea. MCTS should get more confident about specific actions, um, than others, but it, of course, might place a lot of weight on, you know, other actions initially-
- DPDwarkesh Patel
Yeah
- EJEric Jang
... and then as you increase the number of sims, it should converge to a very peaky distribution. Um, so, so this is your new, uh, let's call this, like, pi... Let's, let's wrap this in, like, a MCTS operator of, you know, A given S, right? So after applying MCTS process, your, your policy-recommended distribution looks like this. It's, it's a bit more peaky than, than the previous one. Um, and so then you take the argmax, or maybe you just sample from this, uh, it doesn't have to be the argmax, and then you, you make your move. And then, um, and then you throw away the tree, and then you, you begin anew on the next move, right? So again, like, you, um, you know, you compute a new distribution. So initially, maybe your guess looks like this, and then you refine it through MCTS.
- DPDwarkesh Patel
There should be one more X on the board, right?
- EJEric Jang
I'm sorry. That's correct, yes. Um, to something that looks like this, right? So, so on every move, you have your initial guess from your policy network, um, and then the search process that combines your policy network and your value network arrives at a more confident action that you take.
- DPDwarkesh Patel
Mm-hmm.
- EJEric Jang
And, and then so on and so forth, and then the game ends, and one person wins and one person loses. So a, um, the way that, uh, the, the beauty of, of how AlphaGo trains itself is that it actually can take this final search process, the outcome of the search process, and tell the policy network, "Hey," like, you know, "instead of having MCTS do all this, you know, legwork to arrive here, why don't you just predict that from the get-go?" Right? Like, "Why don't you, like, you know, not use this guess and just predict this to begin with?" And if you have this guess to begin with in your policy network, then MCTS has to do a lot less work to, to get things to work. And so if we draw like a sort of test-time scaling plot, um, so, so let's say, like, this is, like, number of simulations. Um, let's say, you know, at, at zero simulations, your, your sort of implicit win rate is like, um, is like, I don't know, here, and then, and then, um, w-without any simul- if you just take this raw action, th-this is what your w-win, win rate is. And let's say as we increase the number of sims, um, maybe, maybe you kind of have a, a win rate that looks like this, right? So, um, when you search for, let's say, a thousand simulation steps, that gets you to a, um, a policy here that gets you to here, which is great. But if you were to distill this MCTS policy network back into your sort of, uh, shoot-from-the-hip policy network, then you could actually, um, uh, you know, start here, like, uh, if, let's say this was, you know, zero, um, for, uh, by, by distillation, then if you spend another one thousand sim steps, then you actually kind of get to here. It's almost like if you could just, you know, am-, um, amortize the, the first one thousand steps actually into the policy network-
- DPDwarkesh Patel
Yeah
- EJEric Jang
... instead of the search process, then you could begin at a much better starting point and then get a much better result for, for your, uh, for the number of sims that you put in.
- DPDwarkesh Patel
The same more type nature of test-time scaling as the number of simulations increases, the, the increase in win rate is, is, uh, smaller. Is that true even for the distilled network? That is to say, is there some gain of, like, okay, we start from the distilled, we get these early gains again, or is that just inherent to, like, the nature of-
- EJEric Jang
Yeah
- DPDwarkesh Patel
... MCTS?
- EJEric Jang
To be honest, I actually don't know the test-time scaling behavior of MCTS simulations, and I, I believe it might actually be quite sensitive to how strong this one is in practice.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
I'm just drawing a monotonically increasing function that gets to one.
- DPDwarkesh Patel
Okay, cool.
- EJEric Jang
So yeah. So, so don't pay too much attention to the shape of the curve. Just know that it's monotonic with respect to sim. Um, okay, so, um, so, so the idea of MCTS is very brilliant, which is like we're gonna... We got something better by applying search.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
And, um, we're going to now on our next iteration of updating this network, just train this to approximate the outcome of a thousand steps of search.
- DPDwarkesh Patel
Yep.
- EJEric Jang
And so instead of starting here, we get to now have our neural network start here, and then, and then, you know, the, the play gets stronger once we then apply another thousand steps on top of it.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
And you can keep going, right? So, so the training algorithm for AlphaGo is to basically take the games where you've applied the search on every move that the policy encountered, uh, whether you won or lost, and that's quite important. Um, and you're just gonna train, uh, the, the model to imitate the search process. So, um, there's an analogy to robotics, actually, which is the DAGGER algorithm.
- DPDwarkesh Patel
Hmm.
- EJEric Jang
Um, uh, fir-first I'm gonna draw like a, uh, schematic of like, let's say, you know, the states, right? So S0, S1, S2, S3. So let's say, you know, we, we took a series of actions in an MDP to, to get, uh, a trajectory, and these actions may be suboptimal, right? Maybe we lost at the end of this game. So, um, there is a family of algorithms that basically take trajectories, uh, and relabel the actions to better trajectories. So maybe a better action here would have been to take, you know, A0 prime. A better action here would have been to take A1 prime, and yet another one, like A2 prime.A3 prime. So, um, what MCTS is doing is basically saying, like, you played this game, where you eventually lost, but on every single action, I'm gonna give you a strictly better action that you should take instead. It does not guarantee that you are going to win, but it does guarantee that, you know, if you take these tuples as training data so that you retrain your, uh, your, your policy network to predict these ones instead of these ones, you're gonna do better. And, uh, this is very related to DAgger in, uh, robotics and, and imitation learning, where you want to collect a intervention here, and even if you're in a, in a not great state, for example, like a self-driving car that, you know, veers off the side of the road, there is still a valid action that kind of corrects you and brings you back.
- DPDwarkesh Patel
Yeah. Okay, so, um, pedantic question, but is there a guarantee that MCTS must be better than the policy? For example, you could imagine early on in training, because MCTS is informed by the value network-
- EJEric Jang
Yeah
- DPDwarkesh Patel
... early on in training, uh, when the value network hasn't been well-trained on finished, uh, games, um, that, like, MCTS is worse than sort of randomly initialized policy. So is it just, like, a heuristic that MCTS is better than the policy, or is that, like, is there some guarantee?
- 1:25:27 – 1:45:36
Alternative RL approaches
- EJEric Jang
Okay, so, uh, it, it's worth kinda thinking a little, a little bit about, like, okay, what are some alternatives we could do to train self-play agents instead of MCTS, right? Like, you know, we, we use a lot of, uh, LLM style RL these days. Like, is that relevant? Could we do that instead? Um, so, so let, let's think through this a little bit. L-let's suppose we have a very naive algorithm where we take a league of agents of different checkpoints, and we play them against each other, and, um, for, for the games where a single player wins, uh, we're gonna reinforce those actions up, and then, and then, uh, and, and train-- retrain the policy network to imitate those, those guys instead of, uh, instead of the MCTS objective. Um, so what ends up happening is, let's say you have a chain of actions that led to a win, and you're, you're-- you have a match-up between two agents that are basically the same. So in fact, let's just assume that, like, uh, you know, policy A and policy B are, like, evenly matched, right? So their true, their true win rate is, is, like, fifty percent. Um, so let's say you play a hundred games, and then, uh, each game, let's say, lasts, you know, three hundred, um, moves. And, um, you're doing some sort of, like, evolution strategy or some way to perturb these things to get them, get them to do different things, or maybe you don't, and you just play them against each other, and you see, like, occasionally this one might actually have a better strategy than this one, right?
- DPDwarkesh Patel
Yeah.
- EJEric Jang
And so, so let's say, um, you know, fifty-one games, um, policy A wins. And then forty-nine games, policy B wins. And this is just due to random luck, or maybe you perturbed policy A-
- DPDwarkesh Patel
Yeah
- EJEric Jang
... in some way that let it do this. And just to have a very, very simple model, let's pretend that for, uh, for, like, uh, forty-nine of the games, they played exactly equally. Um, I'm sorry, for, for fifty of the games, they, they played exactly equally, right? Um, and on that one game where this one won, it, it played slightly differently. It made, like, one critical move that, like, you know, normally it would have done differently, but, uh, due to some exploration or some random noise, it just happened to make a smarter move than it did previously.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
So you have one supervision signal, like one true supervision signal for your policy network. Um, and then you have, uh, ninety-nine games times three hundred moves for which, uh, imitating those actions gives you exactly the same policy you had before. And so the, um, the scale of your variance is actually very bad because-
- DPDwarkesh Patel
Mm-hmm
- EJEric Jang
... it's like you only have one label out of this enormous data set of, of actions, of supervision actions where you want-- Actually, sorry, let, let me, let me clarify a little bit.
- DPDwarkesh Patel
Okay, so we're just talking about how the good move, the out-of-distribution move, is a small fraction of all the moves that are played across all the games on which you'd want to train. And, um, this of course reminds me of how LLMs are trained with policy gradient methods. Uh, Karpathy was-- when he was on the podcast called it, like, sucking supervision through a straw. Um, and uh, and so yeah, it's interesting that this, like, this thing you're saying, which would be intractable and prevents you from actually getting beyond a certain level in Go is just by default how LLMs are trained, question mark?
- EJEric Jang
R-right. So, um, in this case, th-this is not to say it doesn't work, right? Like if you imagine increasing the number of games to like, you know, millions of samples, you actually can get some meaningful supervision, like, s-samples, so long as you find a way to sort of mask out the supervision from these guys. And then this is where things start to get pretty related to RL in terms of advantage and baselines and so forth.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
So, so let's, um, let's look at the eq- you know, the gradient variance of a very naive approach like this where, um, I'm just gonna call it, like, gradient RL, and it's basically the, you know, sum of rewards.
- DPDwarkesh Patel
Okay, I see what you're saying.
- EJEric Jang
So, so the sum of rewards is the return, right? So, so, like, uh, in, in our naive setup here, we only have a indicator variable for the return, where either you won or lost. So, um, so in the case where you lost, well, you just don't train on-- your gradient is zero. You don't train on those examples. And when you won, you try to predict those, those things, right?
- DPDwarkesh Patel
Yeah.
- EJEric Jang
So, so you can think about this setup as a, as a special case of-
- DPDwarkesh Patel
Yeah
- EJEric Jang
... this general formula here. Um, the, um, the trouble here is that this is very high variance because when you multiply these terms out, when you take-- when you try to compute the variance of this, and so, so variance of the gradient is equal to expectation of squared minus... And just for simplicity, we can pretend this is like, you know, on average zero or something, we c-- if, if you're centering it at, at, you know, no signal. Um, and uh, the variance here basically means that you're, you know, taking the square of this product term, and so you end up with, uh, a term that kind of grows quadratically with the, with T.So, so variance, um, when you have a setup like this, th-this thing acts as a coupling effect on top of, of these terms here. So, um, um, let's actually map this to an LLM case, and we can answer like why do LLMs only do one-step RL instead of a multi-step RL scenario? Um, in LLMs you have a decoder that might, you know, predict some words like, "Hello, world." And so in current LLM RL they treat this entire sequence as a single action, just at, and big T is just one, right? And so yes, it is true that, you know, the, uh, because of how, you know, transformers are formulated, uh, th-through the sort of p-product of conditional probabilities, um, we do have, uh, you know, probability of this sequence is equal to the sort of sum of... Lo-log probability of the whole sequence is equal to the sum of the probabilities of like, you know, uh, individual tokens, right?
- DPDwarkesh Patel
Yeah.
- EJEric Jang
So, so in this case I would, um, I would say something like, you know, log hell plus log lo plus log world. So, um, this is true, and if this term were one, then they would be the same thing. However, um, in sampling things if you have a reward term assigned to every specific token, now you have these interaction effects between the cross-multiplication of these terms and these terms.
- DPDwarkesh Patel
Right.
- EJEric Jang
And so the problem becomes how do you ascribe the credit associated with every episode to all these different terms here?
- DPDwarkesh Patel
I guess the thing I'm confused on is h- what would that even look like to do it that way in, um-
- EJEric Jang
In LLMs
- DPDwarkesh Patel
... in LLMs because you do, you only do get a reward at the end of the episode, so...
- EJEric Jang
You could imagine a reward that says like, "I'm gonna give you some process supervision-
- DPDwarkesh Patel
Yeah
- EJEric Jang
... uh, where you get a reward for each of these actions on every step."
- DPDwarkesh Patel
Okay, so you're saying it, instead of doing it that way where you, um, well I guess the way you've written it would be a sum at the end anyways, so it would, they, they wouldn't have to be multiplied. But you're saying instead of doing it that way you would just add up this process rewards at the end and then treat that as one single reward signal?
- 1:45:36 – 2:00:58
Why doesn’t MCTS work for LLMs
- DPDwarkesh Patel
Okay, so this is very interesting. And then to unify this with our discussion of LLMs, so with LLMs, you're doing something, you don't have Q values, but you're doing this sort of backwards learning where, hey, let's find the trajectories which pass some unit test in some coding environment, and then let's reinforce those trajectories.
- EJEric Jang
Mm-hmm.
- DPDwarkesh Patel
And then there's a huge difference between that and this forward approach with MCTS, and the reason you can do MCTS, and it- it's much more preferable to do MCTS because you can do it per move, uh, and make each move better rather than having to learn per trajectory, um, and hope, you know, as Karpathy said, hope to learn this like-
- EJEric Jang
Through a straw
- DPDwarkesh Patel
... yeah, so you get this supervision through a straw. Uh, basically just upgrade all the tokens r- in the trajectory that might or might not have been relevant to getting the answer right. The reason you can do this much more sort of sample efficient, uh, much more favorable thing with Go is that because MCTS works in Go, you basically know that, hey, if I just do search locally here, and this search is sort of truncated at the end by this value function that, uh, works even, even if I haven't unfolded my whole trajectory, I can just say, "This is my new policy," um, and I can improve in a more iterative, like, local way rather than having to ha- having to unfold all these trajectories.
- EJEric Jang
So there was some research, I think, from Google in 2030, 2023, 2024, where they did try to apply tree structures to reasoning.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
And I think it's, you know, the jury is still out as to whether this can ever work. So I, I would say like, uh, it-- we probably will see like, you know, revisiting of this idea of forward search, um, in, in the future. But there's two things that make MCTS very simple for Go, which is that value estimation is kind of concrete and you can determine it-
- DPDwarkesh Patel
Yeah
- EJEric Jang
... for real, and then you can kind of, uh, uh, sort of use it to truncate depth, as you said.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
And then the, uh, breadth is also, uh, determined, and what's kind of critical is that the action selection algorithm where you iteratively visit and grow the tree is, um, well suited for the size of problem that Go is-
- DPDwarkesh Patel
Yeah
- EJEric Jang
... and the depth of the problem. But for something like LLM reasoning, um, you know, PUCKT might actually not be a good enough heuristic.
- DPDwarkesh Patel
Right.
- EJEric Jang
It might be too greedy with local tokens, and it might do something like, oh, only give you, you know, uh, sort of obvious thoughts that are correct, but not really solve your final problem.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
So I, I would say the jury is probably still out on how, like, what the final instantiation of reasoning for LLMs would look like, and I wouldn't rule out that, like, this stuff could, you know, come back-
- DPDwarkesh Patel
But d-
- EJEric Jang
But it's been hard
- DPDwarkesh Patel
... do- don't LLMs sort of ni- natively learn to do MCTS, where they'll try and approach and be like, "Oh, that doesn't work. Let's back up. Let's try this other thing-"
- EJEric Jang
Mm-hmm
- DPDwarkesh Patel
... and then go in the direction that proves to be more fruitful?
- EJEric Jang
Uh, yeah, cer- certainly I think the LLMs manage to do something that looks like real human reasoning-
- DPDwarkesh Patel
Yeah
- EJEric Jang
... without having to do an explicit tree structure.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
Um, that being said, I think the idea of doing forward search and simulation to get a better sense of what is valuable might make a comeback, um, even though not exactly in the same instantiation as, as-
- DPDwarkesh Patel
Interesting
- EJEric Jang
... uh, AlphaGo.
- 2:00:58 – 2:11:51
Off-policy training
- EJEric Jang
seeing, like, what really matters.
- DPDwarkesh Patel
Wait, this is a sort of tangential question, but why is it okay to have a buffer in AlphaGo? 'Cause every time I talk to an AI researcher, they're telling me about how bad it is to be off policy.
- EJEric Jang
Mm-hmm.
- DPDwarkesh Patel
But then the way a naive implementation of AlphaGo Zero would work is that most of the moves in a given backward step or in a batch of backward steps, um, would be n- not, not among the ones that were made by the most recently trained model. So why is that okay?
- EJEric Jang
Great question, yeah. And this, this gets into the sort of fundamental off-policy versus on-policy, uh, reinforcement learning kind of questions. So, uh, as you recall, in MCTS, you take actions that you took and you relabel them to, uh, to take different actions on the same states.
- DPDwarkesh Patel
Mm.
- EJEric Jang
Right? So, so the off-policy part here comes where, um, what if you're relabeling states that your new policy would never visit?
- DPDwarkesh Patel
Yeah.
- EJEric Jang
Right, like, what's the point? You're kinda wasting capacity, and in the extreme limit, imagine your distribution of states in your training buffer are all states that you would never visit. Then you're basically supervising them, uh, to, uh, take good actions on states you would never achieve-
- DPDwarkesh Patel
Yeah
- EJEric Jang
... and therefore your policy can get really bad, right? So this is where off-policy can really hurt, um, uh, uh, AlphaGo. Um, however, if you interpret this sort of from, like, the dagger perspective, which is basically saying, like, a way to kind of correct yourself back to the optimal trajectory, uh, given some, some data, what, what you kind of want in a algorithm like this is to have mostly states that you would visit, but then you have a small percentage or maybe a reasonable percentage of states in this kind of high-dimensional tube around your optimal-
- DPDwarkesh Patel
Mm
- EJEric Jang
... you know, tr- trajectories, and any of those states are given a supervision target to kind of f- uh, sort of funnel you back into your optimal trajectory. Uh, so maybe I can just draw-
- DPDwarkesh Patel
Yeah, yeah
- EJEric Jang
... it out quickly here. Great. So in, um, sort of a dagger-style setup, what your kind of optimal training data distribution is, is that here is your optimal states and actions, so this is like, you know, you wanna be in this state, you wanna be in this state, you wanna be in this state, and then you win here. Um, and then these are your optimal policy actions. So th- these are the, these are the things that you definitely wanna train on, but to make it robust to disturbances, um, you want to make sure that if you happen to drift off into some other states-
- DPDwarkesh Patel
Mm
- EJEric Jang
... you can kind of funnel yourself back into-
- DPDwarkesh Patel
But why isn't this a fully general argument for off-policy training?
- EJEric Jang
This is actually why you wanna do off-policy training sometimes-
- DPDwarkesh Patel
Mm
- EJEric Jang
... is that, like, you, you don't want to have a compounding error where if you make a mistake, you don't have the data of how to return back to your optimal distribution.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
And so, um, optimal control does not really say too much about, like, uh, y- you know, how to, uh, you know, not accidentally get here, because it's sort of making the assumption that, like-
- DPDwarkesh Patel
Right
- EJEric Jang
... once you learn the policy, you're gonna get here.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
But- but in applications like robotics, right, like, like, I don't know, a, a gust of wind blows you slightly off, and then now you need to, like, correct, right?
- DPDwarkesh Patel
Yeah.
- EJEric Jang
Um, or the friction on one of your tires is kind of a little bit, like, lower than the other wheel, and then now, now your, your car is drifting, and you gotta, gotta, like, correct it. So, so these kind of things in, in, like, more real environments often happen, where, like, um... A- actually, there was a funny, uh, quote about chess and, and also Go. It's like, "The problem with, uh, the, the problem with Go and chess is that the other player's always trying to do some shit," right?
- DPDwarkesh Patel
[laughs]
- 2:11:51 – 2:22:05
RL is even more information inefficient than you thought
- DPDwarkesh Patel
I, I'm reminded now of our earlier conversation of why MCTS is so favorable as compared to the kind of r- you know, reinforce or policy gradient kind of thing LLMs do. And this might be totally wrong, but I wrote a blog post a few months ago about, um, how RL, at least policy gradient RL, is even more, uh, inefficient than you might think. And so the inefficiency one thinks about naively is the fact that you have to roll out a whole trajectory in order to get any learning signal at all, a- at all. And so as these trajectories become longer and longer, as an agent has to, instead of just previously like complete the next word in the sentence, it has to go instead to, "Hey, sp- do two days worth of work to figure out even if you even did this project correctly." The amount of information per, uh, FLOP has been decreasing. As you had to unroll two days worth of thinking in order to see if you even did something correctly to like, did I, uh, implement this feature, the amount of samples per FLOP has been decreasing. But so you can think of, um, uh, you're trying to maximize as you're learning bits per FLOP, right?
- EJEric Jang
Mm-hmm.
- DPDwarkesh Patel
Um, and this is, you can think of bits per FLOP as, um, samples per FLOP times, um, uh, bits per sample. And what I just mentioned, uh, a second ago is that the samples per FLOP go down as RL becomes more and more long horizon.
- EJEric Jang
Mm-hmm.
- DPDwarkesh Patel
But, um, at least this kind of naive RL is also terrible from a bits per sample perspective, and here's what I mean, at least compared to supervised learning. So early on in training, let's say you have a, a vocabulary size for an LLM that is 100K long, so th- there's 100K possible, you know, tokens that one could answer. And you have a totally untrained model, and you have a prompt like, "The sky-"Is, um, with supervised learning, you, what, what, what would happen is that the model would have some probability distribution over all the things it could say. Um, there's a label that says actually the term here is blue, and it would figure-- it would learn basically through cross-entropy loss exactly how far its distribution is from correctly saying blue. Now, if you were doing this through RL, um, you would say the model would try, "The sky is halicon." Nope, that's wrong. "The sky is told." Nope, that's wrong. This is a totally untrained model, right? And so you would have to do this on the order of a hundred thousand times in order to just stumble on blue, then get some learning signal off of that. So if you're in the supervised learning regime, and you just get-- you have your distribution of probabilities, you get told, uh, that it's blue, and you figure out how far off you were, the amount you learn is, um, is a function of your pass rate. So, like, the further away you are from blue, the more you've learned to go towards blue, uh, u-using cross-entropy loss. And so you can think of it as, like, your pass rate, your, like, prior probability of having said blue. And, um, as a function of that, like, in supervised learning, uh, through cross-entropy loss, you would, you would learn negative log P, P being pass rate-
- EJEric Jang
Mm-hmm
- DPDwarkesh Patel
... uh, bits once you get this label. Whereas in RL, if you're just randomly guessing shit and seeing if it works or not, that's, um, that's just basically going to be the entropy of a binary random variable, which is-
- EJEric Jang
And what's also tough here is that actually the distribution that you're sampling under is your policy's distribution.
- DPDwarkesh Patel
Mm.
- EJEric Jang
Right. So, so it's like if your policy has no chance of sampling blue, then you will never get a signal.
- DPDwarkesh Patel
Exactly. Right, right. So that's, that's being modeled by the fact that your probability of sampling blue is extremely low. If you do sample it, you do learn as much as you would have learned under supervised learning. In all other cases, like, you know, ninety-nine point nine nine nine percent of in an untrained model, you're, um, you're just learning incredibly little from, like, seeing halicon is not the correct word or told is not the correct word.
- EJEric Jang
Yep.
- DPDwarkesh Patel
Um, and that's what happens most of the time. So you're just like, um, learn very little. So if you try to graph, um, if you put on the X-axis your pass rate, um, and, uh, here you put the, like, sort of the bits you, bits you're learning from a sample.
- EJEric Jang
Mm-hmm.
- DPDwarkesh Patel
If you have, like, zero percent here, fifty percent here, and a hundred percent here, so the end of training you're here. Um, if you have, um, supervised learning, negative log pass rate would look something like this. And then the, uh, entropy binary random variable would look like this. Um, and this is, uh, depending on whether you're doing natts or bits-
- EJEric Jang
Mm-hmm
- DPDwarkesh Patel
... uh, yeah, if you do bits, it's like one, right-
- EJEric Jang
Mm-hmm
- DPDwarkesh Patel
... here at the, at the peak. Um, this is like a coin flip. You learn the most from a coin flip.
- EJEric Jang
Mm-hmm.
- DPDwarkesh Patel
Uh, this is supervised learning. This is RL. However, the problem is you spend most of training in this regime, right? Like in the, in the low pass rate regime. And, um, in fact, of how fast you're learning is a function of how many bits per sample you're getting, uh, and you're getting very little signal here. If you chart the pass rate on a log scale, so you put the X-axis on a log scale where, like, at the beginning of training with a vocab size of one hundred K, the pass rate is one hund- one over one hundred thousand, then one over ten thousand, one over one thousand, uh, one over one hundred, and then, um... Okay, what this graph looks like here, where supervised learning would look like this, and, and then RL, if you just basically cr-crunch what I just sh-uh, showed there, it would look like that.
- EJEric Jang
Yeah. And arguably, you spend all your time here-
- DPDwarkesh Patel
Exactly
- EJEric Jang
... potentially e-never even getting a single success, right?
- DPDwarkesh Patel
Exactly.
- EJEric Jang
Like, uh, so, so it's, it's a sort of depressing plot in the sense that, like, once you're here, it's not at all obvious how you get to here.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
Um, you know, once you're here, you have something-
- DPDwarkesh Patel
Yeah
- EJEric Jang
... but, like, you actually, in many RL problems, spend all the time here.
- 2:22:05 – 2:30:11
Automated AI researchers
- EJEric Jang
Sounds good.
- DPDwarkesh Patel
One thing I really wanted to talk to you about is that you did a bunch of the research for this project through this kind of automated, uh, LLM coding assistant loop, and, um, there's an idea that if you fully automated AI research, you could have some sort of singularity. Uh, obviously we're not there yet, but to the extent that we have early indications of what this process might look like, I am curious what your observations about, um, what the AI is good at, what it's not good at, what you think about this scenario's likelihood eventually, what thoughts you have about this in general.
- EJEric Jang
For sure, yeah. I think automated scientific research is one of the most exciting, um, uh, skills that, you know, that frontier labs are developing right now, and I think it's important for everyone who's doing any kind of research to get a good intuition of, like, what it can do now and what it can't, and how might the sort of science process work in the future once we're having AIs automating a lot of this, this investigation. Um, so, uh, in brief, I mostly use Opus four point six and four point seven throughout the, uh, working on this, and, um, what works is that the models can do a very good job of doing hyperparameter optimization.
- DPDwarkesh Patel
Mm.
- EJEric Jang
So in the past, people would kind of come up with a search base of hyperparameters like learning rate and, you know, weight decay and maybe how many layers are in your network, and, um, they would just kind of do a grid search or a sort of Bayesian hyperparameter optimization, uh, approach, and then it would find some tuned parameters. Um, the kind of really cool thing that automated, uh, you know, uh, coding can do now is that it can search a much more open-ended set of problems, right? It can say, like, "Well, um, I've identified that, like, the gradients are kind of small on this layer, so let me change it up here. Let me rewrite the code so the data later- data loader has a new augmentation I came up with. Let's, uh, let's, uh, sort of try to find the, the best way to kind of fit the constraints of the optimization problem." And, and you end up with this much more flexible and kind of high level, almost like grad student-like, uh, ability to just, you know, grind a performance metric.
- DPDwarkesh Patel
Mm.
- EJEric Jang
And, and so this can squeeze out quite a lot of performance. You can, you know, f- on a fixed dataset with a fixed time budget, um, improve perplexity by quite a lot on, on a sort of classification problem like LLMs or, um, or Go. Um, and, uh, it is also fantastic now at basically executing any experiment, right? So I have a, a Claude skill that I wrote called Experiment where, um, I give it a description of what I want it to plot, and like I just describe, "Here's the X-axis I want, here's the Y-axis. Answer this question for me." And it'll go run off and do all the experiments, compile the plot, make a report, and suggest like, you know, what might have caused it or, or so forth. Um, so, so that's what works quite well today, and I think we can expect that these abilities get better in the future. But it's also kind of useful to know, you know, what, what is it not doing so well today. Um, so on my, uh, blog version of this tutorial, I have a, a plot of basically all the kind of experiments I did grouped in a sort of tree where, um, you know, every node kind of represents a failed, successful, or sort of mixed experimental result, and then from there it branches off into a child where it's like the follow-on experiment. Um, occasionally I'll kind of rabbit hole down a track like this off-policy MCTS relabeling, do a few experiments, and then realize it's probably not worth it, so then I'll kind of jump to a completely different track, right? And I call these kind of things like rows, right? So, so what, what I find is that current, uh, you know, closed models that we can access, the public can access today, um-They don't seem to be that great at selecting what the next experiment should be in a given track, and they don't seem to be able to kind of step back and do the lateral thinking of like, "Wait a minute, this track doesn't really make sense. Like, let's go back to sort of first principles and, and think about, you know, what the bottleneck might be, or like what are we trying to achieve?" Right? And, and so often I had to catch infra bugs myself by prompting the right question to Claude to like investigate, you know, why, why, what, what is causing this discrepancy, and then it'll answer the question. Um, I think with like, you know, Mythos class models or Mythos plus plus models coming online, um, maybe this just completely changes, and these, these problems just fall to, to just improved scaling. Um, but at the same time, I think there's a lot of like rich opportunity to, um, develop RL environments that might incentivize this kind of lateral thinking. And, and so one of the motivations for setting up this Go environment was that I think that, you know, Go captures a lot of very interesting research problems, often overlapping with, you know, LLMs or robotics. Uh, and yet it's like very quick to verify. Um, the outer loop is ultimately like, does the agent do what I think it does? And, and you can just kind of check the outcome of a Go game quite easily. Um, and then the inner loop involves all this kind of like, you know, research engineering around distributed systems, uh, predicting whether an idea is gonna work or not, um, predicting the, you know, the difference a particular modification to your training algorithm might make. Um, and I think there's a rich library of subtasks and sub-environments that you can kind of train an automated scientist to work on, uh, with Go as a sort of outer verification loop that then once you acquire these skills, maybe you can apply them to like other domains like, uh, you know, um, biosciences or robotics.
- DPDwarkesh Patel
Or automating AI research.
- EJEric Jang
Or automating AI research.
- DPDwarkesh Patel
Which is, which is the real crux or the, um, scary/uh, incredible thing of just making AIs making future versions of AIs. And you're suggesting the outer loop here could just be your win rate against KataGo, basically?
- EJEric Jang
Um, that's one of them. Um, I think there's a lot of deeper questions that one could tackle, right? So, for example, um, let's say you have an idea on, um, how to improve a scaling law compute multiplier.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
Um, the outcome isn't necessarily like I, I, uh, achieve the best Go bot ever. The outcome might just be like, can I predict what the win rate of my Go bot will be?
- DPDwarkesh Patel
Yeah, yeah.
- EJEric Jang
Uh, or can I predict the scaling law plots that emerge from my idea? But then you can verify that you haven't kind of reward hacked anything by using a very verifiable game like Go on the outer loop.
- DPDwarkesh Patel
I, I, I think there's a couple of interesting follow-on questions. There's questions on the inner loop and the outer loop. On the inner loop, there's a question of how locally verifiable any modification you might make is. That is to say, could-- Would you know whether something is actually improvement or a degradation, some idea you try out? Would you know that if something isn't working as a result of, um, a bug, or is it the result of the idea itself being wrong? Um, Ilya was talking about why having-- One of the reasons he, he thinks he's a good researcher is-- He is a good researcher. One of the things he thinks makes him a good researcher is that, um, he has intuition about-- He has strong belief in what the correct idea is, and he is able to persevere through bugs and know which things are bugs versus mistakes in the fundamental idea based on his high-level belief about this idea should work, so therefore it's, this, there has to be bug versus the other way around. Why don't we start with that question, actually? Yeah. How locally verifiable are things which are good ideas?
- EJEric Jang
Yeah. I think as in the case of the success story for deep learning, you can think about this as like a decades-long idea that took, like, took a lot of faith-
- DPDwarkesh Patel
Mm-hmm
- EJEric Jang
... to get it to work.
- DPDwarkesh Patel
Mm-hmm.
- EJEric Jang
Um, and so this presents a very challenging long-horizon, you know, RL problem where, you know, every, every step of the way you have like a committee telling you that this is a bad idea, and then ultimately you break it through, right?
- DPDwarkesh Patel
Yeah.
- EJEric Jang
And so, like, how do you design RL environments that maybe give you some feedback, uh, uh, earlier? Um, and, and I think this is a very tough open question that I don't have an answer to. But, um, but, you know, ultimately, to play a very strong Go bot, you probably did need to discover deep learning.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
Right? And so, um, um, I think that, like, having a challenging game that cannot be, you know, cheated easily on the outer loop could be used as a sort of outer loop signal for something like discovering the principles of deep learning. Now, of course, like, to make it tractable, and this is where research taste really matters, um, like, you have to come up with ways to initialize your problem so that you don't solve a sort of very intractable problem, right? Like, like, maybe you can leverage LLMs as, as a sort of a universal grammar in the middle to kind of g-give, give you some sort of local feedback.
- DPDwarkesh Patel
Yeah.
- EJEric Jang
Um, um, the, the, the fact that LLMs are universal grammar means that they can kind of move at almost any level of the stack, right? They can think very locally as well as step back and think, like, in very broad steps. And I think that's where a lot of, uh, um, the, the lateral thinking ability of humans kinda come from.
Episode duration: 2:37:17
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode X_ZVSPcZhtw
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome