This video isn’t embeddableWatch on YouTube →

Stanford CS230 | Autumn 2025 | Lecture 5: Deep Reinforcement Learning

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai October 21, 2025 This lecture covers deep reinforcement learning. To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs230-deep-learning To follow along with the course schedule and syllabus, visit: https://cs230.stanford.edu/syllabus/ More lectures will be published regularly. View the playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNRRGdS0rBbXOUGA0wjdh1X Andrew Ng Founder of DeepLearning.AI Adjunct Professor, Stanford University’s Computer Science Department Kian Katanforoosh CEO and Founder of Workera Adjunct Lecturer, Stanford University’s Computer Science Department

Kian Katanforooshhost

Oct 31, 20251h 45mWatch on YouTube ↗

EVERY SPOKEN WORD

85 min read · 17,292 words

0:05 – 5:09
Why Deep Reinforcement Learning now: from Atari to ChatGPT alignment
1. KKKian Katanforoosh
  Welcome to our fifth lecture in person for Stanford Deep Learning CS230. Um, today's lecture is gonna be about deep reinforcement learning. I actually switched, uh, the original plan of talking about neural network interpretability and LLM visualization, uh, simply because you, you haven't had the chance to study attention maps, um, uh, you know, convolutional neural networks, and so it would have been an overkill to do that week five. So we're gonna talk about neural network interpretability and visualization in a later lecture, actually. Um, but today, uh, our focus will be on deep reinforcement learning, uh, which is probably my favorite, uh, uh, lecture of, uh, the class. I, I feel like I say that every week, but it's okay. I like it. Um, the agenda is pretty packed. We're gonna start with, uh, deep reinforcement learning, which you can think of as the marriage between deep learning and reinforcement learning. Together, the baby is called deep reinforcement learning, and we're going to see how reinforcement learning works and how neural networks can play a part in building a reinforcement learning, um, agent. Um, in the second half of the class, we will focus on a very specific, um, you know, concept called reinforcement learning from human feedback that you might have heard of. It's one of the core concept that, uh, really made the difference between what, um, you might have remembered as GPT-2 and ChatGPT. You know, that's the leap. Uh, that's really the, the, the, the technique that had, uh, that has, you know, democratized, um, access to LLM because of the performance improvements and the alignment with humans. So we're going to see what, what is this concept of RLHF, and how, um, does it work, and why does it allow us to align a language model to human preferences. Ready to go? As always, let's try to make it interactive. Um, so the motivation behind deep reinforcement learning, and as usual, you're gonna have all the most important papers that are covered in the class listed at the bottom of each slide. Um, reinforcement learning has grown in popularity. Um, one of, uh, uh, the, you know, very popular papers called Human-Level Control through Deep Reinforcement Learning is the work, um, from, um, DeepMind, um, has showed us that a single algorithm/training method can allow us to train, um, AI that can play many, many Atari games better than humans. Single algorithm over forty, fifty games where it exceeds human capability, which is quite impressive when you thought about the fact that, you know, machine learning used to be niche, and you would have to train a really niche algorithm to perform different tasks. Here's an algorithm that can just learn sort of every Atari game. A little later, um, you might have heard of AlphaGo. AlphaGo is, um, is a, is an algorithm that was developed to beat and exceed human performance in the game of Go. We'll talk about it a little more. The game of Go is a very complex game. Um, some would argue way more complex, uh, than chess from a decision-making standpoint and from the, uh, possibilities, uh, that can happen on the board. And so, um, it, it actually got solved, um, in 2017, again by the DeepMind, uh, DeepMind team and, and David Silver's lab. Um, later on, and again, another great paper from DeepMind had showed us that reinforcement learning can also be used for strategy game that might be a touch more complex than, uh, chess or Go. That might actually involve multiple players playing with each other or against each other. Some of you might have played StarCraft, for example. That's an example of a game where, um, it requires a lot of lo-long-term thinking, short-term thinking. Another one is, uh, you know, Dota. Some of you might have played Dota or League of Legend, where you have a team playing against another team. Those are examples of games that involve multiple agents playing collaboratively, and it's pretty hard to develop systems that can play with each other against multiple opponents. Um, and finally, most recently, this is 2022, so alongside the release of ChatGPT, um, this paper that introduces the concept of reinforcement learning with human feedback applied to aligning language models with, uh, human preferences, and we'll talk about that later. So all, uh, all this to say that reinforcement learning allowed, um, um, us to exceed
5:09 – 13:44
Why supervised learning falls short for Go and other sequential decision problems
1. KKKian Katanforoosh
  human performance in a variety of tasks. The first one, um, I want us to think about is the, the game of Go. So let's say that you were asked to solve the game of Go with classic supervised learning, okay? Everything we've seen together so far, labeled data. How would you solve the game of Go with classic supervised learning? What data would you collect? What would be the label, et cetera? Yes.
2. SPSpeaker
  [faintly speaking]
3. KKKian Katanforoosh
  Okay, good point. Yeah, you look at history of plenty of games, hopefully from good players.Train the game, the, the, the algorithm to work. Um, and you look at X as the input being the current state of the board, and Y as the next state of the board, and this would tell you what move was selected, and you learn the move essentially. And hopefully, if you do that across many, many games, you know, you might, you might see, uh, um, the, the agent become more attuned to the game and, and, and develop, uh, better strategies. So, you know, really hopefully it's a professional player. What, what are the disadvantages of that or the shortcomings that you can anticipate? Yes.
4. SPSpeaker
  You might miss out on the space types of moves that players use and maybe, maybe, maybe go against some other set of moves that were formerly considered.
5. KKKian Katanforoosh
  Yeah. Yeah, great point. You might not see the entire space of possible states of the board, which is what you said. So you might miss out on a lot of different strategies. So the game of Go is actually a game with two players, one player that uses the black stones and one player that uses the white stones, and iteratively, they're gonna place those stones on the grid, a thirteen by thirteen grid that you can see on screen, with the goal of surrounding their opponents. So you're constantly trying to surround the stones of the opponent, and the opponent is trying to surround your stones. And so you can imagine that for every intersection on the grid, there is multiple possibilities. Either there's a black stone or a white stone or nothing. And on a thirteen by thirteen grid, you can imagine how many possibilities of a board state there are. It's, um, impossible to capture all of that with historical moves from professional players. It will just never cover that. The same thing could be said in chess as well. You, you know that even the professional players can plan X number of steps in advance, but nobody knows where the game takes you. And in the late stages of the games or the end games, um, players always find themselves playing a different game, and that's part of the magic of being good at chess. Um, so yeah, that's a problem. What's another problem or shortcoming beyond the fact that we can't observe possibly all the states? Yes.
6. SPSpeaker
  You also can't, like, anticipate, like, what that action will lead to in the future. Like, you might not make the best decision at first, but later on.
7. KKKian Katanforoosh
  Correct. Correct. You-- If I repeat what you said... Well, first, you don't even know if this was a good move, you know. So maybe it was not even a good move, and you're learning something that was not a good move, and you're labeling it as a good move. And second, um, you're actually only getting partial information, meaning you don't have the information of what's in the person's mind and what strategy they're trying to execute. So you're sort-- you're sort of looking at a single example among a long-term strategy, and you can't expect the model to guess what's the long-term strategy because it was just trained on X and Y and matching the inputs to a possible output. So you, you don't really have any concept of a strategy at that point. It looks one-off at every decisions of the model. Okay, those are really good points. Um, the other one is the ground truth might be ill-defined. What I mean by that is, um, even the best humans in the world do not play their best game every day, and even their best game is not the ground truth. And that creates an issue because you're essentially training against a target that is off by a certain margin. You're never gonna get better than the best human, and the best human is not the best possible, um, uh, existing the best possible strategy at every point. So you could argue, what if we get a panel of experts that we're monitoring, and those are the best players in the world? Even with a panel of expert that decides every move, you still have an ill-defined ground truth, you know? So that's a big issue. Too many states in the game you mentioned, and we will likely not generalize, which is what you said, meaning we're looking at one-off situations, we're not looking at entire strategies. And so when we face a board sta-state that we've never seen before, because the model was not trained on strategy, it sort of will get stuck, you know. Okay. And this is an example of a perfect application for reinforcement learning, because reinforcement learning is all about delayed labels and making sequences of good decisions. So if you had to remember in one sentence what's RL, RL is making good sequences of decisions. Sequences of good decisions, sorry. And do that automatically. Another way to look at it is the difference between, you know, classic supervised learning and RL is in, uh, in classic supervised learning, you teach by example. In reinforcement learning, you teach by experience, which is also a different concept. You're not just showing cats and non-cats to a model, you're actually letting the model experience an environment until it figures out, uh, what were the best decision it made and learns from them. Some examples of reinforcement learning applications, I'm gonna mention them. We, we have, we have gaming, of course, that we already covered. What are other applications of AI where we need good sequences of decisions?Yes. Autonomous driving? Yeah, correct. I mean, in driving, you could argue RL could work, and there's some RL going on. But what you mean, I think, is you're, you have some sort of a dynamic planning algorithm that allows you to strategize. If you see a, a red light ahead, you might start slowing down over time. But maybe it will turn green, so you might not slow down completely. This is an example of a strategy that you need, of course. Yeah.
8. SPSpeaker
  Robot controlling.
9. KKKian Katanforoosh
  Robot controlling. That's a great example, also related to autonomous driving. But imagine you, uh, wanna teach to a robot to move from point A to point B. The number of good decisions that the robot needs to make in terms of moving each of their joints is tremendous. Like, it's actually super unlikely that a robot would move from A to B if it's not trained to make good sequences of decisions. What else? Actually, the biggest one nobody mentioned yet. It's not a great application. I don't like it, but it happens to be the biggest one over reinforcement learning. Yeah.
10. SPSpeaker
  Like, creating or making suggestions.
11. KKKian Katanforoosh
  Yeah. Yeah, yeah. Advertisement. Yeah, marketing. You're right. So yeah, we talked about robotics. Advertisement is another example. Um, advertisement is a long game. Like, companies are showing you multiple ads before you buy, and in fact, the reason rein- reinforcement learning is important is because, you know, they're planning a strategy that might lead a buyer to execute a purchase over time, and it requires, uh, long-term thinking. So there's a lot of reinforcement learning
13:44 – 17:38
Core RL vocabulary: agent, environment, state vs observation, reward, transitions
1. KKKian Katanforoosh
  applied to, uh, marketing, advertisement, real-time bidding processes, et cetera. Okay. Clear on what RL is and how it differs from classic supervised learning? No? Okay. Um, so let's put, uh, some vocabulary around that concept. In reinforcement learning, you have an agent, and the agent, uh, interacts with an environment. As the agent interacts with the environment, the agent will perform certain actions that we will denote At, where t is a time step. And the environment will show you states that transition from time step t to time step t plus one. So subject to an action At, an environment may transition from H-- uh, St to St plus one. You can think of the game of Go. I take the action of putting my black stone on a certain grid, uh, intersection, and the environment has changed. It moved from, uh... The state has changed. It moved from state time step t to time step t plus one, where my stone is on the grid. After that, um, state update happens, um, there's two things that the agent observes. The, the agent observes, um, an observation that we will note Ot and a reward, Rt. Okay? So those are the vocabulary words. And of course, the goal of the agent will be to maximize the rewards. One thing to know about the observation, we'll talk about it a little more. Um, the observation sometimes is equal to the state. Can someone guess why we might need two concepts instead of a single concept? Why is it important to have a state and an observation? Yes.
2. SPSpeaker
  Different values for different outputs.
3. KKKian Katanforoosh
  Yes, correct. So in some cases, uh, the environment may not be fully, um, uh, uh, you know, transparent to the user. And so, for example, in chess or in Go, uh, the observation is actually equal to the state. You see everything on your board. All the information is available to you. If you play League of Legends or StarCraft, uh, you know the concept of, uh, you know, I think in English it's called like a cloud or a fog. I think it's the fog. You only see certain parts of the map until you have explored everything or until your friends are, uh, sort of visiting the other parts of the map. And so the observation is actually, uh, less information than the states of, uh, the environment. Okay. And then the last piece of vocabulary is a transition. When I refer to a transition, I refer of the process of getting from state t to state t plus one, which means we're in state t, the agent takes an action At. It observes Ot and a reward Rt, and it transition to the next state, St plus one. Question.
4. SPSpeaker
  Regarding competition, are there too much, uh, the statements are too large to, uh, once you've got the different training plus the
5. KKKian Katanforoosh
  Uh, wait, what do you mean? You mean is there... Are there examples of, uh, environment where the state is so large that the-
6. SPSpeaker
  You want to be the entire structure.
7. KKKian Katanforoosh
  Okay.
8. SPSpeaker
  For you to request something without alteration.
9. KKKian Katanforoosh
  Yeah, possibly. For computational reasons? Yeah, yeah. You might have games. I mean, look at open world games. Like, truly, you, you could, you could argue, uh, I don't know, there are some games where you might press start, and you see the entire environment. But who cares of what's happening, uh, twenty thousand kilometers, uh, west of you if you're in a certain location? Uh, that might not influence your strategy, so you might actually put some sort of a, you know, trust circle or, like, some sort of a circle
17:38 – 20:42
Toy RL problem: 'Recycling is Good' MDP and rewards design
1. KKKian Katanforoosh
  in which you observe, which you think has ninety-nine percent of the information you need, possibly for computational reasons. That's a good point. Okay, let's get to a practical example of a reinforcement learning algorithm and develop it together.Uh, this example is called Recycling is Good because recycling is good, but also because it's a simple example illustrative of reinforcement learning. So let's say we have a, a small, um, environment, uh, with, uh, five states. There is a starting state, uh, marked in brown, which is state two. It's our s-- It's our initial state. And then on the right side... Uh, sorry. On the left side, you have state one, which is, uh, garbage. And it's great to get to the garbage because you're gonna be able to recy-- to, to put in the garbage the, um, you know, the stuff that you have in your hands. You know, you're trying to throw away some garbage, and the garbage kinda happens to be there, and so we would expect there to be a reward. On the other side, if you actually go to the right, you might pass by state three, which is empty. You might pass by state four, where there is a chocolate, uh, packaging that is left on the ground that you can pick up, and, um, it's good to pick it up. And then on stage five, uh, state five, you have the recycle bin, which is more valuable than the garbage can because you can recycle, and you should get better rewards for that. So that's our game. In this game, we define a reward that is associated with the type of behaviors that we want the agents to learn, um, and the reward is as follows. That's just one example. Plus two for throwing your garbage in the normal can, uh, plus one for picking up the chocolate packaging, and plus ten if you manage to make it to the recycle bin. Is it clear? Now, the goal will be, and that's the case in, uh, reinforcement learning oftentimes, to maximize the return. We define formally the return, but think about it as maximize the amount of rewards that you get as you go through this journey and you make your decisions. In this specific game, we have five states, and there's three types of state. In brown is the initial states. We have normal states, and we have in blue terminal states. When you get to a terminal state in reinforcement learning, it will typically end the game. It will end one episode of the game. We move to another episode. You'll get back to the starting state or initial state, and you'll redo another episode. The possible actions for our agent here are gonna be fairly simple, left and right. And we are gonna add an additional rule that is important, which is that the garbage collector, uh, comes in three minutes, and it takes a minute to get from one state to the other. Why is that an important rule to add to the game? Can
20:42 – 30:39
Discounted return and solving via a Q-table (backtracking through outcomes)
1. KKKian Katanforoosh
  you guess? Yeah. Otherwise, you just go. Yeah. Otherwise, you just go back and forth between, uh, stage three and stage four. You just collect a bunch of, uh, chocolate packaging, and you never make it to the bin. And so, um, it's not what we want. Yeah. Okay. So how do we define the long-term return? The long-term return is gonna be defined, uh, as capital R, which is, uh, the sum of rewards with a discount. Um, discount is a very important concept in reinforcement learning. It's also a very natural, uh, concept to think about. Can you think of what, what the discount would represent in-- for humans? Do you have an example of, uh, what it could be? Yeah. The value of money and time. Huh? The value of money and time. Yeah, the value of money and time. Exactly. Uh, or the energy that a robot might have, things like that. Yeah. You, you would rather get, uh, you know, a dollar now than a dollar in ten years knowing that there's some inflation, for example. Uh, that's the example of a discount, and reinforcement learning is the same. Uh, you know, let's say you have a strategy that takes so much time, you need to discount it because your robot might lose energy as you're going through it, for example. Discounts can vary, you know, but they stay between zero and one. Um, so what is the best strategy to follow if, uh, gamma, the discount, is equal to one, meaning, uh, you know, time doesn't matter here if it's longer or shorter? Just wanna maximize the return. Best strategy to follow. Anyone give it a try? Someone who hasn't spoken yet. Yes. You could just, uh, bounce around forever. Bounce around, but remember the rule of, uh, three minutes. You can't bounce around because you, you will not get to the terminal state before the time allotted is done. But that would be a good idea if this rule was not true. What else could you do? Any idea? It's an easy one, no? Not too hard. Best strategy for gamma equals one, and give me also the maximum reward you would get. People are, are sleepy today, yeah. Recycle. Go to recycle. Go to the recycle. So right, right, right. Yeah. Yeah, that's right. Thank you. Right, right, right. And then what's your... Sorry. What's your, um, what's your total reward? Eleven. Yeah, that's right, eleven. So that's where we get terminal state, and we grab our reward of eleven. Very good. Now, assuming zero point nine for gamma.We're gonna complexify things a little bit. I'm gonna walk you through a very simple algorithm that, you know, allows us to sort of determine the best strategy, and we will put our numbers in a matrix. So for instance, um, we'll define a Q table, and Q stands, uh, you know, it's, it's a, it's a, it's a value function, um, where the, the, the name Q learning, Q star, you might have heard. Um, all of these things come from Q learning. And so let's say we have a Q table which has, uh, the size of number of states times number of actions. So five rows, two columns in our case. Every entry of the Q table is essentially representing how good it is to take action A in state B. Do you agree that if we had a table with these numbers, essentially we solved the problem? Meaning at any point, the agent can just look in the table. I am in state three. Let's look at column one. That would tell me the value of action one, and let's look at column two. It would tell me the value of action two. So I have everything I need to make my decisions. So that table is really the, the thing you wanna find in this exercise. Now, the way we will find the table is, uh, sort of using a backtracking algorithm where we might actually, uh, uh, codify the environment as a tree and traverse the tree. So here's what it looks like. I start in S2 and I have two options ahead of me. I can go to the left where I will get a reward of two. It's an immediate reward. The immediate reward is not discounted. It's an immediate reward. Remember the, the formula for R. The immediate reward R0 is not discounted. That would take me to S1. It's a terminal state, so there's nothing to do after. Second option, I go to the right and I get a reward of zero. That's my immediate reward and I end up in state three. State three is not a terminal state, so I can go and do the same exercise from state three. In state three, I have two options. I can go to the left where I would see a reward of zero and I will end up in S2, or I will go to the right and I will get an immediate reward of plus one. It's an immediate reward, we're not discounting it. I will end up in S4 and from S4, again, I have two options. Back to the left to S3 with zero reward or to the right with the amazing reward of plus ten and the terminal state of S5. So that's my map of immediate rewards. That's not my discounted return. So what we're gonna do now is we're gonna backtrack up the tree in order to compute the discounted returns. Actually, if I'm in S3 right here, I see that I can get an immediate reward in S4 of plus one, and I wanna compute my maximum return that I can get from when I'm in S3. My maximum return is that in S4 I could get a plus ten, right? But I need to discount that. My discount is zero point nine, so I multiply ten by zero point nine. What it tells me is that from S4 I can expect nine. Plus one, which I get as an immediate reward from moving from S3 to S4, I can update this number to ten, meaning from S3, the best you can hope for is a discounted return of ten, which is one plus zero point nine times ten. Everyone follows? Now let's do the same exercise one step before, uh, in S2. Uh, you know. Um, in S2, um, I have, um, an immediate reward of zero for going to S3 or an immediate reward of two for going to S1. Um, S1 is not gonna be worth it. We already know that because when I'm in S3, I can actually expect ten, which I have to discount. Zero point nine times ten gives me nine, plus zero immediate reward from S2 to S3. That tells me that the discounted return from state two, which is our initial state, is nine. You all follow? Just a simple backtracking. Now I can copy back this so S3, I know that when I'm in S3, um, uh, you know, I can expect a zero immediate reward to, um, uh, to... Sorry. If I, if I'm, if I'm in S2, I can expect, uh, zero immediate reward plus a discount times the plus nine that I could expect in S3. And so, uh, that gives me values that should cover everything that we have in this Q table. So I, I do that backtracking, I copy-paste all of that into my Q table all the way up here, and this is what I get. We essentially finish the game at this point. We, um, can look, uh, at a certain row. So let's say I'm in state number three. I look on the third row of that Q table and I see that I have two options. If I go back to S2, ultimately my discounted return will be eight point one, right? If I actually go to S4 on the right, I will get ten because I will get one plus zero point nine times ten, which is ten. So this is a toy example, but it tells you that if you were able to backtrack through the entire environment, you will be able to build a massive Q table and you will be able to give it to your agent to make its decisions. Yeah.
2. SPSpeaker
  Why is it like that but with two more solutions?
3. KKKian Katanforoosh
  Sorry, can you repeat?Yeah, here I'm simplifying. I'm not considering the time, uh, remaining. But in practice, um, you... If, if I remove the time component, so I remove the fact that there's a three-minute deadline before the garbage collector comes, then, uh, this would, uh, uh, be slightly more difficult because you would have to do a time series essentially of adding the discount times the reward that you collect. Yeah. But I'm simplifying here, and that's why I use the three-minute rule.
30:39 – 33:42
Bellman optimality equation and policy extraction from Q-values
1. KKKian Katanforoosh
  Any question on the Q table? Super. Okay. So, uh, this was the Q table, and in fact, we can put together our strategy for gamma equals zero point nine. Uh, the best strategy is still the same. You go to the right, and, uh, you can expect a return of nine. Now, uh, one of the most important concepts in reinforcement learning is this equation on the board, uh, called, uh, the Bellman optimality equation. Oftentime, you'll see it's noted as Q star of state S and action A equals R plus gamma times the max of that same function applied to S prime, A prime. Let me explain this equation for you because it's super important. This equation is called the optimality equation because your optimal Q table will follow this equation. If you have finished the game, this equation can be applied to any state action pair, and it will still be true. The intuition behind why, um, the Bellman equation is the optimality equation is that, um, if you're in a-- if you have the perfect Q function, Q table, um, and you're in a certain state and you perform a certain action A, you will observe a reward, and this reward will, uh, you know, you, you have taken an action, so you would be in a new state. And from that new state, you can repeat what you just did, right? And because, uh, you've done the backtracking and stuff like that, you will, uh, get this equation to be true because it's the reward plus discount times the best next action that you could be taking. Does that make sense? Any question on that? That's exactly the backtracking that we did, by the way. Immediate reward plus discount times the best possible action that you can take in the next state, S prime. The last concept I cover in terms of vocabulary is the policy. The policy is the function that, given your state, is gonna tell you what to do. And in Q-learning, the way this policy is defined is argmax of Q star, um, across the action. So essentially what it says is, like, look in the table and look at a certain state S. You want the policy, which is what you should do. It's the function that tells you our best strategy. You just look at the two possible actions, which one has the highest Q value, and select that action. That's it. This is a very simple example, but it, it's the core of, um, Q-learning
33:42 – 36:46
Why Q-tables don’t scale and the shift to Deep Q-Learning (DQN)
1. KKKian Katanforoosh
  that, you know, later on you will use policies widely. There's a lot of reinforcement learning algorithms, but this concept of understanding the policy, the function telling us our best strategy. In Q-learning, it's the argmax of the best Q value in a given state. It tells you which action to take. That's the core thing you need to understand. So remember this Bellman equation because we're gonna reuse it in a bit. The main issue, um, with this approach, um, of a Q table is that state and action spaces can be super large and having a matrix that you discover through backtracking, um, and where every time you wanna do an action, you have to look up the given states, the possible action, it becomes impossible. Like imagine you using this algorithm for the game of Go, where there's so many states, there are so many possible actions. You can put your stone anywhere on the board. You can imagine how big this matrix becomes and how impossible it, it is to use. So that's our problem, and that's the moment where deep learning, um, comes into play. So let's look at it. Um, the, the, the, uh-- Oh, actually, before I go there, I'm just gonna cover some vocabulary. We said the environment, the agent, the state, the action, the reward, the total return, and the discount factor. We learned all of that. We saw that the Q table is the matrix of entries representing how good is it to take action A in state S. And the policy is the function that tells us what's the best strategy to adopt, and the Bellman equation is satisfied by the optimal Q table. So let's get to deep Q-learning, which is what I was about to say is we are gonna frame the problem slightly differently. So instead of using a Q table, we're gonna use the fact that neural networks are universal function approximators, and we're gonna define a Q function that's essentially a neural network, so that the function can take a state S and an action A and tell you how good that action is in state S. So instead of a lookup in a matrix, you just run a forward pass in a neural network, and it gives you the answer. That feels like a better solution for games where there's a lot of states and a lot of actions.So here is, uh, same problem statement. In the past, we looked for a Q table, and this time we will look for a neural network. One of the things we're gonna do is to define the output layer to have two outputs. So given a certain state as input, think about it as a one-hot vector encoding the state. So this one is the example of state two, zero, one, zero, zero, zero. If you pass state two in this Q function, uh, with multiple layers, it will give you two outputs. One output that corresponds
36:46 – 51:30
Training DQN using Bellman targets: creating labels from experience
1. KKKian Katanforoosh
  to, uh, Q of S action right and the other one Q of S action left because it's the two actions. If we had more actions to take, we would just increase the output layer, and we might have many more neurons in the output layer. So the big question is, how the hell are we going to train that network? Because we're not in classic supervised learning. We don't have labels. So this one is a hard question, but, uh, what do you-- what would you do? Given we, we don't have traditional X and Y pairs, how are you going to train this neural network? 'Cause remember, at the beginning, this neural network will give you garbage. It will take a, a state S, and it might tell you, "Go to the left or to the right," but it's completely random. So how are you gonna tune it to the level where it makes really good decisions? Yes.
2. SPSpeaker
  Maybe you can assume based on some prior knowledge.
3. KKKian Katanforoosh
  Assume based on some prior knowledge. Tell me more. What?
4. SPSpeaker
  Like, basically, yeah, if you go to the problem, you have some idea of [continuing]
5. KKKian Katanforoosh
  So what are the things we know about this problem right now? What are the, the, the rules of the game that we could use in order to... I'm s-I'm seeing what you say. You're saying we could estimate what good looks like, but based on what?
6. SPSpeaker
  Like for each problem,
7. KKKian Katanforoosh
  Okay. So reward structure. You're saying that's one thing we have in every game. We have a reward structure for every state. That definitely should be used in order to estimate the good-- what a good decision looks like. Uh, yeah. The problem is not in every state you will see a reward, and if you look at many games of, like, Go, you might not see a reward until fifty moves.
8. SPSpeaker
  Yeah.
9. KKKian Katanforoosh
  So what do you do in this case? Yes.
10. SPSpeaker
  Can we run through a bunch of actions and states and see what the output is and get more data to train the neural net?
11. KKKian Katanforoosh
  Yeah. So you could... You're def-- You're actually, um, uh, bringing up a sort of a three search, right? You go down the tree, you do every possible action, and then you backtrack.
12. SPSpeaker
  Not every possible action.
13. KKKian Katanforoosh
  So which actions?
14. SPSpeaker
  Trying to spread out optimized.
15. KKKian Katanforoosh
  Okay. That's, that's-- We're getting there. So first possibility is we just go down the tree. In the game of Go, you could put your stone everywhere, so the tree already start by your thirteen by thirteen, uh, options, and then it's exponentially grows. Impossible. It's intractable. But what you said is, what if there are certain actions that are more likely than others? Do we need actually to explore the entire tree? What's this? Like, what are you using when you're saying that? How do you determine what action might be better than another one?
16. SPSpeaker
  Expected return.
17. KKKian Katanforoosh
  Expected return. We're getting close, yeah. But, you know, how do you know the expected return without, uh, going through the tree once at least?
18. SPSpeaker
  You can estimate the value of that cell.
19. KKKian Katanforoosh
  Okay. You can estimate it using what?
20. SPSpeaker
  Using optimization.
21. KKKian Katanforoosh
  Yeah, maybe. Yeah. What, what-- So that's exactly what we're gonna do actually, but we're gonna use the Be- the Bellman equation. 'Cause there are two things we know about this problem. We know the reward structure, which you brought up, and we also know that the perfect Q function will follow the Bellman equation. That we know as well. At the end, the Bellman equation should be respected, meaning for every state, if you wanna know the Q value of that state given an action, the way you will get that is you will look at the immediate reward plus the discount times the best Q value from the next state across all actions. That equation will be respected. So those are the only information we have, and we're gonna use them drastically to define our labels and sort of mimic a classic supervised learning approach. So here's what we have. We have our neural network. We have Q, S to the left and Q, S to the right that represent how good it is to go to the left in that state versus the right. And then, uh, I've pasted the Bellman equation on top right s- of the screen. We're gonna define a loss function. So let's say for the sake of simplicity, because those are scalar values that we'll use, uh, you know, uh, L2 loss, quadratic loss, um, that compares a certain label Y to, um, a certain, uh, Q value of a state in a certain action. So what we would like is to minimize this loss function, meaning Y and the Q value for a given action in a given state is as close as possible to each other. And we're gonna leverage the reward and the Bellman equation. So let's do, um, two things. Right now, we don't have a Y. So in supervised learning, you will have a picture of a cat. There is a cat. The Y is one or zero. Here, we don't have a Y, so we have to come up with an estimate of a good Y, at least better than random. So let's say at this point in time, when I send a state S in the network, it turns out that Q of going to the left is higher than Q of going to the right.Which means that today, at that moment, the Q function tells me it's better to go to the left than to go to the right. That is random at the beginning. It's completely random. Right? So what I'm gonna do is I'm gonna use as my target value Y, the immediate reward that I observe on the left, plus gamma times the best Q value that I can get, so the best action that I could take in the next step based on my current Q value. That's very important. So remember, this target is off. It's not a perfect target, but it's better than nothing. Meaning, not only it tells us, "Hey, there is a good reward to the left," we should consider that in saying that that might be a good move because we're seeing an immediate reward. But on top of that, we also know that at the end of training, the Q value should follow the Bellman equation, so why don't we set the target as the Bellman equation? So we add the discounted maximum future reward when you are in the next state. So you were in state S, you go to the left, now you're in state S next left, and you look again at your Q values, and you select the best one, then you add that number here. So there is actually two forward path in that process, right? There's one forward path where you send the state S in Q, and you look at the two options, left or right, and you're like, "Okay, I'm going to the left." And then you're like, "I'm gonna compare that value to a target Y." But to get that target Y, I need to do another forward path. So I take my action left, I perform it, I get an S prime state, S next, and I send that S next into the Q network. I look at the two options I have, I pick the best one, and I add it here with the discount. So fundamentally, what's happening is, is the following, is we have a Q network that's random at the beginning. It has never observed the rewards. We just know that at some point it will get to the Q, um, uh, it will get to a perfect, you know, policy. It will get to a perfect Q function. But the best we can do right now is to say, as a guide to-- for our agent, we will look at the immediate reward, and we will look at the Bellman equation, which should tell us a better estimate than where we are right now, and we will try to catch up to that estimate. And then we do that again and again. So remember, every time your Q gets better, it gets better for the next state as well. So, you know, the Bellman equation tells you, estimate it with the second forward path, and you just keep getting better and better as you're observing more rewards.
22. SPSpeaker
  Does this-- Or can this go into a loop in which you just connect straight, uh, or go to the forward of the next state, and you get pretty much how to refer to go left instead of go right, and you end up in a loop. You know, it's not good.
23. KKKian Katanforoosh
  Uh, how would it... So describe the loop. You-
24. SPSpeaker
  Right now, like, imagine if going to the right, like the next state, you want to go right.
25. KKKian Katanforoosh
  Yeah.
26. SPSpeaker
  And then next, we're going to arrive to, again, the target.
27. KKKian Katanforoosh
  Yeah, you would stop at that point. So what you-- Yeah, that's a good question. I, I'll show you how we fix certain things, but you do only one step, meaning you have your Q value at this point, and it tells you, go to the left, and you just wanna target Y. So what you do is you put left, and you look at your next state. You forward propagate your next state. You look at the two options, you pick the best. You don't go further. You just use that one step. You look one step ahead, essentially. You don't look multiple steps ahead. You could, but it would be more computationally heavy to do one more step again, and so on. So yeah. Um, yeah.
28. SPSpeaker
  Yeah. It seems like you're learning the Q function globally. Like, I, I appreciate some of the information about the forward and coming from, but the forward would be that step you took. So, uh, do people know, like, how fast that converges to the true, uh, Q function for steps, like, the states?
29. KKKian Katanforoosh
  Yeah.
30. SPSpeaker
  The current state of-
51:30 – 53:32
DQN pseudocode loop: episodes, timesteps, actions, and updates
1. KKKian Katanforoosh
  and you update the parameters of the network, and you repeat that process. Okay? Uh, here is concretely, if you were to code it in pseudocode, here is what it would look like to train a NORL agent using Q learning. We start by initializing our Q network parameters. So initialization, it's random at first. Then we will loop over episode. As a reminder, episodes are one full game from start to terminal state. Um, within an episode, we're gonna start from the initial state S, and we're gonna loop over timesteps until we reach a terminal state. So within one timestep, here is what we will do. We forward propagate the state S in the Q network. We will execute the action A that has the maximum Q value. We will observe a reward, and we will also observe a next state, S prime. We will use that S prime to compute our target Y by forward propagating S prime in the Q network and then computing our loss function. And based on that, we will use gradient descent to update the parameters of the network. Should be simpler looked at l-like that, right? Okay, so this is the vanilla Q learning. So to summarize again, the one diff-- the main difference is that we don't have a target, and we use our own network to estimate the target. And the rewards are what's gonna help us get better over time. By the way, it's okay if you don't understand everything. This is an entire class at Stanford, um, you know, an entire quarter of studying that type of stuff. So we're trying to get the basics within an hour and a half, two hours. Okay, let's go a little further now together, um, and
53:32 – 1:00:50
Breakout case study: defining inputs/outputs and practical preprocessing
1. KKKian Katanforoosh
  apply that to an actual game. So here's the game. It's called Breakout. We wanna destroy all the bricks. Who has played Breakout in the past? About a few. Okay, good. So you have a, a paddle that you control, and you are trying to destroy the bricks. If the ball gets past your paddle, you lost. And if the bricks are all destroyed, you won. That's it. Let's do it together. What, um, what is the input of our Q network? What would you use as input? If you remember in... Yeah. Yes.
2. SPSpeaker
  Entire screen.
3. KKKian Katanforoosh
  Entire screen. Okay, let's do that. So I take-- I define that as the state S, which is the input to my Q network. Uh, what's the output of the Q network? Yes.
4. SPSpeaker
  Do you do that on screen?
5. KKKian Katanforoosh
  Good question. We'll get there. I'm a- I'm gonna ask you, um, but do we have to look at the full screen? The answer is no, but we'll see why. Uh, what's the output? Yeah.
6. SPSpeaker
  Game score.
7. KKKian Katanforoosh
  The game score? Uh, no, but we're gonna talk about the game score. In the back.
8. SPSpeaker
  [background noise]
9. KKKian Katanforoosh
  Yeah. Let's, let's talk about the output first, and then we'll talk about the stuff we can, uh, get rid of on the inputs. But what's the output? Yeah.
10. SPSpeaker
  It's the movement of the, like-
11. KKKian Katanforoosh
  The actions?
12. SPSpeaker
  Yeah.
13. KKKian Katanforoosh
  Yeah, the actions. So y- yeah, it will be the Q values associated with the actions in state S. Remember, it's a Q function, so the output is... We need one value for left, one value for right, and one value for idle. You could make this game more complicated and say, "We have eight actions. We have a little bit to the left, a lot to the left, a lot more to the left," you know, and if you add multiple buttons. But let's simplify and say three actions. Either you don't move, you move to the left, or you move to the right, so these are the outputs. So now let's get to the question of the screen. Do we need the entire screen? So you were saying something earlier.
14. SPSpeaker
  [background noise]
15. KKKian Katanforoosh
  Okay. So you say you need the tray and the bricks. I would argue, uh, you need more because there's the walls and... I guess that you could-- if you're an expert player, you could know where the walls are, but generally, you need a little more than that. Well, what, what, what would be obviously things we can get rid of, and why would we do that?
16. SPSpeaker
  The background, probably the blocks at the top.
17. KKKian Katanforoosh
  Okay. The score, the score at the top. Um, who would remove the score at the top? About half? Why would you not remove it?
18. SPSpeaker
  Like, why would you remove it or not?
19. KKKian Katanforoosh
  Why would you remove it?
20. SPSpeaker
  Because the point of not playing the score is you're still trying to-
21. KKKian Katanforoosh
  Okay. You wanna always win, so the score doesn't matter. It's true. We would remove the score. So you, you could actually crop the top. You could also crop the bottom. I mean, if you pass the paddle, you don't care about the few pixels at the bottom. You could get rid of them. Um, this is not always true. There are games where the score matters. And in fact, you know, I like football, soccer. The, the, um-- in soccer, if you're one-zero up, you can park the bus. So your strategy is dependent of the score that you have. Like, you wouldn't park the bus if you're losing one-zero. Parking the bus meaning you ask every player to come back and defend. If you're losing, you would actually do the opposite. You will go all-out attack. So in certain games, you want the scores. In others, you don't want. And so it's part of the designer, the, the, the AI engineer that's working on that to determine what information we need and what we don't need. What else could we do to reduce the dimensionality of the problem and make our computation faster?
22. SPSpeaker
  Remove those, like, channels, probably.
23. KKKian Katanforoosh
  Yeah. Remove the RGB channels, so do grayscale essentially. That's true. Here, you actually don't need the colors. It's just nice as a user for user experience purposes. You don't need... I don't think there's different points based on the bricks that you destroy. It's all the same. Um, they're... Actually, funny enough, this algorithm was used by, um, uh, DeepMind to play a lot of Atari games, and they did a single preprocessing where they removed the channels because they said it doesn't matter. Turns out in one of the games, I think it was SeaQuest, I forgot which one, uh, the fish disappeared when you did that. And so that game didn't work. The, the agent couldn't crack it, uh, because they thought that the same preprocessing could apply to every game, but actually, they had to make a slight tweak.
24. SPSpeaker
  I think, uh, if you really want to keep it, you could just keep it for-- because for a campaign, because if you have multiple blocks and you have a one or a zero block, there's more parameters. And then you also just want to keep the game simple.
25. KKKian Katanforoosh
  Correct. You-- So just to recap, you could do it even better by using a, a low-dimensional representation of this game that describes the game. It's true, but because we wanna use a single algorithm for fifty-plus Atari games, we'll say the human sees the screen, we just give the screen, and it will probably scale better essentially. But you're perfectly right if you were working on only that game. Okay, so let's do that. We'll, we'll do preprocessing. There's one last thing that nobody mentioned, which is history. Because in fact, if you get only one screen, you don't know where the ball is going. So actually, you can't solve the game. And the way you fix that is by giving a history of multiple screens. For example, four screens, so that you see the direction that the ball is going in. So our preprocessing function is, you know, called, uh, phi of s, let's say. And phi of s is a mix of, um, you know, you might do convert to grayscale, reduce the dimension, the height and width, and also add the history of four frames. And that should be enough. Turns out in most games, you will need the history, a little bit of history to know where the ball is going or where-
26. SPSpeaker
  In this example, you could also include the, uh, like, velocity vector of the ball.
27. KKKian Katanforoosh
  Yeah, you could, you could replace... Exactly. You could replace, um, the history, so multiple screen, by just adding the gradients or the velocity of where the ball is going. That's true, but would it scale to every game, you know? Turns out this, because we know humans look at the Atari machine and they look at pixels, this would be more likely to scale to every game. Yeah. Like, think about a game where... Actually, SeaQuest is a good game, or Space Invaders, where you have multiple enemies coming at you. Then you would need to change your preprocessing to take into account the velocity of all these enemies, so it wouldn't work the same way. While if you actually give the pixels, you actually, from the pixels, get the velocity of all your enemies and the directions they're going.Okay, so this is our preprocessing. I'm gonna refer to it as phi of S and, um, our deep Q-network architecture, because we're working with pixels, is going to be, um, a convolutional neural network. Don't worry if you haven't learned it yet in the class, but it's a bunch of conv and ReLU
1:00:50 – 1:10:16
Stabilizing and improving DQN: terminal handling and Experience Replay
1. KKKian Katanforoosh
  activations. Um, and then we end with a fully connected, uh, layer, uh, that gives us the three Q-values for, uh, the different actions. So nothing special here. Now, uh, we're gonna go back to our vanilla training, so this one that we saw together earlier, and we're gonna look at tips to train reinforcement learning algorithms. Those tips are not specific to Q-learning. Some of them are applied to a lot more than Q-learning, and they're very important to know. Um, and they're part of the reason reinforcement learning has worked better in the last few years. Um, so one of the things that's pretty simple that we forgot to do, um, is the preprocessing that we just did. So anywhere I had an S, I'm gonna instead run S through the preprocessing step. I'm gonna use phi of S. So I initialize instead of S with phi of S. Uh, I start from the initial state phi of S, and then I forward propagate phi of S. I, um, I get the Q of that preprocess state in action A and et cetera, et cetera. And then when I get my next state, so let's say I look at my current preprocess state, I forward propagate it once, I see the three Q-values, in our case, the two Q-values. One of them was better than the other. Action right, then I get my next state, S prime. I wanna preprocess that state as well. Yeah? So that's pretty straightforward. You just replace all of that. The second thing we forgot to do is to keep track of the terminal states. In our pseudocode, there is no concept of terminal states. It's pretty easy to add. You would probably just do an if-else statement. You would create a Boolean to detect terminal states. So let's say your Boolean is terminal equals false, and then as you loop over the time step of a single episode, every time you're gonna check, "Is the state that I'm going in based on the action I'm taking a terminal state?" If it's a terminal state, then get out of the loop. You know? There's nothing else after. The one thing that you need to be careful of is if it's a terminal state, then your target is not the Bellman equation. It's just the ter-- the immediate reward. Remember, you get to the terminal state, you get a reward of ten, there's no Bellman equation to apply. It's just ten. It's immediate reward, there's no discount, et cetera. Okay, so these are fairly easy changes. Now, we're gonna look at a new method that will enable more data efficiency. It's called experience replay. One of the... A cou- a couple of issues with the way we've been training so far is, uh, one, the correlation of successive screens. It's like imagine in the Atari game, you have the ball that's in the top left corner, and it's traveling to the bottom right of the screen. You have, like, many, many time step that are essentially the same. It's all the ball, uh, traveling in the same st- in the same place. So you're actually training repetitively on something that is not that meaningful. You don't need to just train on a batch. The equivalent in supervised learning is, let's say, you're trying to differentiate cats and dogs, and you train on a mini batch of cats, then you train on a mini batch of dogs, then you train on a mini batch of cats, then you... It, it will never converge. It will just index too much on cats and then index too much on dogs. So you wanna add some sort of a, um, a experience replay concept that we'll see in order to create more mixes in the data and get more diversity. The other, uh, thing that is important is, in our current training process, we are not reusing our data. Like you experience something, you immediately train on it, you never see it again, unless you re-experience the same thing sometimes in the future, which might or might not happen. Experience replay is gonna help us to keep experiences in memory and maybe retrain on them on a regular basis so that one experience might be useful multiple times. Which intuitively makes sense. Like maybe you do an experience, you get an amazing reward, and you don't wanna forget it. You wanna retrain the model on it on a regular basis. It's more data efficiency. So here's what it looks like. Um, the current way we were training was we're in a state, I'm just gonna say state instead of preprocess state, but it's preprocess. We're in a state S, we perform action A, we get a reward R, and we get into the next state. From that next state, we perform another action, A prime, um, we get a reward, R prime, and we get into S second, and so on, you know, and so on. And each of these would be called one experience. It's one iteration of gradient descent. It's one experience. So right now, we're training on these experiences. So the training looks like I train on E1, I update my parameters. Then I train on E2, I update my parameters. Then I train on E3, update my parameters. Those are highly correlated because they're part of the same episode. And as I was saying with the ball traveling in one direction, that might actually not be that helpful to train on all of these. You know? So instead, what we'll do is we'll use experience replay, where we will collect our first experience, but instead of training on it, we will put it in a memory called the replay memory D. We put it in there, and then at every step, we will sample from that memory to decide what to train on. So of course, at the beginning, if we just have one, uh, experience in the memory, we will train on that experience. But over time, you will see that we get, uh, more-Diversity and reuse out of our experiences. So for example, let's say I experience E2, I put it in the memory, and then instead of training on E2, I'm gonna randomly sample from the memory. I might get E1 or I might get E2. Then I experience E3, and I put it in the memory, and I might get one of the three. You know. This is the vanilla experience replay. In practice, there is more methods like prioritize sweeping, which might tell you which experience you wanna weigh. Maybe some experiences had higher gradients, so you wanna prioritize them more often. You know, things like that. So all in all, this is what our training looks like with experience replay. We experience, um, E1, we train on E1. Then we-- The next training iteration is not on E2, it's on a sample from E1 and E2, either or. The third experience is then put in the replay memory, but we don't train on it. We train on a sample from whatever is in the replay memory, and we repeat, and that is more sample, uh, more efficient, uh, allows more reusability and less cross-correlation in our, um, training batch. Okay? So that's called replay memory. And you can use it with mini-batch gradient descent. Note that you're still experiencing the direction that the game is played. Like, we still go and take the action as expected. We just don't necessarily update our model parameter based on the action that we ended up taking. We put it in the replay memory. We may train on it later. Okay. Um, so here is how it modifies our vanilla, uh, setup. We've added an experience from state S to state S prime to the replay memory. You know, like, let, let me walk you through it again. Within one time step, we forward propagate our state into the Q network. We execute the best action given the Q values. This gives us a reward and a next state. The next state is preprocessed, and then instead of training on that, instead of training, we just add that transition to the replay memory, and instead we sample randomly a mini batch of transition from the replay memory, and we train on those. And we redo the same thing again and again. Yes.
2. SPSpeaker
  Does the replay memory, like, bias towards the start of the game because you sample from everything? So aren't you more likely to train at the start of the game because you're learning something like chess, like you would sometimes wanna-
3. KKKian Katanforoosh
  Yeah
4. SPSpeaker
  ... do it than the end game?
5. KKKian Katanforoosh
  Yeah, you, you, you would. Uh, you would within one episode, but, you know, uh, if you, if you play multiple chess game, uh, your replay memory will get already bigger. So then you would see some end games, some middle of the game, some early games. Yeah. Good. And in practice, it's actually useful because, uh, you might imagine that in a chess game, you know, all of us, uh, um, um, let's say if you're a beginner, you, you see a lot of beginning of the games. You actually... People that are beginners, they are good at openings, but they're bad at end games because they don't get to play a lot of end games. Uh, well, that type of approach could be useful. You can retrain on end games more often and, you know, the, the, a more advanced version of the replay memory would also weigh the experience in the replay memory based on how much the gradient is going to be. So if you have an experience that actually was super insightful, you can weight higher so that you, you, you prioritize grabbing it
1:10:16 – 1:19:58
Exploration vs exploitation: epsilon-greedy to avoid local traps
1. KKKian Katanforoosh
  and retraining on it, essentially. So let's say you blunder in chess, you might actually wanna resee that blunder later so that you don't do it again, let's say. Um, okay. So these were all the different methods. Another one that's very, um, intuitive and very important is when during the training process, our, um, agent gets stuck. We get stuck in a local minima. Uh, here is how it would work in practice. Uh, you start in initial state S1, and you have three states ahead of you. If you take action A1, you go to state two, which is a terminal state. If you take, uh, and you get a reward of zero. If you take action A2, you get to S3, also a terminal state, and you get a reward of one. And if you get, um, action A3, you get to state four, terminal state with a reward of a thousand. So of course, to us, it's obvious that we would wanna explore the state number four. It's pretty obvious. In practice, let's say you update, you, you initialize your network, and in the first forward path, that's what you get. First forward path, the network is random. You get Q value for action one, point five, for action two, point four, for action three, point three. What does that mean? It means the agent is saying, "I'm gonna go to action one." So I take action one, and I see an immediate reward of zero, right? Because it's a terminal state, the Bellman equation thing doesn't happen. I just have the immediate reward, which becomes my target Y. And so I perform a gradient descent update to say this Q value should have been zero. So I convert this Q value to zero. Now, second try. This time, the Q value is saying, "Take action two." It's the highest Q value. I take action two. I have an immediate reward ahead of me. That's one. Because it's a terminal state, there's no second discounted future reward term. Um, so I just take Y equals one. I perform my gradient descent update, and this converts to one. And then third time, the agent is still saying, "Go to A2. Go to the-- Uh, take action A2."Reward of one. Good. You-- That's what you predicted. Nothing to do, just keep going. We're done with training. We're stuck. We never visit the state we actually wanted to visit. Okay? So that, that, that wouldn't work for us. We will never visit that state using our current algorithm. Does that make sense why we wouldn't ever visit that state? Um, in practice, this is a big issue. The analogy of this, uh, concept, uh, uh, of exploration versus exploitation is when every day you take your bike and you cross, um, campus, you have a favorite route, and turns out that the more you take that route, the better you get every time. Like, you get a little faster, maybe your turn is faster or something, or you can predict how many people are gonna be at that roundabout, and you know how to take it in the wide way, so you go faster. We've all done that. Um, that's exploitation. You exploit what you already know, and you get better at it. But maybe there's another route that you're not thinking of that's pretty... Instead of going north from campus, you go south, and maybe it might be better. You will never see because you don't have the courage or the patience to do it. Uh, that's the difference between exploration and exploitation. In practice, a good model would be able to handle both, to exploit when it... to exploit, to explore when it needs to explore. The way we do it in practice in our pseudocode is to inject some randomness. So, for example, um, when we are looping over a time step, with probability epsilon, let's say five percent, take a random action. So from time to time, on average one time every twenty times, you take a random action, it will allow you to visit maybe a new path. The analogy in chess is, you know, you might use a creative move from time to time that might be worse today, but might allow you to learn something and to get better over time. Yeah.
2. SPSpeaker
  Um, in that, uh, the example you just covered, couldn't we resolve that by just setting the initial, um, Q values to infinity?
3. KKKian Katanforoosh
  Setting the m-- the... Couldn't we resolve this problem by setting the initial values from into infinity? Well, the problem if you set the initial values to infinity... So you would say instead of randomly in-initializing your network, you initialize it in a way that the outputs are equal to infinity?
4. SPSpeaker
  Yeah, so that we wouldn't get the issue of where, like, the Q value of action-
5. KKKian Katanforoosh
  Yeah
6. SPSpeaker
  ... from state one is like-
7. KKKian Katanforoosh
  Well, in practice, if the three Q values are infinity, then you can't make a decision on the spot, so you're saying just pick one randomly? Because if the three are infinity, you, you can't decide which one to take, right?
8. SPSpeaker
  Oh.
9. KKKian Katanforoosh
  And also, if inf-- if it's infinity and the reward is one, I mean, if it's a really large number and the reward is one, your gradient is gonna be massive, right? So it's gonna... I, I guess the loss function is gonna be massive and, um, I don't know. I imagine it would be really hard to train it. But in practice, you start with a random initialization because in... Th-this might be one example but, you know, if in the game of chess, um, actually the, the reward is one at the end and zero all the time, or maybe the reward is a thousand at the end, and when you lose your rook, it's, uh, it's a negative reward. You can't predict what the reward structure is gonna be. You want an agent that is able to adapt to it and, um, it's better to find a method that can scale to different environments, essentially. Um, okay. So this was, um, epsilon-greedy, um, action, which is adding some randomness with probability epsilon take a random action. Okay? So adding all our techniques because we get good at training reinforcement learning algorithms, this is what we have. We initialize our Q network parameters. We have a random network. We initialize our replay memory, D, and then we loop over episodes. We start from an initial state. We create a Boolean that allows us to detect terminal states. Um, with probability epsilon, we're gonna take a random action. Otherwise, we're gonna follow what we know, which is forward propagate the state in the Q network, take the action that has the highest Q value. That allows you to observe a reward in the next state. Take that next state, forward propagate it again. Instead... And then, and then instead of, um... Uh, oh, no. Sorry, sorry. Observe that next state, add it to the replay memory, sample from the replay memory, and then train on that sample, and in the process, you will need to do another forward path because you need to estimate your target Y using the immediate reward plus the Bellman equation, plus the discounted future reward. Okay. Are you experts at, uh, Q learning? Okay, good. Sounds good. And here is where we get at the end. You can claim proudly you have trained an Atari. It's not that complicated, as you can see, other than the Bellman equation piece. Um, turns out the agent has discovered that it can send the ball on the back, and it's actually much easier to finish the game like that, um, which is quite interesting. You know, a good player would know that you can dig a tunnel, and you can finish the game without too much issues. Yeah.
10. SPSpeaker
  How do you quantify the results of the actual deep reinforcement learning?
11. KKKian Katanforoosh
  How do you quantify when, uh, when the game has ended?
12. SPSpeaker
  How good the model becomes after it trains.
13. KKKian Katanforoosh
  Yeah, well, uh, first you would, you would, you would start seeing, um, uh, the model get to good rewards as it play. Like, it manages to get really good rewards, while earlier it might not, you know? And so that's probably your best guess for how good the model is. In practice, if you're AlphaGo, you can also test it against the best humans in the world, and you can observe that, uh, they're losing against the model.
14. SPSpeaker
  But I'm saying you have, like, a bunch of different chess engines and some of them are way, way betterDifferent structures. Maybe they're both based on reinforcement learning, and at the end, they maximize both of their rewards.
15. KKKian Katanforoosh
  Yeah.
16. SPSpeaker
  So how do you know which model is actually doing better?
17. KKKian Katanforoosh
  Or you can get them to play together.
18. SPSpeaker
  So you have no idea if you could have one?
19. KKKian Katanforoosh
  No. You, you could, you could actually monitor the loss function and look at is the Bellman equation respected. If the Bellman equation is respected, then your model is really, really good. And then, uh, we're, we're gonna see an example of competitive self-play, where you get the model to play against other models, and then over time, as you watch them play for thousands and thousands of time, you can tell which model is ahead of another one. You can then sort of copy-paste the best model into the other models and then make them play a game for many times. And because you have the epsilon-greedy approach, one of the model is naturally gonna get better than the others because of the randomness that you add. So... Um, okay, let's look at a few examples, and then we'll spend twenty minutes on the RLHF. Um, here are other examples. This is Pong, uh, which is one V one. SeaQuest, which is an underwater game. And then the one that maybe
1:19:58 – 1:30:05
Beyond DQN: harder games, sparse rewards, PPO, self-play, and multi-agent RL
1. KKKian Katanforoosh
  more of you know, Space Invaders, very popular game as well. So the impressive thing that they showed is that you can, um, actually, uh, solve many games with the exact same algorithm. No tweaks, which is quite impressive. Let's go a little further and talk about, um, advanced topics. Um, here is a game called Montezuma Revenge. Um, this game is particular because you're controlling a little character right here, and this character is trying to go and grab, let's say, this key right here, and it has some obstacles or some enemies that it needs to take care of. What, what do you think is gonna be an issue if we apply what we just learned to this game? What makes this game especially hard in comparison to, let's say, chess or Go? Yes.
2. SPSpeaker
  The reward is very delayed.
3. KKKian Katanforoosh
  Yeah. The reward is very delayed. Like, if you start f- with a random network, what are the chances that the network is gonna figure out that to get to the key, it actually should go in the opposite direction? It should go in the opposite direction. It should jump down here. It should up-- uh, catch the rope. The rope will probably allow the character to go to the, uh, ladder. It goes down the ladder. It has to go jump up this enemy. My guess is it's an enemy. I'm not sure, but I think it's an enemy because of the color. And I know that in gaming, if it was green, it might not have been an enemy, but if it's gray or red, it might be an enemy. And then go up the ladder and grab the key. The chance is very low that the agent is gonna make that successive good decisions to get there. You're right. Why is it use-- why is it easier for a human to actually solve that game?
4. SPSpeaker
  Intuition.
5. KKKian Katanforoosh
  Hmm?
6. SPSpeaker
  Intuition.
7. KKKian Katanforoosh
  Intuition, prior knowledge. So for example, when you look at this game, even if you have never played it, my guess is you would know you can go down the ladder because you know what a ladder is. Or you can see this little rope, and you're like, "I'm gonna catch the rope. I'm gonna jump and go to the other side." And you look at this little monster, and you're like, "I'd better not touch this monster." Or if anything, "I will jump on top of it," 'cause you've played Mario, let's say. So all of this is human, um, uh, intuition. Uh, sometimes you would call as a baby survival instinct, like you throw the baby in the water and suddenly it flips and it can, um, it can, uh, swim. Those are things that are, to a certain extent, encoded in our DNA, but at the very least encoded in our experience of doing other things that have nothing to do with this game. And so the, the problem here is called imitation learning, is, is there a better way to start our network than a random initialization that allows the network to, for example, guess that this is a ladder? It turns out that if the network knows that, it will be more likely to get to the reward first and then learn from that reward and then get better over time. The other part that can also use human knowledge, which is what we're gonna see together, is reinforcement learning from human feedback, where you have an analogy here, which is you can train a language model and it might be completely misaligned with what actually humans care about. How does reinforcement learning help in those situations? That's gonna be the next topic in the last part of the lecture. Okay. Let me show you a few other results, um, quickly. Today, we talked about DQN, Deep Q-Learning. In practice, there is a lot more reinforcement learning algorithm, but you got the gist of it. You got the concept of making good sequences of decision, epsilon-greedy, um, exploration, exploitation, um, uh, terminal state, starting state, all of that you, you got. The one, the one algorithm that is very popular right now is called PPO, proximal policy optimization. There is one that is even more popular right now that's actually from a year ago at Stanford called DPO that we won't study in the class. One of the things to know about PPO, just, just to go over it really quickly, and, and I, I pasted two important papers from Schulman et al., uh, a few years back, trust, uh, TRPO and PPO, um, is that it is not a value-based algorithm. So in Q-learning, you learn the Q values, and then you define your policy as the argmax of the Q values. In PPO, you learn the policy directly, which is a more probabilistic method. Um, it also works well with continuous spaces. If you look at the Q-learning we learned, one output for one action. If you actually have, um, a, a game that has continuous action, like autonomous driving, where it's not like just turn the wheel to the right or to the left, it's like what degree you turn it, it's continuous, then Q-- DQN would not work well.Or you would have to granularize the number of action a little bit to the right, a little bit more, a little more, which would not be really useful. Instead, you would use PPO. Yeah.
8. SPSpeaker
  Uh, you, you learn... Is there gonna be a reward for a game like Go, like zero for most steps including [inaudible] .
9. KKKian Katanforoosh
  Yeah. So, uh, question is how do you define the reward in DQN? Um, different reward structure will lead to different, uh, types of, you know, agent strategies. Uh, but you're right. For the game of Go, you could a-actually define the reward as one if you win and zero if you don't win. That's it. You know, every move will be zero until the last move is a win. In chess, you might actually do intermediate reward because you wanna tell the, you wanna tell the agent that it's good to kill the opponent's pieces to get rid of them. You could also do end-to-end and say, "I don't give any intermediate reward, I just give a final reward," which might be more complicated to train on, but it might actually lead to a more optimal strategy. Because in fact, you could actually win without taking any piece from your opponent. Yeah. So. Other things about PPO is, um, you know, it's more probabilistic. Uh, it has a concept of an expected advantage, which at every steps, instead of telling you how good that action is, it would tell you how much better it is than random, than, than the current state. Like, how much better would it be to do a certain thing versus what you would have done otherwise. I'm not gonna go into the details, it's all in the paper, but those are, uh, things that are important. Here's a few examples of PPO. So this example on the left is from OpenAI a few years back, where you can see it's a continuous space, where, uh, the agent is being, um, bullied a little bit. But, um, it's trying to grab the rewards, but it's also subject to external forces that are sort of, uh, throwing balls at it. It's a little bit, uh, mean, but, um, you can imagine that this is, um, a continuous space, meaning you're controlling the nodes, um, you're controlling the joints of the agent, and you're controlling the forces, the angles, and so it's a sl- It's-- that's why PPO would be better in that case. Super. Uh, here is a competitive self-play, which I really like. [upbeat music] Where you have agent play with each other. And this is the Sumo game, push the opponent outside the ring and you get a reward. So actually, it's interesting because you're seeing some emergent behavior, which is they attack each other's, [chuckles] uh, feet, or they lower their center of gravity to be more stable, for example. Yeah.
10. SPSpeaker
  Is this the exact same one we looked at yourself?
11. KKKian Katanforoosh
  Yeah. Yeah. It's versions, sometimes different initializations, for example.
12. SPSpeaker
  Okay.
13. KKKian Katanforoosh
  So no, but good question. So oftentimes, what OpenAI would do back, you know, back in that time is they would create copies of the same model, they would initialize them differently, and they would let them learn. And turns out one of the model will get better than the others. And then they will copy again that model to the rest and do the same thing again and again, pretty much. Oh, yeah. It's kind of funny, isn't it? That's a good catch.
14. SPSpeaker
  [laughs]
15. KKKian Katanforoosh
  Yeah. That's a good goal. Could watch that for hours. Okay. They're a little awkward, you have to say, but, but it, it works. [laughs] Okay. So this, you know, I let you watch the video, it's gonna be shared. But, um, here's another set of games that are even more complicated that I mentioned early on. OpenAI Five, uh, which you can think of an equivalent of League of Legend, Dota, where you have, um, five V five game, so you have to collaborate, et cetera, which makes... It adds, like, literally one additional, um, uh, degree of, uh, complexity. Uh, and StarCraft, uh, AlphaStar from DeepMind is an example of where the observation is not the entire state. You have fog. And so that adds another layer of complexity. We're not gonna see that together today. Um, I would encourage you to look at the AlphaGo documentary on Netflix, if you haven't. Who has seen it already? Nobody? Okay. Well, um, you can now watch it with a different eye, understanding reinforcement learning. And at some point in the, in the documentary, you will see that, um, AlphaGo makes a very odd move, a very creative move. And people are like, "I don't understand that move." Even the top researchers or the best players would say in the video, they don't understand that move. It turns out that that move is very unintuitive for humans, because as humans, we are trained to maximize our chances of winning. Like literally, if I can eat all your pieces in chess, I will eat all your pieces. And if I can surround your stones in Go as much as I can, I will do it. The agent is just programmed to win. So that move actually looked counterintuitive because the agent doesn't care about winning by one or winning by, uh, you know, twenty stones. It just cares about winning, and that move specifically
1:30:05 – 1:37:12
RLHF pipeline: from next-token pretraining to supervised fine-tuning (SFT)
1. KKKian Katanforoosh
  put the agent in a good place to win by a small margin. Yeah. So that's an example of an insight that you will learn, you, you understand from this class and you will see in the, in the documentary. Okay. I think we have ten minutes. I'm, I'm just gonna introduce, uh, reinforcement learning from human feedback, uh, because it's a more, um, uh, modern topic that, um, is very trendy right now. It's important to know. And so let's look at it together. We're gonna start by recapping how language models are trained in a nutshell, and then we'll see what self, uh, what supervised fine-tuning looks like. We'll talk about how do we train a critique model, a reward model, and then finally, uh, what RLHF looks like and why is it so trending in the news. So, um-Our training objective for language models is next token prediction, right? We've already talked about it in a former lecture. Um, the idea is that I will get some inputs. I'm reading Wikipedia, let's say, or some sort of a text online, and I read a sentence, and I predict the last token, and I do that again and again. So for example, deep learning, and then deep learning is. Deep learning is so... Deep learning is so cool, and that's it. Um, so you, you get the idea, right? You're always predicting the next token, and then over time, it forces the model to, um, explicit, uh, um, uh, emerging behaviors, and it understands the connections between those concepts, and it's really good at generating texts. We compute the loss function. You're actually gonna study this loss function in, uh, C five. So I'm not gonna talk about it right now, but you perform, uh, you know, a gradient, um, descent loop. And this is how you get your first pre-trained language model. You get a pre-trained language model. You can call it on a text or a prompt, and it will continually generate, and you call it again and again and again, and it generate, generate, generates. Everybody's comfortable with that, right? Okay. So, uh, that's how we trained a language model, but there is a couple of problems. The first problem is that online data does not reflect helpfulness. So to give you a concrete example, um, what you might find in a training set is something like deep learning is so cool, when actually what you might find in practice is people asking, "What is deep learning?" So the data is not really reflective of you want an agent to be helpful, and that's a problem because the model was trained to continue text rather than answer questions. And in practice, you would see it's a big problem. Another problem is the model has no concept of good, polite, or helpful yet. And to give you a concrete example, you might actually ask a pre-trained language model, "My laptop won't turn on. What should I do?" And then the model responds, because it has read it on Reddit or on Wikipedia, [chuckles] is, "Laptops sometimes don't turn on because of power issues." Um, which is not what you asked. You asked, "What should I do?" And in fact, a better answer would have been, "Check your charger if, is properly connected or the outlet works. If that's fine, try holding the power button for ten seconds. If it still doesn't start, the battery or motherboard may..." blah, blah, blah, blah. That's a better answer. That's what you want a l- a language model to do nowadays. Um, and the model, um, can give you factual text because that's what it's been trained on, but it's-- it doesn't understand being helpful or, um, having an answer that looks like a human-like answer. So our solution to it will start with using supervised fine-tuning, um, which is going to be learning from human-written demonstrations of helpful behavior, and then we get to even further and use RLHF, which will optimize not only for human-written sentences or paragraphs, but for preferences. And the word preference is the keyword. Let's talk about how we can improve our pre-trained model with supervised fine-tuning. I take, um, that we want to align models with human-written responses. And the step one that we're gonna use is to build a data set. Let's build a data set of human prompt-response pairs. So what actually OpenAI is gonna do, I'll explain it in a second, is it might collect some of the prompts that we all use and then ask humans to respond to those prompts and put that in a data set. It might al- also ask separately experts to write really good prompts and then answer those prompts. It's a fully human-made data set. And then we use that data set to fine-tune our pre-trained model. And by now you've learned fine-tuning in the online video, so you know what I'm talking about, um, using supervised learning. So what it looks like is I take my pre-trained model that I just told you how we train, and then I give it a prompt, explain deep learning to a beginner, and I also will concatenate to it a response-- a good response written by a human. Deep learning is a type of machine learning that uses neural, and then I expect the model to come up with the word networks. So it's literally do whatever we did to train the pre-trained model, but we do it on human-written prompt-response pairs. And if you do that many times and you use the, you know, the same loss function, how far the model's response is from a human response, um, you do that many times, and you will get SFT, supervised fine-tuning. Um, but it has some shortcomings. One of the shortcomings is it is data that is extremely costly to collect. In fact, I believe in the first version of that InstructGPT, there was only thirteen thousand prompt-response pairs. It turns out it did really well despite that. Um, the second aspect is it's unlikely to generalize well because you're-- again, you're not doing reinforcement learning here. You're doing supervised learning, and so you're just showing a set of examples, thirteen thousand examples that you wanna learn, but it's-- what tells you that it will generalize to an unseen prompt that will come up from your user base? And so this approach, SFT, really teaches the model to imitate good behavior from humans, and that's the key. It's, it's imitation. It is not preference optimization.To do preference optimization, that's where we're gonna train a reward model, and we're gonna do proper RLHF. So let me talk to you about the RM reward model, and then I'll tell you about RLHF in a nutshell. The problem of,
1:37:12 – 1:44:55
Reward model and RLHF as preference optimization over full responses
1. KKKian Katanforoosh
  um, RLHF is to align not with human responses, but with human preferences. So what ha- what, what's gonna happen is we're gonna train a separate model to predict which responses human prefer, and we're gonna call that model the reward model. It's a separate model from whatever we've trained before. The model, um, is gonna use data from labelers. So you're gonna show labelers two or more responses to the same prompts, and those responses will be sampled from the SFT. So your best model right now is the SFT. You will sample three or four responses, and you know how we sample, right? You, you can tweak the temperature. You can select not only the top priority word, the top-- the softmax layers' number one word, but you can sometimes sample, uh, differently, and you will get a variety of answers. And then you will ask a human labeler to say answer B is better than answer C, and answer C is better than answer A, and answer A is sort of equal to answer D, let's say. Um, they will be asked which response they prefer, and it can get more complicated. It doesn't have to be just a simple ranking. You, you have multiple Likert scale methods and so on. Um, but the qu- the point is that you will collect those pairwise comparison that we call preference data, and you will use it to train a reward model, which is initialized from your SFT. So your SFT is here. It's your best model to date, and you're gonna modify the last layer. So the softmax layer at the end of a language model that will tell you, "This is the token we should output," or, "This is the word we should write." Um, instead of that, you'll get rid of that layer. You'll put a scalar value as output. You'll put a linear layer with a scalar value that will represent the reward head. It will predict the, the reward, you know, uh, which is a proxy for the preference of the human. The way you'll train that reward model is you'll give it, um, a batch of two. Um, you know, you'll give it a prompt X with a response A and the preference of the user, and you'll give it the same prompt with a response B from the SFT with the preference of the user. So here the user is saying response A is better than response B. And so if you actually were sending that in th-this model, you will get a predicted reward for the preferred answer and a predicted reward for the less preferred answer. That allows you to, uh, train using, um, loss function, and I'm not gonna cover it given our, our, our time sensitivity. The loss function will encourage the model to assign higher rewards to preferred responses. So you're trying to dissociate the higher reward, better preference from the lower reward for lower preference. And it turns out that if you do that many times, you will have a reward model that given a prompt and a response will be approximating human preference. So you've just trained a critique that represents your humans. It's a proxy for what humans prefer. It's been trained on a lot of human preferences. The reason it's better to use a model than actual humans is because we can use it widely on all sorts of inputs, and it can scale from a data standpoint. Also note that this method is better than SFT because it's way easier to ask humans, "What's your preference between those two things?" than to ask them to come up with answers to prompts. Takes way less time. And if you've used ChatGPT, you've probably been asked before to, to, to tell them which response you prefer. Yeah. So once trained, the reward model replaces the human as the evaluator during reinforcement learning with, uh, from human feedback. And reinforcement learning from human feedback is very comfortable for you now. I will, I will show you what it looks like given the Q learning algorithm we learned. But essentially, we, we have first taught the model what good behavior looks like with SFT, and then we built a reward model that can tell us how good an answer is according to human preferences. And the RLHF approach is where we will let this model practice, get scored by the reward model or the critique, and update itself to produce higher-scoring answers, so more preferred answers. And it's the same as the games we've seen together, but some things differ. So I just pasted here the exact, uh, setup that we've learned together, um, for reinforcement learning. The differences are the following. You know, our objective is still to maximize expected reward that is produced, um, by the reward model aligned with human preferences. The agent is the language model being fine-tuned. The environment is the space of possible prompts and continuations. It's any, any text that you can encounter. The state is the specific prompt plus the tokens that were generated so far. The next state is one more token added, and the action is the next token that is chosen by the agent or the model, which is, of course, determined by the policy. And then the reward is estimated by the reward model that we trained to represent human preferences. Okay? In this case, one episode is one full prompt. So imagine that you get a prompt, and you start generating, and you go through this reinforcement learning loop, and you observe the rewards, and then you try to maximize the future rewards. And then at the end of training, you end up with having your pre-trained model turn into an SFT, and your SFT turn into a way better model using RLHF.Okay? So a few things to note to end on this. The model does not get a reward at every single token. It gets a reward at the end of a sequence when the completion is finished because reward model was, uh, was asked to rate prompts and responses together. So you need to finish the generation in order to see what's the reward. And so again, going back to making good sequences of decision, that's exactly it. You want the model to make enough good sequences of decision so that the response is preferred by the critique, which represents a proxy to the human preferences. So all intermediary rewards are typically zero, and that makes it a very sparse reward episodic tasks, just like a game of chess where you only get a reward when you finish, assuming you're not defining intermediary reward. So you only know if you did well at the end, and you have to then use that information to update your network and get a better proxy for it. Super. There's a very nice video. We're not gonna play it for the sake of time, but I will send it online. Uh, it's from four days ago. It's, uh, you know, uh, a f- former Stanford, uh, uh, student, Andrei Scarpati, who is very thoughtful and articulate and was explaining four days ago why reinforcement learning can be terrible at times, and that human minds work way more efficiently. And so I would encourage you to watch this four-minute video because he's very clearly outlining why reinforcement learning is still not great, even if it's the best thing we can use in many ways.

Episode duration: 1:45:00

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode 4E27qlfYw0A

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Why Deep Reinforcement Learning now: from Atari to ChatGPT alignment

Why supervised learning falls short for Go and other sequential decision problems

Core RL vocabulary: agent, environment, state vs observation, reward, transitions

Toy RL problem: 'Recycling is Good' MDP and rewards design

Discounted return and solving via a Q-table (backtracking through outcomes)

Bellman optimality equation and policy extraction from Q-values

Why Q-tables don’t scale and the shift to Deep Q-Learning (DQN)

Training DQN using Bellman targets: creating labels from experience

DQN pseudocode loop: episodes, timesteps, actions, and updates

Breakout case study: defining inputs/outputs and practical preprocessing

Stabilizing and improving DQN: terminal handling and Experience Replay

Exploration vs exploitation: epsilon-greedy to avoid local traps

Beyond DQN: harder games, sparse rewards, PPO, self-play, and multi-agent RL

RLHF pipeline: from next-token pretraining to supervised fine-tuning (SFT)

Reward model and RLHF as preference optimization over full responses

Get more out of YouTube videos.