Skip to content
Stanford OnlineStanford Online

Stanford CS230 | Autumn 2025 | Lecture 5: Deep Reinforcement Learning

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai October 21, 2025 This lecture covers deep reinforcement learning. To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs230-deep-learning To follow along with the course schedule and syllabus, visit: https://cs230.stanford.edu/syllabus/ More lectures will be published regularly. View the playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNRRGdS0rBbXOUGA0wjdh1X Andrew Ng Founder of DeepLearning.AI Adjunct Professor, Stanford University’s Computer Science Department Kian Katanforoosh CEO and Founder of Workera Adjunct Lecturer, Stanford University’s Computer Science Department

Kian Katanforooshhost
Oct 31, 20251h 45mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. KK

    Welcome to our fifth lecture in person for Stanford Deep Learning CS230. Um, today's lecture is gonna be about deep reinforcement learning. I actually switched, uh, the original plan of talking about neural network interpretability and LLM visualization, uh, simply because you, you haven't had the chance to study attention maps, um, uh, you know, convolutional neural networks, and so it would have been an overkill to do that week five. So we're gonna talk about neural network interpretability and visualization in a later lecture, actually. Um, but today, uh, our focus will be on deep reinforcement learning, uh, which is probably my favorite, uh, uh, lecture of, uh, the class. I, I feel like I say that every week, but it's okay. I like it. Um, the agenda is pretty packed. We're gonna start with, uh, deep reinforcement learning, which you can think of as the marriage between deep learning and reinforcement learning. Together, the baby is called deep reinforcement learning, and we're going to see how reinforcement learning works and how neural networks can play a part in building a reinforcement learning, um, agent. Um, in the second half of the class, we will focus on a very specific, um, you know, concept called reinforcement learning from human feedback that you might have heard of. It's one of the core concept that, uh, really made the difference between what, um, you might have remembered as GPT-2 and ChatGPT. You know, that's the leap. Uh, that's really the, the, the, the technique that had, uh, that has, you know, democratized, um, access to LLM because of the performance improvements and the alignment with humans. So we're going to see what, what is this concept of RLHF, and how, um, does it work, and why does it allow us to align a language model to human preferences. Ready to go? As always, let's try to make it interactive. Um, so the motivation behind deep reinforcement learning, and as usual, you're gonna have all the most important papers that are covered in the class listed at the bottom of each slide. Um, reinforcement learning has grown in popularity. Um, one of, uh, uh, the, you know, very popular papers called Human-Level Control through Deep Reinforcement Learning is the work, um, from, um, DeepMind, um, has showed us that a single algorithm/training method can allow us to train, um, AI that can play many, many Atari games better than humans. Single algorithm over forty, fifty games where it exceeds human capability, which is quite impressive when you thought about the fact that, you know, machine learning used to be niche, and you would have to train a really niche algorithm to perform different tasks. Here's an algorithm that can just learn sort of every Atari game. A little later, um, you might have heard of AlphaGo. AlphaGo is, um, is a, is an algorithm that was developed to beat and exceed human performance in the game of Go. We'll talk about it a little more. The game of Go is a very complex game. Um, some would argue way more complex, uh, than chess from a decision-making standpoint and from the, uh, possibilities, uh, that can happen on the board. And so, um, it, it actually got solved, um, in 2017, again by the DeepMind, uh, DeepMind team and, and David Silver's lab. Um, later on, and again, another great paper from DeepMind had showed us that reinforcement learning can also be used for strategy game that might be a touch more complex than, uh, chess or Go. That might actually involve multiple players playing with each other or against each other. Some of you might have played StarCraft, for example. That's an example of a game where, um, it requires a lot of lo-long-term thinking, short-term thinking. Another one is, uh, you know, Dota. Some of you might have played Dota or League of Legend, where you have a team playing against another team. Those are examples of games that involve multiple agents playing collaboratively, and it's pretty hard to develop systems that can play with each other against multiple opponents. Um, and finally, most recently, this is 2022, so alongside the release of ChatGPT, um, this paper that introduces the concept of reinforcement learning with human feedback applied to aligning language models with, uh, human preferences, and we'll talk about that later. So all, uh, all this to say that reinforcement learning allowed, um, um, us to exceed human performance in a variety of tasks. The first one, um, I want us to think about is the, the game of Go. So let's say that you were asked to solve the game of Go with classic supervised learning, okay? Everything we've seen together so far, labeled data. How would you solve the game of Go with classic supervised learning? What data would you collect? What would be the label, et cetera? Yes.

  2. SP

    [faintly speaking]

  3. KK

    Okay, good point. Yeah, you look at history of plenty of games, hopefully from good players.Train the game, the, the, the algorithm to work. Um, and you look at X as the input being the current state of the board, and Y as the next state of the board, and this would tell you what move was selected, and you learn the move essentially. And hopefully, if you do that across many, many games, you know, you might, you might see, uh, um, the, the agent become more attuned to the game and, and, and develop, uh, better strategies. So, you know, really hopefully it's a professional player. What, what are the disadvantages of that or the shortcomings that you can anticipate? Yes.

  4. SP

    You might miss out on the space types of moves that players use and maybe, maybe, maybe go against some other set of moves that were formerly considered.

  5. KK

    Yeah. Yeah, great point. You might not see the entire space of possible states of the board, which is what you said. So you might miss out on a lot of different strategies. So the game of Go is actually a game with two players, one player that uses the black stones and one player that uses the white stones, and iteratively, they're gonna place those stones on the grid, a thirteen by thirteen grid that you can see on screen, with the goal of surrounding their opponents. So you're constantly trying to surround the stones of the opponent, and the opponent is trying to surround your stones. And so you can imagine that for every intersection on the grid, there is multiple possibilities. Either there's a black stone or a white stone or nothing. And on a thirteen by thirteen grid, you can imagine how many possibilities of a board state there are. It's, um, impossible to capture all of that with historical moves from professional players. It will just never cover that. The same thing could be said in chess as well. You, you know that even the professional players can plan X number of steps in advance, but nobody knows where the game takes you. And in the late stages of the games or the end games, um, players always find themselves playing a different game, and that's part of the magic of being good at chess. Um, so yeah, that's a problem. What's another problem or shortcoming beyond the fact that we can't observe possibly all the states? Yes.

  6. SP

    You also can't, like, anticipate, like, what that action will lead to in the future. Like, you might not make the best decision at first, but later on.

  7. KK

    Correct. Correct. You-- If I repeat what you said... Well, first, you don't even know if this was a good move, you know. So maybe it was not even a good move, and you're learning something that was not a good move, and you're labeling it as a good move. And second, um, you're actually only getting partial information, meaning you don't have the information of what's in the person's mind and what strategy they're trying to execute. So you're sort-- you're sort of looking at a single example among a long-term strategy, and you can't expect the model to guess what's the long-term strategy because it was just trained on X and Y and matching the inputs to a possible output. So you, you don't really have any concept of a strategy at that point. It looks one-off at every decisions of the model. Okay, those are really good points. Um, the other one is the ground truth might be ill-defined. What I mean by that is, um, even the best humans in the world do not play their best game every day, and even their best game is not the ground truth. And that creates an issue because you're essentially training against a target that is off by a certain margin. You're never gonna get better than the best human, and the best human is not the best possible, um, uh, existing the best possible strategy at every point. So you could argue, what if we get a panel of experts that we're monitoring, and those are the best players in the world? Even with a panel of expert that decides every move, you still have an ill-defined ground truth, you know? So that's a big issue. Too many states in the game you mentioned, and we will likely not generalize, which is what you said, meaning we're looking at one-off situations, we're not looking at entire strategies. And so when we face a board sta-state that we've never seen before, because the model was not trained on strategy, it sort of will get stuck, you know. Okay. And this is an example of a perfect application for reinforcement learning, because reinforcement learning is all about delayed labels and making sequences of good decisions. So if you had to remember in one sentence what's RL, RL is making good sequences of decisions. Sequences of good decisions, sorry. And do that automatically. Another way to look at it is the difference between, you know, classic supervised learning and RL is in, uh, in classic supervised learning, you teach by example. In reinforcement learning, you teach by experience, which is also a different concept. You're not just showing cats and non-cats to a model, you're actually letting the model experience an environment until it figures out, uh, what were the best decision it made and learns from them. Some examples of reinforcement learning applications, I'm gonna mention them. We, we have, we have gaming, of course, that we already covered. What are other applications of AI where we need good sequences of decisions?Yes. Autonomous driving? Yeah, correct. I mean, in driving, you could argue RL could work, and there's some RL going on. But what you mean, I think, is you're, you have some sort of a dynamic planning algorithm that allows you to strategize. If you see a, a red light ahead, you might start slowing down over time. But maybe it will turn green, so you might not slow down completely. This is an example of a strategy that you need, of course. Yeah.

  8. SP

    Robot controlling.

  9. KK

    Robot controlling. That's a great example, also related to autonomous driving. But imagine you, uh, wanna teach to a robot to move from point A to point B. The number of good decisions that the robot needs to make in terms of moving each of their joints is tremendous. Like, it's actually super unlikely that a robot would move from A to B if it's not trained to make good sequences of decisions. What else? Actually, the biggest one nobody mentioned yet. It's not a great application. I don't like it, but it happens to be the biggest one over reinforcement learning. Yeah.

  10. SP

    Like, creating or making suggestions.

  11. KK

    Yeah. Yeah, yeah. Advertisement. Yeah, marketing. You're right. So yeah, we talked about robotics. Advertisement is another example. Um, advertisement is a long game. Like, companies are showing you multiple ads before you buy, and in fact, the reason rein- reinforcement learning is important is because, you know, they're planning a strategy that might lead a buyer to execute a purchase over time, and it requires, uh, long-term thinking. So there's a lot of reinforcement learning applied to, uh, marketing, advertisement, real-time bidding processes, et cetera. Okay. Clear on what RL is and how it differs from classic supervised learning? No? Okay. Um, so let's put, uh, some vocabulary around that concept. In reinforcement learning, you have an agent, and the agent, uh, interacts with an environment. As the agent interacts with the environment, the agent will perform certain actions that we will denote At, where t is a time step. And the environment will show you states that transition from time step t to time step t plus one. So subject to an action At, an environment may transition from H-- uh, St to St plus one. You can think of the game of Go. I take the action of putting my black stone on a certain grid, uh, intersection, and the environment has changed. It moved from, uh... The state has changed. It moved from state time step t to time step t plus one, where my stone is on the grid. After that, um, state update happens, um, there's two things that the agent observes. The, the agent observes, um, an observation that we will note Ot and a reward, Rt. Okay? So those are the vocabulary words. And of course, the goal of the agent will be to maximize the rewards. One thing to know about the observation, we'll talk about it a little more. Um, the observation sometimes is equal to the state. Can someone guess why we might need two concepts instead of a single concept? Why is it important to have a state and an observation? Yes.

  12. SP

    Different values for different outputs.

  13. KK

    Yes, correct. So in some cases, uh, the environment may not be fully, um, uh, uh, you know, transparent to the user. And so, for example, in chess or in Go, uh, the observation is actually equal to the state. You see everything on your board. All the information is available to you. If you play League of Legends or StarCraft, uh, you know the concept of, uh, you know, I think in English it's called like a cloud or a fog. I think it's the fog. You only see certain parts of the map until you have explored everything or until your friends are, uh, sort of visiting the other parts of the map. And so the observation is actually, uh, less information than the states of, uh, the environment. Okay. And then the last piece of vocabulary is a transition. When I refer to a transition, I refer of the process of getting from state t to state t plus one, which means we're in state t, the agent takes an action At. It observes Ot and a reward Rt, and it transition to the next state, St plus one. Question.

  14. SP

    Regarding competition, are there too much, uh, the statements are too large to, uh, once you've got the different training plus the

  15. KK

    Uh, wait, what do you mean? You mean is there... Are there examples of, uh, environment where the state is so large that the-

  16. SP

    You want to be the entire structure.

  17. KK

    Okay.

  18. SP

    For you to request something without alteration.

  19. KK

    Yeah, possibly. For computational reasons? Yeah, yeah. You might have games. I mean, look at open world games. Like, truly, you, you could, you could argue, uh, I don't know, there are some games where you might press start, and you see the entire environment. But who cares of what's happening, uh, twenty thousand kilometers, uh, west of you if you're in a certain location? Uh, that might not influence your strategy, so you might actually put some sort of a, you know, trust circle or, like, some sort of a circle in which you observe, which you think has ninety-nine percent of the information you need, possibly for computational reasons. That's a good point. Okay, let's get to a practical example of a reinforcement learning algorithm and develop it together.Uh, this example is called Recycling is Good because recycling is good, but also because it's a simple example illustrative of reinforcement learning. So let's say we have a, a small, um, environment, uh, with, uh, five states. There is a starting state, uh, marked in brown, which is state two. It's our s-- It's our initial state. And then on the right side... Uh, sorry. On the left side, you have state one, which is, uh, garbage. And it's great to get to the garbage because you're gonna be able to recy-- to, to put in the garbage the, um, you know, the stuff that you have in your hands. You know, you're trying to throw away some garbage, and the garbage kinda happens to be there, and so we would expect there to be a reward. On the other side, if you actually go to the right, you might pass by state three, which is empty. You might pass by state four, where there is a chocolate, uh, packaging that is left on the ground that you can pick up, and, um, it's good to pick it up. And then on stage five, uh, state five, you have the recycle bin, which is more valuable than the garbage can because you can recycle, and you should get better rewards for that. So that's our game. In this game, we define a reward that is associated with the type of behaviors that we want the agents to learn, um, and the reward is as follows. That's just one example. Plus two for throwing your garbage in the normal can, uh, plus one for picking up the chocolate packaging, and plus ten if you manage to make it to the recycle bin. Is it clear? Now, the goal will be, and that's the case in, uh, reinforcement learning oftentimes, to maximize the return. We define formally the return, but think about it as maximize the amount of rewards that you get as you go through this journey and you make your decisions. In this specific game, we have five states, and there's three types of state. In brown is the initial states. We have normal states, and we have in blue terminal states. When you get to a terminal state in reinforcement learning, it will typically end the game. It will end one episode of the game. We move to another episode. You'll get back to the starting state or initial state, and you'll redo another episode. The possible actions for our agent here are gonna be fairly simple, left and right. And we are gonna add an additional rule that is important, which is that the garbage collector, uh, comes in three minutes, and it takes a minute to get from one state to the other. Why is that an important rule to add to the game? Can you guess? Yeah. Otherwise, you just go. Yeah. Otherwise, you just go back and forth between, uh, stage three and stage four. You just collect a bunch of, uh, chocolate packaging, and you never make it to the bin. And so, um, it's not what we want. Yeah. Okay. So how do we define the long-term return? The long-term return is gonna be defined, uh, as capital R, which is, uh, the sum of rewards with a discount. Um, discount is a very important concept in reinforcement learning. It's also a very natural, uh, concept to think about. Can you think of what, what the discount would represent in-- for humans? Do you have an example of, uh, what it could be? Yeah. The value of money and time. Huh? The value of money and time. Yeah, the value of money and time. Exactly. Uh, or the energy that a robot might have, things like that. Yeah. You, you would rather get, uh, you know, a dollar now than a dollar in ten years knowing that there's some inflation, for example. Uh, that's the example of a discount, and reinforcement learning is the same. Uh, you know, let's say you have a strategy that takes so much time, you need to discount it because your robot might lose energy as you're going through it, for example. Discounts can vary, you know, but they stay between zero and one. Um, so what is the best strategy to follow if, uh, gamma, the discount, is equal to one, meaning, uh, you know, time doesn't matter here if it's longer or shorter? Just wanna maximize the return. Best strategy to follow. Anyone give it a try? Someone who hasn't spoken yet. Yes. You could just, uh, bounce around forever. Bounce around, but remember the rule of, uh, three minutes. You can't bounce around because you, you will not get to the terminal state before the time allotted is done. But that would be a good idea if this rule was not true. What else could you do? Any idea? It's an easy one, no? Not too hard. Best strategy for gamma equals one, and give me also the maximum reward you would get. People are, are sleepy today, yeah. Recycle. Go to recycle. Go to the recycle. So right, right, right. Yeah. Yeah, that's right. Thank you. Right, right, right. And then what's your... Sorry. What's your, um, what's your total reward? Eleven. Yeah, that's right, eleven. So that's where we get terminal state, and we grab our reward of eleven. Very good. Now, assuming zero point nine for gamma.We're gonna complexify things a little bit. I'm gonna walk you through a very simple algorithm that, you know, allows us to sort of determine the best strategy, and we will put our numbers in a matrix. So for instance, um, we'll define a Q table, and Q stands, uh, you know, it's, it's a, it's a, it's a value function, um, where the, the, the name Q learning, Q star, you might have heard. Um, all of these things come from Q learning. And so let's say we have a Q table which has, uh, the size of number of states times number of actions. So five rows, two columns in our case. Every entry of the Q table is essentially representing how good it is to take action A in state B. Do you agree that if we had a table with these numbers, essentially we solved the problem? Meaning at any point, the agent can just look in the table. I am in state three. Let's look at column one. That would tell me the value of action one, and let's look at column two. It would tell me the value of action two. So I have everything I need to make my decisions. So that table is really the, the thing you wanna find in this exercise. Now, the way we will find the table is, uh, sort of using a backtracking algorithm where we might actually, uh, uh, codify the environment as a tree and traverse the tree. So here's what it looks like. I start in S2 and I have two options ahead of me. I can go to the left where I will get a reward of two. It's an immediate reward. The immediate reward is not discounted. It's an immediate reward. Remember the, the formula for R. The immediate reward R0 is not discounted. That would take me to S1. It's a terminal state, so there's nothing to do after. Second option, I go to the right and I get a reward of zero. That's my immediate reward and I end up in state three. State three is not a terminal state, so I can go and do the same exercise from state three. In state three, I have two options. I can go to the left where I would see a reward of zero and I will end up in S2, or I will go to the right and I will get an immediate reward of plus one. It's an immediate reward, we're not discounting it. I will end up in S4 and from S4, again, I have two options. Back to the left to S3 with zero reward or to the right with the amazing reward of plus ten and the terminal state of S5. So that's my map of immediate rewards. That's not my discounted return. So what we're gonna do now is we're gonna backtrack up the tree in order to compute the discounted returns. Actually, if I'm in S3 right here, I see that I can get an immediate reward in S4 of plus one, and I wanna compute my maximum return that I can get from when I'm in S3. My maximum return is that in S4 I could get a plus ten, right? But I need to discount that. My discount is zero point nine, so I multiply ten by zero point nine. What it tells me is that from S4 I can expect nine. Plus one, which I get as an immediate reward from moving from S3 to S4, I can update this number to ten, meaning from S3, the best you can hope for is a discounted return of ten, which is one plus zero point nine times ten. Everyone follows? Now let's do the same exercise one step before, uh, in S2. Uh, you know. Um, in S2, um, I have, um, an immediate reward of zero for going to S3 or an immediate reward of two for going to S1. Um, S1 is not gonna be worth it. We already know that because when I'm in S3, I can actually expect ten, which I have to discount. Zero point nine times ten gives me nine, plus zero immediate reward from S2 to S3. That tells me that the discounted return from state two, which is our initial state, is nine. You all follow? Just a simple backtracking. Now I can copy back this so S3, I know that when I'm in S3, um, uh, you know, I can expect a zero immediate reward to, um, uh, to... Sorry. If I, if I'm, if I'm in S2, I can expect, uh, zero immediate reward plus a discount times the plus nine that I could expect in S3. And so, uh, that gives me values that should cover everything that we have in this Q table. So I, I do that backtracking, I copy-paste all of that into my Q table all the way up here, and this is what I get. We essentially finish the game at this point. We, um, can look, uh, at a certain row. So let's say I'm in state number three. I look on the third row of that Q table and I see that I have two options. If I go back to S2, ultimately my discounted return will be eight point one, right? If I actually go to S4 on the right, I will get ten because I will get one plus zero point nine times ten, which is ten. So this is a toy example, but it tells you that if you were able to backtrack through the entire environment, you will be able to build a massive Q table and you will be able to give it to your agent to make its decisions. Yeah.

  20. SP

    Why is it like that but with two more solutions?

  21. KK

    Sorry, can you repeat?Yeah, here I'm simplifying. I'm not considering the time, uh, remaining. But in practice, um, you... If, if I remove the time component, so I remove the fact that there's a three-minute deadline before the garbage collector comes, then, uh, this would, uh, uh, be slightly more difficult because you would have to do a time series essentially of adding the discount times the reward that you collect. Yeah. But I'm simplifying here, and that's why I use the three-minute rule. Any question on the Q table? Super. Okay. So, uh, this was the Q table, and in fact, we can put together our strategy for gamma equals zero point nine. Uh, the best strategy is still the same. You go to the right, and, uh, you can expect a return of nine. Now, uh, one of the most important concepts in reinforcement learning is this equation on the board, uh, called, uh, the Bellman optimality equation. Oftentime, you'll see it's noted as Q star of state S and action A equals R plus gamma times the max of that same function applied to S prime, A prime. Let me explain this equation for you because it's super important. This equation is called the optimality equation because your optimal Q table will follow this equation. If you have finished the game, this equation can be applied to any state action pair, and it will still be true. The intuition behind why, um, the Bellman equation is the optimality equation is that, um, if you're in a-- if you have the perfect Q function, Q table, um, and you're in a certain state and you perform a certain action A, you will observe a reward, and this reward will, uh, you know, you, you have taken an action, so you would be in a new state. And from that new state, you can repeat what you just did, right? And because, uh, you've done the backtracking and stuff like that, you will, uh, get this equation to be true because it's the reward plus discount times the best next action that you could be taking. Does that make sense? Any question on that? That's exactly the backtracking that we did, by the way. Immediate reward plus discount times the best possible action that you can take in the next state, S prime. The last concept I cover in terms of vocabulary is the policy. The policy is the function that, given your state, is gonna tell you what to do. And in Q-learning, the way this policy is defined is argmax of Q star, um, across the action. So essentially what it says is, like, look in the table and look at a certain state S. You want the policy, which is what you should do. It's the function that tells you our best strategy. You just look at the two possible actions, which one has the highest Q value, and select that action. That's it. This is a very simple example, but it, it's the core of, um, Q-learning that, you know, later on you will use policies widely. There's a lot of reinforcement learning algorithms, but this concept of understanding the policy, the function telling us our best strategy. In Q-learning, it's the argmax of the best Q value in a given state. It tells you which action to take. That's the core thing you need to understand. So remember this Bellman equation because we're gonna reuse it in a bit. The main issue, um, with this approach, um, of a Q table is that state and action spaces can be super large and having a matrix that you discover through backtracking, um, and where every time you wanna do an action, you have to look up the given states, the possible action, it becomes impossible. Like imagine you using this algorithm for the game of Go, where there's so many states, there are so many possible actions. You can put your stone anywhere on the board. You can imagine how big this matrix becomes and how impossible it, it is to use. So that's our problem, and that's the moment where deep learning, um, comes into play. So let's look at it. Um, the, the, the, uh-- Oh, actually, before I go there, I'm just gonna cover some vocabulary. We said the environment, the agent, the state, the action, the reward, the total return, and the discount factor. We learned all of that. We saw that the Q table is the matrix of entries representing how good is it to take action A in state S. And the policy is the function that tells us what's the best strategy to adopt, and the Bellman equation is satisfied by the optimal Q table. So let's get to deep Q-learning, which is what I was about to say is we are gonna frame the problem slightly differently. So instead of using a Q table, we're gonna use the fact that neural networks are universal function approximators, and we're gonna define a Q function that's essentially a neural network, so that the function can take a state S and an action A and tell you how good that action is in state S. So instead of a lookup in a matrix, you just run a forward pass in a neural network, and it gives you the answer. That feels like a better solution for games where there's a lot of states and a lot of actions.So here is, uh, same problem statement. In the past, we looked for a Q table, and this time we will look for a neural network. One of the things we're gonna do is to define the output layer to have two outputs. So given a certain state as input, think about it as a one-hot vector encoding the state. So this one is the example of state two, zero, one, zero, zero, zero. If you pass state two in this Q function, uh, with multiple layers, it will give you two outputs. One output that corresponds to, uh, Q of S action right and the other one Q of S action left because it's the two actions. If we had more actions to take, we would just increase the output layer, and we might have many more neurons in the output layer. So the big question is, how the hell are we going to train that network? Because we're not in classic supervised learning. We don't have labels. So this one is a hard question, but, uh, what do you-- what would you do? Given we, we don't have traditional X and Y pairs, how are you going to train this neural network? 'Cause remember, at the beginning, this neural network will give you garbage. It will take a, a state S, and it might tell you, "Go to the left or to the right," but it's completely random. So how are you gonna tune it to the level where it makes really good decisions? Yes.

  22. SP

    Maybe you can assume based on some prior knowledge.

  23. KK

    Assume based on some prior knowledge. Tell me more. What?

  24. SP

    Like, basically, yeah, if you go to the problem, you have some idea of [continuing]

  25. KK

    So what are the things we know about this problem right now? What are the, the, the rules of the game that we could use in order to... I'm s-I'm seeing what you say. You're saying we could estimate what good looks like, but based on what?

  26. SP

    Like for each problem,

  27. KK

    Okay. So reward structure. You're saying that's one thing we have in every game. We have a reward structure for every state. That definitely should be used in order to estimate the good-- what a good decision looks like. Uh, yeah. The problem is not in every state you will see a reward, and if you look at many games of, like, Go, you might not see a reward until fifty moves.

  28. SP

    Yeah.

  29. KK

    So what do you do in this case? Yes.

  30. SP

    Can we run through a bunch of actions and states and see what the output is and get more data to train the neural net?

Episode duration: 1:45:00

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode 4E27qlfYw0A

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.