This video isn’t embeddableWatch on YouTube →

Stanford CS230 | Autumn 2025 | Lecture 5: Deep Reinforcement Learning

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai October 21, 2025 This lecture covers deep reinforcement learning. To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs230-deep-learning To follow along with the course schedule and syllabus, visit: https://cs230.stanford.edu/syllabus/ More lectures will be published regularly. View the playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNRRGdS0rBbXOUGA0wjdh1X Andrew Ng Founder of DeepLearning.AI Adjunct Professor, Stanford University’s Computer Science Department Kian Katanforoosh CEO and Founder of Workera Adjunct Lecturer, Stanford University’s Computer Science Department

Kian Katanforooshhost

Oct 31, 20251h 45mWatch on YouTube ↗

CHAPTERS

0:05 – 5:09
Why Deep Reinforcement Learning now: from Atari to ChatGPT alignment
The lecture opens by reframing the week’s plan to focus on deep reinforcement learning (DRL) and later cover interpretability. It motivates DRL through landmark successes: Atari DQN, AlphaGo, multi-agent games, and modern RLHF for language model alignment.
- •DRL as the combination of deep learning + reinforcement learning
- •DQN achieving human-level control across many Atari games with one algorithm
- •AlphaGo solving Go and expanding beyond supervised learning limits
- •Multi-agent RL examples: StarCraft, Dota-style team play
- •RLHF as a key step from GPT-2-like models to ChatGPT-like behavior
5:09 – 13:44
Why supervised learning falls short for Go and other sequential decision problems
Using Go as a case study, the lecture contrasts supervised learning with reinforcement learning. Students discuss what labeled data would look like for Go and why imitation of historical games fails to capture strategy and optimal play.
- •Supervised setup: board state as X, next move as Y from expert games
- •State-space coverage is impossible; many board positions are never seen
- •Labels are ambiguous/ill-defined: even experts aren’t always optimal
- •Supervised learning captures local moves, not long-horizon strategy
- •Motivation for RL: delayed credit assignment and sequences of decisions
13:44 – 17:38
Core RL vocabulary: agent, environment, state vs observation, reward, transitions
The lecture formalizes RL terminology and explains how an agent interacts with an environment over time. It distinguishes “state” from “observation” to handle partial observability (e.g., fog-of-war).
- •Agent takes actions At; environment transitions St → St+1
- •Agent receives observation Ot and reward Rt
- •Goal: maximize cumulative reward (return)
- •Observation may be a partial view of the underlying state
- •Transition defined by (St, At, Rt, St+1) and interaction loop
17:38 – 20:42
Toy RL problem: 'Recycling is Good' MDP and rewards design
A small 5-state environment illustrates RL mechanics, terminal states, and reward shaping. The example highlights why a time/step constraint matters and sets up discounted returns.
- •Five states with start state and terminal states (garbage/recycle)
- •Rewards: +2 garbage, +1 pickup, +10 recycle bin
- •Terminal states end an episode; new episode restarts at start
- •Actions are left/right; step limit prevents reward farming loops
- •Sets up return maximization with discounting
20:42 – 30:39
Discounted return and solving via a Q-table (backtracking through outcomes)
The lecture defines discounted return and works through optimal behavior under different discount factors. It shows how a Q-table encodes action values for each state and can be computed by reasoning backward in the toy environment.
- •Discount factor γ models time preference/limited resources
- •With γ=1, best path is to reach recycle for total reward 11
- •Q-table dimensions: (#states) × (#actions)
- •Backtracking computes returns by combining immediate reward + discounted future value
- •Once Q(s,a) is known, decision-making reduces to table lookup
30:39 – 33:42
Bellman optimality equation and policy extraction from Q-values
After building intuition from the toy example, the lecture introduces the Bellman optimality equation and the concept of a policy. It explains that optimal Q-values satisfy a self-consistency relationship that enables iterative improvement.
- •Bellman optimality: Q*(s,a)=r+γ·max_{a'} Q*(s',a')
- •Interprets Bellman as “one-step reward + best future continuation”
- •Policy π(s)=argmax_a Q*(s,a) selects the best action in each state
- •Bellman equation matches the earlier backtracking computation
- •Provides the foundation for learning Q-values without enumerating the tree
33:42 – 36:46
Why Q-tables don’t scale and the shift to Deep Q-Learning (DQN)
The lecture explains the scalability limits of tabular RL in large state/action spaces like Go. It then introduces replacing the Q-table with a neural network function approximator.
- •Tabular methods become intractable with huge state/action spaces
- •Neural networks serve as universal function approximators for Q(s,a)
- •Network outputs Q-values for each action given a state
- •Moving from lookup tables to forward passes enables scaling
- •Key question becomes: how to train without labeled targets
36:46 – 51:30
Training DQN using Bellman targets: creating labels from experience
This section derives the DQN training signal by using the Bellman equation to define a target. It explains the two forward passes used to compute current Q-values and a bootstrapped target estimate, plus the practical stop-gradient idea.
- •No ground-truth labels; targets are bootstrapped from the model itself
- •Target y = r + γ·max_{a'} Q(s',a') (one-step lookahead)
- •Two forward passes: evaluate Q(s,·) for action selection and Q(s',·) for target
- •For stability, treat the target as fixed during backprop (no gradient through y)
- •Iterate experience → target estimate → gradient update → improved estimates
51:30 – 53:32
DQN pseudocode loop: episodes, timesteps, actions, and updates
The lecture consolidates the DQN idea into a straightforward training loop. It clarifies the role of episodes and how each step generates data to update the Q-network.
- •Initialize network parameters randomly
- •Loop over episodes (start state to terminal state)
- •At each timestep: pick action via max-Q, execute, observe (r, s')
- •Compute target from s' and update network via gradient descent
- •DQN mimics supervised learning using self-generated targets
53:32 – 1:00:50
Breakout case study: defining inputs/outputs and practical preprocessing
Using Atari Breakout, the lecture grounds DQN design choices: state representation, action space, and why raw pixels need preprocessing. It motivates cropping, grayscale conversion, and adding frame history to infer velocity.
- •Input/state: (preprocessed) game frames; output: Q-values for actions (left/right/idle)
- •Remove irrelevant pixels (score, borders) and reduce dimensionality
- •Convert RGB to grayscale when color isn’t informative
- •Add history of frames (e.g., 4) to capture motion direction
- •CNN architecture for pixel inputs; final layer outputs action values
1:00:50 – 1:10:16
Stabilizing and improving DQN: terminal handling and Experience Replay
The lecture adds key engineering refinements to vanilla DQN training. It explains how to treat terminal states correctly and introduces experience replay to reduce correlation and increase sample efficiency.
- •If next state is terminal, target is just immediate reward (no future term)
- •Experience replay stores transitions (s,a,r,s') in a replay buffer D
- •Train on random minibatches sampled from D rather than only latest transition
- •Benefits: reduces correlation of sequential frames; reuses rare/high-value experiences
- •Mentions extensions like prioritized replay (sampling important transitions more often)
1:10:16 – 1:19:58
Exploration vs exploitation: epsilon-greedy to avoid local traps
A simple three-action example shows how greedy action selection can get stuck in suboptimal behavior and never discover better outcomes. The lecture introduces epsilon-greedy exploration to periodically try random actions and discover higher rewards.
- •Pure argmax-Q can lock into local optima before discovering better states
- •Example: agent learns reward=1 terminal and never finds reward=1000 terminal
- •Exploration-exploitation tradeoff mirrors real-world route-finding/learning behaviors
- •Epsilon-greedy: with probability ε take a random action, else exploit best known action
- •Modern training loop combines replay memory + ε-greedy + terminal handling
1:19:58 – 1:30:05
Beyond DQN: harder games, sparse rewards, PPO, self-play, and multi-agent RL
The lecture surveys why some environments (e.g., Montezuma’s Revenge) remain difficult due to sparse/delayed rewards and the need for priors or imitation. It briefly contrasts DQN with policy-gradient-style methods like PPO and shows examples of continuous control and self-play.
- •Montezuma’s Revenge illustrates extremely delayed/sparse rewards and low chance random discovery
- •Humans use priors/intuition; motivates imitation learning and better initialization
- •PPO as policy-based (learn policy directly), often better for continuous actions
- •Examples: continuous control tasks and competitive self-play (Sumo)
- •Mentions OpenAI Five, AlphaStar, fog-of-war partial observability, and AlphaGo insights
1:30:05 – 1:37:12
RLHF pipeline: from next-token pretraining to supervised fine-tuning (SFT)
The lecture transitions to RLHF by first summarizing language model pretraining and why it doesn’t guarantee helpfulness. It introduces supervised fine-tuning on human-written prompt-response pairs as the first alignment step and notes its cost and generalization limits.
- •Pretraining objective: next-token prediction on broad internet text
- •Misalignment issues: continuation vs helpful answering; lack of politeness/helpfulness objectives
- •SFT: fine-tune on human prompt→response demonstrations to imitate helpful behavior
- •SFT data is expensive (e.g., limited prompt-response pairs)
- •SFT is imitation-based and may not capture preference nuances or generalize fully
1:37:12 – 1:45:00
Reward model and RLHF as preference optimization over full responses
The lecture explains training a reward model from human preference rankings and then using RL (commonly PPO) to optimize the language model against this learned reward. It maps RL concepts (states, actions, episodes) to token generation and emphasizes sparse end-of-sequence rewards.
- •Collect preference data: rank multiple sampled SFT responses for the same prompt
- •Train a reward model (RM) by adding a scalar ‘reward head’ to the SFT backbone
- •RM learns to assign higher scores to preferred responses (proxy for human judgment)
- •RLHF loop: LM generates; RM scores; LM updates policy to maximize expected reward
- •Rewards typically applied at sequence end → sparse, episodic credit assignment