Stanford OnlineStanford CS230 | Autumn 2025 | Lecture 5: Deep Reinforcement Learning
CHAPTERS
Why CS230 switches to Deep Reinforcement Learning (and where RLHF fits)
The lecture opens by motivating the topic swap: interpretability/visualization is postponed, and deep reinforcement learning (DRL) is introduced as the “marriage” of deep learning and reinforcement learning. The second half preview frames Reinforcement Learning from Human Feedback (RLHF) as a key step behind the jump from GPT-2–like models to ChatGPT-style alignment.
Big wins that popularized reinforcement learning: Atari, AlphaGo, multi-agent games, RLHF
The instructor surveys landmark results that drove RL adoption: DeepMind’s Atari DQN, AlphaGo, and later complex multi-agent environments like Dota/StarCraft. The chapter connects these successes to RLHF as a more recent reinforcement-learning application in language models.
Why supervised learning struggles with Go: missing states, missing strategy, and ill-defined ground truth
Using Go as a case study, the lecture contrasts supervised learning (“teach by example”) with RL (“teach by experience”). It highlights why imitation-only approaches break down: the state space is enormous, labels are ambiguous, and single-step predictions miss long-term strategy.
Core RL vocabulary: agent, environment, state vs observation, rewards, transitions
The lecture formalizes the RL setup: an agent acts in an environment, observes rewards and observations, and induces state transitions over time. It clarifies why “observation” can differ from “state” (partial observability), using examples like fog-of-war in strategy games.
Toy RL environment: “Recycling is Good” and the idea of episodes/terminal states
A 5-state environment illustrates RL mechanics, reward shaping, and terminal states. The “garbage collector in 3 minutes” rule is added to prevent degenerate looping for small rewards and to enforce finite-horizon planning.
Discounted return and optimal strategy under different gamma values
The lecture defines return as a discounted sum of rewards and interprets discounting as time preference (money now vs later, robot energy). Students compute the best strategy when gamma=1 and then consider gamma=0.9, illustrating how discounting changes effective value.
Q-tables, backtracking, and the Bellman optimality equation
A Q-table is introduced as a complete solution: it stores the value of each action in each state. The lecture computes Q-values by backtracking through a tree of possible moves, then generalizes the logic into the Bellman optimality equation and defines the policy as argmax over Q.
Why tabular Q-learning breaks—and how Deep Q-Networks (DQN) replace the table
The lecture explains why Q-tables become infeasible in large state/action spaces (e.g., Go). Deep Q-learning replaces the table with a neural network approximator, outputting Q-values for each action given a state, enabling generalization and tractability.
Training DQN using bootstrapped targets from the Bellman equation (two forward passes)
The lecture derives the DQN loss by constructing a target from observed reward plus discounted max future Q, using the network itself to estimate the future term. It emphasizes the bootstrapping nature of learning (targets improve as the network improves) and notes practical simplifications like treating the target as fixed during backprop.
Breakout as a concrete DQN example: inputs/outputs, preprocessing, and CNN architecture
Breakout is used to ground the abstractions: states are game frames, actions are left/right/idle, and rewards correspond to game outcomes. The chapter covers practical input engineering—cropping score regions, grayscale, resizing—and the crucial need for frame history to infer motion; a CNN is used to process pixels.
Stabilizing and improving learning: terminal handling, experience replay, and epsilon-greedy exploration
The lecture introduces training “tricks” that made RL practical: correctly handling terminal states in targets, experience replay to reduce correlation and reuse data, and epsilon-greedy exploration to avoid local optima. These ideas address instability, inefficiency, and getting stuck when greedy behavior never discovers high-reward paths.
Beyond DQN: sparse rewards, imitation learning intuition, PPO and self-play
The lecture briefly surveys harder settings and modern algorithms. Montezuma’s Revenge illustrates sparse/delayed rewards and the need for better priors (imitation learning/human intuition). It then contrasts value-based DQN with policy-based PPO, highlighting continuous actions and showing examples like locomotion, sumo self-play, and multi-agent games.
RLHF for language models: from next-token prediction to supervised fine-tuning, reward models, and preference optimization
The final segment reframes RL in the LLM context: pretraining optimizes next-token prediction but doesn’t ensure helpfulness or alignment. Supervised fine-tuning (SFT) teaches imitation using human prompt-response data, then a reward model is trained on human preference rankings; RL then optimizes the language model to maximize reward, yielding RLHF-aligned behavior.