Stanford OnlineStanford CS230 | Autumn 2025 | Lecture 5: Deep Reinforcement Learning
At a glance
WHAT IT’S REALLY ABOUT
Deep reinforcement learning fundamentals, deep Q-networks, and RLHF for LLMs
- Reinforcement learning (RL) is framed as learning good sequences of decisions from experience, especially when labels are delayed and supervised learning targets are ill-defined (e.g., Go).
- The lecture builds intuition for value functions via Q-tables, the Bellman optimality equation, and policies, then shows why tabular methods fail in large state/action spaces.
- Deep Q-Networks (DQN) replace the Q-table with a neural network trained using Bellman-consistent bootstrapped targets, enabling pixel-based control in Atari games like Breakout.
- Key DQN training stabilizers are introduced: state preprocessing (frame stacking), terminal-state handling, experience replay to reduce correlation and reuse data, and ε-greedy exploration to avoid local optima.
- RLHF is presented as a modern alignment pipeline for LLMs: pretraining (next-token prediction) → supervised fine-tuning (human demonstrations) → reward model learning from human preferences → RL optimization (often PPO) to maximize reward-model scores.
IDEAS WORTH REMEMBERING
5 ideasRL is suited to delayed feedback and strategy, not one-step imitation.
For tasks like Go, supervised learning on expert moves can’t cover the state space, can’t capture long-term intent, and bakes in an ill-defined “ground truth,” while RL optimizes long-horizon returns from experience.
State and observation can differ; partial observability matters.
In fully observed games (Go/chess), observation equals state, but in fog-of-war settings (StarCraft/LoL), the agent must act with incomplete observations, changing how learning and memory must work.
The Bellman equation is the backbone of value-based RL.
Optimal action values satisfy “immediate reward + discounted best future value,” which motivates both tabular backtracking in toy problems and bootstrapped learning in large problems.
Q-tables ‘solve’ small MDPs but collapse under combinatorics.
A lookup table over (state × action) becomes infeasible for large environments like Go or pixel-based games, motivating neural networks as function approximators (DQN).
DQN trains without external labels by bootstrapping targets from itself.
The target for Q(s,a) is constructed from observed reward plus γ times the network’s estimate of the best next-step value, treating that target as fixed for backprop in each update.
WORDS WORTH SAVING
5 quotesIf you had to remember in one sentence what's RL, RL is making good sequences of decisions.
— Kian Katanforoosh
In classic supervised learning, you teach by example. In reinforcement learning, you teach by experience.
— Kian Katanforoosh
Even with a panel of expert that decides every move, you still have an ill-defined ground truth, you know?
— Kian Katanforoosh
The main issue, um, with this approach, um, of a Q table is that state and action spaces can be super large and having a matrix that you discover through backtracking, um, and where every time you wanna do an action, you have to look up the given states, the possible action, it becomes impossible.
— Kian Katanforoosh
The problem of, um, RLHF is to align not with human responses, but with human preferences.
— Kian Katanforoosh
High quality AI-generated summary created from speaker-labeled transcript.