Skip to content
Stanford OnlineStanford Online

Stanford CS230 | Autumn 2025 | Lecture 5: Deep Reinforcement Learning

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai October 21, 2025 This lecture covers deep reinforcement learning. To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs230-deep-learning To follow along with the course schedule and syllabus, visit: https://cs230.stanford.edu/syllabus/ More lectures will be published regularly. View the playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNRRGdS0rBbXOUGA0wjdh1X Andrew Ng Founder of DeepLearning.AI Adjunct Professor, Stanford University’s Computer Science Department Kian Katanforoosh CEO and Founder of Workera Adjunct Lecturer, Stanford University’s Computer Science Department

Kian Katanforooshhost
Oct 30, 20251h 45mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

Deep reinforcement learning fundamentals, deep Q-networks, and RLHF for LLMs

  1. Reinforcement learning (RL) is framed as learning good sequences of decisions from experience, especially when labels are delayed and supervised learning targets are ill-defined (e.g., Go).
  2. The lecture builds intuition for value functions via Q-tables, the Bellman optimality equation, and policies, then shows why tabular methods fail in large state/action spaces.
  3. Deep Q-Networks (DQN) replace the Q-table with a neural network trained using Bellman-consistent bootstrapped targets, enabling pixel-based control in Atari games like Breakout.
  4. Key DQN training stabilizers are introduced: state preprocessing (frame stacking), terminal-state handling, experience replay to reduce correlation and reuse data, and ε-greedy exploration to avoid local optima.
  5. RLHF is presented as a modern alignment pipeline for LLMs: pretraining (next-token prediction) → supervised fine-tuning (human demonstrations) → reward model learning from human preferences → RL optimization (often PPO) to maximize reward-model scores.

IDEAS WORTH REMEMBERING

5 ideas

RL is suited to delayed feedback and strategy, not one-step imitation.

For tasks like Go, supervised learning on expert moves can’t cover the state space, can’t capture long-term intent, and bakes in an ill-defined “ground truth,” while RL optimizes long-horizon returns from experience.

State and observation can differ; partial observability matters.

In fully observed games (Go/chess), observation equals state, but in fog-of-war settings (StarCraft/LoL), the agent must act with incomplete observations, changing how learning and memory must work.

The Bellman equation is the backbone of value-based RL.

Optimal action values satisfy “immediate reward + discounted best future value,” which motivates both tabular backtracking in toy problems and bootstrapped learning in large problems.

Q-tables ‘solve’ small MDPs but collapse under combinatorics.

A lookup table over (state × action) becomes infeasible for large environments like Go or pixel-based games, motivating neural networks as function approximators (DQN).

DQN trains without external labels by bootstrapping targets from itself.

The target for Q(s,a) is constructed from observed reward plus γ times the network’s estimate of the best next-step value, treating that target as fixed for backprop in each update.

WORDS WORTH SAVING

5 quotes

If you had to remember in one sentence what's RL, RL is making good sequences of decisions.

Kian Katanforoosh

In classic supervised learning, you teach by example. In reinforcement learning, you teach by experience.

Kian Katanforoosh

Even with a panel of expert that decides every move, you still have an ill-defined ground truth, you know?

Kian Katanforoosh

The main issue, um, with this approach, um, of a Q table is that state and action spaces can be super large and having a matrix that you discover through backtracking, um, and where every time you wanna do an action, you have to look up the given states, the possible action, it becomes impossible.

Kian Katanforoosh

The problem of, um, RLHF is to align not with human responses, but with human preferences.

Kian Katanforoosh

Why supervised learning struggles for Go and long-horizon tasksRL vocabulary: agent, environment, state vs observation, action, reward, return, discountQ-tables, policy as argmax, Bellman optimality equationDeep Q-Learning: bootstrapped targets and neural function approximationAtari DQN setup: pixel inputs, CNNs, frame historyTraining improvements: terminal handling, experience replay, ε-greedy explorationRLHF pipeline: SFT, reward model from preferences, PPO-style optimization

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.