This video isn’t embeddableWatch on YouTube →

Stanford CS230 | Autumn 2025 | Lecture 10: What’s Going On Inside My Model?

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai December 2, 2025 This lecture covers what's happening inside your model and provides a class wrap-up. To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs230-deep-learning Please follow along with the course schedule and syllabus: https://cs230.stanford.edu/syllabus/ View the playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNRRGdS0rBbXOUGA0wjdh1X Andrew Ng Founder of DeepLearning.AI Adjunct Professor, Stanford University’s Computer Science Department Kian Katanforoosh CEO and Founder of Workera Adjunct Lecturer, Stanford University’s Computer Science Department

Kian Katanforooshhost

Dec 15, 20251h 46mWatch on YouTube ↗

CHAPTERS

0:05 – 2:37
Lecture roadmap: from CNN interpretability to frontier-model diagnostics
Kian frames the lecture as a broadened take on “neural network interpretability,” spanning what’s well understood for CNNs and what’s still emerging for large frontier models. He previews a packed agenda: a frontier-lab case study, deep CNN visualization methods, then modern representation/scaling/data analysis for transformers and LLMs.
- •Interpretability is mature for CNNs, less settled for frontier LLMs/vision models
- •Plan: case study → CNN deep dive (neurons/feature maps) → frontier-model analysis (scaling, benchmarks, data)
- •Goal is to build transferable intuition and research vocabulary
2:37 – 7:01
Frontier-lab case study: sudden regression in a 200B-parameter checkpoint
Students brainstorm what to inspect when a new checkpoint passes basic sanity checks but regresses on reasoning, fails safety evals, and shows tool-use latency spikes. The discussion highlights practical triage signals before retraining or deep code changes.
- •Do error analysis on failed benchmark/safety examples to find patterns
- •Inspect training/validation loss for smoothness, spikes, and divergence
- •Check recent data batches for corruption, poisoning, or distribution shift
- •Consider hardware/infra issues as a cause of latency and anomalies
7:01 – 12:28
What to probe “inside” an LLM: checkpoints, attention, sensitivity, MoE routing
The class shifts from global symptoms to internal inspection ideas for language models. Suggestions include comparing multiple checkpoints, visualizing attention behavior, running sensitivity analysis on hyperparameters, and checking mixture-of-experts utilization and routing behavior.
- •Checkpoint-by-checkpoint comparisons to localize when degradation began
- •Attention map inspection as a first-pass internal visualization
- •Sensitivity analysis over optimizer/LR schedule/compute-data-model balance
- •Mixture-of-experts failure modes: collapsed routing, unused experts, reduced effective capacity
12:28 – 15:02
Four buckets of evidence: training/scaling, representations, data, and multi-level evals
Kian consolidates the brainstorm into four diagnostic categories used in practice. He emphasizes that issues can stem from training dynamics, internal representations (attention/embeddings), data distribution/contamination, or from evaluation happening at different system levels (model vs agentic workflow).
- •Training & scaling signals: loss, gradients, LR, scaling laws, MoE routing
- •Representation/internal signals: attention heads, embeddings, neuron behaviors (hard at scale)
- •Data & distribution: mismatched train/test distributions, contaminated benchmarks
- •System-level eval: separate model capability from agentic/tool-use performance
15:02 – 19:55
Zoo CNN case study: building trust with input–output explanations
A zoo wants to trust an animal classifier but fears black-box decisions. The class discusses how to communicate CNN behavior, from softmax outputs to layerwise feature extraction and systematic evidence that the model “looks at” the right regions.
- •Explain softmax probabilities and how CNN layers build hierarchical features
- •Use examples and dataset evidence to build stakeholder intuition
- •Need more than narrative: show systematic localization of evidence in images
19:55 – 24:52
Saliency maps and integrated gradients: pixel attribution for CNN decisions
Kian introduces saliency maps by differentiating the pre-softmax class score with respect to the input image to highlight influential pixels. He explains why pre-softmax is preferred over post-softmax, then briefly motivates integrated gradients as a more stable, path-based extension.
- •Compute ∂(class score)/∂(input pixels) to highlight influential regions
- •Use pre-softmax logits to avoid confounding effects from other classes
- •Integrated gradients: integrate gradients along a baseline-to-input path for better interpretability
- •Medical example: attributions align with lesion regions in retinal imagery
24:52 – 28:27
Occlusion sensitivity: masking patches to test what regions matter
A more intuitive but compute-heavy method is introduced: slide a masking square across the image and track how the target class score changes. Examples show how occluding breed-specific facial regions harms fine-grained classification, and how masking irrelevant objects can even increase confidence.
- •Mask (zero-out) a patch and re-run the model across many positions
- •Plot class score changes as a heatmap over mask locations
- •Interpret drops as evidence the model relies on those regions
- •Compute cost is high due to many forward passes
28:27 – 36:57
Real-time localization via CAM/Grad-CAM: fixing the interpretability bottleneck
Kian identifies fully connected layers as a key interpretability weakness because they mix spatial information. Replacing heavy FC stacks with global average pooling plus a final linear layer enables Class Activation Maps (CAM), producing heatmaps from weighted feature maps; Grad-CAM is noted as an enhancement.
- •Fully connected layers destroy spatial locality, making “where it looked” hard to recover
- •Use global average pooling to preserve feature-map structure while producing class scores
- •CAM: weight last-layer feature maps by final-layer weights to get class heatmaps
- •Grad-CAM improves CAM-style localization without strict architectural constraints
36:57 – 42:56
Querying what the model “thinks”: class model visualization via gradient ascent
To probe conceptual understanding, Kian shows how to synthesize an input image that maximizes a class’s pre-softmax score (with regularization for natural-looking pixels). Examples (Dalmatian, goose, flamingo) reveal dataset biases and what visual cues the model has internalized.
- •Optimize input pixels to maximize a chosen class logit (not softmax)
- •Regularize to keep images within plausible pixel ranges
- •Generated prototypes can expose dataset bias (e.g., ‘goose’ → many geese)
- •Method can target classes or intermediate neurons/activations
42:56 – 48:45
Dataset search for interpretability: top activating examples per filter and receptive fields
Kian presents a simple, widely used approach: for a chosen feature map, find validation images that maximize its activation to infer what the filter detects. He explains why the shown patches are cropped—each activation corresponds to a receptive field region in the original image, which grows with depth.
- •Pick a feature map and retrieve top-k dataset examples that maximize it
- •Interpret filters via consistent patterns (shirts, edges, shapes) in top activations
- •Cropping reflects receptive fields: deeper activations ‘see’ larger input regions
- •Provides empirical, human-readable evidence of learned features
48:45 – 1:05:27
Reverse engineering CNN activations with deconvolution (transposed convolution) and unpooling
The lecture builds a reconstruction pipeline to trace a strong activation back to input pixels. Kian derives transposed convolution by viewing convolution as matrix multiplication and (approximately) reversing it via transpose, then explains practical implementation tricks, plus how to invert max pooling with stored ‘switches.’
- •Convolution can be written as a matrix–vector product; ‘deconv’ uses transpose under assumptions
- •Implementation intuition: flip filters, insert zeros (subpixel), adjust stride to upsample
- •Max pooling is not invertible; store argmax ‘switches’ during forward pass to unpool
- •Reconstruct the input region that caused a specific feature-map activation
1:05:27 – 1:12:57
Putting CNN interpretability together: Zeiler/Fergus and Yosinski visualization toolbox
Kian shows how classic results validate hierarchical feature learning: early layers detect edges, later layers capture complex parts and concepts. A short video demo illustrates optimization-based neuron visualization, dataset search, and deconvolution-based pixel attribution as a cohesive toolkit.
- •Layer 1: filters and top patches are directly interpretable (edge/color detectors)
- •Deeper layers: deconv reconstructions show increasingly abstract, compositional features
- •Toolbox unifies multiple methods: optimized images, activation browsing, dataset search, deconv views
- •CNNs can be inspected at input-output, neuron, and feature-map levels
1:12:57 – 1:18:33
From CNNs to transformers: attention patterns and embedding-space sanity checks
Kian contrasts CNN locality with transformer relational structure. He notes that attention maps and embedding visualizations (e.g., t-SNE) are the most accessible interpretability hooks for LLMs, while deeper mechanistic interpretability remains challenging beyond small models.
- •CNNs: localized textures/shapes; Transformers: relationships/meaning across tokens via attention
- •Visualize attention heads to inspect token-to-token dependencies
- •Visualize embeddings with dimensionality reduction to check semantic neighborhoods
- •State of the art is limited; transformer-circuits and induction heads are leading approaches
1:18:33 – 1:46:53
Frontier model diagnostics: telemetry, scaling laws, benchmarks, safety, and data health
The lecture closes with practical monitoring used by frontier labs: training curves, gradient norms, learning-rate schedules, hardware utilization, and scaling-law alignment (e.g., Chinchilla vs GPT-3 compute/data balance). Kian also covers capability/safety benchmarking, contamination detection, and dataset distribution diagnostics, ending with Q&A on domain data, synthetic data, and AI-generated-data feedback loops.
- •Training telemetry: loss curves (global/domain), gradients, LR schedules, hardware efficiency
- •Scaling laws guide whether to invest in more compute, more data, or bigger models (Chinchilla insight)
- •Capability & safety evals: benchmark tracking, error clustering, agentic workflow evaluation
- •Data diagnostics: domain proportions, token drift, contamination checks via n-grams/hashes/embeddings; MoE routing/load-balancing signals

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Lecture roadmap: from CNN interpretability to frontier-model diagnostics

Frontier-lab case study: sudden regression in a 200B-parameter checkpoint

What to probe “inside” an LLM: checkpoints, attention, sensitivity, MoE routing

Four buckets of evidence: training/scaling, representations, data, and multi-level evals

Zoo CNN case study: building trust with input–output explanations

Saliency maps and integrated gradients: pixel attribution for CNN decisions

Occlusion sensitivity: masking patches to test what regions matter

Real-time localization via CAM/Grad-CAM: fixing the interpretability bottleneck

Querying what the model “thinks”: class model visualization via gradient ascent

Dataset search for interpretability: top activating examples per filter and receptive fields

Reverse engineering CNN activations with deconvolution (transposed convolution) and unpooling

Putting CNN interpretability together: Zeiler/Fergus and Yosinski visualization toolbox

From CNNs to transformers: attention patterns and embedding-space sanity checks

Frontier model diagnostics: telemetry, scaling laws, benchmarks, safety, and data health

Get more out of YouTube videos.