Stanford OnlineStanford CS230 | Autumn 2025 | Lecture 10: What’s Going On Inside My Model?
At a glance
WHAT IT’S REALLY ABOUT
Interpreting CNNs and diagnosing frontier model training and behavior issues
- The lecture opens with a frontier-lab case study and organizes debugging evidence into four buckets: training/scaling telemetry, internal representations, data/distribution issues, and capability/safety evaluation results.
- For CNNs, it presents multiple ways to connect inputs to outputs—saliency maps, integrated gradients, and occlusion sensitivity—to verify whether predictions rely on the right image regions.
- It shows how architectural tweaks enable built-in localization via Class Activation Maps (CAM) and Grad-CAM by replacing deep fully connected stacks with global average pooling plus a final linear layer.
- It covers methods to probe what the network has learned, including activation maximization (class/neurons) and dataset search for top-activating examples, then extends to deconvolutional “reverse engineering” with unpooling switches.
- For frontier transformers, the lecture contrasts CNN locality with attention/embedding-based meaning, notes current interpretability limits, and emphasizes modern diagnostics: scaling laws, benchmark contamination checks, safety evals, and data distribution/token drift monitoring.
IDEAS WORTH REMEMBERING
5 ideasStart frontier-model investigations with structured evidence, not ad-hoc guesses.
The lecture recommends gathering signals across training/scaling telemetry (loss, gradients, LR), internal representation probes (attention/embeddings), data/distribution checks, and eval/agentic workflow regressions to quickly narrow root causes.
Use pre-softmax logits for attribution-style interpretability in classifiers.
For saliency and activation maximization, post-softmax probabilities entangle all classes; pre-softmax class scores isolate the class of interest and avoid misleading attributions caused by changes in competing classes.
Occlusion sensitivity provides an intuitive, model-agnostic “where is it looking?” test.
By sliding a masking patch and tracking the target-class score change, you obtain a heatmap of regions critical to the prediction, at the cost of many forward passes (computational expense).
CAM makes localization easier by preserving spatial structure until the end.
Replacing multiple fully connected layers with global average pooling + a final linear layer enables a class activation map formed by a weighted sum of last-layer feature maps, giving a real-time, interpretable localization signal (often improved by Grad-CAM).
Activation maximization reveals what a class or neuron ‘wants to see,’ but needs regularization.
Gradient ascent on pixels can generate synthetic prototypes (e.g., Dalmatian as black dots on white), and regularization keeps images in natural pixel ranges so the visualization is interpretable rather than noisy artifacts.
WORDS WORTH SAVING
5 quotesYour VP is wondering what's happening, and they ask, "What is going on?"
— Kian Katanforoosh
If you do the saliency maps and you realize that the pixels that are bright when you compute that gradient are all over the place, it's probably that the model is not even looking at the right place. It's just getting lucky.
— Kian Katanforoosh
Unfortunately, the modern transformers are so complicated that even the cutting-edge research is only able to interpret those relationships with two-layer transformers, pretty much.
— Kian Katanforoosh
The general consensus, I mean, my opinion is I, I actually don't look too much at the benchmarks when a foundation model provider publishes them.
— Kian Katanforoosh
Frontier labs rarely publish, uh, those dashboards because it's IP and because it can, uh, leak certain deep information about their IP and how their models are trained.
— Kian Katanforoosh
High quality AI-generated summary created from speaker-labeled transcript.