Stanford OnlineStanford CS230 | Autumn 2025 | Lecture 2: Supervised, Self-Supervised, & Weakly Supervised Learning
At a glance
WHAT IT’S REALLY ABOUT
Practical deep learning decisions: supervision types, embeddings, and project tradeoffs
- The instructor frames deep learning as an engineering decision process—choosing data, labels, capacity, resolution, architecture, and especially loss functions to match real-world constraints.
- Through a day-vs-night classifier, the lecture shows how task definition and edge cases drive dataset diversity, model size, and input resolution tradeoffs, often validated with quick human-proxy experiments.
- A trigger-word detection case study highlights how labeling strategy (timestamped sequence labels vs one global label) can dramatically reduce data needs, and how synthetic-data pipelines can scale training data rapidly.
- Face verification is used to introduce embeddings and metric learning, where a triplet-loss objective trains an encoder so same-identity images cluster and different identities separate, enabling verification, identification (kNN), and clustering (k-means).
- Self-supervised and weakly supervised learning are presented as scalable alternatives to manual labeling: contrastive learning via augmentations (SimCLR), next-token prediction for language models, and multimodal pairing (e.g., ImageBind) to build shared embedding spaces across modalities.
IDEAS WORTH REMEMBERING
5 ideasDefine the task boundaries before collecting data.
“Day vs night” becomes easy or extremely hard depending on whether you constrain location, indoor/outdoor, dawn/dusk definitions, weather, and geography; these choices determine needed dataset diversity and model capacity.
Use humans as fast proxies to validate modeling assumptions.
Printing/previewing images at multiple resolutions to see when humans can still solve the task provides a quick lower bound on what resolution likely contains enough signal, saving expensive training iterations.
Input resolution is an engineering knob that directly affects cost and iteration speed.
Too low loses critical cues (e.g., a clock or fine facial details), while too high increases compute, memory, and training time—slowing the iterate-test-improve loop that drives successful projects.
Labeling strategy can be more important than model choice.
For trigger-word detection, labeling only a single 0/1 for a 10-second clip creates a cold-start problem and extreme imbalance; labeling the time window of the keyword (sequence of 1s) makes the learning signal far denser and training much faster.
Synthetic data pipelines can turn small recorded datasets into millions of training examples.
By separately collecting positive words, negative words, and abundant background noise, then programmatically inserting them into 10-second clips, the script “knows” insertion timestamps and can auto-label at scale (plus augment speed/pitch).
WORDS WORTH SAVING
5 quotesThe loss function, which is what gives the feedback to the model, um, you were right or you were wrong, and what to do about it, is an art.
— Kian Katanforoosh
The number one mistake that we see in projects is that people add more data, uh, but forget to adjust the labels.
— Kian Katanforoosh
In an AI project, that's why resolution matter a lot.
— Kian Katanforoosh
So if it's easier for you, it's easier for the model, basically.
— Kian Katanforoosh
Emergent behaviors are unexpected capabilities that arise from simple training objectives at scale without being explicitly taught or labeled.
— Kian Katanforoosh
High quality AI-generated summary created from speaker-labeled transcript.