Stanford OnlineStanford CS230 | Autumn 2025 | Lecture 2: Supervised, Self-Supervised, & Weakly Supervised Learning
CHAPTERS
Course framing: industry mindset, interactive decision-making, and what’s ahead
Kian introduces himself, his industry role, and how the in-person lectures will emphasize practical decision-making in real AI projects. He previews the lecture’s main segments—supervised learning case studies, self/weak supervision and embeddings—and briefly flags later-quarter topics like adversarial robustness, RL, and RAG/agents.
Neural network recap: architecture vs. parameters, gradient descent, and what can vary
A quick refresh of the supervised learning loop: inputs flow through an architecture with parameters; loss compares predictions to ground truth; gradient descent updates parameters iteratively. Kian emphasizes how changing inputs/outputs/architectures/losses alters the problem and why loss design is a creative lever in deep learning.
From binary to multi-class/multi-label: outputs, labels, and common project pitfalls
Using “cat vs. not-cat,” Kian asks how to extend to multiple animals. The class identifies key changes: expanding the output layer, collecting broader data, and—critically—changing labels from scalar to one-hot or multi-hot vectors to match the new task definition.
Depth, capacity, and feature learning: what layers tend to encode
Kian explains capacity and overfitting as a mismatch between model size and dataset diversity. He then builds intuition for representations: early layers learn edges/pixels, middle layers learn parts (eyes/noses), and deeper layers learn higher-level concepts—setting up later discussions of embeddings and interpretability.
Case study 1 — Day vs. night classification: scoping the task and collecting data
The class designs a toy supervised project: classify images as day or night. Kian uses it to highlight that “simple” problems become hard when you expand scope (indoor scenes, dawn/dusk, geography) and that defining the task boundaries drives data needs and model complexity.
Day/night engineering decisions: resolution trade-offs, human-as-proxy, and training choices
Kian focuses on practical choices: selecting an image resolution balancing information vs. compute, and validating with quick human experiments. He ties these to iteration speed in real projects and outlines a typical setup (CNN, sigmoid, binary cross-entropy).
Case study 2 — Trigger word detection: why real systems use cascades of models
Kian frames trigger word detection as part of an energy-efficient cascade (activity detection → keyword spotting → heavier intent model). The class discusses what data is needed to detect a word (“activate”) in 10-second clips and why distributional coverage (accents, age, noise) is crucial.
Trigger word labeling experiment: why temporal labels beat clip-level labels
A human listening experiment shows that guessing a keyword from only clip-level labels is hard; providing timing/location cues makes the task far easier. Kian translates this into a labeling strategy decision: richer labels can reduce required data massively and solve cold-start learning issues.
Synthetic data generation for keyword spotting: scalable labeling via scripting and augmentation
Kian explains the practical “hack” used to scale training data: collect small libraries of positive words, negative words, and abundant background noise, then synthetically compose millions of labeled 10-second clips. Because the script inserts words at known timestamps, labels are generated automatically; real test sets still require manual labeling.
Case study 3 — Face verification: why pixel comparison fails and embeddings solve it
For student ID verification, Kian contrasts naive pixel-wise comparison with real-world variability (lighting, pose, background, occlusions). The solution is to encode images into vectors using a deep network and compare distances in embedding space with a threshold tuned to the application’s risk tolerance (e.g., dining hall vs. airport).
Training embeddings with triplet loss (FaceNet): designing a loss to shape representation
The class derives how to train the encoder so that same-identity embeddings are close and different identities are far. Kian introduces triplets (anchor, positive, negative) and the triplet loss objective, emphasizing that this is “loss design” as a key deep learning skill and that negatives are a training-time construct.
Beyond verification: identification and clustering via nearest neighbors and k-means
Kian shows how the same face embedding model powers multiple products. Identification becomes nearest-neighbor search over stored embeddings, while clustering groups photos by identity using unsupervised methods like k-means—illustrating embeddings as reusable infrastructure for downstream tasks.
Self-supervised learning: contrastive pairs from augmentations (SimCLR)
Kian motivates self-supervision as a response to expensive labeling and explains contrastive learning for images. By treating two augmented views of the same image as a positive pair, the model learns invariances and semantic structure without manual labels—enabling training on massive unlabeled datasets.
Weak supervision & multimodal embeddings: next-token prediction, emergent behaviors, and ImageBind
Kian links self-supervision in text (next-token prediction) to emergent capabilities like factual recall and conditional reasoning. He then introduces weak supervision for multimodality—using naturally paired data (captions, subtitles, audio-video)—and highlights shared embedding spaces (often pivoting around text) as the “connective tissue” enabling cross-modal retrieval (e.g., ImageBind).