This video isn’t embeddableWatch on YouTube →

Stanford CS230 | Autumn 2025 | Lecture 4: Adversarial Robustness and Generative Models

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai October 14, 2025 This lecture covers adversarial robustness and generative models. To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs230-deep-learning To follow along with the course schedule and syllabus, visit: https://cs230.stanford.edu/syllabus/ More lectures will be published regularly. View the playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNRRGdS0rBbXOUGA0wjdh1X Andrew Ng Founder of DeepLearning.AI Adjunct Professor, Stanford University’s Computer Science Department Kian Katanforoosh CEO and Founder of Workera Adjunct Lecturer, Stanford University’s Computer Science Department

Kian Katanforooshhost

Oct 21, 20251h 47mWatch on YouTube ↗

CHAPTERS

0:05 – 2:37
Lecture roadmap: adversarial robustness + modern generative modeling
Kian frames the lecture around two fast-moving areas: adversarial robustness (attacks/defenses for deployed AI) and generative modeling (how today’s image/video models work). He previews a hands-on, visual style and flags GANs and diffusion as the main generative families covered.
- •Two main themes: adversarial attacks/defenses and generative models
- •Why robustness matters as models proliferate in real products
- •Generative modeling relevance: images, video, text, code
- •Plan: robustness first, then GANs, then diffusion models
2:37 – 5:00
Real-world attack surface: prompt injection, data poisoning/backdoors, model inversion risks
Students propose concrete threats (prompt injection, data poisoning like Nightshade, sensitive data leakage). Kian expands to high-risk scenarios such as autonomous driving and outlines why these issues are urgent in practice.
- •Prompt injection as instruction override to exfiltrate secrets/PII
- •Data poisoning/backdoors via polluted training data
- •Memorization/training-data leakage risks for LLMs
- •Safety-critical examples (e.g., stop-sign recognition)
5:00 – 8:33
Three “waves” of adversarial ML: perturbations → backdoors → prompt injections
Kian gives a historical view of adversarial ML research and how the attack surface expanded. He emphasizes the arms-race dynamic where new defenses provoke new attacks, and notes modern agents/RAG introduce additional entry points.
- •2013 era: imperceptible perturbations (adversarial examples)
- •Later: backdoor/data poisoning as web-scraped training scaled
- •Recent: prompt injection/jailbreak attacks for instruction-following systems
- •Arms race: same researchers often develop both attacks and defenses
- •RAG/agent tooling increases vulnerability surfaces
8:33 – 16:14
Forging adversarial examples by optimizing the input pixels (targeted misclassification)
The class constructs an attack as an optimization problem: find an input image that forces an ImageNet model to output “iguana.” The key twist is taking gradients with respect to the input (pixels) instead of model parameters.
- •Define goal: choose X such that ŷ(X)=iguana
- •Use a loss (e.g., L2 between ŷ and target label) to drive optimization
- •Backprop through fixed network to compute ∂L/∂X
- •Gradient descent in pixel space produces a targeted adversarial input
16:14 – 19:55
Adding a “looks real” constraint: adversarial examples that still appear like a cat
To make attacks practical (e.g., tampered stop signs), Kian adds a second objective: keep the modified image close to a natural starting image. The class discusses starting from a real cat image and regularizing changes to preserve human realism while flipping the model prediction.
- •Practical attacks must be human-plausible, not random noise
- •Start from X_cat and optimize toward iguana output
- •Add regularization term to keep X close to X_cat
- •Result: image lies in region ‘looks real’ but model misclassifies
19:55 – 22:43
Physical-world examples: adversarial patches, misclassification on devices, invisibility cloaks
Kian shows real demonstrations: small perturbations and patches can flip mobile vision predictions or hide a person from detection. He highlights how researchers craft loss functions with constraints like printability and smoothness to make attacks work in the physical world.
- •Examples: library→prison, washer→doormat misclassifications
- •Adversarial patch can cause a detector to miss a person
- •Patch optimization includes constraints: printable colors, smoothness
- •Physical-world robustness is harder due to intra-class variability
22:43 – 24:31
Transferability & black-box attacks: attacking a model you can’t inspect
A question about YOLOv2 leads into transfer attacks: patches built on one model often affect others trained on similar data/tasks. Kian explains rate-limiting and query restrictions, and how attackers train surrogate models to craft black-box attacks.
- •White-box vs black-box threat models
- •Transferability across models with similar features/training
- •Query limits as a defense (restrict ‘pings’)
- •Surrogate-model strategy to craft attacks without target access
24:31 – 31:23
Why adversarial perturbations work: high dimensionality and linear behavior
Kian challenges the common intuition that nonlinearity causes adversarial sensitivity. Using a logistic regression example, he shows how small aligned perturbations accumulate across many dimensions, making high-dimensional models extremely sensitive to tiny, coordinated changes.
- •Nonlinearity is not the main culprit; networks behave locally linear end-to-end
- •High-dimensional input spaces enable perturbations to compound
- •Logistic regression example: X* = X + εW flips prediction dramatically
- •Tiny per-feature changes add up via ε·WᵀW
31:23 – 33:57
Fast Gradient Sign Method (FGSM): one-shot adversarial example generation
Kian introduces FGSM as a practical, one-step method to generate adversarial examples. It perturbs every pixel slightly in the sign direction of the input gradient, producing images that remain visually similar but change model outputs.
- •FGSM: X* = X + ε·sign(∂J/∂X)
- •One-shot attack using gradient sign across all pixels
- •Imperceptible changes can still induce large output shifts
- •Links to broader literature and attack taxonomies
33:57 – 38:56
Defenses toolbox: sanitization, adversarial training, red teaming, RLHF
The class brainstorms defenses and Kian organizes them into common categories. He emphasizes adversarial training as a dominant technique for vision models, while modern LLM defenses include safety layers, red teaming, and alignment/post-training methods.
- •Input sanitization / safety nets to detect suspicious inputs
- •Output filtering or masking information to hinder exploitation
- •Adversarial training using generated adversarial examples with same labels
- •Red teaming as systematic internal attack evaluation
- •RLHF/constitutional AI as post-training alignment tools
38:56 – 44:38
Backdoor/data poisoning attacks: triggers embedded in training data
Kian explains backdoors by injecting a trigger pattern (e.g., a small patch) into a subset of training images and flipping labels. The model learns to key on the trigger, enabling attackers to control behavior at deployment; parallels to text/data scraping are discussed.
- •Trigger + mislabeled examples teach model a hidden rule
- •At deployment, trigger causes systematic misclassification
- •Works across modalities (images and text corpora)
- •Defending is hard: combine sanitization, auditing, red teaming, alignment
44:38 – 50:00
Prompt injection (direct & indirect): overriding instruction hierarchy in LLM apps
Kian formalizes prompt injection via prompt templates (“yellow bricks”) combined with malicious user text that tells the model to ignore prior instructions. He contrasts direct attacks with indirect attacks embedded in retrieved web pages/documents that agents ingest.
- •Prompt template + user input concatenation creates attack channel
- •Direct injections: ‘ignore previous instructions’ style overrides
- •Indirect injections: malicious text hidden in retrieved content for agents
- •Classic jailbreak-style examples and evolving mitigations
50:00 – 56:04
Generative modeling foundations: discriminative vs generative + key industry use cases
The lecture pivots to generative AI, focusing on learning data distributions rather than labels. Kian surveys use cases (text-to-image, privacy-preserving synthetic data, super-resolution, inpainting) and motivates self-supervision at scale.
- •Generative models aim to match the real data distribution
- •Use cases: text-to-image/video, synthetic medical data, super-resolution
- •Inpainting for privacy (remove people, fill scene realistically)
- •Self-supervised learning via massive data; model capacity vs dataset size
56:04 – 1:11:15
GANs: generator–discriminator game, losses, and training stability issues
Kian introduces GANs as a two-network minimax game where the discriminator classifies real vs fake and the generator learns to fool it. He derives the standard discriminator BCE-style objective, then discusses practical training instabilities and the non-saturating loss trick.
- •Generator maps latent z to an image; discriminator predicts real (1) vs fake (0)
- •Losses: discriminator on real and generated; generator tries to fool D
- •Cold-start: D learns fast, G gradients can vanish (saturating loss)
- •Non-saturating generator loss increases early gradients and speeds learning
1:11:15 – 1:15:35
GAN limitations and heuristics: mode collapse, update ratios, and latent-space arithmetic
Kian highlights mode collapse as a core GAN failure: generating a narrow set of outputs that fool D without covering the full distribution. He mentions heuristics like training D more frequently than G and shows linear latent manipulations (e.g., ‘woman + sunglasses’) as a notable GAN property.
- •Mode collapse: G ‘cheats’ by generating limited high-fidelity modes
- •Training heuristics: adjust D/G update frequency for stability
- •Stopping criteria and evaluation challenges (often qualitative)
- •Latent arithmetic enables controllable attribute edits in generated images
1:15:35 – 1:18:39
Diffusion models: motivation vs GANs and evidence of improved diversity
Diffusion is presented as a single-model alternative that avoids adversarial games and reduces mode collapse. Kian uses examples (flamingos, burgers) to show diffusion’s stronger coverage and variety compared to GAN outputs.
- •Single-model training is more stable than GAN minimax dynamics
- •Better distribution coverage reduces mode collapse
- •Qualitative comparisons show more diverse samples
- •Sets up diffusion as denoising-based generative modeling
1:18:39 – 1:34:02
Diffusion training: forward noising, reverse denoising, and noise-prediction loss
Kian explains diffusion as learning to reverse a progressive noising process. The forward process adds (often Gaussian) noise over T steps; training teaches a neural network to predict the cumulative noise so it can be subtracted to recover the clean image—self-supervised via known injected noise.
- •Forward diffusion: x_{t+1}=x_t + ε_t (conceptually), ε_t ~ Gaussian
- •Reverse model predicts ε̂ given noisy x_t and timestep t
- •Reconstruction/L2 loss between true ε and predicted ε̂
- •Self-supervision: labels come from the known noise you added
- •Noise schedule and practical tweaks (β_t) matter
1:34:02 – 1:37:44
Sampling/inference with diffusion: iterative denoising from random noise (and conditioning)
At inference, diffusion starts from random noise and repeatedly predicts and subtracts noise to form an image—computationally expensive due to many model calls. Kian notes real systems guide generation via conditioning (e.g., text embeddings) rather than relying on unconstrained randomness.
- •Iterative loop: noise → predict ε̂ → subtract → less noisy sample
- •Many steps per image explains slow early generations in tools
- •Unconditioned sampling yields arbitrary outputs; conditioning steers results
- •Conditioning via text/image embeddings enables prompt-based generation
1:37:44 – 1:47:16
Latent diffusion + video diffusion (Sora/Veo): efficiency and temporal consistency
To reduce compute, latent diffusion performs the noising/denoising in a compressed autoencoder latent space, then decodes back to pixels. For video, the model operates over spatiotemporal ‘cubes’/tokens to enforce frame-to-frame consistency, enabling modern text-to-video systems.
- •Autoencoder compresses images to latents; diffusion runs in latent space
- •Decoder reconstructs final image from denoised latent z_0
- •Video adds temporal dimension; model must maintain consistency across frames
- •Tokenizing video into spatiotemporal patches/cubes supports diffusion training
- •Practical systems combine latent diffusion, conditioning, and optimization tricks

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Lecture roadmap: adversarial robustness + modern generative modeling

Real-world attack surface: prompt injection, data poisoning/backdoors, model inversion risks

Three “waves” of adversarial ML: perturbations → backdoors → prompt injections

Forging adversarial examples by optimizing the input pixels (targeted misclassification)

Adding a “looks real” constraint: adversarial examples that still appear like a cat

Physical-world examples: adversarial patches, misclassification on devices, invisibility cloaks

Transferability & black-box attacks: attacking a model you can’t inspect

Why adversarial perturbations work: high dimensionality and linear behavior

Fast Gradient Sign Method (FGSM): one-shot adversarial example generation

Defenses toolbox: sanitization, adversarial training, red teaming, RLHF

Backdoor/data poisoning attacks: triggers embedded in training data

Prompt injection (direct & indirect): overriding instruction hierarchy in LLM apps

Generative modeling foundations: discriminative vs generative + key industry use cases

GANs: generator–discriminator game, losses, and training stability issues

GAN limitations and heuristics: mode collapse, update ratios, and latent-space arithmetic

Diffusion models: motivation vs GANs and evidence of improved diversity

Diffusion training: forward noising, reverse denoising, and noise-prediction loss

Sampling/inference with diffusion: iterative denoising from random noise (and conditioning)

Latent diffusion + video diffusion (Sora/Veo): efficiency and temporal consistency

Get more out of YouTube videos.