Skip to content
Stanford OnlineStanford Online

Stanford CS230 | Autumn 2025 | Lecture 4: Adversarial Robustness and Generative Models

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai October 14, 2025 This lecture covers adversarial robustness and generative models. To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs230-deep-learning To follow along with the course schedule and syllabus, visit: https://cs230.stanford.edu/syllabus/ More lectures will be published regularly. View the playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNRRGdS0rBbXOUGA0wjdh1X Andrew Ng Founder of DeepLearning.AI Adjunct Professor, Stanford University’s Computer Science Department Kian Katanforoosh CEO and Founder of Workera Adjunct Lecturer, Stanford University’s Computer Science Department

Kian Katanforooshhost
Oct 20, 20251h 47mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

Adversarial attacks and modern generative models: GANs to diffusion explained

  1. The lecture frames adversarial robustness as an arms race, outlining three “waves” of attacks: imperceptible input perturbations, data poisoning/backdoors via scraped training data, and prompt-injection/jailbreaks targeting LLM instruction hierarchies and tool-using agents.
  2. It demonstrates how to construct adversarial examples by optimizing pixels (holding model weights fixed) and explains why high-dimensional inputs make tiny, coordinated perturbations compound into large prediction shifts, motivating fast one-step attacks like FGSM.
  3. It reviews practical defenses—input/output filtering, adversarial training, red teaming, and alignment techniques like RLHF/constitutional AI—while emphasizing that no single defense is complete, especially against backdoors and indirect prompt injection.
  4. For generative modeling, it contrasts discriminative vs. generative goals (matching distributions) and explains GANs as a two-network minimax game, including key training pathologies (saturation, instability, mode collapse) and common loss/optimization tricks.
  5. It presents diffusion models as a more stable, single-model alternative that learns to reverse a forward noising process (predicting noise with an L2 reconstruction objective), then extends to latent diffusion for efficiency and to video diffusion via spatiotemporal “cube/token” representations and conditioning on prompts.

IDEAS WORTH REMEMBERING

5 ideas

Adversarial robustness expands as models gain more ‘entry points.’

Early attacks mainly perturbed inputs, but modern systems with prompts, retrieval, and tools introduce new attack surfaces—especially indirect prompt injection through documents/webpages an agent reads.

You can ‘attack’ a fixed classifier by optimizing the input, not the weights.

Define a loss that pushes the model output toward a target label (e.g., iguana) and run gradient descent on pixels; adding a realism constraint keeps the result visually similar to an original image (e.g., cat that the model calls iguana).

High dimensionality—not just nonlinearity—drives adversarial sensitivity.

Even near-linear models can be brittle because tiny per-dimension shifts add up; the logistic regression example shows how adding a small ε·w term compounds via wᵀw, flipping predictions.

FGSM is a practical ‘one-shot’ way to generate adversarial examples.

Instead of iterative optimization, add ε·sign(∂J/∂x) to nudge every pixel in the direction that increases loss, often preserving human-perceived similarity while changing the model’s decision.

Defenses are layered and imperfect; backdoors are especially hard.

Sanitization and output filtering can catch some attacks, adversarial training improves robustness to perturbations, and red teaming/RLHF help find jailbreaks—but poisoned training data can hide triggers that only appear at deployment.

WORDS WORTH SAVING

5 quotes

One thing that is special about, you know, this space, I would say, is that, you know, for every new defense there's a new attack, and for every new attack there's a new defense.

Kian Katanforoosh

Those are called adversarial attacks and, or adversarial examples, and you can think of them as, uh, optical illusions for neural networks.

Kian Katanforoosh

It turns out this space is way bigger than the space that us human look at.

Kian Katanforoosh

You might actually hide certain things online and you know that a large foundation model provider would at some point send a bot that's gonna read that data, collect it, put it in a training set. You essentially created an entry point for your attack later on when that model will be in production.

Kian Katanforoosh

GANs might actually find a way to fool D without actually looking at the entire data distribution. So it might actually create a set of cats that are so good, so impossible to tell from reality that D is always getting it wrong, and it would look like the GAN game is done when actually G has not learned the full data distribution. It has only partially learned it, and that is a problem.

Kian Katanforoosh

Three waves of adversarial attacks (perturbations, backdoors, prompt injection)White-box vs. black-box attacks and transferabilityAdversarial example optimization in pixel spaceFast Gradient Sign Method (FGSM) intuitionDefenses: sanitization, adversarial training, red teaming, RLHFGANs: generator–discriminator minimax losses, saturation fixMode collapse and training instability in GANsDiffusion: forward noising, reverse denoising, noise schedulesLatent diffusion via autoencoders for compute reductionPrompt-conditioned image/video generation and temporal consistency

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.