Stanford OnlineStanford CS230 | Autumn 2025 | Lecture 4: Adversarial Robustness and Generative Models
EVERY SPOKEN WORD
85 min read · 17,110 words- KKKian Katanforoosh
Welcome to CS230 Lecture 4. Thank you for coming in person or joining online. Uh, today's lecture is, uh, one of, uh, my favorite. It's, it's a fun one. There's a lot of visuals that we look at, um, and we cover a lot of modern methods as well. A lot of the content is, uh, brand new. Um, the focus areas for us today is going to be two topics: uh, adversarial robustness and, uh, generative modeling. Uh, adversarial robustness is an important topic today because there are more and more AI models in the wild. You're using dozens of them on a daily basis, and the more algorithms are being used, the more they're prone to attacks and the more we have to be careful and build defenses proactively, which is what makes this research field of adversarial attacks and defenses, uh, very prolific. The other topic we'll cover is generative models, which as you may have seen in the news, is really, really hot right now. Uh, you have video generation now becoming a reality, image generation, which you're all already used to, and of course, text generation, code generation, which, you know, we all use regularly. Uh, there's a lot of heat in that space, and so we're gonna try to break down what are the types of algorithms that power, uh, you know, products like Sora or Veo and so on. We're excited for this. Uh, so let's keep it interactive as always. Uh, we'll start with adversarial robustness. It should probably take us thirty to forty-five minutes, and then we'll keep, uh, the latter part focused on generative models with a focus on GANs, generative adversarial networks. Even if it's called adversarial, it is not really connected to adversarial attacks. It's a different problem. Um, and then diffusion models, which are, I would say, the most popular, um, type or family of algorithm for today's, um, image and video generation products. So let's start with adversarial robustness with an open question for you all. Can you tell me examples of attacks on AI models? Are you worried about anything when you use AI? Yes.
- SPSpeaker
Prompt injection.
- KKKian Katanforoosh
Prompt injection. What, what is that?
- SPSpeaker
Like, you like sneak, uh, sentence into like a prompt, so copy-paste, um, but that's malicious on there, uh, on there.
- KKKian Katanforoosh
Yeah. So you, you-- we, we'll talk about prompt injections, but you essentially try to fool the LLM, let's say, by giving it an instruction that might bypass another instruction that the builder of the model or the user of the model wanted, uh, you to use in the first place. Um, it might create dangerous situations where you might steal information such as passwords or, uh, PII data. What else? Yeah.
- SPSpeaker
Night shift.
- KKKian Katanforoosh
Huh?
- SPSpeaker
Night shift.
- KKKian Katanforoosh
Lang what?
- SPSpeaker
Night shift.
- KKKian Katanforoosh
Oh, night, night-- What is that?
- SPSpeaker
Uh, it's like a data poisoning for AI model. So it-- I believe it takes some image and, for example, the image of, is of a cat, but it gives the image some features of a dog. So it tries to trick the AI model in, like, learning the features of a dog and then attack it.
- KKKian Katanforoosh
I see. Great one. A, a type of data poisoning attack where you're trying to fool the model by inserting certain pixels or certain traits that might confuse the model and in turn allow someone to bypass the algorithm, for example. Yeah, you're right. What else? What are use cases where, uh, you know, a model being attacked can be very high risk? Yeah.
- SPSpeaker
Very well saying, observe, right?
- SPSpeaker
Yeah.
- SPSpeaker
Uh, if reasons to, uh, fake account numbers, like about the, the training data.
- KKKian Katanforoosh
Yeah. So, you know, train-- LLMs are trained on the wild. There's a lot of data online. It might be actually trained on banking numbers, Social Security numbers. If someone can reverse engineer the training data and find this information, um, it puts, uh, uh, the company that's building that LLM at risk, for sure, and the users as well. Okay. Anyone wants to add anything else? There, there, there's a lot of reasons as well. If you think of autonomous driving, you know, a car is trained to detect stop signs, and if someone maliciously tries to, you know, modify sort of the algorithm so that it doesn't see the stop sign, it may create a crash and potentially harm someone. Those are a lot of examples. We're gonna cover that. I would say that in the space of, um, adversarial attacks, we've had three waves over the last ten years, where in two thousand thirteen, uh, Christian Szegedy, with a great paper on intriguing properties of neural network, essentially tells us that small perturbations, let's say, to an image, can fool a computer vision model. Like, you might not actually see the perturbation, but the model, which looks at pixels as numbers, sees the perturbation, and even imperceptible perturbation can widely, um, change the output of the model, and this is very dangerous. Those are called adversarial attacks and, or adversarial examples, and you can think of them as, uh, optical illusions for neural networks.A few years later, you know, as training models was more common, more people were training models, and in fact, most importantly, a lot of scraping happened online. So models were scraping the web. Uh, another type of attack which you mentioned became prominent, uh, backdoor attacks or data poisoning attacks, which is, as an attacker, you might actually hide certain things online and you know that a large foundation model provider would at some point send a bot that's gonna read that data, collect it, put it in a training set. You essentially created an entry point for your attack later on when that model will be in production. And then more recently, prompt injections. Uh, we all use prompts very commonly and, you know, there's a lot of malicious prompt injection or jailbreaking attacks that can happen to override what the model was intended to do originally, and we'll also talk about these attacks. You know, all of them are relevant and it's a research area, but it's important to know at a high level how those attacks work. One thing that is special about, you know, this space, I would say, is that, you know, for every new defense there's a new attack, and for every new attack there's a new defense. So it's sort of defenses and attacks, uh, sort of competing with each other. And you'll find frankly that, um, in the AI space, including, uh, in the Gates department here, uh, at Stanford, a lot of the people who are coming up with attacks are the same that are coming up with defenses, you know. But it matters. Um, one thing to note is the progression of these attacks is that originally if you look two thousand and fourteen, two thousand and eighteen period, a lot of the attacks were using the inputs, and as AI agents sort of now work with instruction, with context, with retrieval pipelines, there is a lot more entry points to perform an attack, and so models are more vulnerable. We'll talk about retrieval-augmented generation in a lecture in two to three weeks, maybe three, three weeks, and you'll see that, you know, when you connect an agent to a database that you might not know, there's a lot of risks involved in that. It might be reading a document that can maliciously attack your agent. Okay. So let's try to come up with a first attack, an adversarial example in the image space. So my problem for you, and we're gonna do it like last week, uh, but more interactive, like two weeks ago. Given a network that is pre-trained on ImageNet. So remember, ImageNet has a bunch of classes, a lot of images, so it can detect pretty much all the common objects, people that you can imagine, uh, would be in a picture. Can you find an input image that will be classified as an iguana? You know? So what I'm asking you is, you have that neural network, it's pre-trained, and I want you to find an image, but instead of, you know, you take an image of a cat, of course, if you give it to the model, it's gonna say, "Hey, I think it's a cat." What I'm asking you is, how do you find an image such that the output is iguana? So how do you do that? Yes. Take a picture of an iguana, give it to the model, and it's likely to find an iguana. That's a fair solution. What else? Although you wouldn't even be guaranteed that it finds the iguana. Probably it would, but, you know, depends on the model performance. How can you be guaranteed that it's gonna predict it as an iguana? Yeah, you wanna try again? The training set on the computer with iguana. Okay. So if-- assuming you have access to the training sets of the model, you can find pictures labeled as iguanas, and it's likely that because it's been trained on that data set, it will in fact predict it as an iguana. That's also true. Now let's say you, you, you, you don't have access to the model parameters. Yeah. You have access to the output. Yeah. [inaudible] Okay. Put a higher probability on it. I see. So you send a bunch of pictures and you hit it until you find that the prediction is iguana, and then you say that's the picture. Yeah, correct. So that, that, that's sort of an optimization problem you're posing, which is what we're gonna do. And so remember two weeks ago I told you, like, designing loss functions is an important skill, maybe an art in neural networks. Here's an example of you coming up with a loss function that would allow you to forge an attack on pretty much any model. So here's what we're gonna do. We're gonna rephrase what we want in simple words. Um, we wanna find X, the input such as-- such that Y-hat of X is equal to the label for iguana. So the prediction is as close as possible to Y iguana. If you had to do that in, in terms of a loss function, what would it look like? A loss function you wanna minimize, let's say. Yeah. Mean squared error. Hmm? Mean squared error. Mean squared error between what and what? Um, between Y iguana and Y-hat. Yeah, Y-hat and Y iguana. Uh, good. Yeah, I agree. You could put an L2 distance between Y-hat, given the parameters, the biases, um, the weights and biases and, uh, Y iguana. And if you minimize that, then you would get X to optimize, uh, to, to, to lead to a Y-hat equals Y iguana or as close as possible to it. So there is one difference here with what we've seen in the past, which is that we are not touching the parameters of the network. We're starting from an image X. We're sending that image in the network. We're computing the defined loss function.And then we're computing the gradients of L with respect to the input pixels. So, you know, in gradient descent, you're used to the training process where you push the parameters to the right or to the left. Here, you're doing the same thing in the pixel space. The model is completely fixed. It's already pre-trained. And if you do that many times with gradient descent, you should end up with an image that is going to be predicted as iguana. Does that make sense to everyone? Yeah. So now the question is: Will the forged image X look like an iguana or not? Who, who thinks it will look like an iguana? Who thinks it will not? It was you. Someone wants to say why you think it will not look like an iguana? Yeah.
- SPSpeaker
Like, it's like gradient descent on the pixels, so that would be low on iguanas.
- KKKian Katanforoosh
You think the chance is low?
- SPSpeaker
Yes.
- KKKian Katanforoosh
You're not convinced that pushing pixels in a, in a certain direction will lead to a continuous set of colors that would look like an iguana. Okay. That's a good intuition.
- SPSpeaker
And one would think that the possible range would start getting more and more impossible. And of the space of all possible images, by defining that
- KKKian Katanforoosh
I see.
- SPSpeaker
Right.
- KKKian Katanforoosh
So you're saying there is more images that are classified as iguana by the model than there are iguana images in real-- possible? Yeah. Yeah, that's also a good intuition. Exactly. Yeah.
- SPSpeaker
I would think the model would be picking up on certain features, not necessarily that the whole image is an iguana. Like, I remember for example, for assignments, uh, with detecting cats, I took a picture of a sheepskin, and it's like, "Yeah, that's definitely a cat."
- KKKian Katanforoosh
[laughs] I see. I see. Yeah. Okay, so you're saying we might see some patterns that are alike an iguana, but it's unlikely the picture will look like an iguana as a whole. Yeah. It's a good example. For example, possibly the, the picture we're gonna see is more green than not, let's say. Maybe. That's, that's possible. So you're right. It is highly unlikely that the forged image will look like an iguana. Um, and, um, the reason is, is all of what you mentioned. Let's imagine the space of possible input images to the network. It turns out this space is way bigger than the space that us human look at. We never look at the randomness of images in the wild. We look at actually a fairly small distribution of patterns from our eyes. Um, and so let's say this is the space of possible input images. This space is very large. Um, the space of real images, what we come up as humans, uh, you know, when we look at the world, is much smaller than that. Um, and, uh, the blue space is, you know, this size because the model can take anything as an input. Two hundred and fifty-six pixel on a thirty-two-by-thirty-two-by-three channels is gigantic. Um, it's way more than the number of atoms in the universe. Um, and so it is very likely that, you know, because of the way we defined our optimization problem, that our image will fall in the green space, the space of images that are classified as iguana. And yes, there is an overlap between the green and the red space. Those are the iguanas that are following the real distribution, but the space is much bigger, as you were saying, and that's why it's unlikely that we'll end up there. Okay? So this is more likely what we'll see. Does not look at all like an iguana. Okay? Does that make sense? So now we're gonna go one step further because it's nice to be able to forge an attack, um, but if it looks random, it looks random to humans. So, you know, you're looking at a stop sign that's been forged. It doesn't look at all like a stop sign. Someone will just take it down, right? So a smarter attacker is gonna try to come up with an image that also looks like something to the human, and that might be more problematic. Let's say, you know, uh, you know, a, a s- a stop sign still looks like a stop sign, but it's not predicted as a stop sign. That becomes way more dangerous. So how do we modify the previous setup in order to do that? Given a network pre-trained on ImageNet, find an input image that is displaying a cat, uh, but instead of predicting, uh, it as a cat, the model now predicts it as an iguana because the, the image has been tempered. So how do we change our initial, uh, pipeline? Someone-- Yeah, in the back.
- SPSpeaker
Try swapping some pixels of the cat image.
- KKKian Katanforoosh
Okay.
- SPSpeaker
And see if that looks like an iguana.
Episode duration: 1:47:16
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode aWlRtOlacYM