Skip to content
Stanford OnlineStanford Online

Stanford CS230 | Autumn 2025 | Lecture 2: Supervised, Self-Supervised, & Weakly Supervised Learning

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai September 30, 2025 This lecture covers key AI concepts through case studies. To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs230-deep-learning To follow along with the course schedule and syllabus, visit: https://cs230.stanford.edu/syllabus/ More lectures will be published regularly. View the playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNRRGdS0rBbXOUGA0wjdh1X Andrew Ng Founder of DeepLearning.AI Adjunct Professor, Stanford University’s Computer Science Department Kian Katanforoosh CEO and Founder of Workera Adjunct Lecturer, Stanford University’s Computer Science Department

Kian Katanforooshhost
Oct 7, 20251h 39mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. KK

    I'm, uh, Kian Katanforoosh, and I am, uh, the co-creator and co-lecturer with Andrew, uh, for this class, CS230. Um, and I will teach about half of the in-person lectures this quarter. Um, outside of, uh, Stanford, I, I work in industry. I, I lead a company called Workera, which uses AI to measure skills. Um, and with the history of CS230 students that have started AI startups and companies, what I try to do usually is to bring a lot of examples from industry. So what you should, uh, expect from these in-class lectures is not as much of the academic side of things which we learn anyway in the online videos, uh, but also the, um, industry, um, uh, specific input. And some of the topics that we cover this, um, year together include, um, decision-making in AI projects, which we're gonna see today. You know, I want you to come out of today's lecture feeling like you had some fun, it was interactive, and also you have a better way to make decisions in AI projects because you've seen how deep learning researchers, engineers, and scientists make their decisions solving problems in industry. Um, other topics later in the quarter for the in-classroom time include things like, you know, adversarial attacks and defenses. We might have some time to cover it today. Uh, deep reinforcement learning, which is really hot in the market right now, and I think it's very important to know about it. Um, and then all the stuff that is very practical, like, uh, retrieval, uh, augmented generation, AI agents, multi-agent system. As we go deeper, um, into the class and you get the baggage of neural networks, we'll be able to cover even more fun topics. Okay. So today's, uh, lecture is gonna be, uh, structured in three parts, uh, maybe four, depending on whether we have time. We'll start with a little recap of the week, um, what you've learned online about neurons and layers and deep neural networks. Then we get into a set of supervised learning projects, um, including a day and night simple, you know, vanilla classification, the trigger word detection, which, which is actually a project you're gonna build at the end of the class yourself, um, and then face verification, which we'll see also variation of how face verification algorithms work. In the third section, we'll focus on self-supervised learning and weakly supervised learning. Don't worry if you don't know these terms. We're gonna learn them together. Um, and we talk a lot about embeddings, because embeddings are, um, the connective tissue of many AI systems online today, and it's important to know about them. And finally, if we have time, we'll also talk about adversarial attacks and defenses. With more and more AI systems in the wild, knowing how to defend them is very important, and knowing how to attack them can also teach you how to defend them. So, uh, we'll cover that as well. Sounds good? Uh, please interrupt me as we go through the lecture. Uh, we want this to be very, um, conversational as much as possible. So recap of the week, um, is the, the core way that, um, AI learns from data in a traditional supervised learning setup. You can think of it as an input, such as this little image of the confused cat, um, and an output, in this case, a number between zero and one that represents the chance that there might be a cat on the picture, one, or there's no cat on the picture, zero. Uh, what the model is, and oftentimes you'll see me refer to the model as two things. There's an architecture, which is essentially the blueprint of the model, the skeleton, and parameters. It might be a few parameters, it might be billions of parameters, like the models that OpenAI, DeepMind, and others work on. Um, outside of that, uh, a-a-and so when you think about AI models being deployed in the wild, like when you think about what's happening with ChatGPT, uh, what-- You know, you can, you can really come down to there's two files somewhere on the cloud, one that describes the architecture of the model, one that describes the parameters that are part of this architecture, and you keep calling that-- to those two files, and you get your inference or your outputs. That's really what's happening behind the scene. Much more complicated than that obviously, but those are the two critical components of a neural network architecture and its parameters that are trained. How does the model learn is through a gradient descent optimization. Meaning I send the picture of the cat through the model, and the model at the beginning is not trained, so it's probably wrong. It tells me, "I think there's no cat. I think it's zero." And then I use something called the loss function to compare the ground truth, there is a cat on the picture, with the prediction from the model at this point in time. Those two numbers are far from each other. That should be a penalty, which the loss function describes. And then in order to give feedback to the parameters, we use this gradient descent update. We do that many, many times. What it means is that we take our parameters and we tell them, "Hey, you should go a little bit more to the right or a little bit more to the left," um, until that number that is the prediction for the cat is closer to the ground truth. We do that with batches of data, millions of images of cats and images of anything else, and we give that feedback repetitively to the model until the parameters are calibrated and the model is in fact finding the cat on this picture.Nothing new here. You've seen it in the videos. Any question on that learning setup? No. Okay. Easy so far. There's many things that can change in this setup, and you'll see in the class. First thing is the input. The input does not have to be an image. It can be text, like when you chat, you know, with ChatGPT. It can be audio, it can be video, it can be structured data, it can be spreadsheets and numbers. Those, we'll see a variety of examples in the class and how it influences the architecture. Um, the output, again, doesn't have to be zero and one. This is an example of a classification. Um, you could turn this problem into a regression. Uh, for example, if I was asking you, what's the age of the cat? Estimate the age of the cat. That would be a regression task, not a classification task anymore. Later in the class, we'll also see generative task. In fact, lecture four is gonna focus on diffusion models, generative adversarial networks, where the output actually is much bigger than the input, typically. You know, so you can have a low resolution of a cat as input, and the output is a high resolution of the same cat. The, the output is bigger than the input, which can be counterintuitive to people. Um, other things that can change include the architecture. Um, you've learned about the vanilla multilayer perceptron or the fully connected neural network. That's what we're learning right now online together. Um, by the end of the class, you'll have many architectures that you'll be familiar with, from, uh, RNNs and convolutional neural networks, uh, transformer models. All of these, at the end of the day, use the basis neural network that you're learning right now. They're just stacked on top of each other differently. Uh, the loss function is actually a big focus of today's class and, uh, of, of the class in general. The loss function, which is what gives the feedback to the model, um, you were right or you were wrong, and what to do about it, is an art. Designing good loss functions, you know, great deep learning researchers are very creative when it comes to designing loss functions. And in fact, when we built, uh, the algorithm called YOLO, it's-- it is called YOLO, um, uh, for You Only Look Once, not you only live once. But YOLO, uh, has a very, you know, difficult to understand at first loss function, and there's a reason why the loss function was designed like that. So by the end of this class, you'll also have a better intuition on how do we design great loss functions. Uh, other things I'm not gonna cover right now, the activation functions in your neural network, the optimizer that you use for your gradient descent loop, and then the hyperparameters that might come in when you train your algorithms. Okay. Nothing new here. Um, this is the basic setup. You've also learned this week about neurons. Uh, the easiest way to think about a neuron is the classic, uh, logistic regression, um, algorithm, where I'm taking the image of the cat. So an image i-in computer science, uh, the way the machine reads it is three channels, RGB, for the three colors, red, green, blue. Um, and we take all these numbers, we put them in a vector. The vector is then fed into a neuron, and the neuron has two components: the linear part, W transpose X plus B, W being the weight and B being the bias, and then an activation function, in this case, the sigmoid function, which is very handy because it takes any number, and it puts it between zero and one so that the output can look like a probability. Classic setup. And here the probability is zero point seventy-three, which is above zero point five, which tells me the model thinks there's a cat on the picture, because one is a cat, zero is no cat. So question for you to get started, how would you modify this binary classification that detects cats in an algorithm that would be able to detect multiple animals, such as a cat, a dog, and a giraffe? What do you need to change about this neural network? Yeah.

  2. SP

    Instead of the sigmoid, you would need a different function that [inaudible] .

  3. KK

    Okay, so you would change the output layer to match to the number of animals you wanna detect. Yeah, correct. Anyone wants to add anything else? Yeah.

  4. SP

    The data that goes in.

  5. KK

    The data that goes in there, how would you change the data?

  6. SP

    It has to be for the animal that you want to classify.

  7. KK

    Okay. Very good. Yeah. You need, uh, you need data from dogs and giraffes and also maybe nature in general. What else do we need not to forget? Yeah.

  8. SP

    Um, maybe you could add a neuron for each of the animals, and then your prediction would be whichever output is the highest.

  9. KK

    Yeah. Okay. Add one neuron per animal. Those neurons will be independent from each other, and each neuron would focus on one animal. Yeah, good point. It's actually what we're gonna do. So yes, I think your, your suggestions were the right one. We could multiply this output layer to have three neurons instead of one. All of them, because it's a fully connected neural net, see the entire pixels flattened in the vector, and then each of them will be focused on an animal. The number one mistake that we see in projects is that people add more data, uh, but forget to adjust the labels. So how do the labels need to be adjusted here? It's not anymore zero and one, right? What type of lab-labels do we need to train this? Yeah.

  10. SP

    You gotta have the number for each animal.

  11. KK

    Yeah. Okay. Do you know how we call that? Or no? Okay. So [laughs] yeah. You-- Yeah.

  12. SP

    Vector? Vectors?

  13. KK

    Yeah, vectors. Yeah. I think you're saying the same thing. But the, uh... Yeah. So here you would, you would use a one-hot vector or a multi-hot. You know, one-hot means we'll have, um-You'll have a vector of size three, and if there is a cat on the picture, the label is gonna be zero, one, zero because the second neuron is gonna be responsible to detect cats. Um, in fact, that would be called the one-hot vector. Uh, oftentimes, you'll have multiple animals on the picture because cats and dogs can appear together. Cats and giraffes, less so, and dogs and giraffes, I've never seen one in the same picture. But anyway, um, you'll have a multi-hot vector. If you have a cat and a dog on the picture, you'll probably label it as one, one, zero. And the reason I'm mentioning that, it may sound silly, but in a lot of projects, people change their data, they forget to change their labels, and then they wonder why it doesn't work. Yeah. Okay, cool. Now, in the class, we use a specific notation with, uh, superscript and subscript. So when you'll see me refer to something like, um, A one one, the superscript in square brackets indicates the layer that you're in. So you're in the first layer. Uh, the subscript refers to the index of the neuron, okay? And A is for activation. So A subscript three square bracket one is A-- is the ne-- the, the output of the third neuron of the first layer. Okay? Again, if I continue, second layer would be written like that, and then you'll get your probability. The deeper the network is, the more capacity it has. This is the word we use, capacity. What it means is that if you send a million pictures of cats and a million pictures of non-cats to a shallow network, it might not have the capacity to learn what's in the dataset. It's just not flexible enough. The deeper the network, the more capacity it has. So in fact, a network that is super deep, imagine a billion parameters transformer model with one million pictures of cat and non-cats, it will just overfit to those pictures, meaning it's not gonna learn what a cat is. It's just gonna learn by heart those million pictures because its capacity is way bigger than the dataset it's fed. So it's very important to understand the amount of data you're gonna feed and the complexity, diversity of that data will probably dictate the capacity of the models you need to use. Okay. Now, just to give you a little bit more intuition on what happens inside those neural networks, we take this relatively shallow network, but, you know, call it three layers, and we train it on a dataset of facial images. Uh, ignore the task. In, in, in face datasets, there is a lot of tasks you could do. You could do a face verification, you could do a face recognition, you could do, um, you know, face, uh, uh, clustering, things like that. We, we'll talk about that later. But let's say it's been trained really well on understanding faces. If you now unpack this network and you sort of query each neuron and look what's going on inside, what you'll notice is that the first layers are gonna be better at encoding low-complexity features, while the deeper networks are gonna be better at encoding higher complexity features. So here's how it goes. Nothing too complicated for now. Um, the neuron in the first layers, they're looking at pixels because you're giving them directly the pixels. So they're gonna be good at stitching those pixels together. And maybe the first neuron will be good at detecting a diagonal edge. The second neuron will be good at vertical edges, and the third one at horizontal edges because they're just looking at pixels and trying to make sense out of them. Now you go one layer deeper in the middle of the network. Those are not seeing pixels. They're seeing the output of the first layer, which is already slightly more complex. So what you can expect the layers in the middle of the network, uh, to give you or to activate for is higher level features like an eye or a nose or an ear, because it turns out if you have a few edges, you can start detecting circles. And so you would see a neuron that is really good at detecting circles. Eyes. The deeper you go in the network, the more you'll get closer to the task itself, which in this case, facial analysis, let's say. You would see the last few neurons detect larger features of the face because, again, they're seeing higher complexity information. Does that make sense? This concept, you know, we call it encoding. We'll also talk about embeddings. It's very important because when you train neural networks, you wanna make sure first that they're understanding what they're doing, and that's one way to see it. We'll have an entire lecture on interpreting and visualizing neural networks later this quarter. Um, and on top of that, um, you probably can make use of some of those encodings and embeddings. We'll, we'll see that later, but why in the vector space, the distances between those concepts are important. You can already imagine that for co-- for tasks like search, searching in a database, having a neural network able to encode information in a very meaningful way can allow you to find concepts that are close to each other and associate them with each other and concepts that are far from each other and dissociate them from each other. Okay, so this was the warm-up for today. How much time we spent? Okay, fifteen minutes on the warm-up. That's good. So we learned a few new words: model, architecture, parameters. I didn't talk about it, but feature engineering versus feature learning, that's the core, uh, uh, concept in, in deep learning is feature engineering, is what we used to do before deep learning, which is, uh, you might actually build an algorithm that is good at detecting eyes. It's just, uh, good at scanning eyes. You manually build it, you know. And then you build another one that is good at detecting a mouth, a mouth, and then you put them together to fill-- detect faces. We don't do that anymore. We do end-to-end learning, meaning we let the data speaks for itself and train the model. This is called feature learning. It's automatic, and that's how the neural network actually learns those features without you needing to tell it eyes are important to detect faces. You don't need to do that.Um, encoding and embedding the, the, the real difference is encoding is any, any vector representation and, um, an embedding though-- Sorry, it should have been, uh, e- an embedding is when an encoding has meaning. Meaning the distance between two encodings has, has meaning. You know, they might be close or far from each other, and it tells you something. There is a logic to it. And then we talked about one-hot and multi-hot vectors. Okay. So end of the recap, now let's go into supervised learning projects, and we're gonna make decisions together and walk through it. The first case study is day and night classification. So here's my problem for you. Given an image that I give you, classify it as whether it's, uh, the day or whether it's the night. Okay, open-ended problem. Ignore that foundation models exist. You don't have access to ChatGPT or Claude or whatever. The point is to get under the hood and have those discussions because obviously it's a toy example. So, uh, what do you do? Like, what, what, what data do you wanna collect to solve this problem? Start. Yes.

  14. SP

    Check how different pixels in the same row are.

  15. KK

    So tell me more. Check how pixels in the same row are.

  16. SP

    Uh, different rows just, uh, when, uh, see how they're different and if they are very different, there's a possibility that there is the day.

  17. KK

    What does it tell you that if you look at the row of pixel and the row next to it, if they're very different, what does it tell you?

  18. SP

    It has more colors than just-

  19. KK

    Okay, so you say if the delta between pixels that are close geographically to each other, um, is high, then there's probably colors, color changes, so it's day, most likely. Is that it? Okay. So th-that's an example of a, of feature engineering. It's like you're, you're going for it in gray. It's like you're going for it, and you're trying to understand what's a pattern that tells me that a picture is day or night. What else can you do in the world of neural networks? Any other ideas? Yeah.

  20. SP

    Just feed in a bunch of pictures, like half that are during the day and half that are-

  21. KK

    Okay, good. Yeah. So yeah, I agree. Uh, I said ten thousand images, but how, how do you even determine how many pictures you need to get started with this project?

  22. SP

    Check for differences.

  23. KK

    Mm?

  24. SP

    Feed some data.

  25. KK

    Feed some data. So you start with like ten pictures, and then you continue to go. Yeah, you could do that. Might, you know, take some time. I think the question is how easy is it to collect that data? Like, how would, how would you collect that data?

  26. SP

    Probably for the same location. You get pictures for day and pictures for-

  27. KK

    Okay. Yeah. We could put our phone out there and then record the day and the night and have a stream of pictures and add it to the data set. Same location, but different, uh, lightings. Yeah.

  28. SP

    If you mentioned that this was a very hard problem because, um, in my opinion also, it depends on how, like, what kind of model you want to have, right? For example, if you were building something for one location to sense whether it's day or night, then that one location would be really nice. But then if you were trying to look at, like, anywhere in the world, like in any kind of climate, whether it's day or night, then you would have to have like an extremely diverse set of images. And because the problem is so broad, you would have also need like so many, like very-- a huge amount of data.

  29. KK

    That's a great point. So just to repeat for the, the, the people online, uh, you have to define the task first because the task can be easy. You can, you can be in a park in a very specific location and say, "Just detect if it's day or night." Or you can have a problem which is your camera can be anywhere, and that makes it more complicated. And also the amount of data you'll need is probably much more for that second problem. That's what you said, right? Uh, now, actually, that's a great thread. Tell me about cases where this problem would be really hard to solve. Yeah.

  30. SP

    Like pictures of places like inside buildings.

Episode duration: 1:39:47

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode DNCn1BpCAUY

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.