This video isn’t embeddableWatch on YouTube →

Stanford CS230 | Autumn 2025 | Lecture 2: Supervised, Self-Supervised, & Weakly Supervised Learning

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai September 30, 2025 This lecture covers key AI concepts through case studies. To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs230-deep-learning To follow along with the course schedule and syllabus, visit: https://cs230.stanford.edu/syllabus/ More lectures will be published regularly. View the playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNRRGdS0rBbXOUGA0wjdh1X Andrew Ng Founder of DeepLearning.AI Adjunct Professor, Stanford University’s Computer Science Department Kian Katanforoosh CEO and Founder of Workera Adjunct Lecturer, Stanford University’s Computer Science Department

Kian Katanforooshhost

Oct 7, 20251h 39mWatch on YouTube ↗

EVERY SPOKEN WORD

85 min read · 16,718 words

0:05 – 3:07
Course goals and why this lecture is industry-driven
1. KKKian Katanforoosh
  I'm, uh, Kian Katanforoosh, and I am, uh, the co-creator and co-lecturer with Andrew, uh, for this class, CS230. Um, and I will teach about half of the in-person lectures this quarter. Um, outside of, uh, Stanford, I, I work in industry. I, I lead a company called Workera, which uses AI to measure skills. Um, and with the history of CS230 students that have started AI startups and companies, what I try to do usually is to bring a lot of examples from industry. So what you should, uh, expect from these in-class lectures is not as much of the academic side of things which we learn anyway in the online videos, uh, but also the, um, industry, um, uh, specific input. And some of the topics that we cover this, um, year together include, um, decision-making in AI projects, which we're gonna see today. You know, I want you to come out of today's lecture feeling like you had some fun, it was interactive, and also you have a better way to make decisions in AI projects because you've seen how deep learning researchers, engineers, and scientists make their decisions solving problems in industry. Um, other topics later in the quarter for the in-classroom time include things like, you know, adversarial attacks and defenses. We might have some time to cover it today. Uh, deep reinforcement learning, which is really hot in the market right now, and I think it's very important to know about it. Um, and then all the stuff that is very practical, like, uh, retrieval, uh, augmented generation, AI agents, multi-agent system. As we go deeper, um, into the class and you get the baggage of neural networks, we'll be able to cover even more fun topics. Okay. So today's, uh, lecture is gonna be, uh, structured in three parts, uh, maybe four, depending on whether we have time. We'll start with a little recap of the week, um, what you've learned online about neurons and layers and deep neural networks. Then we get into a set of supervised learning projects, um, including a day and night simple, you know, vanilla classification, the trigger word detection, which, which is actually a project you're gonna build at the end of the class yourself, um, and then face verification, which we'll see also variation of how face verification algorithms work. In the third section, we'll focus on self-supervised learning and weakly supervised learning. Don't worry if you don't know these terms. We're gonna learn them together. Um, and we talk a lot about embeddings, because embeddings are, um, the connective tissue of many AI systems online today, and it's important to know about them. And finally, if we have time, we'll also talk about adversarial attacks and defenses. With more and more AI systems in the wild, knowing
3:07 – 8:40
Deep learning recap: supervised learning loop, model components, and what can vary
1. KKKian Katanforoosh
  how to defend them is very important, and knowing how to attack them can also teach you how to defend them. So, uh, we'll cover that as well. Sounds good? Uh, please interrupt me as we go through the lecture. Uh, we want this to be very, um, conversational as much as possible. So recap of the week, um, is the, the core way that, um, AI learns from data in a traditional supervised learning setup. You can think of it as an input, such as this little image of the confused cat, um, and an output, in this case, a number between zero and one that represents the chance that there might be a cat on the picture, one, or there's no cat on the picture, zero. Uh, what the model is, and oftentimes you'll see me refer to the model as two things. There's an architecture, which is essentially the blueprint of the model, the skeleton, and parameters. It might be a few parameters, it might be billions of parameters, like the models that OpenAI, DeepMind, and others work on. Um, outside of that, uh, a-a-and so when you think about AI models being deployed in the wild, like when you think about what's happening with ChatGPT, uh, what-- You know, you can, you can really come down to there's two files somewhere on the cloud, one that describes the architecture of the model, one that describes the parameters that are part of this architecture, and you keep calling that-- to those two files, and you get your inference or your outputs. That's really what's happening behind the scene. Much more complicated than that obviously, but those are the two critical components of a neural network architecture and its parameters that are trained. How does the model learn is through a gradient descent optimization. Meaning I send the picture of the cat through the model, and the model at the beginning is not trained, so it's probably wrong. It tells me, "I think there's no cat. I think it's zero." And then I use something called the loss function to compare the ground truth, there is a cat on the picture, with the prediction from the model at this point in time. Those two numbers are far from each other. That should be a penalty, which the loss function describes. And then in order to give feedback to the parameters, we use this gradient descent update. We do that many, many times. What it means is that we take our parameters and we tell them, "Hey, you should go a little bit more to the right or a little bit more to the left," um, until that number that is the prediction for the cat is closer to the ground truth. We do that with batches of data, millions of images of cats and images of anything else, and we give that feedback repetitively to the model until the parameters are calibrated and the model is in fact finding the cat on this picture.Nothing new here. You've seen it in the videos. Any question on that learning setup? No. Okay. Easy so far. There's many things that can change in this setup, and you'll see in the class. First thing is the input. The input does not have to be an image. It can be text, like when you chat, you know, with ChatGPT. It can be audio, it can be video, it can be structured data, it can be spreadsheets and numbers. Those, we'll see a variety of examples in the class and how it influences the architecture. Um, the output, again, doesn't have to be zero and one. This is an example of a classification. Um, you could turn this problem into a regression. Uh, for example, if I was asking you, what's the age of the cat? Estimate the age of the cat. That would be a regression task, not a classification task anymore. Later in the class, we'll also see generative task. In fact, lecture four is gonna focus on diffusion models, generative adversarial networks, where the output actually is much bigger than the input, typically. You know, so you can have a low resolution of a cat as input, and the output is a high resolution of the same cat. The, the output is bigger than the input, which can be counterintuitive to people. Um, other things that can change include the architecture. Um, you've learned about the vanilla multilayer perceptron or the fully connected neural network. That's what we're learning right now online together. Um, by the end of the class, you'll have many architectures that you'll be familiar with, from, uh, RNNs and convolutional neural networks, uh, transformer models. All of these, at the end of the day, use the basis neural network that you're learning right now. They're just stacked on top of each other differently. Uh, the loss function is actually a big focus of today's class and, uh, of, of the class in general. The loss function, which is what gives the feedback to the model, um, you were right or you were wrong, and what to do about it, is an art. Designing good loss functions, you know, great deep learning researchers are very creative when it comes to designing loss functions. And in fact, when we built, uh, the algorithm called YOLO, it's-- it is called YOLO, um, uh, for You Only Look Once, not you only live once. But YOLO, uh, has a very, you know, difficult to understand at first loss function, and there's a reason why the loss function was designed like that. So by the end of this class, you'll also have a better intuition on how do we design great loss functions. Uh, other
8:40 – 14:32
Neurons to networks: multi-class labeling, one-hot vs multi-hot, and capacity/overfitting intuition
1. KKKian Katanforoosh
  things I'm not gonna cover right now, the activation functions in your neural network, the optimizer that you use for your gradient descent loop, and then the hyperparameters that might come in when you train your algorithms. Okay. Nothing new here. Um, this is the basic setup. You've also learned this week about neurons. Uh, the easiest way to think about a neuron is the classic, uh, logistic regression, um, algorithm, where I'm taking the image of the cat. So an image i-in computer science, uh, the way the machine reads it is three channels, RGB, for the three colors, red, green, blue. Um, and we take all these numbers, we put them in a vector. The vector is then fed into a neuron, and the neuron has two components: the linear part, W transpose X plus B, W being the weight and B being the bias, and then an activation function, in this case, the sigmoid function, which is very handy because it takes any number, and it puts it between zero and one so that the output can look like a probability. Classic setup. And here the probability is zero point seventy-three, which is above zero point five, which tells me the model thinks there's a cat on the picture, because one is a cat, zero is no cat. So question for you to get started, how would you modify this binary classification that detects cats in an algorithm that would be able to detect multiple animals, such as a cat, a dog, and a giraffe? What do you need to change about this neural network? Yeah.
2. SPSpeaker
  Instead of the sigmoid, you would need a different function that [inaudible] .
3. KKKian Katanforoosh
  Okay, so you would change the output layer to match to the number of animals you wanna detect. Yeah, correct. Anyone wants to add anything else? Yeah.
4. SPSpeaker
  The data that goes in.
5. KKKian Katanforoosh
  The data that goes in there, how would you change the data?
6. SPSpeaker
  It has to be for the animal that you want to classify.
7. KKKian Katanforoosh
  Okay. Very good. Yeah. You need, uh, you need data from dogs and giraffes and also maybe nature in general. What else do we need not to forget? Yeah.
8. SPSpeaker
  Um, maybe you could add a neuron for each of the animals, and then your prediction would be whichever output is the highest.
9. KKKian Katanforoosh
  Yeah. Okay. Add one neuron per animal. Those neurons will be independent from each other, and each neuron would focus on one animal. Yeah, good point. It's actually what we're gonna do. So yes, I think your, your suggestions were the right one. We could multiply this output layer to have three neurons instead of one. All of them, because it's a fully connected neural net, see the entire pixels flattened in the vector, and then each of them will be focused on an animal. The number one mistake that we see in projects is that people add more data, uh, but forget to adjust the labels. So how do the labels need to be adjusted here? It's not anymore zero and one, right? What type of lab-labels do we need to train this? Yeah.
10. SPSpeaker
  You gotta have the number for each animal.
11. KKKian Katanforoosh
  Yeah. Okay. Do you know how we call that? Or no? Okay. So [laughs] yeah. You-- Yeah.
12. SPSpeaker
  Vector? Vectors?
13. KKKian Katanforoosh
  Yeah, vectors. Yeah. I think you're saying the same thing. But the, uh... Yeah. So here you would, you would use a one-hot vector or a multi-hot. You know, one-hot means we'll have, um-You'll have a vector of size three, and if there is a cat on the picture, the label is gonna be zero, one, zero because the second neuron is gonna be responsible to detect cats. Um, in fact, that would be called the one-hot vector. Uh, oftentimes, you'll have multiple animals on the picture because cats and dogs can appear together. Cats and giraffes, less so, and dogs and giraffes, I've never seen one in the same picture. But anyway, um, you'll have a multi-hot vector. If you have a cat and a dog on the picture, you'll probably label it as one, one, zero. And the reason I'm mentioning that, it may sound silly, but in a lot of projects, people change their data, they forget to change their labels, and then they wonder why it doesn't work. Yeah. Okay, cool. Now, in the class, we use a specific notation with, uh, superscript and subscript. So when you'll see me refer to something like, um, A one one, the superscript in square brackets indicates the layer that you're in. So you're in the first layer. Uh, the subscript refers to the index of the neuron, okay? And A is for activation. So A subscript three square bracket one is A-- is the ne-- the, the output of the third neuron of the first layer. Okay? Again, if I continue, second layer would be written like that, and then you'll get your probability. The deeper the network is, the more capacity it has. This is the word we use, capacity. What it means is that if you send a million pictures of cats and a million pictures of non-cats to a shallow network, it might not have the capacity to learn what's in the dataset. It's just not flexible enough. The deeper the network, the more capacity it has. So in fact, a network that is super deep, imagine a billion parameters transformer model with one million pictures of cat and non-cats, it will just overfit to those pictures, meaning it's not gonna learn what a cat is. It's just gonna learn by heart those million pictures because its capacity is way bigger than the dataset it's fed. So it's very important to understand the amount of data you're gonna feed and the complexity, diversity of that data will probably dictate the capacity of the models you need to use. Okay. Now, just to give you a little bit more intuition on what happens inside those neural
14:32 – 18:33
What networks learn inside: feature learning, encodings vs embeddings, and why distance matters
1. KKKian Katanforoosh
  networks, we take this relatively shallow network, but, you know, call it three layers, and we train it on a dataset of facial images. Uh, ignore the task. In, in, in face datasets, there is a lot of tasks you could do. You could do a face verification, you could do a face recognition, you could do, um, you know, face, uh, uh, clustering, things like that. We, we'll talk about that later. But let's say it's been trained really well on understanding faces. If you now unpack this network and you sort of query each neuron and look what's going on inside, what you'll notice is that the first layers are gonna be better at encoding low-complexity features, while the deeper networks are gonna be better at encoding higher complexity features. So here's how it goes. Nothing too complicated for now. Um, the neuron in the first layers, they're looking at pixels because you're giving them directly the pixels. So they're gonna be good at stitching those pixels together. And maybe the first neuron will be good at detecting a diagonal edge. The second neuron will be good at vertical edges, and the third one at horizontal edges because they're just looking at pixels and trying to make sense out of them. Now you go one layer deeper in the middle of the network. Those are not seeing pixels. They're seeing the output of the first layer, which is already slightly more complex. So what you can expect the layers in the middle of the network, uh, to give you or to activate for is higher level features like an eye or a nose or an ear, because it turns out if you have a few edges, you can start detecting circles. And so you would see a neuron that is really good at detecting circles. Eyes. The deeper you go in the network, the more you'll get closer to the task itself, which in this case, facial analysis, let's say. You would see the last few neurons detect larger features of the face because, again, they're seeing higher complexity information. Does that make sense? This concept, you know, we call it encoding. We'll also talk about embeddings. It's very important because when you train neural networks, you wanna make sure first that they're understanding what they're doing, and that's one way to see it. We'll have an entire lecture on interpreting and visualizing neural networks later this quarter. Um, and on top of that, um, you probably can make use of some of those encodings and embeddings. We'll, we'll see that later, but why in the vector space, the distances between those concepts are important. You can already imagine that for co-- for tasks like search, searching in a database, having a neural network able to encode information in a very meaningful way can allow you to find concepts that are close to each other and associate them with each other and concepts that are far from each other and dissociate them from each other. Okay, so this was the warm-up for today. How much time we spent? Okay, fifteen minutes on the warm-up. That's good. So we learned a few new words: model, architecture, parameters. I didn't talk about it, but feature engineering versus feature learning, that's the core, uh, uh, concept in, in deep learning is feature engineering, is what we used to do before deep learning, which is, uh, you might actually build an algorithm that is good at detecting eyes. It's just, uh, good at scanning eyes. You manually build it, you know. And then you build another one that is good at detecting a mouth, a mouth, and then you put them together to fill-- detect faces. We don't do that anymore. We do end-to-end learning, meaning we let the data speaks for itself and train the model. This is called feature learning. It's automatic, and that's how the neural network actually learns those features without you needing to tell it eyes are important to detect faces. You don't need to do that.Um, encoding and embedding the, the, the real difference is encoding is any, any vector representation and, um, an embedding though-- Sorry, it should have been, uh, e- an embedding is when
18:33 – 34:18
Case study 1 — Day vs night classification: defining scope, collecting data, and choosing resolution
1. KKKian Katanforoosh
  an encoding has meaning. Meaning the distance between two encodings has, has meaning. You know, they might be close or far from each other, and it tells you something. There is a logic to it. And then we talked about one-hot and multi-hot vectors. Okay. So end of the recap, now let's go into supervised learning projects, and we're gonna make decisions together and walk through it. The first case study is day and night classification. So here's my problem for you. Given an image that I give you, classify it as whether it's, uh, the day or whether it's the night. Okay, open-ended problem. Ignore that foundation models exist. You don't have access to ChatGPT or Claude or whatever. The point is to get under the hood and have those discussions because obviously it's a toy example. So, uh, what do you do? Like, what, what, what data do you wanna collect to solve this problem? Start. Yes.
2. SPSpeaker
  Check how different pixels in the same row are.
3. KKKian Katanforoosh
  So tell me more. Check how pixels in the same row are.
4. SPSpeaker
  Uh, different rows just, uh, when, uh, see how they're different and if they are very different, there's a possibility that there is the day.
5. KKKian Katanforoosh
  What does it tell you that if you look at the row of pixel and the row next to it, if they're very different, what does it tell you?
6. SPSpeaker
  It has more colors than just-
7. KKKian Katanforoosh
  Okay, so you say if the delta between pixels that are close geographically to each other, um, is high, then there's probably colors, color changes, so it's day, most likely. Is that it? Okay. So th-that's an example of a, of feature engineering. It's like you're, you're going for it in gray. It's like you're going for it, and you're trying to understand what's a pattern that tells me that a picture is day or night. What else can you do in the world of neural networks? Any other ideas? Yeah.
8. SPSpeaker
  Just feed in a bunch of pictures, like half that are during the day and half that are-
9. KKKian Katanforoosh
  Okay, good. Yeah. So yeah, I agree. Uh, I said ten thousand images, but how, how do you even determine how many pictures you need to get started with this project?
10. SPSpeaker
  Check for differences.
11. KKKian Katanforoosh
  Mm?
12. SPSpeaker
  Feed some data.
13. KKKian Katanforoosh
  Feed some data. So you start with like ten pictures, and then you continue to go. Yeah, you could do that. Might, you know, take some time. I think the question is how easy is it to collect that data? Like, how would, how would you collect that data?
14. SPSpeaker
  Probably for the same location. You get pictures for day and pictures for-
15. KKKian Katanforoosh
  Okay. Yeah. We could put our phone out there and then record the day and the night and have a stream of pictures and add it to the data set. Same location, but different, uh, lightings. Yeah.
16. SPSpeaker
  If you mentioned that this was a very hard problem because, um, in my opinion also, it depends on how, like, what kind of model you want to have, right? For example, if you were building something for one location to sense whether it's day or night, then that one location would be really nice. But then if you were trying to look at, like, anywhere in the world, like in any kind of climate, whether it's day or night, then you would have to have like an extremely diverse set of images. And because the problem is so broad, you would have also need like so many, like very-- a huge amount of data.
17. KKKian Katanforoosh
  That's a great point. So just to repeat for the, the, the people online, uh, you have to define the task first because the task can be easy. You can, you can be in a park in a very specific location and say, "Just detect if it's day or night." Or you can have a problem which is your camera can be anywhere, and that makes it more complicated. And also the amount of data you'll need is probably much more for that second problem. That's what you said, right? Uh, now, actually, that's a great thread. Tell me about cases where this problem would be really hard to solve. Yeah.
18. SPSpeaker
  Like pictures of places like inside buildings.
19. KKKian Katanforoosh
  Yeah, indoor. Indoor pictures. Actually, can you-- How could you tell if you took a picture of me with the screen here, you know, what time it was? You couldn't? You actually can. Yeah. There is a clock here. Um, so that, that's very interesting because, um, because if that was part of the task, then our problem would be hard, and ten thousand images is not gonna cut it to understand time, you know? And so it's very important to define the task very well. What else can be hard other than indoor pictures?
20. SPSpeaker
  What-- The clock could be AM or PM.
21. KKKian Katanforoosh
  Also.
22. SPSpeaker
  Um-
23. KKKian Katanforoosh
  Also, the clock can be AM or PM, but you can probably take additional information, which is how people are dressed, and think that it might be warmer outside than colder. You, you could-- Again, it can be very complicated at the end of the day, but a human would say, someone is teaching, students are in class, it's probably not, um, twelve AM, you know, or... So I mean, like, it gets complicated. Yeah. What, what else can be hard?
24. SPSpeaker
  Sunny versus cloudy.
25. KKKian Katanforoosh
  Okay. Sunny, cloudy. Yeah.
26. SPSpeaker
  If you're in a part of the world where the sun doesn't set during the day, then-
27. KKKian Katanforoosh
  Great point. If you're in the north of Norway right now or Sweden, uh, even the clock can tell you probably if it's day or night. Yeah.
28. SPSpeaker
  Dawn and dusk.
29. KKKian Katanforoosh
  Dawn and dusk. Yeah, exactly. Those are great examples. Actually, this is a good semantic one because you'd need also to define exactly what's the definition of day and night. So long story short, um, the problem can seem easy at first. It can be very complicated. And trust me, if you wanted to do this really well, even the foundation models today couldn't do it in certain cases. You know?Um, okay, let's say we have ten thousand images, and you talked about the split of images earlier, and I agree with you. You want a mix of different situations in order to be able to cover all of them. Um, and going back to our discussion on model capacity, if it's you-- if it's just a simple problem, you probably need just a small capacity model. If you wanna add all these edge cases, you probably are looking for bigger capacity models and more data. Uh, what's the input to our model? I think someone said it already. So let's say a picture of, um, a picture of a, you know, day or night or whatever. What's the resolution we're gonna work with? How do you determine resolution when you build a data set, and why does it matter? Yes.
30. SPSpeaker
  The number of pixels, the width-
34:18 – 42:02
Case study 2 — Trigger word detection: cascaded assistants and why labeling strategy matters
1. KKKian Katanforoosh
  starting with the next example, and those are usually helpful to make quick decisions in your project when you're in the industry. So second project, trigger word detection. Let me give some context on this. Uh, the general problem, um-- Okay, you're familiar with, like, Alexa and, um, all these, let's say, Siri and things like that, that you might have in your kitchen listening to you. Everybody knows? So, um, the way these network typically work, th-these models, is it's not a single model. It's a cascade of models for energy and efficiency purposes. So for example, if you have a virtual assistant in your kitchen, the first model is activity detection. It just detects if there is any volume, you know. Because you don't wanna listen with the heavy model at all time. It just uses a lot of energy, right? You want a very lightweight model that understands when volume is playing. And so let's say this, uh, network detects volume. It calls another network that is focused on the activation word, the trigger word, "Alexa," S-- "Hey, Siri," "Okay, Google." That's usually the second layer. Um, and that one is only listening for a specific keyword. If the keyword comes in, it would typically call a better model that's slightly slower, that might be heavier and more energy consumption, and that might understand, uh, your-- what you're trying to do, you know. And then I'm not gonna go into the details, but you have architectures that get very complicated. Back in the day, some of these companies were doing one model to set up a timer, one model to buy something online. One model was very complicated. Today, it's slightly simpler and more end-to-end, but I just want you to know the cascade of models that are being called because this case study is about the second model. It's about the trigger word. So here's my problem for you. Given a ten-second audio speech, uh, detect the word "activate." How would you build that off the shelf like that, starting from no-- zero? What data would you collect? Yep.
2. SPSpeaker
  So are we looking at frequencies that, uh, like, our mic receive? And from those frequency, we want to, uh, figure out, like, what was the word, and, uh, we feel-- we, we also want to look into the lengths.
3. KKKian Katanforoosh
  Okay.
4. SPSpeaker
  Say the word and then we will run some kind of algorithm will translate our frequency to-
5. KKKian Katanforoosh
  Like a Fourier transform or-- Okay, yeah, yeah. What you described is a Fourier transform or like plus the pre-processing. You're right. So you're saying audio is a bunch of frequencies with, you know, values, and we wanna first pre-process that then to give it to an algorithm, and the length of the sequence matters as well. Because if you wanna detect the word activate, you know that the length needs to be, I mean, there's a minimum length. You can't say activate in less than ten milliseconds, right? So the, the length matter as well. Okay. That-that's good insight. What else? What data? How, how would you collect that data? Yes. Microphone. Microphone, like with your phone? Okay. Like, you would go around campus and record people. How would you-- What would you ask them to say? Uh, the word activate. Okay. You would ask-- You would record a bunch of people saying the word activate. You would ask them anything else? Other words. Other words? Yeah. Say a sentence that- Say a sentence. Yeah. It turns out you have website that are random generators, and you just say, "Say this and say this," and you record everything. Yeah. Say the word deactivate. Ah, good one. So you're saying you wanna find negative words that are close to the positive word just to make sure that the model learns that. That's a great one, yeah. Actually, it turns out activate is a really bad word to choose. [chuckles] The reason Alexa-- And actually, there was a lot of discussions at Amazon back in the days around what would the word be, and it turns out it's very important what you choose because you wanna choose a word that is not used in common language, otherwise your assistant is always, uh, turning on, right? And Alexa is not ideal either. It's not bad, but it's not ideal either. Um, okay. So let me just narrow down the problem. Let's say we've gone around campus, and we've collected a bunch of ten-second audio clips. Okay? Um, do we need to think about the distribution of the data? Like, why does it matter? Like, wh-why campus only might be limited, let's say. Yeah. Maybe you'll get a lot of accents or- Okay, accents. Turns out the first version of this model, um, that, you know, I, I built with Andrew, um, my German friends could not make it work. None of my German friends would make it work, and we had to collect more data from-- And I had two German roommates at the time, so, uh, we had to collect more data from them because there's just a certain way that, you know, people would say words. Um, okay, good insight. What else other than, um, accents? Yes. The age of the campus is probably younger. Good point. Average age on campus is probably younger than if you actually cross campus and go to Palo Alto. Um, and in fact, the frequencies are gonna be different that younger people use. That's correct. Yeah. Just the cadence, how fast they're saying words. How fast. Some people speak fast, some people-- And it has to do with the language of origin. Some people just speak faster. Um, correct. And l-l-look at this, like, when you hear someone who speaks fast versus someone who speaks slow, it doesn't make a big difference to you as a human. But if you actually just had access to the numbers and the frequencies, it would look completely different. So the model actually struggles a lot with that problem. Yeah. I-I don't think this is very important for the ratio of male to female voices. Yeah, for sure. Ratio of male to female. Anything that would modify your frequencies. And on average, yes, there's different frequencies or distribution male to female. Yeah. Data with some noise or people like talking in the background. Background noise, very important. Turns out on Stanford campus, you don't hear the metro. So it's very likely that your algorithm will not work for people in New York that are taking the subway all the time because of the background noise behind it. Yeah. Okay, I think we get a sense of like the-- again, the complexity of the task ahead of us. Uh, let's say the input is a ten-second audio clip I'm gonna call X, and this audio clip has a few things, uh, that are special to it. So one of the things is, uh, negative words, which are in purple. Positive words, which are, um, in green, and then the background is in orange. Okay, so this is, for example, someone saying, "Hi, activate yourself," whatever. You know? Y-you see what I mean? Activate is the positive word. Uh, what's the resolution we'd want? Okay, I'm not gonna ask you this question because we don't have speech expert. A speech expert would know it. Um, what you can do, though, uh, without being a speech expert, is to go on GitHub and find another speech project, and you actually search for the hyperparameters they're using. And if you're using human audio, you'll find that the same numbers will work for your project. Okay? So you do that little search, and you'll find that there is just a certain sample rate that works well with human voice. Uh, what's the output? Zero or one. Zero or one. Okay, let's try something. So
42:02 – 48:54
Human experiment: weak labels vs time-localized labels (and the cold-start problem)
1. KKKian Katanforoosh
  let's say the output is zero or one. Zero meaning there is no positive word. The word activate is not there. One meaning there is a positive word in that ten-second audio clip. So we're gonna do a little human experiment. I've selected three, um... Let me turn the volume on. I've selected three, uh, uh, audio samples of around ten seconds. Okay? Um, I'm not gonna tell you what the language is, okay, because the model doesn't know language when we start training it. So you're acting like the model. That's the experiment. I'm just telling you that the first and the third sample have the word that we're looking for. And I'm not telling you what the word is, again, because the model doesn't know what the word is at the beginning of training. Okay? So now up to you to guess what the word is. And I hope one of you will save us and find the word. Okay, let's try.
2. SPSpeaker
  [foreign language]
3. KKKian Katanforoosh
  Too loud? Second sample. Wait, let me see if I can put the microphone here.
4. SPSpeaker
  [foreign language]
5. KKKian Katanforoosh
  Anybody has it or no? No, no. [laughs] Impossible. Third one.
6. SPSpeaker
  [foreign language]
7. KKKian Katanforoosh
  Okay, who has the word? [foreign language] Okay, maybe. That's not it, but maybe. [laughs] Yeah, in the back. It sounds like [foreign language] You're Italian? No, Italian- You speak-- Okay. [laughs] Uh, yeah. It's funny. Nobody finds it usually in the first try, but, uh, you, you do-- you, you did find it. Yeah, that's correct. Um, okay, let's try again. Um, and, uh, do it with a different labeling scheme this time. Okay? Let's try again. I'm gonna play it again, but the labeling scheme has changed. Okay.
8. SPSpeaker
  [foreign language]
9. KKKian Katanforoosh
  That's the first one.
10. SPSpeaker
  [foreign language]
11. KKKian Katanforoosh
  Third one.
12. SPSpeaker
  [foreign language]
13. KKKian Katanforoosh
  Okay. What's the word? Someone who's not Italian speaker. Hmm? I'm not sure people heard, so I wanna try, uh, someone else. Yeah, you, you've heard it? Something like promo digio. Okay. Not far. Promo digio. It's pomeriggio, but you're close, you know. Um, was it easier the second time or the first time? Second. Way more-- way easier. So if it's easier for you, it's easier for the model, basically. Um, and so what is the question I'm posing here is, we could go with the first labeling scheme, which is easier for us to label, frankly, right? You don't need to indicate the location of the word. Um, but how much more data do you think we need in order for the model to figure it out? It's probably a thousand x data, right? And so the question is: Is the second labeling scheme a thousand times harder for us to label than the first one? And the answer is no. Um, so the answer is very clear. You would rather have the second labeling scheme, and your model is gonna learn way faster than the first case. All right? So that's the type of human experiment you can do. Now, we're gonna, we're gonna use that labeling scheme. Yeah, question. So for the labeling, uh, this is manual, right? Yeah, we'll talk about it. Yeah. Question: Is it manual labeling or not? Yes, right now it's manual, but I explain some tricks we can use. Uh, yeah.
14. SPSpeaker
  Um, is there a trade-off where like you do-- like you spend a lot more time to label something, but then it does do better on the model? Like how do you pick or choose?
15. KKKian Katanforoosh
  Yeah, you, you, you-- How do you pick the trade-off between the different labeling strategies and not? In fact, you, you, you know, today you could have, you could have a pre-training and a post-training that have different labeling schemes. Um, the, the question is, for pre-training, you want the model to get really good, and you don't wanna-- you wanna avoid a cold start problem. The problem of the cold start is, with the first labeling scheme, maybe even with a thousand data points that you dete-- that you collect and label manually, it's not even gonna understand anything. So you need-- The second labeling scheme can be a great way to, um, work around the cold starts. You might need less data, but it will start understanding what you mean, and then the rest of the data might be labeled differently, essentially. Um, okay, let me talk a little bit about the labeling scheme. We're, we're actually gonna use a slightly different labeling scheme, and the reason is, the first, the, the one with one one and only zeros, um, the risk is you can actually have a model that performs ninety-nine point nine percent accurate that is all zeros, just predict zero all the time. It's very accurate, [laughs] but the thing is, there's just one one to find, and it's very hard to find it. And so the network is gonna lean toward zeros, meaning the data is-- the labels are so skewed towards zero that it's gonna be hard to find any signal in it. And so the trick that deep learning researchers use generally is, can we actually do a little more balanced, um, between the positive and the negative labels? Just a pure engineering hack. Not much science behind it. Uh, the last activation, we're gonna use a sigmoid, but because it's a sequential problem, we're gonna use sigmoid in sequence. At every time step, there's gonna be a sigmoid activation. And then the architecture, we're gonna learn it later in the class. You don't need to worry. This would be likely an RNN. You'll learn later in the class what that means. And then the loss function, can someone guess what loss function we would be using? Yes. Cross entropy. Yeah. Binary cross entropy, but we're gonna use it sequentially, meaning at every step, we're gonna compute that with the sigmoid output. Yes. Sorry. On the output, so you just label the activate word as multiple ones? Correct. You take your, your input, ten-second input. You look at where the positive word activate is, and you put ones in there when it plays, and you force the model to predict, "Hey, activate has been played." There is a lot of nuances you'll see because you actually are gonna build this project with like, do we want the ones to start exactly when the words start? Do we wanna have a delay? There's a lot of technical questions around that. Yeah. You had a question?
16. SPSpeaker
  I'm wondering, like, what do you mean by steps? I was like, uh, how do you predict steps?
48:54 – 55:21
Synthetic data pipeline for keyword spotting: auto-labeling at scale + expert architecture guidance
1. KKKian Katanforoosh
  What I mean by steps is audio is, uh, sequence data. Um, and so it's time step by time step. And so what I mean is that at every time step, whatever your sample rate is, you're gonna have a prediction at every step, potentially. Okay.Um, so what, what is critical to the success of this project is really the labeling strategy. Here is how we did it. Um, uh, and you know, it's a te-- Y-you just have to know it or not know it. It actually takes a long time to figure something like that out. And so thankfully, at-- w-when I was a grad student, one of my, uh, senior, senior PhDs helped me with these methods, and he was able to guide me and saved me probably month, because I would not have figured out myself this probably. So here's what we did. We took three databases. We created three databases, one that would have positive words. So as you were saying earlier, we record people for the word activate. One that would have negative words, other words including deactivate, but also kitchen and lion and dog and whatever. And then we have background noise. Um, turns out background noise in audio data is almost free. There's just a ton of background noise online. You can go on many platforms online, video platforms, you can just take the audio, it will be background noise. So background noise is free. You don't need to go and collect it, most likely. Uh, the other two are, though, are harder. Yeah, question.
2. SPSpeaker
  Did you record those first two yourself?
3. KKKian Katanforoosh
  Yeah.
4. SPSpeaker
  How many?
5. KKKian Katanforoosh
  So I'll tell you we, we-- I'll tell you how we do it in a second. But yes, we recorded everything manually. Um, what we did is we took-- We went online, we scraped, uh, free license in, um, data. Actually, it's a skill to know the licensing models. You're gonna learn that in the project mentorship with the TAs. What licensing allows you to do what with data. It's good to know forever. You learn it once and then you know. What is CC By, what is CC By NA, what is the MIT license, what is the Apache license, et cetera. Um, and so we take ten-second audio clips. We clip the background noise, and we went around campus, and we literally recorded people saying activate and other words, and we cut it when they said it. So the word was contained exactly to the amount that it was needed. Then we created a Python script that randomly inserts words, randomly, non-overlapping words. So for example, this would be created synthetically. I would have ten second of background noise. I would insert two negative words, and I would insert one positive word. Okay. The trick is that because the Python script did that, the Python script knows where activate was put, so it can label automatically. And so it turns out that we went around campus, we used an app even to get help. So we hired a couple of people to come with us. Each brought their phone, and we went all around campus recording people to say activate and other words, and we created those datasets. Within three hours, we had millions of data points. Um, because think about it, let's say you have a thousand activates across campus, ten thousand other words, infinite background noise. Imagine how much data you can create with that. When you actually write the Python script, you can also add some data augmentation. You can reduce some frequencies, you can augment some frequencies, you can accelerate it, you can decelerate it. So you can actually create a pretty meaningful data sets for this problem in three hours. Now, I'm talking about training sets 'cause you don't wanna use that data for test sets. For test sets, you want data to be as real as possible. You're familiar with the concept of training test set, right, everyone? So for the test set, we had to manually label data, but it was a much smaller set than the training set, right? So it's much more convenient. Okay. The second, um, important part was architecture search. W-- I'm not gonna talk about it too much here, but there are architectures that just work better for these types of problems. And, um, this is an example of an architecture on the right that works way better. Um, and my learning from that project was just go to the expert and ask them what they've tried and try to learn from their mistakes. And in fact, I remember Oni Hanun, which was in the first, first level at the Gates Computer Science, uh, building, and he knew that this architecture was gonna fail, and he knew why, because he's done so many speech projects, and he just knows what works and what doesn't work. So that's why your TAs are here for actually in your project. So you should give them a call and say, "Hey, is what I'm doing good?" Or, "Give me a pointer for what I might spend my next week doing." Uh, okay. So learnings from this section, this case study. Uh, data collection strategy is extremely important, including the data labeling strategy. Using human experiments matters as well, and then, uh, referring to expert advice. That's the type of thing you wanna do in a project. Okay? Do you understand conceptually how such project is built now at a high level? Okay, good. Yeah, question.
6. SPSpeaker
  At the application layer, how often do you have to create your own architecture or how often it's just, like, off-the-shelf? You know
7. KKKian Katanforoosh
  So how often do you need to do an architecture search, um, nowadays? Um, less often than before. But this class is all about understanding what's going on under the hood, so we're gonna walk you through that. In practice, um, it depends on your problem. Like I give you in the industry, you might have a-- You might be building a company that requires the model to be running on the browser, and so you have additional constraints that push you to create your own architecture, collect your own data, fine-tune the model the way you want. For many startups out there and companies out there, you're gonna start from a foundation model. You're gonna start from a foundation model, and then you might actually quantize it or prune it or modify it to meet your needs in terms of latency, in terms of memory, in terms of hardware capacity. And knowing what's going on like we did is important in those cases. Yeah. Okay. Uh, super. So we're one hour in. We are about halfway through the class, a little bit ahead, and I have a few more case studies to cover with youOkay.
55:21 – 1:04:55
Case study 3 — Face verification: why pixel comparison fails and how embeddings solve it
1. KKKian Katanforoosh
  By the end, we, we'll have a full f- set of proxy projects to work with, okay? So this one is cool. It's face verification. Um, a school wants to use face verification to validate student IDs in facilities like dining halls, gym, and pool. So let me explain what, what, what it is. It's like you arrive at the Arriaga Gym and instead of, uh, uh, just, you know, normally you would swipe your student ID, right? And your picture will show up on the screen, and there's someone sitting there who's gonna compare the picture they're seeing on the screen to the picture they're seeing with their eyes. And if it's the same, they're gonna say, "You can go ahead," right? It's slightly different. Here, um, you're gonna, you're gonna be, uh, verified by a camera. So there's actually a camera ahead, and you're walking in, and you're swiping your card, and we're gonna compare the picture in the database to the picture that is being taken by the camera to make sure that it's the same person. Okay? How do you get started?
2. SPSpeaker
  You just put in the mobile students because we have some of them.
3. KKKian Katanforoosh
  Okay. So everybody who got admitted uploaded pictures, at least one, so we have that already in the database, correct? Yeah. What else?
4. SPSpeaker
  When we use camera, that's probably going to be used for detecting things that-
5. KKKian Katanforoosh
  You're saying the camera matters?
6. SPSpeaker
  Yeah.
7. KKKian Katanforoosh
  Yeah, for sure. The camera matters. In fact, we talked about resolution. You think the resolution is lower or higher than the day and night project?
8. SPSpeaker
  Higher.
9. KKKian Katanforoosh
  Probably higher. In fact, it's gonna be higher. And again, I would go back to, uh, literally doing a human experiment and showing pictures of twins and asking people if they can differentiate the twins. Um, and you'll see that the resolution matters, actually. Okay. Yeah. You wanted to add something?
10. SPSpeaker
  Oh, yeah. Um, like you also need like noise, like other people that are not your student.
11. KKKian Katanforoosh
  Okay. Also data from outside the university.
12. SPSpeaker
  Um, just features.
13. KKKian Katanforoosh
  Which features?
14. SPSpeaker
  Uh, the person's human features that people look for.
15. KKKian Katanforoosh
  Understand the person's human feature, but they're already in the picture, right? So the picture would have the feature or you would add anything on top of that? So you-- Typically, you would not actually wanna get to the feature level. You would just wanna say, "We need to make sure it's in the data." If it's in the data, the neural network will learn it.
16. SPSpeaker
  Would likely be like the same person multiple times, you know, after we-
17. KKKian Katanforoosh
  Absolutely. Same person multiple times because the angle matters, the time matters, the, you know, et cetera. Yeah.
18. SPSpeaker
  Kind of same.
19. KKKian Katanforoosh
  Okay, same thing. Okay. Yeah.
20. SPSpeaker
  Try to crop the image so that the, the person is centered.
21. KKKian Katanforoosh
  Okay. So you're saying we might do some pre-processing to crop all the images so that it's centered, or at least that the image that we're trained the model with looks like the image that the camera is gonna take because the model will run on the camera. Okay. All these are good. So let's say our data set is picture of every student labeled with their name. So this is one of my friends, Bertrand, and he has his picture, which is his picture from the student ID. Um, and then the input is, okay, he shows up in front of the build-- uh, the building. He's a little bit confused, [chuckles] but he showed up, and a picture was taken. Uh, the resolution, we talked about it. What we use here is four twelve, four twelve by three. It's much higher than before, okay, as we were expecting because we need small details. Even eye color is identifiable, right? So these things we cannot find without a higher resolution. In fact, if you actually go through, um, airport security and you use some of these fast tracks which take a picture of you, trust me, the resolution is gonna be even higher than that, much higher than that because they're actually getting into the iris at that level, right? So what's the output? The output is zero or one. Yeah. It's Bertrand, or it's not Bertrand. Okay? We're, we're good so far? Um, the architecture, um, let, let me actually, um, let me actually ask you, uh, how would you, uh, do this comparison without neural networks, let's say a very basic way, if you had to start with the first method? Yeah.
22. SPSpeaker
  You have to have like an average list of characteristics that
23. KKKian Katanforoosh
  So you would feature engineer. You would say, like, for example, you would define ten features that are good for identifying people, and you would have a filter for each of them, and you would run it on the picture and say, "Yes, do we have this feature or not?" Essentially. Okay. Yeah. It's a good one. Even more basic than that would be, uh, pixel comparison. You just compare the two pixels. What's the problem with doing, um, a pixel comparison? So the idea is I take the two pictures, I compare them, and if they're close enough in pixel-wise comparison, then it's the same person. If they're far, it's not the same person. What can go wrong?
24. SPSpeaker
  Difference in lighting.
25. KKKian Katanforoosh
  Difference in what?
26. SPSpeaker
  Lighting.
27. KKKian Katanforoosh
  Lighting, yeah. Actually, what's interesting with the lighting is if you look here, so in this one, you take the top left pixel right here, okay? It's bright. You take the top left pixel on this one, it's dark or at least dark green. The difference between these two pixels is massive. It's close to two fifty-five, yet the pixel doesn't even matter. So why would you use that? It would penalize the comparison without actually mattering at all. So that's a good point. Yeah.
28. SPSpeaker
  [background noise]
29. KKKian Katanforoosh
  Yeah, absolutely. Background difference. Translation invariance. Like imagine the same picture, but the person is like three pixels to the right. The co-comparison will be completely different because it's a pixel comparison rather than a semantically meaningful comparison. Okay. What are other things that can go wrong?
30. SPSpeaker
  Distance from the camera
1:04:55 – 1:12:46
Training the encoder with triplet loss: anchor/positive/negative and decision-driven loss design
1. KKKian Katanforoosh
  certain types of features. But right now I can't tell you, you know, unless I do that study. No. Um, okay, so question for you, what-- how would you build a training and a loss function to make that possible to train that network? Do you have ideas? It's not an easy question. Okay. Try. Where to start? Yes.
2. SPSpeaker
  So when two vectors are like in mean squared error, it's probably better.
3. KKKian Katanforoosh
  So mean squared error between what? You're right, actually. Two vectors, mean squared error because it's a, you know-- Yeah, but-
4. SPSpeaker
  The, um, the loss function, probably. The, the cost function, probably. Cost of the game-
5. KKKian Katanforoosh
  So are you saying we would take pairs of pictures, we will run it through the network, we will then take the two vectors that we get and do lo-- apply the loss function, some distance, L1 distance, L2 distance, and then trace it back and say, "These were the same people. You should have been closer." That's what you mean?
6. SPSpeaker
  Yeah.
7. KKKian Katanforoosh
  Yeah, it's a good idea. Yeah. Someone else wanted to say something.
8. SPSpeaker
  I was gonna say we could use, like, the cosine similarity vector right at a different time.
9. KKKian Katanforoosh
  Yeah. That's another one, cosine similarity. That could also be our loss function.
10. SPSpeaker
  Um, I think I would start some data manipulation.
11. KKKian Katanforoosh
  Okay, like what?
12. SPSpeaker
  Like create a, like a data set from the original picture and play around with things to see what happens.
13. KKKian Katanforoosh
  Great idea. Data augmentation. So you would-- you say, I can take the picture of Bertrand and probably, uh, mirror it, flip it, rotate it, crop it, and I would use more data that way. Yeah, absolutely. That, that would help a lot, actually. Okay, so all of these are good ones. Um, if I summarize your point, though, because that's really the key to, uh, the-- designing a good loss function, what we really want is that similar picture of the same person end up with similar vectors, and picture of different people end up with different vectors, if we rephrase it in plain English. So what we'll do is that we'll build a data set of triplets. The triplets includes a picture that is the anchor, a picture that is the positive. The reason it's called positive is because it's the same person as the anchor, and a picture that is called the negative because it's a different person than the anchor, and by definition, also a different person than the positiveOne picture of a person? Great question. Mm, if you have one picture of a person, then you can't do that. We'll actually see another method that would allow us to do it even with one pic- one picture of a person. Yeah.
14. SPSpeaker
  So you can kind of rotate it?
15. KKKian Katanforoosh
  You can rotate it. That's true. You, you could actually do some data augmentation, as he was mentioning, um, and build a data set starting with one picture, but this approach will not be the best one. We'll see another approach right after that would work better. Yeah.
16. SPSpeaker
  Why are we comparing, like, a vector from the model from-- with the vector from the model instead of just comparing the output to the output?
17. KKKian Katanforoosh
  Uh, a good question. Why do we compare a vector rather than the output of the model? Right? Um, so what's the output of the model? We actually haven't talked about the architecture, but I'm assuming you're saying it's a binary number. It's between zero and one. Because it's a single dimension, it cannot hold meaningful information. So you probably want to have a vector that is big enough where you believe it has enough flexibility to hold information that can allow to, uh, us to verify if the same person is on the picture. Yeah, essentially. Okay, I'm gonna move on and, and if there's... So what we want is to minimize the encoding distance, uh, between the anchor and the positive, and we wanna maximize the encoding distance between the anchor and the negative. So question for you. What I'm gonna ask you is to take ten, fifteen seconds, look at the slide, and you're gonna start voting for A, B, or C. By the way, anch is encoding. It's just how I call the, the vector that we get out of the network. A is the anchor, N is the negative, and P is the positive.
18. SPSpeaker
  What is the
19. KKKian Katanforoosh
  So A is the anchor picture, uh, N is the negative picture, which is different from the anchor, the different person, and P is the positive picture, which is the same as the anchor. An anch of A is when you run A through the network, you get the vector anch of A. Okay, let's look at the results. Well. A. Forty-seven for A, twenty-three for B, three for C. So someone who said, "Good job" first, that is correct. Um, someone who selected A wants to tell us why? Yeah.
20. SPSpeaker
  Um, because we're mi-- to minimize the encoding distance between the anchor and the positive, we just minimize the distance right there. But maximizing the encoding distance between the anchor and the negative, this is saying it's, like, minimizing the negative.
21. KKKian Katanforoosh
  Correct. Correct. Great. So actually the keyword here is minimize. If I had said maximize, the answer indeed, as you say, would have been different because here we're looking at minimizing the distance between the anchor and the positive, and in fact, minimizing this or maximizing the opposite of it. That's why the answer is A. Okay, good stuff. Let's keep going. Um, so going back to the initial setup, um, we had a cat, and we were predicting a binary number. Here instead, we have three pictures going through the network in parallel, so you can imagine it's batch processing. It's like the three are going in the same network at the same time. And then you're getting three vectors. You're computing the loss function. Okay? You're doing this loss function we talked about. I'm not gonna talk here about the alpha number, but you're gonna learn when you build it why the alpha number matters. Hint is maybe zero would have been a correct answer if you didn't have the alpha number, so it would have created instability in the model. Uh, but you do that many, many times. You push the parameters to the right or the left, and because of the way you created your loss function and your data labeling, um, your-- the way you structured your data and the loss function, um, essentially the model is gonna learn by itself to create similar encoding for pictures that are of the same person and separate encodings for pictures that are not from the same person. And you didn't need to do feature engineering. You didn't need to talk about eyes and ears and whatever because it will figure it out. You know that you created the learning environment to allow that to happen. So congratulation. You designed your first loss function, and we're gonna design many more in this course. Um, this by the way is from FaceNet. It's a, it's a paper from two thousand and fifteen from Xiao et al. And you'll see in the slides I used, I always put the, the reference to the papers in case you wanna go back and study the actual paper.
1:12:46 – 1:20:22
From verification to identification and clustering: k-NN search and k-means on embeddings
1. KKKian Katanforoosh
  Many students do it for their projects. This is a great one, great, great paper to, to look. A lot of citations as well. Um, let me make it slightly complicated-- more complicated. We learned face verification, and now we wanna do face identification. How is that different? Identification is a school wants to use, um, uh, to recognize students in facilities. So imagine face verification is you swiped your card, and then that picture was compared to the picture of the camera. That's verification. The two, are they the same or not? Identification is you have this picture in the database somewhere. The person enters, immediately you can identify them. You know? So the difference, uh, for, for those of you who fly in the US is when you, when you go through Global Entry-Uh, many people don't even need to put their passport or anything. They just watch-- look at the camera, and they move on. That's identification. Okay? But actually, wh-when you're in Europe, for example, you put your passport in, then you walk in, then it takes a picture. That's verification. You see the difference or no? Yeah.
2. SPSpeaker
  You were saying, uh, verification, the negative or the output, how would you create those triplets on each end?
3. KKKian Katanforoosh
  The negative or how do you create those triplets, essentially?
4. SPSpeaker
  When you're actually doing it in real time.
5. KKKian Katanforoosh
  Uh, no, in real time. That's a great question I didn't talk about. So at train time, you, you have databases, and you, you create the triplets automatically. Like, you pick pictures from the same person, or you use data augmentation, and you add a random picture from someone else. You create millions of triplets like that or billions of triplets. At test time, you only take the picture from the camera, run it. You don't use the negative. You just take the picture from the camera. You run it through the network. The person swipes. You take the picture from the swipe, run it through the network. You do the comparison. You let them in or not. So there's no more negative at test time in practice. It's just a trick to train the model. Okay. So how would you do face identification using what we learned for face verification? Is there any small tweak you can make that would make this network work for identification? Yes.
6. SPSpeaker
  Maybe you compare it to records of previous faces in the database.
7. KKKian Katanforoosh
  Correct. Correct. What is it called in machine learning? There's a machine learning algorithm that we can stack on top of what we just did.
8. SPSpeaker
  Can you repeat what he said?
9. KKKian Katanforoosh
  He said, you can compare-- So because we don't have two pictures anymore, we just have one from the camera. You just compare the vector of the-- You run this by the network, you get the vector, and then you compare it to the database.
10. SPSpeaker
  Clustering.
11. KKKian Katanforoosh
  No. Good, good, good try. You have a database of all the student pictures. You run everything through the network. Instead of storing the image, you store the vectors. And then someone shows up, and you're looking in the database. Is there any vector that is super close to this one? That's identification. What is this algorithm called in machine learning? It's pretty simple algorithm.
12. SPSpeaker
  [on phone] K-nearest neighbor?
13. KKKian Katanforoosh
  No. Okay, I'm gonna make it easier. What if instead of having one picture of a student in the database, you have three of each student? You have three vectors for each person, and then you're trying to find the nearest vectors in the database from the one that the camera takes. I used a keyword.
14. SPSpeaker
  [on phone] Closest neighbors.
15. KKKian Katanforoosh
  No.
16. SPSpeaker
  K-nearest neighbors?
17. KKKian Katanforoosh
  Yeah. [chuckles] K-nearest neighbors. That's a K-nearest neighbor algorithm. It's, it's essentially-- You wa-- You wanna explain, uh, what you, what you meant? Wh-why is it K-nearest neighbor?
18. SPSpeaker
  Um, well, I was just thinking about, like, what's the nearest vector-
19. KKKian Katanforoosh
  Yeah
20. SPSpeaker
  ... closest to the vector that you have in the database.
21. KKKian Katanforoosh
  Yeah. It's K-nearest neighbor in high di-- for high dimensional vectors. So here is a simple example of K-nearest neighbor for two dimensions. In practice, it's one hundred twenty-eight dimensions, so I can't put it on a slide, of course. But let's say in green you have the query point. The query point is the camera picture. Okay? And then you, you run a nearest neighbor algorithm and you say, are there three vectors in the database that are close to this vector? And you can add additional checks. Are these three vectors from the same person? If they are, then it's very likely the person is correct because you just could prove that the three closest, uh, uh, vectors in the database are from three, uh, the, the same person three times. So it's higher likelihood. You could even do it for ten-nearest neighbor if you wanna be really secure. Let's say you go to the airport every time, and every time they take a picture of you, and now they can do a ten-nearest neighbor, um, on, on that search. Does that make sense? Now, slightly more complicated, you wanna do face clustering. So, you know in your phone sometimes it says, uh, um, it put automatically all the pictures from your mom in one folder and from your dad in another folder, right? Uh, how does it do it? How could you make a tweak to, again, what we created? Our encoding network, how can you use that to create that?
22. SPSpeaker
  Try K-means.
23. KKKian Katanforoosh
  K-means. Yeah, exactly. K-means algorithm, which is an unsupervised learning algorithm clustering. So you have a bunch of pictures. You have vectorized all of them with the network you trained, and normally, the vectors that are from the same person should be clustered around the same place, and that's very simply how big companies do it on your phone. Yes.
24. SPSpeaker
  What happens if the person's not in the database?
25. KKKian Katanforoosh
  Yeah. So if, if the person is not in the database, then you shouldn't find any vector that is close to the vector you're taking picture of. The closest vector might be above your threshold in terms of distance, and you wouldn't let that person in. Yeah.
26. SPSpeaker
  I think that's why they make you, like, sign up for global entry.
27. KKKian Katanforoosh
  Yeah, for sure. Yeah. They make you sign up there. So it, it's interesting because companies know that they need to build these net-- these algorithms, and then some, like, the admission process, the sign-up process might include certain data points, and now you're starting to understand how it's used in the background, right? Okay. Uh, let's move on. Uh, yeah, one question.
28. SPSpeaker
  Are you comparing every image after, uh, each record or with a centroid?
29. KKKian Katanforoosh
  Oh, uh, good question. So are you comparing each new picture? So I take a picture, um, of my mom with my phone. What's gonna happen? This picture is gonna likely, if you're doing clustering, is gonna be compared to the centroid of my mom. So the phone keeps probably a centroid of my mom, and if it's close enough to the centroid over another centroid, it's gonna probably put it in that folder, essentially. No. Uh, yeah, one more question, and then we move on.
30. SPSpeaker
  Algorithm to be used to determine how many centroids you want
1:20:22 – 1:26:33
Self-supervised learning: contrastive pairs via augmentation (SimCLR) to avoid manual labels
1. KKKian Katanforoosh
  we're gonna get to an interesting section, which is brand new, um, around self-supervised learning. So note that everything we did so far, the day and night classification, um, the trigger word detection, and the triplet loss were supervised learning. We had labels, essentially. Day and night is very classic supervised learning. You label data with zero and one. Same for trigger word detection. Face verification, you can, can, can debate. It can be different. But anyway, we, we focused on supervised learning. Now we're gonna talk about self-supervised learning, and my question for you is the following. Um, labeling is expensive. We know that. So how would you redo what we did with a different approach that does not require labels? Meaning, you remember even in face verification, we sort of had the name of the students in the database with their face, and we might have multiple pictures of them. Let's say you don't even have that. You just have faces in the wild, unlabeled. How would you do things differently? Any idea? Yeah.
2. SPSpeaker
  Ask the neural network to decide, like, to point out which images are close to each other and
3. KKKian Katanforoosh
  Okay. Let the neural network find the pictures that are close to each other. But how would you train that network? Like, you're starting with a network that doesn't do anything. You give it an image, it gives you a random vector at first. So how would you train it? Yeah.
4. SPSpeaker
  Um, do some kind of clustering offline, and then use that as-- like, use those centroids as, like, your labels.
5. KKKian Katanforoosh
  Do some clustering offline. But again, my question is the clustering algorithm, how is it trained? How do you cluster if you don't have any encoder network? 'Cause the clustering came after we trained the encoder network. The clustering only worked because we had a good encoder network.
6. SPSpeaker
  But if you have, like, for example, a bunch of pictures of the same person, and then you run that through the network, right?
7. KKKian Katanforoosh
  But you don't know if it's the same person. That's what I'm saying is like-
8. SPSpeaker
  The vectors would be similar.
9. KKKian Katanforoosh
  No, because that's the network you're training. The vectors are not similar because that's, that's the network we wanna train. Right now, you-- I gave you a network. It's completely random. You give it my picture on Saturday and on Sunday, the vectors are completely off. So how do you start? Yeah. Uh, yeah. Go on in.
10. SPSpeaker
  You could use autoencoder.
11. KKKian Katanforoosh
  Okay, tell me more.
12. SPSpeaker
  Uh, you just train the model to give an image, um, first creating a latent representation of that, and then give it-- so encoding to, like, to a latent representation, and then another model that gives it a latent representation decode-
13. KKKian Katanforoosh
  Okay, okay. Yeah, you're ahead, but, but, uh, we, we'll study that in two weeks, actually. So we'll do autoencoders, uh, two weeks from now in class. Um, anyone else has an idea? Yeah.
14. SPSpeaker
  You could use an encoder to try to, um, translate it to a high dimension and then you can do global search.
15. KKKian Katanforoosh
  Okay. It's actually similar to what he was saying, to reconstruct the original image. Yeah, that's what we learn. It's, it's, you know, there's a lot of methods. Diffusion models work like that. Autoencoders, uh, we learn about that in two weeks when we'll focus on generative AI, generative modeling. Um, here, um, I wanna present, uh, also a sort of a generative method, but it's really interesting because it will be your foray into self-supervised learning. Here is the idea. If we have pictures in the wild, going with the methods you had mentioned around data augmentation, you can actually force the network to learn from the data itself. So let me give you an example. I take-- You, you look at the picture of this dog, and you rotate it by ninety degrees. It's still the same dog, right? A human would say it's the same dog. What are we using in our brain? We're using the ability to understand rotation invariance and to understand the semantics of the dog. And so technically, if you gave those two images to the network, um, you could create a loss function that compares those two pairs and has to have vectors that are close to each other. Does that make sense? Other thing you could do, you can do a patch. You can literally take an image of a face and put a patch on half of the image, you know. And then you, you say, um-- and then you do the same thing on the other image. You put the other patch, the other half. And now you tell the network these two should have the same vector, pretty much. So you use your data augmentation stream-- uh, uh, scheme on massive data sets online to force the network to learn from the data itself. Okay? No need to forge triplets per se. You just take a picture, you make a variance of it with noise, with rotation, with cropping, with translation, with whatever you want, and then you put these two in the data set, and you say, "These are two the same person. It should have the same vector." Does that make sense? That's why it's called self-supervised learning, because you don't have labels. You just create a learning environment that makes the, um, the network learn from the patterns of the data directly itself.Okay. So this is an example called SimCLR. Again, the paper is, is right there. And this shift from supervised triplets, FaceNet, which was a paper from two thousand and fifteen, to self-supervised pairs, that is why modern models, uh, are trained on billions of unlabeled images. Okay? That's how we create... It's much simpler when you think about it. You can literally write a script and scrape, and it will label-- auto-label the images and put them in pairs, do variations, and then you'll end up with a very powerful pre-trained model. Much simpler than people think. It's not that hard, you know, at the end of the day. Most of the complexities are gonna come from
1:26:33 – 1:34:07
Self-supervision in language: next-word prediction and emergent behaviors
1. KKKian Katanforoosh
  compute, right? Um, okay, so this method is called contrastive learning. Okay? We're gonna talk about it a little more in two weeks. Um, self-supervision is not only an image thing, it's also used in other modalities. For example, in text. The, the, the principle is the same. You predict what belongs together, and you push away what doesn't. That's for images. And here, um, what we're-- what, what, what the core of GPT, some of you probably have heard of that, um, is a method called next token prediction. We're gonna learn later about tokens in the class, but today, forget about tokens, just think the words. We're trying to look at a sentence and predict the next word. Why is this self- self-supervised learning? Because you don't label data. You just literally grab data from online, and you create a scheme that forces the model to learn from the patterns of the data using self-supervised learning, but the self comes from the fact that you didn't label it manually yourself. So let's do a few examples. And the, the reason I wanna do the examples is because we wanna talk about emerging, um, emergent behaviors, um, that stem from the tasks we defined. So give me the word you're thinking of. I poured myself a cup of? Coffee. Some people said tea, coffee. Anybody said anything else? Water. Water. A cup of water. [laughs] Healthy people said water. Okay. Um, yeah. So the-- what's the emergent behavior that you can expect the model is gonna learn based on just that example? It's gonna be a drink. Mm-hmm. A drink. Yeah, good point, because you know that whatever is here first fits in a cup. So it understands that. Uh, the second reason is poured, so it's a liquid. So just this sentence, without even labeling, is going to generate emergent behaviors that we've never trained the model for. That's what's interesting about mod-modern AI, if you will. It's-- You don't need to define the tasks, you know. Um, the-- Frankly, the same way, think about face verification. Back in the days, we used to do what I showed you, where we would create triplets, and we would be very specific about this is for face verification. You could actually scrape all the images online and do the contrastive learning that I showed you next, and it will still be good at detecting faces without you having even defined that task in the first place, just by doing the contrastive prediction. Um, second ex-- O-okay, first example also, again, I was trying to predict, but people usually say different things. I think the majority of people think coffee. It's very cultural. You go in another country, it's gonna be tea, for sure. Uh, and that forces the model, um, to really think, uh, about everyday co-occurrence pattern, like them, you know, being liquid, being of a certain size, occurring together. So, for example, there's probably a lot of sentences online that says, "Pouring a cup of tea," and there's a lot saying, "Pouring a cup of coffee." Because of that, the model should understand that these two things are probably close to each other because their context is similar. Uh, second example, uh, the capital of France is? [laughs] Mm. It will- Paris. Huh? [laughs] Paris. Okay. Uh, what's the emergent behavior you can expect the model to learn? Has to do something with facts. Yeah, learn about facts. Exactly. So this is really predicting the next token forces the model to encode real-world facts such as Paris being the capital of France. Oops, sorry. Uh, what about the third example? She unlocked her phone using her? Body parts. Body parts. [laughs] I don't know what your phone-- [laughs] type of phone you have, but, uh, wait, what would you say? Face. Face. Password. Password. Fingerprint. Yeah, all of them are possible. So again, the network will learn probably that password, fingerprint, and face can be used to unlock stuff. You know. Um, yeah. And in fact, here, probably fingerprint or face might nowadays be the more common because of how the world has evolved. Uh, but in-- back in the days, it would be password, for sure. Um, and so this forces a semantic understanding that these things are probably all meant to unlock information. Um, the next one. Car-- The cat chased the? Dog or mouth or ball. And again, uh, the model will learn probabilistic reasoning, meaning because in the dataset it will find variations of this sentence with different, um, actually conclusion, it would say that there's a lot of things that the cat can chase, you know. And so that's probabilistic reasoning. What about the last one? If it's raining, I should bring an? Umbrella. Umbrella. Uh, what's the emergent behavior? It's reasoning and inference. Is, um, the model will learn to connect conditions. So, for example, raining requires you to be protecting yourself from the rain with an umbrella. That's reasoning. Okay?So long story short, emergent behaviors are unexpected capabilities that arise from simple training objectives at scale without being explicitly taught or labeled. Later in this class, we're gonna have a full lecture on deeper reinforcement learning, where we're gonna talk about emergent behaviors in robotics or in gaming, where turns out the agent you're training learns to do certain strategies that you didn't expect they would do. AlphaGo is a good example if you've watched the documentary. Okay. Um, self-supervision is not just about text and images. We've seen the next token prediction for GPT, and we've also seen contrastive learning for images. My question here is, is the following: What other examples of modalities can you think of? And tell me the task that you would define. Audio. So for audio, what would you do? Audio. How would you do a self-supervision in audio? Mask out portions of audio and- E-exactly. Mask out portion. So mask out twenty time steps, and because you know what the data was, you have a label. You knew what the truth was. You can do a self-supervision task. It would work great. Again, the, the only limitation is compute and scale. Uh, what other modalities? Maybe self-driving and changing conditions of the- Yeah. So self-driving is a good example. It's very multimodal. Um, there is a lot of different things happening in self-driving. We'll, we'll talk about it in a future lecture. What, what else? What other modalities can you think of? Videos. Videos. What would you do video? Take frames out. Take frames out. You can take some frames out. Same principle as audio. Um, biology. Some people work in healthcare biology here. Yeah, a couple. Uh, well, you, you know about amino acids and, uh, and protein structure. You can actually mask portion of the inputs, such as a protein structure or DNA and, and then, uh, complete
1:34:07 – 1:39:41
Weak supervision and multimodal embeddings: naturally paired data and shared representation spaces
1. KKKian Katanforoosh
  it, and it will force the model to understand those patterns. So great stuff in there. Um, but the world is very multimodal. We experience words, images, sounds, and actions together. How can we connect them? When you think about multimodality, you wanna connect texts and images, let's say. What do you need to do those? You actually need labeled data. You need image captions. So, for example, you have a bunch of picture online of the cat is looking at the camera. So there is a picture, and there's a label underneath, just like on Instagram. Let's say people put captions, right? Um, and the reason you can connect those modalities is because of that data set, because you have a lot of that. Now, this is not typically called supervised learning. It will be called weakly supervised learning because you're not actually labeling images with captions. You are benefiting from naturally occurring pairings in the world. There is naturally occurring pairings of images and texts. Okay? So now what I want you to do is to find other examples that are not just images captioned, but naturally occurring examples of different modalities that appear in the wild together that we could use to connect modalities. The whole point of connecting modalities is that our vectors now can represent, uh, different modalities close to each other in space. Okay? So think about that. Please continue. I'm gonna read some of the answers, but we keep going. Okay, so stock price sequence is, um, is a single modality. You would, you would look at stock price, and you can mask and then predict, but y- I think maybe what you mean is you would put additional data points in there as well. Uh, let me see if I something. Audio paired with video. Audio and video is a great one. Audio and video is naturally paired. You know, you take a YouTube video, it has the audio and the video, and so when a dog is barking, uh, you have the audio of the dog barking and the video of the dog barking, and so you can create a pairing between those two modalities. Uh, transcription. So a lot of movies have subtitles, and so by definition, a video stream or a stream of images will be naturally connected to text, which will also be naturally connected to audio. Music and song title, again, that's a great one. Audio and text are connected. Genotype and phenotype, good one as well. Medical imaging with ultrasound, that's a great one. Naturally occurring. You usually, if you go do an ultrasound, you'll have the different types of images that occur together naturally. Game footage and keyboard action, again, another great one. So, you know, price and area code, good one. The, uh, yeah, great, great examples. Uh, facial expression. So TLDR is we have ways to connect modalities. Um, oftentimes, some modalities are gonna connect very naturally. Most things connect to text. So that's what you wanna use as your shared space, typically. But here is an example of a paper called ImageBind. And the interesting thing about ImageBind is it says that, um, most, um, most things connect through a single modality. So, for example, thermal data connects to imaging. Imaging connects to text. So text is gonna connect through images with thermal data. And what's the consequence of that? If I may show you a little example, is that you can-- Uh, this is a demo from, uh, Meta, uh, called ImageBind. Uh, it's a cool one. You can actually see things occurring together. So, for example, you put a text, drums, and of course, you can get an audio of a drum. [drum beating] But you can also see what the closest image in the vector space is to that concept. So all the spaces are now bound together. Um, you can also do, um, audio and image. So you, you, you give it a dog barking, [dog barking] and a picture. What can you expect? A dog on the beach. That's the multimodal embedding, the connecting tissue between those different modalities, and that's probably one of the biggest, um, innovation of the last few years, connecting those shared spaces. Okay? I'm not gonna cover the full paper, but the core insight is that there are shared spaces. There are spaces like text and image that connect to most modalities that can allow us to connect those modalities together. Okay. We learned a lot of things here, embeddings, self-supervised learning, contrastive learning, data augmentation, next token prediction, weakly supervised learning, and then the shared embedding stays with the central pivot usually being text. Okay? Uh, that's all for today. We're not gonna have time to cover the adversarial example, but we're gonna cover it in two weeks together. Uh, you're gonna have more neural network baggage.

Episode duration: 1:39:47

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode DNCn1BpCAUY

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Course goals and why this lecture is industry-driven

Deep learning recap: supervised learning loop, model components, and what can vary

Neurons to networks: multi-class labeling, one-hot vs multi-hot, and capacity/overfitting intuition

What networks learn inside: feature learning, encodings vs embeddings, and why distance matters

Case study 1 — Day vs night classification: defining scope, collecting data, and choosing resolution

Case study 2 — Trigger word detection: cascaded assistants and why labeling strategy matters

Human experiment: weak labels vs time-localized labels (and the cold-start problem)

Synthetic data pipeline for keyword spotting: auto-labeling at scale + expert architecture guidance

Case study 3 — Face verification: why pixel comparison fails and how embeddings solve it

Training the encoder with triplet loss: anchor/positive/negative and decision-driven loss design

From verification to identification and clustering: k-NN search and k-means on embeddings

Self-supervised learning: contrastive pairs via augmentation (SimCLR) to avoid manual labels

Self-supervision in language: next-word prediction and emergent behaviors

Weak supervision and multimodal embeddings: naturally paired data and shared representation spaces

Get more out of YouTube videos.