Lex Fridman PodcastIshan Misra: Self-Supervised Deep Learning in Computer Vision | Lex Fridman Podcast #206
EVERY SPOKEN WORD
150 min read · 30,091 words- 0:00 – 2:27
Introduction
- LFLex Fridman
The following is a conversation with Ishan Misra, research scientist at Facebook AI Research who works on self-supervised machine learning in the domain of computer vision, or in other words, making AI systems understand the visual world with minimal help from us humans. Transformers and Self-Attention has been successfully used by OpenAI's GPT-3 and other language models to do self-supervised learning in the domain of language. Ishan, together with Yann LeCun and others, is trying to achieve the same success in the domain of images and video. The goal is to leave a robot watching YouTube videos all night, and in the morning come back to a much smarter robot. I read the blog post, Self-Supervised Learning: The Dark Matter of Intelligence by Ishan and Yann LeCun, and then listened to Ishan's appearance on the excellent Machine Learning Street Talk podcast, and I knew I had to talk to him. By the way, if you're interested in machine learning and AI, I cannot recommend the ML Street Talk podcast highly enough. Those guys are great. Quick mention of our sponsors: On It, The Information, Grammarly, and Athletic Greens. Check them out in the description to support this podcast. As a side note, let me say that for those of you who may have been listening for quite a while, this podcast used to be called Artificial Intelligence podcast because my life passion has always been, will always be artificial intelligence, both narrowly and broadly defined. My goal with this podcast is still to have many conversations with world-class researchers in AI, math, physics, biology, and all the other sciences, but I also want to talk to historians, musicians, athletes, and of course, occasionally comedians. In fact, I'm trying out doing this podcast three times a week now to give me more freedom with guest selection, and maybe, uh, get a chance to have a bit more fun. Speaking of fun, in this conversation, I challenge the listener to count the number of times the word "banana" is mentioned. Ishan and I use the word "banana" as the canonical example at the core of the hard problem of computer vision, and maybe the hard problem of consciousness. This is the Lex Fridman podcast, and here is my conversation with Ishan Misra.
- 2:27 – 11:02
Self-supervised learning
- LFLex Fridman
What is self-supervised learning? And maybe even give the, the bigger basics of what is supervised and semi-supervised learning, and maybe why is self-supervised learning a better term than unsupervised learning?
- IMIshan Misra
Uh, let's start with supervised learning. So, typically for machine learning systems, the way they're trained is you get a bunch of humans. The humans point out particular concepts with it. In the case of images, you want the humans to come and tell you what is pos ... like, what is present in the image, draw boxes around them, draw masks of, like, things, pixels which are of particular categories or not. Uh, for NLP, again there are, like, lots of these particular tasks, say, about sentiment analysis, about entailment and so on. So, typically for supervised learning we get a big corpus of such annotated or labeled data, and then we feed that to a system, and the system is really trying to mimic. So, it's taking this input of the data and then trying to mimic the output. So, it looks at an image, and the human has tagged that this image contains a banana, and now the system is basically trying to mimic that. So, it starts its learning signal. And so for supervised learning we try to gather lots of such data, and we train these machine learning models to imitate the input/output. And the hope is basically by doing so, now on unseen or, like, new kinds of data, this model can automatically learn to predict these concepts. So, this is a standard sort of supervised setting. For semi-supervised setting, uh, the idea typically is that you have of course all of the supervised data, but you have lots of other data which is unsupervised or which is, like, not labeled. Now, the problem basically with supervised learning and why you actually have all of these alternate sort of learning paradigms is supervised learning does ... just does not scale. So, if you look at for computer vision, the sort of largest, one of the most popular datasets is ImageNet. Right? So, the entire ImageNet dataset has about 22,000 concepts and about 14 million images. So, these concepts are j- basically just nouns, and they're annotated on images. And this entire dataset was a mammoth data collection effort. It actually, uh, gave rise to a lot of powerful learning algorithms. It's credited with, like, sort of the rise of deep learning as well. But this dataset took about 22 human years to collect, to annotate, and it's not even that many concepts, right? It's not even that many images. 14 million is nothing, really. Um, like, you have about, I think, 400 million images or so, or even more than that uploaded to most of the popular sort of social media websites today. So, now supervised learning just doesn't scale. If I want to now annotate more concepts, if I want to have this ... various types of fine-grained concepts, then it won't really scale. So, y- now you come after these sort of different learning paradigms. For example, semi-supervised learning, where the idea is y- of course, you have this annotated corpus of supervised data, and you have lots of these unlabeled images, and the idea is that the algorithm should basically try to measure some kind of consistency or really try to measure some kind of, uh, signal on this sort of unlabeled data to make itself more confident about what it's really trying to predict. So, by access to this lots of unlabeled data, the idea is that the algorithm actually learns to be more confident and actually gets better at predicting these concepts. And now, we come to the other extreme, which is, like, self-supervised learning. The idea basically is that the machine or the algorithm should really discover concepts or discover things about the world or learn representations about the world which are useful without access to explicit human supervision.
- LFLex Fridman
So, the word "supervision" is still in the term "self-supervised." So, what is the supervision signal? And maybe that perhaps is when Yann LeCun and you argue that unsupervised is the incorrect terminology here.
- IMIshan Misra
Right.
- LFLex Fridman
So, what is the supervision signal when the humans aren't part of the picture, or not, uh, a big part of the picture?
- IMIshan Misra
Right.So, self-supervised, the reason it- it has the term supervise in itself is because you're using the data itself as supervision. So, because the data serves as its own source of supervision, it's self-supervised in that way. Now, the reason a lot of people ... I mean, we did it in that blog post with Yann, but a lot of other people have also argued for using this term self-supervised. So, starting from like '94 from Virginia Dasz group, uh, at I think UCSD and now, now she's at UCSD. Uh, Jitendra Malik has sa- said this a bunch of times as well. So, you have supervised, and then unsupervised basically means everything which is not supervised, but that includes stuff like semi-supervised, that includes other ... Like transductive learning, lots of other sort of settings. So, that's the reason like now people are preferring this term self-supervised because it explicitly says what's happening. The data itself is the source of supervision and any sort of learning algorithm which tries to extract just sort of data supervision signals from the data itself is a self-supervised learning algorithm.
- LFLex Fridman
But there is within the data a set of tricks which unlock the supervision.
- IMIshan Misra
Right.
- LFLex Fridman
So, can you give maybe some examples? And th- there's a- there's innovation, ingenuity required to unlock that supervision.
- IMIshan Misra
Right.
- LFLex Fridman
The data doesn't just speak to you some ground truth. You have to do some kind of trick.
- IMIshan Misra
Right.
- LFLex Fridman
Uh, so I don't know what your favorite domain is. So, you specifically s- specialize in visual learning, but is there favorite examples maybe in language or other domains?
- IMIshan Misra
Perhaps the most successful applications have been in, uh, NLP, nat- language processing. So, the idea basically being that you can train models that can ... Uh, you have a sentence and you mask out certain words, and now these models learn to predict the masked out words. So, if you have like, "The cat jumped over the dog." So, you can basically mask out cat and now you're essentially asking the model to predict, "What was missing? What did I mask out?" So, the model is going to predict basically a distribution over all the possible words that it knows. And probably it has, like if it's a- a well-trained model, it has a h- sort of higher probability density for this word cat. For vision, I would say the sort of more, uh, I mean, the easier example which is not as widely used these days, uh, is basically say for example video prediction.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
So, video is again a sequence of things. So, you can ask the model, so if you have a video of say 10 seconds, you can feed in the first nine seconds through a model and then ask it, "Hey, what happens e- basically in the 10th second? Can you predict what's going to happen?" And the idea basically is because the model is predicting something about the data itself, of course you didn't n- didn't need any human to tell you what was happening because the 10-second video was naturally captured. Because the model is predicting what's happening there, it's g- uh, going to automatically learn something about the structure of the world. How objects move, object permanence, and these kinds of things. Uh, so like if I have something at the edge of the table, it'll fall down. Uh, things like these which you really don't have to sit and annotate. In a supervised learning setting, I would have to sit and annotate, "This is a cup. Now I move this cup. This is still a cup. And now I move this cup, it's still a cup and then it falls down, and this is a fallen down cup." So, I won't have to annotate all of these things in a self-supervised setting.
- LFLex Fridman
Isn't that kind of a brilliant little trick of taking a series of data that is consistent and removing one element in that series, and then, uh, teaching the algorithm to predict that element? Isn't that ... First of all, that's quite brilliant. Um, it seems to be applicable in anything that, um, has the constraint of being a- a- a sequence that is consistent with the physical reality.
- IMIshan Misra
Right.
- LFLex Fridman
Um, the question is, are there other tricks like this that can generate the, uh, self-supervision signal?
- IMIshan Misra
So, sequence is possibly the most widely used one in NLP. For vision, the one that is actually used for like images which is very popular these days, is basically taking an image and now taking different crops of that image. So, you can basically decide to crop say the top left corner and you crop say the bottom right corner, and asking the network to basically, uh, present it with a choice saying that, "Okay, now you have, uh, this image, you have this image. Are these the same or not?" And so the idea basically is that because different cro- like in an image, different parts of the image are going to be related. So, for example, if you have a chair and a table, uh, basically these things are going to be close by. Uh, versus if you take, uh, again, if you have like a zoomed in picture of a chair, if you're taking different crops, it's going to be different parts of the chair. So, the, uh, idea basically is that different crops of the image are related and so the features or the representations that you get from these different crops should also be related. So, this is possibly the most like widely used trick, uh, these days for, uh, self-supervised learning in computer vision.
- LFLex Fridman
So again, using the, uh, consistency that's inherent to physical reality in- in visual domain that's, you know, parts of an image are consistent. And then in the, uh, language domain or s- anything that has sequences like language or something that's like a time series that you can chop off parts in time.
- IMIshan Misra
Right.
- LFLex Fridman
It's similar to the story of, uh, RNNs and CNNs. Um, of RNNs and ConvNets.
- 11:02 – 14:54
Self-supervised learning is the dark matter of intelligence
- LFLex Fridman
You and Yann LeCun wrote the blog post in March 2021 titled Self-Supervised Learning: The Dark Matter of Intelligence. Can you summarize this blog post and maybe explain the main idea or set of ideas?
- IMIshan Misra
The blog post was mainly about sort of just telling ... I mean, this is really a, uh, accepted fact I would say for a lot of people now that self-supervised learning is something that is going to be ... Uh, play an important role for machine learning algorithms that come in the future, and even now.
- LFLex Fridman
Uh, let me just comment that, uh, we don't yet have a good understanding what dark matter is.
- IMIshan Misra
(laughs) That's true.
- LFLex Fridman
So- (laughs)
- IMIshan Misra
(laughs) So, the idea basically being-
- LFLex Fridman
So, maybe the metaphor doesn't exactly transfer but maybe it- maybe this actually perfectly transfers that we don't know. We have a- we have an inkling that it'll be a big part of whatever solving intelligence looks like.
- IMIshan Misra
Right. So, I think self-supervised learning the way it's done right now is I would say like the first step towards what it probably should end up like learning or what it should enable us to do.
- LFLex Fridman
Yeah.
- IMIshan Misra
So, the idea for, uh, that particular piece was ...Self-supervised learning is going to be a very powerful way to learn common sense about the world, or like stuff that is really hard to label. For example, like is this, uh, piece over here heavier than the cup? Now, for all these kinds of things you'll have to sit and label these things, so supervised learning is clearly not going to scale. So, what is the thing that's actually going to scale? It's probably going to be an agent that can either actually interact with it, so lift it up, or, uh, observe me doing it. So, if I'm basically lifting these things up it can probably reason about, "Hey, this is taking him more time to lift up," or, "The velocity's different." Where- Whereas the velocity for this is different, probably this one is heavier. So, essentially by observations of the data you should be able to infer a lot of things about the world without someone explicitly telling you, "This is heavy. This is not. Uh, this is something that can pour. This is something that cannot pour. This is somewhere that you can sit. This is not somewhere that you can sit."
- LFLex Fridman
But you just mentioned the ability to interact with the world. There's so many questions that are yet to be... that are still open which is, how do you select a set of data over which the self-supervised, uh, learning process works? How much interactivity, like in the active learning or the machine teaching context is there? What are the reward signals? Like how much actual interaction there is with the physical world, that kind of thing.
- IMIshan Misra
Right.
- LFLex Fridman
Uh, so that- that's a... that could be a huge que- and then on top of that, which I have a million questions about, which we don't know the answers to but it's worth talking about, is how much reasoning is involved? How much accumulation of knowledge versus something that's more akin to learning, or whether that's the- the same thing. But, so we're like... it is truly dark matter.
- IMIshan Misra
We don't know how exactly to do it-
- LFLex Fridman
Yeah.
- IMIshan Misra
... but we are... I mean a lot of us are actually convinced that it's going to be a sort of major thing in machine learning and-
- LFLex Fridman
So let me reframe it then, that- that human supervision cannot be at large scale, the source of the solution to intelligence.
- IMIshan Misra
Right.
- LFLex Fridman
So there has... we... the machines have to discover the supervision in the natural signal of the world.
- IMIshan Misra
Right. I mean the other thing is also that humans are not particularly good labelers, they're not very consistent. Uh, for example, like what's the difference between a dining table and a table? Is it just the fact that one... like if you just look at a particular table, what makes us say one is a dining table and the other is not? Uh, humans are not particularly consistent, they're not like very good sources of supervision for a lot of these kind of edge cases. So, it may be also the fact that if we want a... like want an algorithm or want a machine to solve a particular task for us, we can maybe just specify the end goal. Uh, and like the stuff in between, uh, we really probably should not be specifying because we're not... maybe you're going to confuse it a lot actually.
- LFLex Fridman
Well, humans can't even answer the meaning of life so we don't... I'm not sure if we're good supervisors of
- 14:54 – 23:28
Categorization
- LFLex Fridman
the end goal either. So let me ask you about categories. Humans are not very good at telling the difference between what is and isn't a table, like you mentioned.
- IMIshan Misra
Right.
- LFLex Fridman
Um, do you think it's possible... let me, let me ask you like a... pretend you're Plato. Um, is, is it possible to create a pretty good taxonomy of objects in the world? It seems like a lot of approaches in machine learning kind of assume a hopeful vision that it's possible to construct a perfect taxonomy. Or it exists, perhaps out of our reach, but we can always get closer and closer to it. Or is that a hopeless pursuit?
- IMIshan Misra
I think it's hopeless in some way. So the thing is, for any particular categorization that you create, if you have a discrete sort of categorization, I can always take the nearest two concepts or I can take a third concept and I can blend it in, and I can create a new category.
- LFLex Fridman
Yeah.
- IMIshan Misra
So if you were to enumerate end categories, I will always find an N plus one category for you, that's not going to be in the end categories. And I can actually create not just N plus one, I can very easily create far more than N categories. The thing is, uh, a lot of things we talk about are actually compositional, so it's really hard for us to come and sit- sit and enumerate all of these out. And they come across in various weird ways, right? Like you have... like a croissant and a donut come together to form a cronut.
- LFLex Fridman
Yeah.
- IMIshan Misra
So if you were to like enumerate all the foods up until, I don't know, whenever the cronut was, about 10 years ago or 15 years ago, then this entire thing called cronut would not exist.
- LFLex Fridman
Yeah, I remember there was a... the most awesome video of a cat wearing a monkey costume.
- IMIshan Misra
(laughs) Yeah. Yes.
- LFLex Fridman
(laughs) People should look it up. It's great. So is that a monkey, is, uh, or is that a cat? It's a very difficult philosophical question. So there is a concept of similarity between objects. So you think that can take us very far, just kind of getting a good function, a good way to tell which parts of things are similar and which parts of things are very different?
- IMIshan Misra
I think so, yeah. So you don't necessarily need to name everything or assign a name to everything to be able to use it, right? So a lot... there are like lots of-
- LFLex Fridman
Shakespeare said that. "What's in a name?"
- IMIshan Misra
What's in a name? Yeah.
- LFLex Fridman
Yeah, okay.
- IMIshan Misra
And I mean, a lot- lots of like, for example, animals, right? They don't have necessarily a well-formed like syntactic language but they're able to go about their day perfectly. The same thing happens for us. So I mean, we probably look at things and we figure out, oh, this is similar to something else that I've seen before and then I can probably learn how to use it. So I haven't seen all the possible doorknobs in the world.
- LFLex Fridman
Yes.
- IMIshan Misra
But if you show me... like I was able to get into this particular place fairly easily. I've never seen that particular doorknob. So I of course relate it to all the doorknobs that I've seen and I know exactly how it's going to open, or I have a pretty good idea of how it's going to open. And I think this kind of translation between experiences only happens because of similarity, because I'm able to relate it to a doorknob. If I related it to a hairdryer, I would probably be stuck still outside, not able to get in. (laughs)
- LFLex Fridman
(laughs) Again, a bit of a philosophical question, but is... can similarity take us all the way to understanding a thing? Can having a good function that compares objects get us to understand something profound about singular objects?
- IMIshan Misra
I think I'll ask you a question back. What does it mean to understand objects?
- LFLex Fridman
Well, let me tell you what that's similar to. No. (laughs)
- IMIshan Misra
(laughs)
- LFLex Fridman
Uh, (laughs) I, so there is, there is an idea of sort of reasoning by analogy kind of thing. I think understanding is the process of placing that thing in some kind of network of knowledge that you have, that it- it perhaps is fundamentally related to other concepts. So, it's not like understanding is fundamentally related by, like, composition of other concepts and maybe in relation to other concepts. Um, and maybe, like, deeper and deeper understanding is maybe just adding more edges to that, uh, to that graph somehow. Uh, so maybe it is a composition of similarities. I mean, ul- ultimately, it is ... I suppose it is a kind of embedding in that wisdom space. (laughs)
- IMIshan Misra
(laughs) Yeah. All right. Okay. Wisdom space is good. Uh-
- LFLex Fridman
(laughs)
- IMIshan Misra
... I think, I do think, right, so similarity does get you very, very far.
- LFLex Fridman
Yeah.
- IMIshan Misra
Is it the answer to everything? I mean, I don't even know what everything is, but it's going to take us really far, um, and I think ... The thing is, things are similar in very different contexts, right? So, an elephant is similar to, I don't know, another sort of wild animal, let's just pick, I don't know, lion, in s- in a different way because they're both four-legged creatures, uh, they're also land animals, but of course they're very different in a lot of different ways. So, elephants are like herbivores, um, lions are not. So, similarity does, uh, similarity and particularly dissimilarity also sort- uh, actually helps us understand a lot about things. And so that's actually why I think discrete categorization is very hard, just like forming this particular category of elephant and a particular category of lion, maybe it's good for like, just like taxonomy, biological taxonomies. But when it comes to like other things which are not as, maybe, uh, for example, like grilled cheese, right?
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
I have a grilled cheese, I dip it in tomato and I keep it outside. Now, is that still a grilled cheese or is that something else?
- 23:28 – 27:12
Is computer vision still really hard?
- LFLex Fridman
Can I ask you just an, uh, out there question? I remember, I think, uh, I think Andrej Karpathy had a blog post about computer vision, uh, like being really hard.
- IMIshan Misra
Mm-hmm.
- LFLex Fridman
I forgot what the title was, but it was many, many years ago. And he had, I think President Obama stepping on a scale and there was humor, and there was a bunch of people laughing and whatever. And, uh, the interesting ... There's a lot of interesting things about that image, and I think Andrej highlighted a bunch of things about the image that us humans are able to immediately understand, like the idea, I think, of gravity and-
- IMIshan Misra
Yeah.
- LFLex Fridman
... that you can, you have the concept of a weight, you have a, you immediately project, uh, because of our knowledge of pose and how human bodies are constructed, you understand how the forces are being applied with the human body. Uh, the really interesting other thing that you're able to understand, there's multiple people looking at each other in the image.
- IMIshan Misra
(laughs)
- LFLex Fridman
Uh, you're able to have a mental model of what the people are thinking about. You're able to infer like, "Oh, this person is probably thinks, like is laughing at how humorous the situation is, and this person is confused about what the situation is because they're looking this way." We're able to infer all of that. So, that's human vision. How difficult is computer vision? Like, in order to achieve that level of understanding. And maybe how big of a part does self-supervised learning play in that, do you think? And do you still, you know, back, that was like over a decade ago, I think Andre and I think a lot of people agreed is, (laughs) computer vision is really hard.
- IMIshan Misra
Yeah.
- LFLex Fridman
Do you still think computer vision is really hard?
- IMIshan Misra
I think it is, yes. And getting to that kind of understanding, uh, I mean, it's really out there. So, if you ask me to solve just that particular problem, I can do it the supervised learning route. I can always construct a data set and basically predict, "Oh, is there humor in this or not?"
- LFLex Fridman
(laughs)
- IMIshan Misra
(laughs) Of course, I can do it.
- LFLex Fridman
Actually, that's a good question. Do you think you can... Okay, okay, do you think you can do human-supervised annotation of humor?
- IMIshan Misra
To some extent, yes. I'm sure it'll work. I mean, it won't be, it won't be as bad as like randomly guessing. I'm sure it can still predict whether it's humorous or not, in some way.
- LFLex Fridman
Yeah, maybe like Reddit upvotes is the signal. I don't know.
- IMIshan Misra
Right.
- LFLex Fridman
Okay.
- IMIshan Misra
I mean, it won't do a great job, but it'll do something.
- LFLex Fridman
Right.
- IMIshan Misra
It may actually be like it may find certain things which are not humorous, humorous as well, which is going to be bad for us, but, I mean, it'll do a, it wouldn't be random.
- LFLex Fridman
Yeah, kind of like my sense of humor.
- IMIshan Misra
(laughs)
- LFLex Fridman
Okay, so, uh, fine. So, you can that particular problem, yes. But the general problem you're saying is hard.
- IMIshan Misra
The general problem is hard. And, I mean, self-supervised learning is not the answer to everything. Of course, it's not. I think, uh, if you have machines that are going to communicate with humans at the end of it, you want to understand what the algorithm is doing, right? You want it to be able to like produce an output that you can decipher, that you can understand, or it's actually useful for something else, which again, is a human... So, at, at some point in this sort of entire loop, a human steps in.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
And now, this human needs to understand what, what's going on. So at, and at that point, this entire notion of language or semantics really comes in. If the machine just spits out something and if we can't understand it, then it's not really that useful for us. So, self-supervised learning is probably going to be useful for a lot of the things before that part.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
Before the machine really needs to communicate a particular kind of output with a human. Um, because, I mean, otherwise, how, how is it going to do that without language?
- LFLex Fridman
Or some kind of communication. But you're saying that it's possible to build a big base of understanding or whatever, of, um, what's a better w-
- IMIshan Misra
Concepts.
- 27:12 – 36:51
Understanding Language
- LFLex Fridman
on today. Can we take a, a little bit of a step back and look at language? Can you summarize the history of success of self-supervised learning in natural language processing, language modeling? What are transformers?
- IMIshan Misra
(laughs)
- LFLex Fridman
What is, uh, the masking, the sentence completion that you mentioned before? Um, how does it lead us to understand anything? Semantic meaning of words, syntactic role of words and sentences?
- IMIshan Misra
So, I'm, of course, not the expert in NLP. Uh, I kind of follow it, um, a little bit from the sides. So, the main sort of, uh, reason why all of this masking stuff works is, I think it's called the distributional hypothesis in NLP. The idea basically being that words that occur in the same context should have similar meaning. So, if you have the blank jumped over the blank, it basically, uh, whatever is like in the first blank is basically an object that can actually jump, is going to be something that can jump. So, a cat or a dog or, I don't know, sheep, something, all of these things can basically be in that particular context. And now, so essentially, the idea is that if you have words that are in the same con- context and you predict them, you're going to learn, uh, a lots of useful things about how words are related because you're predicting by looking at their context what the word is going to be. So, in this particular case, the blank jumped over the fence. So now, if it's a sheep, the sheep jumped over the fence, the dog jumped over the fence. So essentially, the ya- algorithm or the representation basically puts together these two concepts together. So, it says, "Okay, dogs are going to be kind of related to sheep because both of them occur in the same context." Of course, now you can decide depending on your particular application downstream. You can say that dogs are absolutely not related to sheep because, well, I don't, I really care about, you know, dog food, for example. I'm a dog food person and I really want to give this dog food to this particular animal. So, depending on what your downstream application is, of course, this notion of, uh, similarity or this notion or this common sense that you've learned may not be applicable. But the point is basically that this, um, just predicting what the blanks are is going to take you really, really far.
- LFLex Fridman
And so, the- there's a nice feature of language that the number of words in a particular language is very large, but it's finite and it's actually not that large in the grand scheme of things. I, I still gotta... because we take it for granted. So first of all, when you say masking, you're talking about this very process of the blank, of removing words from a sentence and then having the knowledge of what word went there in the ini- initial data set. That's the ground truth that you're training on and then you're asking the neural network to predict what goes there. That, that's, that's like a little trick.
- IMIshan Misra
Yeah.
- LFLex Fridman
It's a really powerful trick.
- IMIshan Misra
Yeah.
- LFLex Fridman
The question is how far that takes us, and the other question is, is there other tricks? 'Cause it, to me, it's very possible there's other very fascinating tricks. I'll give you an example in, um...... in autonomous driving, there's a bunch of tricks-
- IMIshan Misra
Right.
- LFLex Fridman
... that give you the self-supervised signal back. For example, I mean, very similar to sentences, but not really, which is you have signals from humans driving the car-
- IMIshan Misra
Mm-hmm.
- LFLex Fridman
... because a lot of us drive cars to places, and so you can ask the neural network to predict what's going to happen in the next two seconds for a safe navigation through the environment, and the, and the signal is, comes from the fact that you also have knowledge of what happened in the next two seconds because you have video of the data.
- IMIshan Misra
Right.
- LFLex Fridman
The question in autonomous driving, as it is in language, can we learn how to drive autonomously based on that kind of self-supervision? Probably the answer is no. The question is, how good can we get?
- IMIshan Misra
Right.
- LFLex Fridman
And the same with language, how good can we get? And are there other tricks? Like, we get sometimes super excited by this trick that works really well, but I wonder... It's almost like mining for gold. I wonder how many signals there are in the data that could be leveraged-
- IMIshan Misra
Right.
- LFLex Fridman
... that are, like, there, right? Is, is that, uh... I just wanted to kind of linger on that because sometimes it's easy to think that maybe this masking process is self-supervised learning.
- IMIshan Misra
Right.
- LFLex Fridman
No, it's only-
- IMIshan Misra
One.
- LFLex Fridman
... one method. Uh, so there could be many, many other methods, many tricky, uh, methods, maybe interesting ways to leverage human computation in very interesting ways that might actually border on semi-supervised learning, something like that. Uh, obviously, the internet is generated by humans-
- IMIshan Misra
Yeah.
- LFLex Fridman
... at the end of the day. So, all that to say is, what's your sense, in this particular context of language, how far can that, uh, masking process take us?
- IMIshan Misra
So, it has stood the test of time, right? I mean, so Word2vec, uh, the initial sort of, uh, NLP technique that was using this, to now, for example, like all the BERT and all these, uh, big models that we get, um, BERT and RoBERTa, for example. All of them are still sort of based on the same principle of masking. It's taken us really far. I mean, you can actually do things like, oh, these two sentences are similar or not, whether this particular sentence follows this other sentence in terms of logic, so entailment. You can do a lot of these things with this, just this masking trick.
- LFLex Fridman
Yeah.
- IMIshan Misra
Um, so I'm not sure if I can predict how m- how far it can take us, because, like, when it first came out, when like Word2vec was out, uh, I don't think a lot of us would have imagined that this would actually help us do some kind of, like, entailment problems and really that well. And so just the fact that by just scaling up the amount of data that we're training on and, like, using better and more powerful neural network architectures has taken us from that to this is just showing you how maybe poor predictors we are. Like, uh, h- as humans, how poor we are at predicting how successful a particular technique is going to be. So, I think I can say something now, but, like, 10 years from now, I'll look completely stupid (laughs) basically predicting this.
- LFLex Fridman
In the language domain, is there something in your work that you find useful and insightful and, and, um, transferable to computer vision, but also just, I don't know, beautiful and profound that I think carries through to the vision domain?
- IMIshan Misra
I mean, the idea of masking has been very powerful. It has been used in vision as well for predicting, like you say, the next, uh, sort of, if you have N sort of frames, then you predict what going, what's going to happen in the next frame. So, that's been very powerful. In terms of modeling, like, in just terms, in terms of architecture, I think you had asked about transformers-
- 36:51 – 43:36
Harder to solve: vision or language
- LFLex Fridman
What is harder to solve, vision or language? Visual intelligence or linguistic intelligence?
- IMIshan Misra
So, I'm going to say computer vision is harder. My reason for this is basically that, uh, language of course has a big structure to it because we developed it, uh, whereas vision is something that is common in a lot of animals. Everyone is able to get by... a lot of these animals on Earth are actually able to get by without language.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
And we... a lot of these animals we also deem to be intelligent. So, clearly intelligence, uh, does have like a visual component to it. And yes, of course in the case of humans it of course also has a linguistic component, but it means that there is something far more fundamental about vision than there is about language. And I'm sorry to anyone who disagrees, but yes, this is what I feel. (laughs)
- LFLex Fridman
So, that's being a little bit reflected in, um, the challenges that have to do with the, the progress of self-supervised learning, would you say? Or, is that just a peculiar accidents of the progress of the AI community that we focused on lang- or we discovered self-attention and transformers in the context of language first?
- IMIshan Misra
So, like, the self-supervised learning success was actually, uh, for vision has not much to do with the transformers part. I would say it's actually been independent-
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
... a little bit. I think it's just that the signal was a little bit different for, uh, vision than there was for like NLP and probably NLP ac- yeah, yeah, folks discovered it before. So for vision, the main success has basically been this like crops so far-
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
... like taking different crops of images. Uh, whereas for NLP it was this masking thing.
- LFLex Fridman
But also the level of success is still much higher for language.
- IMIshan Misra
Yes, it has. Uh, so that has a lot to do with... I mean, I can get into a lot of details.
- LFLex Fridman
Let's go.
- IMIshan Misra
For this particular question, let's go for it. Okay. So, the first thing is language is very structured, so you are going to produce a distribution over a finite vocabulary.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
English has a finite number of words. It's actually not that large. Uh, and you'll need to produce basically f- when you're doing this masking thing, all you need to do is basically tell me which one of these like 50,000 words it is.
- LFLex Fridman
Yeah.
- IMIshan Misra
That's it. Now for vision, let's imagine doing the same thing, okay? We're basically going to blank out a particular part of the image and we ask the network, or this neural network, to predict what is present in this missing patch. It's combinatorially large, right? You have 256 pixel values. If you're even producing basically a seven cross seven or a 14 cross 14 like window of pixels, at each of these 169 or each of these 49 locations, you have 256 values to predict.
- LFLex Fridman
Yeah.
- IMIshan Misra
And so it's really, really large. And very quickly, the kind of like, uh, prediction problems that we are setting up are going to be extremely like intractable for us. And so the thing is, for NLP, it has been really successful because we are very good at predicting, like doing this like distribution over a finite set. And the problem is when this set becomes really large, we're, we're going to become really, really bad at making these predictions and at solving basically this particular set of problems.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
So, if you were to do it exactly in the s- uh, same way as NLP for vision, there is very limited success. The way stuff is working right now is actually not by predicting these masks. It's basically by saying that you take these two like crops from the image, you get a feature representation from it, and just saying that these two features... so they're like vectors, just saying that the distance between these vectors should be small. And so it's a very different way of learning, uh, from the visual signal than there is from NLP. Okay, the other reason is the distributional hypothesis that we talked about for NLP, right? So, a word given its context, basic, the context actually supplies a lot of meaning to the word.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
Now, because there are just finite number, finite number of words and there is a finite way in like which we compose them, of course, uh, the same thing holds for pixels, but in language there's a lot of structure, right? So I always say whatever, "The dash jumped over the fence," for example. There are lots of these sentences that you'll get, and from this you can actually look at this particular sentence might occur in a lot of different contexts as well. This exact same sentence might occur in a different context. So, "The sheep jumped over the fence." "The cat jumped over the fence." "The dog jumped over the fence." So you immediately get a lot of these words which are... because this particular token itself has so much meaning, you get a lot of these tokens or these words which are actually going to have a, have sort of this related meaning across, given this context. Whereas for vision it's much harder because just by like pure, like the way we capture images, lighting can be different. Um, there might be like different noise in the sensor. So the thing is you're capturing a physical phenomenon and then you're basically g- going through a very complicated pipeline-
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
... of like image processing and then you're translating that into some kind of like digital signal.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
Whereas with language, you write it down and you transfer it to a digital signal almost like it's a lossless like transfer.
- LFLex Fridman
Yeah.
- IMIshan Misra
And each of these tokens are very, very well defined.
- 43:36 – 47:37
Contrastive learning & energy-based models
- LFLex Fridman
- IMIshan Misra
Right.
- LFLex Fridman
This might be a good place to, uh, you already mentioned it, but what is contrastive learning and what are energy-based models?
- IMIshan Misra
Contrastive learning is sort of the, a paradigm of learning where the idea is that you are learning this embedding space, or so you're learning this sort of vector space of all your concepts, and the way you learn that is basically by contrasting. So, the idea is that you have a sample, you have another sample that's related to it, so that's, uh, called a positive, and you have another sample that's not related to it, so that's negative. So, for example, let's just take an NLP or one, and a simple example in, uh, computer vision. So, you have an image of a cat, you have an image of a dog, and for whatever application that you're doing, say you're trying to figure out what, uh, pets are, you're saying that these two images are related. So, image of a cat and dog are related, but now you have another third image of a banana, uh, because you don't like that word.
- LFLex Fridman
Thank you.
- IMIshan Misra
(laughs) So now you basically have this banana-
- LFLex Fridman
Thank you for speaking to the crowd.
- IMIshan Misra
And so you take both of these images and you take the image from the cat, the image from the dog, you get a feature from both of them.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
And now what you're training the network to do is basically, uh, pull both of these features together while pushing them away from the feature of a banana.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
So this is the contrastive part. So, you're contrasting against the banana. So, there's always this notion of a negative and a positive.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
Now, energy-based models are, like, like one way that, uh, Yan sort of explains a lot of these methods. So, uh, Yan basically, I think a couple of years or more than that, like when I joined Facebook, uh, Yan used to keep mentioning this word, energy-based models. And of course, I had no idea what he was talking about.
- LFLex Fridman
Yeah.
- IMIshan Misra
So then one day, I caught him in one of the conference rooms and I'm like, "Can you please tell me what this is?" So then, like very patiently, he sat down with, like a marker and a whiteboard. And his idea basically is that rather than talking about probability distributions, you can talk about energies of models. So, models are trying to minimize certain energies in certain space, or they're trying to maximize a certain kind of, uh, energy. And the idea basically is that you can explain a lot of the contrastive models, GANs, for example, which are like generative adversarial networks. Uh, a lot of these modern learning methods, or VAEs, which are variational autoencoders, you can really explain them very nicely in terms of an energy function that they're trying to minimize or maximize. And so by putting this common sort of language for all of these models, what looks very different in machine learning that, oh, VAEs are very different from what GANs are-
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
... or very, very different from what contrastive models are, you actually get a sense of like, oh, these are actually very, very related.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
It's just that the v- way or the mechanism in which they're, uh, sort of maximizing or minimizing this energy function is slightly different.
- LFLex Fridman
So, revealing the, the, the commonalities between all these approaches-
- IMIshan Misra
Right.
- LFLex Fridman
... and putting a sexy word on top of it, like energy. And so similarities, so- two things that are similar have low energy, like the low energies signifying similarity.
- IMIshan Misra
Right. Exactly. So, basically the idea is that if you were to imagine, like the embedding as a manifold, a 2D manifold, you would get a hill or like a high sort of peak in the energy manifold wherever two things are not related, and basically you would have like a dip where two things are, are related. So, you'd get a dip in the ƒ.
- LFLex Fridman
And, uh, in the self-supervised context, how do you know two things are related and two things are not related?
- IMIshan Misra
Right. So, this is where all the sort of ingenuity or tricks comes in, right? So, for example, like, uh, you can take the fill-in-the-blank problem or you can take in, like the context problem and you, what you can say is two words that are in the same context are related, two words that are in different contexts are not related. For images, basically two crops from the same image are related and, whereas a third image is not related at all. Or for a video, it can be two frames from that video are related because they're likely to c- contain the same sort of concepts in them. Whereas a third frame from a different video is not related. So, it basically is, it's a very general term. Contrastive learning has nothing really to do with self-supervised learning. It actually is very popular in, for u- for example, like any kind of metric learning or any kind of embedding learning. So, it's also used in supervised learning. It's als-... And the thing is because we are not really using labels to get these positive or negative pairs, uh, it can basically also be used for self-supervised learning.
- 47:37 – 51:57
Data augmentation
- IMIshan Misra
- LFLex Fridman
So, you mentioned one of the ideas in the vision context to, uh, that works is to have different crops. So, you could think of that as a way to sort of, uh, manipulating the data-
- IMIshan Misra
Right.
- LFLex Fridman
... to generate, uh, examples that are similar. Obviously, uh, there's a bunch of other techniques. You mentioned lighting as a very, you know, i- in images, lighting is something that varies a lot, and you can artificially, uh, change those kinds of things. There's the whole broad field of data augmentation which, uh, manipulates images in order to increase arbitrarily the size of the dataset. First of all, what is data augmentation? And second of all, what's the role of data augmentation in self-supervised learning and contrastive learning?
- IMIshan Misra
So, data augmentation is just a way, like you said, it's basically a way to augment the data. So, you have say, N samples, and what you do is you basically define some kind of transforms for the sample. So, you take your, say, image, and then you define a transform where you can just increase, say, the colors, like the colors or the brightness of the image, or increase or decrease the contrast of the image, for example.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
Or take different crops of it. Uh, so data augmentation is just a process to, like, basically perturb the data, or, like, augment the data, right? And so, it has played a fundamental role for computer vision, uh, for self-supervised learning especially. The way most of the current methods work, contrastive or otherwise, is by taking an image, uh, in the case of images, uh, is by taking an image and then computing basically two perturbations of it. So, these can be two different crops of the image, uh, with, like, different types of lighting or different contrasts or different colors, so you j- jitter the colors a little bit, and so on. And now, the idea is basically because it's the same object or because it's, like, related concepts in both of these perturbations, you want the features from both of- both of these perturbations to be similar. So, now you can use a variety of different ways to enforce this constraint, like these features being similar. You can do this by contrastive learning. So, basically both of these things are positives. A third sort of image is negative. You can do this basically by, like, clustering, for example. You can say that both of these images should, uh, the features from both of these images should belong in the same cluster because they're related, whereas image ... like, another image should belong to a different cluster. So, there's a variety of different ways to basically enforce this particular constraint.
- LFLex Fridman
By the way, when you say features, it means there's a very large neural network that extracting patterns from the image and the kind of patterns it extracts should be v- ... either identical or very similar.
- IMIshan Misra
Right.
- LFLex Fridman
That's what that means.
- IMIshan Misra
Right. So, the neural network basically takes in the image and then outputs a- a set of, like, basically a vector of, like-
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
... numbers, and that's the feature.
- LFLex Fridman
Right.
- IMIshan Misra
And you want this feature for both of these, like, different crops that your computed to be similar.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
So, you want this vector to be identical, uh, in its, like, entries, for example.
- LFLex Fridman
Be, like, literally close in-
- IMIshan Misra
Yeah.
- LFLex Fridman
... this multi-dimensional space to each other.
- IMIshan Misra
Space. Right.
- LFLex Fridman
And like you said, close can mean part of the same cluster or something like that-
- IMIshan Misra
Right.
- LFLex Fridman
... in the- in this large space. First of all, that ... I wonder if there is connection to the way humans learn to this, almost like maybe subconsciously in order to understand a thing, you kind of have to see it from two, three, multiple angles. I wonder. There's a l- ... I have a lot of friends who are neuroscientists maybe or ... and, and cognitive scientists. I wo- I wonder if that's in there somewhere. Like, in order for us to place a concept in its proper place, we have to basically crop it in (laughs) all kinds of ways, uh, do basic data augmentation on it in whatever very clever ways that the brain likes to do.
- IMIshan Misra
Right.
- LFLex Fridman
Um, like spin it around in our minds somehow. That, that is very effective.
- IMIshan Misra
So, I think for some of them we, like, need to do it. So, like, babies for example, pick up objects, like move them, put them close to ƒ and whatnot.
- LFLex Fridman
Yeah.
- IMIshan Misra
But for certain other things actually, we are good at imagining it as well, right?
- LFLex Fridman
Yes.
- 51:57 – 1:00:10
Fixed audio spike by lowering sound with pen tool
- LFLex Fridman
that are occluded or not there, but not just, like, normal things, like wild things, but they're nevertheless physically consistent.
- IMIshan Misra
So, s- ... I mean, people do kind of, like, o- occlusion-based augmentation as well.
- LFLex Fridman
Yeah.
- IMIshan Misra
So, you place in, like, a random, like, box, gray box to sort of mask out a certain part of the image, and the thing is basically you're kind of occluding it. For example, you place it, say, on half of a person's face, so basically saying that, you know, something below their nose is occluded-
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
... 'cause it's grayed out.
- LFLex Fridman
No.
- IMIshan Misra
Uh, so this is kind of-
- LFLex Fridman
No. I meant, like, you have, like, what is it? A table and you can't see behind the table and, and you imagine there's a bunch of, uh, elves with bananas behind the table.
- IMIshan Misra
(laughs)
- LFLex Fridman
Like, I wonder if there's useful to have, uh, a wild imagination for the network, because that's possible. Well, maybe not elves, but, like, puppies and kittens or something like that. Just have a wild imagination and, like, constantly be generating that wild imagination 'cause in, in terms of data augmentation as currently applied, it's super, ultra, very boring. It's very basic data augmentation. I wonder if, I wonder if there's a benefit to being wildly imaginable while trying to be, uh, consistent with physical reality.
- IMIshan Misra
I think it's a kind of a chicken and egg problem, right? Because to have, like, amazing data augmentation, you need to understand what the scene is.
- LFLex Fridman
Right.
- IMIshan Misra
And what we're trying to do data augmentation to learn (laughs) what a scene is anyway.
- LFLex Fridman
Just-
- IMIshan Misra
So, it's basically-
- LFLex Fridman
(laughs)
- IMIshan Misra
... just keeps going on and on.
- LFLex Fridman
Before you understand it, just put elves with bananas until, until you know it's not to be true. (laughs)
- IMIshan Misra
(laughs)
- LFLex Fridman
(laughs) Just like children have a wild imagination until the adults ruin it all. Okay. So, what are the different kinds of data augmentation that you've seen to be effective in, uh, visual intelligence?
- IMIshan Misra
For, like, vision, it's a lot of these image filtering operations, so, like, blurring the image, uh, you know, all the kind of Instagram filters that you can think of.
- LFLex Fridman
(laughs)
- IMIshan Misra
(laughs) So, like, arbitrarily, like, make the red super red, make the green super greens, like saturate the image, uh-
- LFLex Fridman
Rotation, cropping ƒ-
- IMIshan Misra
Rotation, cropping, exactly. All of these kind of things. Uh-
- LFLex Fridman
Like you said, light- lighting is a really interesting one-
- IMIshan Misra
Yes.
- LFLex Fridman
... to me. Like, that feels, like, really complicated to do.
- IMIshan Misra
So, I mean, they don't ... The augmentations that we work on aren't, like, that involved 'cause they're not going to be, like, physically realistic versions of lighting. It's not a ... We are assuming that there's a light source up and then you're moving it to the right and then what does the thing look like.
- 1:00:10 – 1:03:54
Real data vs. augmented data
- IMIshan Misra
at this side.
- LFLex Fridman
Let me ask you, uh, a ridiculous question. If I were to give you, like a black box, like a choice to have an arbitrary large dataset of real natural data versus really good data augmentation algorithms, which would you like to train in a self-supervised way on?... so natural data from the internet, ar- arbitrarily large, so unlimited data, or it's like more controlled, good data augmentation on the finite dataset.
- IMIshan Misra
The thing is like because our learning algorithms for vision right now really rely on data augmentation, even if you were to give me like an infinite source of like image data, I still need a good data augmentation algorithm to learn from it.
- LFLex Fridman
You need something that tells you that two things are similar.
- IMIshan Misra
Right. And so something... because you've given me an arbitrarily large dataset, I still need to use data augmentation to take that image, construct like these two perturbations of it, and then learn from it. So, the thing is our learning paradigm is very primitive right now.
- LFLex Fridman
Yeah.
- IMIshan Misra
Even if you were to give me lots of images, it's still not really useful. A good data augmentation algorithm is actually going to be more useful. So, you can like reduce down the amount of da- data that you give me by like 10 times, but if you were to give me a good data augmentation algorithm, that would probably do better than giving me like 10 times the size of that data but me having to rely on like a very primitive data augmentation algorithm.
- LFLex Fridman
Like through tagging and all those kinds of things, is there a way to discover things that are semantically similar on the internet? Obviously there is, but they might be extremely noisy.
- IMIshan Misra
Right.
- LFLex Fridman
And the difference might be farther away than you would be comfortable with.
- IMIshan Misra
So, I mean, yes, tagging will help you a lot. It'll actually go a very long way in figuring out what images are related or not. Um, and then, so, but then the purists would argue that when you're using human tags because these tags are like supervision, is it really, really self-supervised learning now?
- LFLex Fridman
Yeah.
- IMIshan Misra
Because you're using human tags to figure out which images are like similar. Perhaps-
- LFLex Fridman
Hashtag no filter means a lot of things.
- IMIshan Misra
Yes.
- LFLex Fridman
(laughs)
- IMIshan Misra
I mean, there are certain tags which are going to be applicable pretty much to anything.
- LFLex Fridman
Yeah. (laughs)
- IMIshan Misra
(laughs) So they're pretty useless for learning.
- LFLex Fridman
Yeah.
- IMIshan Misra
Uh, but I mean, f- certain tags are actually like the eye filter, for example, or the Taj Mahal, for example. These tags are like very indicative of what's going on.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
And they are... I mean, they are human supervision.
- LFLex Fridman
Yeah. Well, this is one of the task of discovering from human-generated data strong signals that could be leveraged, uh, for self-supervision. Like humans are doing so much work already. Like many years ago, there was something that was called I guess human computation back in the day. Humans are doing so much work. It's- it'd be exciting to discover ways to leverage the work they're doing to teach machines without any extra effort from them. An example could be, like we said, driving, humans driving and machines can learn from their driving. I always hope that there could be some supervision signal discovered in video games because there's so many people-
- IMIshan Misra
(laughs)
- LFLex Fridman
... that play video games that it feels like so much effort is put into video games, uh, into playing video games, and you can design video games somewhat cheaply-
- IMIshan Misra
Right.
- LFLex Fridman
... and, and to, to include whatever signals you want. It feels like, uh, that could be leveraged somehow.
- IMIshan Misra
So, people are using that.
- LFLex Fridman
Yeah.
- 1:03:54 – 1:07:32
Non-contrastive learning energy based self supervised learning methods
- IMIshan Misra
- LFLex Fridman
But that said, there's non-contrastive methods.
- IMIshan Misra
Right.
- LFLex Fridman
What do non-contrastive energy-based self-supervised learning methods look like? And why are they promising?
- IMIshan Misra
So, like I said about contrastive learning, you have this notion of a positive and a negative. Now, the thing is this entire learning paradigm really requires access to a lot of negatives, uh, to learn a good sort of feature space. The idea is if I tell you, um, okay, so a cat and a dog are similar and they're very different from a banana. The thing is this is a fairly simple analogy, right? Because, well, bananas look visually very different from what cats and dogs do. So very quickly, if this is the only source of supervision that I'm giving you, your learning is not going to be like... after a point, the neural network is really not going to learn a lot, uh, because the negative that you're getting is going to be so random. So, it can be, oh, a cat and a dog are very- are similar, but they're very different from a Volkswagen Beetle. Now, like this car looks very different from these animals again. So, the thing is in contrastive learning, the quality of the negative sample really matters a lot. And so what has happened is basically that typically these methods that are contrastive really require access to lots of negatives, which becomes harder and harder to sort of scale when designing a learning algorithm. So, that's w- been one of the reasons why non-contrastive methods have become like popular and why people think that they're going to be more useful. So, a non-contrastive method, for example, like clustering is one non-contrastive method. The idea basically being that you have two of these, uh, two of these, uh, samples, and so the cat and dog are two crops of this image, they belong to the same cluster. Uh, and so essentially you're basically doing clustering online when you're lea- learning this network, and which is very different from having access to a lot of negatives explicitly.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
The other way which has become really popular is something called self-distillation. So, the idea basically is that you have a teacher network and a student network, and the teacher network produces a feature, so it takes in the image and it... basically the neural network figures out the patterns, gets the feature out. And there's another, uh, neural network which is the student neural network, and that also produces a feature. And now all you're doing is basically saying that the, uh, features produced by the teacher network and the student network should be very similar. That's it. There is no notion of a, a negative anymore.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
And that's it. So, it's all about similarity maximization between these two features. And so all I need to now do is figure out how to have these two sorts of parallel networks, a student network and a teacher network. And, uh, basically researchers have figured out very cheap methods to do this. So, you can actually have for free really two types of neural networks. Uh, they're kind of related, but there's different enough that you can actually basically have a learning problem set up.
- LFLex Fridman
So, you can ensure that they always remain different enough?So, the thing doesn't collapse into something boring.
- IMIshan Misra
Exactly. So, the main sort of enemy of self-supervised learning a- any kind of similarity maximization technique is collapse. And so, collapse means that you learn the same feature representation for all the images in the world.
- LFLex Fridman
(laughs)
- IMIshan Misra
Which is completely useless. It-
- LFLex Fridman
Everything is a banana.
- IMIshan Misra
Everything is a banana, everything is a cat, everything is a car.
- LFLex Fridman
Yeah.
- IMIshan Misra
Uh, and so all we need to do is basically come up with ways to prevent collapse, contrastive learning is one way of doing it. And then, for example, like clustering or self-distillation are other ways of doing it. We also had a recent paper where we used like decorrelation, um, between like two sets of features to prevent collapse. So, that's inspired a little bit by like Horace Barlow's neuroscience principles.
- LFLex Fridman
By the way, I should comment that whoever counts the number of times the n- the word banana, apple, cat, and dog were used in this conversation wins the internet. I wish you luck.
- IMIshan Misra
(laughs)
- 1:07:32 – 1:10:14
Unsupervised learning (SwAV)
- IMIshan Misra
- LFLex Fridman
What, uh, what is SUAVE and- and the main improvement proposed in, uh, the paper, Unsupervised Learning of Visual Features by Contrasting Cluster Assignments?
- IMIshan Misra
SUAVE basically is a clustering base technique, uh, which is for, again, the same thing, for self-supervised learning in vision where we have two crops. And the idea basically is that you want the features from these two crops of an image to lie in the same cluster. Uh, and basically, uh, crops that are coming from different images to be in different clusters.
- LFLex Fridman
Okay.
- IMIshan Misra
Now, typically, you know, sort of if you were to do this clustering, you would perform clustering offline. What that means is, you would, if you have a data set of N examples, you would run over all of these N e- N examples, get features for them, perform clustering, so basically get some clusters, and then repeat the process again.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
So, this is offline basically because I need to do one pass through the data to compute its clusters.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
SUAVE is basically just a simple way of doing this online. So, as you're going through the data, you're actually computing these clusters online.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
And so, of course, there is like a lot of tricks involved in how to do this in a robust manner without collapsing, but this is the s- sort of key idea to it.
- LFLex Fridman
Is there a nice way to say what is the key methodology of the clustering that enables that?
- IMIshan Misra
Right. So, the idea basically is that, um, when you have N samples, we assume that we- we have access to like, there are always K clusters in a data set. K is a fixed number. So, for example, K is 3,000. And so if you have any n- when you look at any sort of small number of examples, all of them must belong to one of these K clusters, and we impose this equipartition constraint. What this means is, that, uh, basically, uh, your entire set of N samples should be equally partitioned into K clusters. So, all your K clusters are basically equal, will have equal contribution to these N samples. And this ensures that we never collapse. So, collapse can be viewed as a way in which all samples belong to one cluster, right?
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
So, all this, if all features become the same, then you have basically just one mega cluster, you don't even have like 10 clusters or 3,000 clusters. So, SUAVE basically ensures that at each point, all these 3,000 clusters are being used in the clustering process, and that's it. Basically just, uh, figure out how to do this online, and-
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
... um, again, basically just make sure two crops from the same image belong to the same cluster, uh, and others don't.
- LFLex Fridman
And the fact they have a fixed K makes things simpler.
- IMIshan Misra
Fixed K makes things simpler. Our clustering is not like really hard clustering, it's soft clustering.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
So, basically you can be point two to cluster number one and point eight to cluster number two, so it's not really hard. Uh, so essentially even though we have like 3,000 clusters, we can actually represent a lot
- 1:10:14 – 1:15:21
Self-supervised Pretraining (SEER)
- IMIshan Misra
of clusters.
- LFLex Fridman
What is SEER? S-E-E-R. And, uh, what are the key results and insights in the paper Self-Supervised Pre-training of Visual Features in the Wild? What is this big, beautiful SEER system?
- IMIshan Misra
SEER... So, I'll first go to SUAVE because SUAVE is actually like, uh, one of the key components for SEER.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
So, SUAVE was, when we used SUAVE, it was demonstrated on ImageNet.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
So, typically, like, uh, self-supervised methods, uh, the way we sort of operate is, uh, like in the research community, we kind of cheat. So, we take ImageNet, which of course I talked about as having lots of labels, and then we throw away the labels, like throw away all the hard work that went behind basically the labeling process, and we pretend that it is self- like unsupervised. But the problem here is that we have when, like when we collected these images, uh, the ImageNet data set has a particular distribution of concepts, right?
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
So, these images are very curated. And what that means is, these images, uh, of course belong to a certain set of known concepts. And also, ImageNet has this bias that all images contain an object which is like very big and it's typically in the center.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
So, when you're talking about a dog, it's a well-framed dog, it's towards the center of the image. So, a lot of the data augmentation, a lot of the sort of hidden assumptions in self-supervised learning, uh, actually really, uh, exploit this bias of ImageNet.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
And so, I mean, a lot of my work, a lot of work from other people always uses ImageNet sort of as the benchmark to show s- the success of self-supervised learning.
- LFLex Fridman
So, you're implying that there's particular limitations to this kind of data set?
- IMIshan Misra
Yes. I mean, it's basically because our data augmentations that we designed, uh, like in the h- like all, uh, data augmentation that we designed for self-supervised learning and vision are kind of overfed to ImageNet.
- LFLex Fridman
But yet you're saying a little bit hardcoded like the cropping.
- IMIshan Misra
Exactly. The cropping parameters, the kind of lighting that we're using, the kind of blurring that we're using.
- LFLex Fridman
Yeah. But you would, uh, for more in the wild data set, you would need to be, uh, uh, clever or more careful in setting the range of parameters and those kinds of things.
- IMIshan Misra
Right. So, for SEER, our main goal was two-fold. One, basically to move away from ImageNet for training. Uh, so the images that we used were like uncurated images. Now, there's a lot of debate whether they're actually curated or not, but I'll talk about that later. Uh, but the idea was basically these are going to be random internet images, uh, that we are not going to filter out based on like ca- particular categories.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
So, we did not say that, "Oh, images that belong to dogs and cats should be the only images that come in this data set."... banana. Uh, and basically, other images should be thrown out, so we didn't d- and do any of that. So, these are random internet images and of course, uh, it also goes back to like, the problem of scale that you talked about.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
So, these were basically about a billion or so images and for context, ImageNet, uh, the ImageNet version that we used was one million images earlier, so this is basically going like three orders of magnitude more. The idea was basically to see if we can train a very large convolutional model in a self-supervised way on this uncurated, but really large set of images.
- LFLex Fridman
Mm-hmm.
- IMIshan Misra
And how well would this model do? So, is self-supervised learning really overfit to ImageNet? Uh, or, or can it actually work in the wild? And it was also out of curiosity, what kind of things will this model learn? Will it actually be able to still figure out, you know, different types of objects and so on? Would there be particular kinds of tasks it would actually, uh, do better than, uh, an ImageNet, uh, trained model? And so for SEAR, one of our main findings was that we can actually train very large models in a completely self-supervised way on lots of internet images without really necessarily filtering them out. Which was in itself a good thing, because it's a fairly simple process, right? So, you get images which are uploaded, and you basically can immediately use them to train a model in an unsupervised way. You don't really need to sit and filter them out. These images can be cartoons, these can be memes, these can be actual pictures uploaded by people, and you don't really care about what these images are. You don't even care about what concepts they contain. So, this was a very sort of simple setup.
- LFLex Fridman
What image selection mechanism would you say is there, like, um, inherent in some aspect of the process? So, you're kind of implying that there's almost none, but, uh, what, what is there would you say if you were to introspect?
- IMIshan Misra
So... Right. So, it's not like uncurated can basically like... One way of imagining uncurated is basically you have like cameras obs- like cameras that can take pictures at random viewpoints. When people upload pictures to the internet, they are typically going to care about the framing of it. They're not going to upload say, the picture of a zoomed in wall, for example.
Episode duration: 2:30:29
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode FUS6ceIvUnI
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome