
Ishan Misra: Self-Supervised Deep Learning in Computer Vision | Lex Fridman Podcast #206
Lex Fridman (host), Ishan Misra (guest)
In this episode of Lex Fridman Podcast, featuring Lex Fridman and Ishan Misra, Ishan Misra: Self-Supervised Deep Learning in Computer Vision | Lex Fridman Podcast #206 explores self-Supervised Vision: Teaching Machines to See Without Human Labels Lex Fridman and Ishan Misra dive into self-supervised learning, focusing on how machines can learn visual representations from raw data without human annotations. They contrast supervised, semi-supervised, and self-supervised paradigms, explain key tricks like masking, cropping, contrastive learning, and transformers, and explore why language has progressed faster than vision. The conversation covers large-scale systems like SWaV and SEER, multimodal audio-visual learning, data augmentation, and active learning as crucial ingredients for scalable intelligence. They close by zooming out to the limits of deep learning, the role of embodiment and interaction, and philosophical questions about categories, reasoning, and the nature of intelligence.
Self-Supervised Vision: Teaching Machines to See Without Human Labels
Lex Fridman and Ishan Misra dive into self-supervised learning, focusing on how machines can learn visual representations from raw data without human annotations. They contrast supervised, semi-supervised, and self-supervised paradigms, explain key tricks like masking, cropping, contrastive learning, and transformers, and explore why language has progressed faster than vision. The conversation covers large-scale systems like SWaV and SEER, multimodal audio-visual learning, data augmentation, and active learning as crucial ingredients for scalable intelligence. They close by zooming out to the limits of deep learning, the role of embodiment and interaction, and philosophical questions about categories, reasoning, and the nature of intelligence.
Key Takeaways
Self-supervised learning scales where supervised learning cannot.
Manually labeling massive visual datasets is prohibitively expensive and inconsistent, while self-supervised methods use structure in the data itself (e. ...
Get the full analysis with uListen AI
Data augmentation is a central, underappreciated form of supervision.
For vision SSL, carefully designed augmentations (cropping, color jitter, blurring, etc. ...
Get the full analysis with uListen AI
Contrastive and non-contrastive methods both aim to avoid collapse.
Contrastive learning pulls together positive pairs while pushing away negatives, but needs many good negatives; newer non-contrastive approaches (clustering, self-distillation, decorrelation) remove explicit negatives and control collapse through other constraints.
Get the full analysis with uListen AI
Language has structural advantages over vision for self-supervision.
Text operates over a finite vocabulary and clear tokens, making masked prediction tractable and informative, while predicting missing pixels in images is combinatorially harder and more sensitive to noise and capture conditions.
Get the full analysis with uListen AI
Large-scale, uncurated training can produce highly transferable visual features.
Systems like SEER show that billion-parameter ConvNets trained on billions of random internet images, without labels, can match or outperform ImageNet-pretrained models on downstream tasks, suggesting SSL can exploit truly “in the wild” data.
Get the full analysis with uListen AI
Multimodal learning leverages cross-modal consistency as a powerful signal.
Aligning audio and video embeddings (e. ...
Get the full analysis with uListen AI
Active and continual learning are likely necessary for robust intelligence.
Selecting the most informative examples (edge cases, high-uncertainty situations) and avoiding catastrophic forgetting are critical for systems operating in the open world, as illustrated by Tesla-style data engines and “learning by asking” paradigms.
Get the full analysis with uListen AI
Notable Quotes
“Supervised learning just does not scale.”
— Ishan Misra
“The data itself is the source of supervision, so it’s self‑supervised.”
— Ishan Misra
“You shouldn’t want it, you should need it.”
— Ishan Misra
“We are super dumb about how we set up the self-supervised problem, and despite that the models learn to find objects.”
— Ishan Misra
“Any sufficiently advanced technology is indistinguishable from magic.”
— Arthur C. Clarke (quoted by Lex Fridman)
Questions Answered in This Episode
How far can current self-supervised techniques go toward learning genuine common sense about the physical world without any labels or explicit symbolic structure?
Lex Fridman and Ishan Misra dive into self-supervised learning, focusing on how machines can learn visual representations from raw data without human annotations. ...
Get the full analysis with uListen AI
What would a learned, data-dependent data augmentation pipeline look like, and could it become more important than the neural network architecture itself?
Get the full analysis with uListen AI
In practice, how do we decide where to draw the line between acceptable opacity of self-supervised representations and the need for human interpretability?
Get the full analysis with uListen AI
Can multimodal self-supervision (vision, audio, language, possibly proprioception) bridge the gap between pattern recognition and reasoning, or is an additional architectural breakthrough needed?
Get the full analysis with uListen AI
What forms of active learning and continual learning are most promising for handling rare, long-tail edge cases in safety-critical domains like autonomous driving?
Get the full analysis with uListen AI
Transcript Preview
The following is a conversation with Ishan Misra, research scientist at Facebook AI Research who works on self-supervised machine learning in the domain of computer vision, or in other words, making AI systems understand the visual world with minimal help from us humans. Transformers and Self-Attention has been successfully used by OpenAI's GPT-3 and other language models to do self-supervised learning in the domain of language. Ishan, together with Yann LeCun and others, is trying to achieve the same success in the domain of images and video. The goal is to leave a robot watching YouTube videos all night, and in the morning come back to a much smarter robot. I read the blog post, Self-Supervised Learning: The Dark Matter of Intelligence by Ishan and Yann LeCun, and then listened to Ishan's appearance on the excellent Machine Learning Street Talk podcast, and I knew I had to talk to him. By the way, if you're interested in machine learning and AI, I cannot recommend the ML Street Talk podcast highly enough. Those guys are great. Quick mention of our sponsors: On It, The Information, Grammarly, and Athletic Greens. Check them out in the description to support this podcast. As a side note, let me say that for those of you who may have been listening for quite a while, this podcast used to be called Artificial Intelligence podcast because my life passion has always been, will always be artificial intelligence, both narrowly and broadly defined. My goal with this podcast is still to have many conversations with world-class researchers in AI, math, physics, biology, and all the other sciences, but I also want to talk to historians, musicians, athletes, and of course, occasionally comedians. In fact, I'm trying out doing this podcast three times a week now to give me more freedom with guest selection, and maybe, uh, get a chance to have a bit more fun. Speaking of fun, in this conversation, I challenge the listener to count the number of times the word "banana" is mentioned. Ishan and I use the word "banana" as the canonical example at the core of the hard problem of computer vision, and maybe the hard problem of consciousness. This is the Lex Fridman podcast, and here is my conversation with Ishan Misra. What is self-supervised learning? And maybe even give the, the bigger basics of what is supervised and semi-supervised learning, and maybe why is self-supervised learning a better term than unsupervised learning?
Uh, let's start with supervised learning. So, typically for machine learning systems, the way they're trained is you get a bunch of humans. The humans point out particular concepts with it. In the case of images, you want the humans to come and tell you what is pos ... like, what is present in the image, draw boxes around them, draw masks of, like, things, pixels which are of particular categories or not. Uh, for NLP, again there are, like, lots of these particular tasks, say, about sentiment analysis, about entailment and so on. So, typically for supervised learning we get a big corpus of such annotated or labeled data, and then we feed that to a system, and the system is really trying to mimic. So, it's taking this input of the data and then trying to mimic the output. So, it looks at an image, and the human has tagged that this image contains a banana, and now the system is basically trying to mimic that. So, it starts its learning signal. And so for supervised learning we try to gather lots of such data, and we train these machine learning models to imitate the input/output. And the hope is basically by doing so, now on unseen or, like, new kinds of data, this model can automatically learn to predict these concepts. So, this is a standard sort of supervised setting. For semi-supervised setting, uh, the idea typically is that you have of course all of the supervised data, but you have lots of other data which is unsupervised or which is, like, not labeled. Now, the problem basically with supervised learning and why you actually have all of these alternate sort of learning paradigms is supervised learning does ... just does not scale. So, if you look at for computer vision, the sort of largest, one of the most popular datasets is ImageNet. Right? So, the entire ImageNet dataset has about 22,000 concepts and about 14 million images. So, these concepts are j- basically just nouns, and they're annotated on images. And this entire dataset was a mammoth data collection effort. It actually, uh, gave rise to a lot of powerful learning algorithms. It's credited with, like, sort of the rise of deep learning as well. But this dataset took about 22 human years to collect, to annotate, and it's not even that many concepts, right? It's not even that many images. 14 million is nothing, really. Um, like, you have about, I think, 400 million images or so, or even more than that uploaded to most of the popular sort of social media websites today. So, now supervised learning just doesn't scale. If I want to now annotate more concepts, if I want to have this ... various types of fine-grained concepts, then it won't really scale. So, y- now you come after these sort of different learning paradigms. For example, semi-supervised learning, where the idea is y- of course, you have this annotated corpus of supervised data, and you have lots of these unlabeled images, and the idea is that the algorithm should basically try to measure some kind of consistency or really try to measure some kind of, uh, signal on this sort of unlabeled data to make itself more confident about what it's really trying to predict. So, by access to this lots of unlabeled data, the idea is that the algorithm actually learns to be more confident and actually gets better at predicting these concepts. And now, we come to the other extreme, which is, like, self-supervised learning. The idea basically is that the machine or the algorithm should really discover concepts or discover things about the world or learn representations about the world which are useful without access to explicit human supervision.
Install uListen to search the full transcript and get AI-powered insights
Get Full TranscriptGet more from every podcast
AI summaries, searchable transcripts, and fact-checking. Free forever.
Add to Chrome