Skip to content
Lex Fridman PodcastLex Fridman Podcast

Ishan Misra: Self-Supervised Deep Learning in Computer Vision | Lex Fridman Podcast #206

Ishan Misra is a research scientist at FAIR working on self-supervised visual learning. Please support this podcast by checking out our sponsors: - Onnit: https://lexfridman.com/onnit to get up to 10% off - The Information: https://theinformation.com/lex to get 75% off first month - Grammarly: https://grammarly.com/lex to get 20% off premium - Athletic Greens: https://athleticgreens.com/lex and use code LEX to get 1 month of fish oil EPISODE LINKS: Ishan's twitter: https://twitter.com/imisra_ Ishan's website: https://imisra.github.io Ishan's FAIR page: https://ai.facebook.com/people/ishan-misra/ PODCAST INFO: Podcast website: https://lexfridman.com/podcast Apple Podcasts: https://apple.co/2lwqZIr Spotify: https://spoti.fi/2nEwCF8 RSS: https://lexfridman.com/feed/podcast/ Full episodes playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4 Clips playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOeciFP3CBCIEElOJeitOr41 OUTLINE: 0:00 - Introduction 2:27 - Self-supervised learning 11:02 - Self-supervised learning is the dark matter of intelligence 14:54 - Categorization 23:28 - Is computer vision still really hard? 27:12 - Understanding Language 36:51 - Harder to solve: vision or language 43:36 - Contrastive learning & energy-based models 47:37 - Data augmentation 51:57 - Fixed audio spike by lowering sound with pen tool 1:00:10 - Real data vs. augmented data 1:03:54 - Non-contrastive learning energy based self supervised learning methods 1:07:32 - Unsupervised learning (SwAV) 1:10:14 - Self-supervised Pretraining (SEER) 1:15:21 - Self-supervised learning (SSL) architectures 1:21:21 - VISSL pytorch-based SSL library 1:24:15 - Multi-modal 1:31:43 - Active learning 1:37:22 - Autonomous driving 1:48:49 - Limits of deep learning 1:52:57 - Difference between learning and reasoning 1:58:03 - Building super-human AI 2:05:51 - Most beautiful idea in self-supervised learning 2:09:40 - Simulation for training AI 2:13:04 - Video games replacing reality 2:14:18 - How to write a good research paper 2:18:45 - Best programming language for beginners 2:19:39 - PyTorch vs TensorFlow 2:23:03 - Advice for getting into machine learning 2:25:09 - Advice for young people 2:27:35 - Meaning of life SOCIAL: - Twitter: https://twitter.com/lexfridman - LinkedIn: https://www.linkedin.com/in/lexfridman - Facebook: https://www.facebook.com/lexfridman - Instagram: https://www.instagram.com/lexfridman - Medium: https://medium.com/@lexfridman - Reddit: https://reddit.com/r/lexfridman - Support on Patreon: https://www.patreon.com/lexfridman

Lex FridmanhostIshan Misraguest
Jul 31, 20212h 30mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

Self-Supervised Vision: Teaching Machines to See Without Human Labels

  1. Lex Fridman and Ishan Misra dive into self-supervised learning, focusing on how machines can learn visual representations from raw data without human annotations. They contrast supervised, semi-supervised, and self-supervised paradigms, explain key tricks like masking, cropping, contrastive learning, and transformers, and explore why language has progressed faster than vision. The conversation covers large-scale systems like SWaV and SEER, multimodal audio-visual learning, data augmentation, and active learning as crucial ingredients for scalable intelligence. They close by zooming out to the limits of deep learning, the role of embodiment and interaction, and philosophical questions about categories, reasoning, and the nature of intelligence.

IDEAS WORTH REMEMBERING

5 ideas

Self-supervised learning scales where supervised learning cannot.

Manually labeling massive visual datasets is prohibitively expensive and inconsistent, while self-supervised methods use structure in the data itself (e.g., predicting missing parts, aligning crops) to learn powerful representations at internet scale.

Data augmentation is a central, underappreciated form of supervision.

For vision SSL, carefully designed augmentations (cropping, color jitter, blurring, etc.) are effectively where human prior knowledge lives; choosing or learning better, more realistic augmentations can matter more than changing architectures.

Contrastive and non-contrastive methods both aim to avoid collapse.

Contrastive learning pulls together positive pairs while pushing away negatives, but needs many good negatives; newer non-contrastive approaches (clustering, self-distillation, decorrelation) remove explicit negatives and control collapse through other constraints.

Language has structural advantages over vision for self-supervision.

Text operates over a finite vocabulary and clear tokens, making masked prediction tractable and informative, while predicting missing pixels in images is combinatorially harder and more sensitive to noise and capture conditions.

Large-scale, uncurated training can produce highly transferable visual features.

Systems like SEER show that billion-parameter ConvNets trained on billions of random internet images, without labels, can match or outperform ImageNet-pretrained models on downstream tasks, suggesting SSL can exploit truly “in the wild” data.

WORDS WORTH SAVING

5 quotes

Supervised learning just does not scale.

Ishan Misra

The data itself is the source of supervision, so it’s self‑supervised.

Ishan Misra

You shouldn’t want it, you should need it.

Ishan Misra

We are super dumb about how we set up the self-supervised problem, and despite that the models learn to find objects.

Ishan Misra

Any sufficiently advanced technology is indistinguishable from magic.

Arthur C. Clarke (quoted by Lex Fridman)

Supervised, semi-supervised, and self-supervised learning paradigmsSelf-supervised learning tricks in language (masking) and vision (cropping, prediction)Contrastive learning, clustering, energy-based models, and collapseLarge-scale self-supervised vision systems: SWaV, SEER, DINO, RegNets, VISSLData augmentation: role, design, and future (learned, realistic augmentations)Multimodal learning with audio and video and cross-modal agreementLimits of deep learning, active learning, object categories, and general intelligence

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome