Ishan Misra: Self-Supervised Deep Learning in Computer Vision | Lex Fridman Podcast #206

Ishan Misra is a research scientist at FAIR working on self-supervised visual learning. Please support this podcast by checking out our sponsors: - Onnit: https://lexfridman.com/onnit to get up to 10% off - The Information: https://theinformation.com/lex to get 75% off first month - Grammarly: https://grammarly.com/lex to get 20% off premium - Athletic Greens: https://athleticgreens.com/lex and use code LEX to get 1 month of fish oil EPISODE LINKS: Ishan's twitter: https://twitter.com/imisra_ Ishan's website: https://imisra.github.io Ishan's FAIR page: https://ai.facebook.com/people/ishan-misra/ PODCAST INFO: Podcast website: https://lexfridman.com/podcast Apple Podcasts: https://apple.co/2lwqZIr Spotify: https://spoti.fi/2nEwCF8 RSS: https://lexfridman.com/feed/podcast/ Full episodes playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4 Clips playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOeciFP3CBCIEElOJeitOr41 OUTLINE: 0:00 - Introduction 2:27 - Self-supervised learning 11:02 - Self-supervised learning is the dark matter of intelligence 14:54 - Categorization 23:28 - Is computer vision still really hard? 27:12 - Understanding Language 36:51 - Harder to solve: vision or language 43:36 - Contrastive learning & energy-based models 47:37 - Data augmentation 51:57 - Fixed audio spike by lowering sound with pen tool 1:00:10 - Real data vs. augmented data 1:03:54 - Non-contrastive learning energy based self supervised learning methods 1:07:32 - Unsupervised learning (SwAV) 1:10:14 - Self-supervised Pretraining (SEER) 1:15:21 - Self-supervised learning (SSL) architectures 1:21:21 - VISSL pytorch-based SSL library 1:24:15 - Multi-modal 1:31:43 - Active learning 1:37:22 - Autonomous driving 1:48:49 - Limits of deep learning 1:52:57 - Difference between learning and reasoning 1:58:03 - Building super-human AI 2:05:51 - Most beautiful idea in self-supervised learning 2:09:40 - Simulation for training AI 2:13:04 - Video games replacing reality 2:14:18 - How to write a good research paper 2:18:45 - Best programming language for beginners 2:19:39 - PyTorch vs TensorFlow 2:23:03 - Advice for getting into machine learning 2:25:09 - Advice for young people 2:27:35 - Meaning of life SOCIAL: - Twitter: https://twitter.com/lexfridman - LinkedIn: https://www.linkedin.com/in/lexfridman - Facebook: https://www.facebook.com/lexfridman - Instagram: https://www.instagram.com/lexfridman - Medium: https://medium.com/@lexfridman - Reddit: https://reddit.com/r/lexfridman - Support on Patreon: https://www.patreon.com/lexfridman

Lex FridmanhostIshan Misraguest

Jul 31, 20212h 30mWatch on YouTube ↗

CHAPTERS

0:00 – 2:31
Why self-supervised learning matters: scaling beyond human labels
Lex introduces Ishan Misra’s work at FAIR and frames the core motivation: letting models learn from raw images/video the way language models learn from text. The discussion sets up self-supervision as a path around the cost, inconsistency, and limited scope of human annotation.
- •Goal: learn visual representations from passive observation (e.g., “watch YouTube all night”)
- •Supervision doesn’t scale: labeling like ImageNet is hugely expensive
- •Self-supervision as a central ingredient for visual intelligence
2:31 – 7:26
Supervised vs. semi-supervised vs. self-supervised: where the learning signal comes from
Ishan defines supervised learning as learning from human-provided labels, semi-supervised as mixing labeled and unlabeled data, and self-supervised as extracting supervision from the data itself. They clarify why “self-supervised” is more precise than the catch-all term “unsupervised.”
- •Supervised learning imitates human-provided input/output pairs (labels, boxes, masks)
- •Semi-supervised uses unlabeled data to enforce consistency and improve confidence
- •Self-supervised uses the data as its own supervision source
- •Why ‘unsupervised’ is too broad a category
7:26 – 10:55
Core self-supervision tricks: masking in language, prediction and crops in vision
They explore practical mechanisms that turn raw data into a training signal: mask tokens in text, predict future frames in video, and match representations from different crops of the same image. The unifying theme is leveraging natural consistency in sequences and scenes.
- •NLP: masked-word prediction as a powerful self-supervised objective
- •Video: next-step prediction can teach motion, physics, and object permanence
- •Images: different crops/views of the same image should yield similar features
- •Self-supervision often relies on cleverly constructed prediction tasks
10:55 – 14:49
“Dark matter of intelligence”: common sense from observation (and the limits of labeling)
Ishan summarizes the ‘dark matter’ argument: most useful knowledge about the world is hard to label but abundant in raw experience. They discuss learning properties like weight or affordances by watching interactions rather than annotating every concept explicitly.
- •Self-supervision as a scalable route to ‘common sense’ knowledge
- •Many concepts (heaviness, pour-ability, sit-ability) are impractical to label
- •Observation and interaction can implicitly reveal hidden properties
- •Humans are inconsistent supervisors, especially at category boundaries
14:49 – 23:01
Why taxonomies break: categorization, compositional concepts, and similarity as the substrate
The conversation turns philosophical: perfect object taxonomies are likely hopeless because new categories can always be invented via composition and context. Instead, similarity (and dissimilarity) may be the more fundamental organizing principle for representations.
- •Discrete taxonomies are brittle: you can always create an ‘N+1’ category
- •Compositionality creates endless edge cases (e.g., cronut)
- •Similarity enables transfer: recognizing novel instances via related experience
- •Classification as ‘icing’; representations as the ‘cake’
23:01 – 27:02
Is computer vision still really hard? From pixels to humor, intent, and social context
Lex uses vivid examples (humor, gaze, gravity, social inference) to argue that human-level vision involves deep world models. Ishan agrees vision remains hard and notes self-supervision helps build pre-semantic concepts, but communication to humans ultimately demands semantics and language.
- •Human vision infers physics, intent, and social context from a single image
- •You can label narrow tasks (e.g., humor) but general understanding is far harder
- •Self-supervised learning isn’t ‘the answer to everything’
- •Semantics/language become necessary when systems must communicate with humans
27:02 – 36:32
Self-supervised NLP success and what transfers to vision: distributional hypothesis and transformers
Ishan explains why masked-language modeling works so well: words in similar contexts share meaning, and the vocabulary is finite. They introduce transformers via self-attention—global context modeling—then connect those ideas to vision’s need for context (e.g., local patches can be ambiguous).
- •Distributional hypothesis: context drives meaning in language representations
- •Masked prediction objectives have scaled from word2vec to BERT/RoBERTa
- •Transformers/self-attention: each element attends to all others (global context)
- •Vision also benefits from global context because local pixel patches are ambiguous
36:32 – 43:36
Vision vs. language: why vision self-supervision is harder to scale naively
They debate which domain is harder and settle on vision, partly because language is a human-constructed, discretized signal. Vision prediction (e.g., reconstructing pixels) explodes combinatorially, and real-world imaging injects noise and variability absent from text tokens.
- •Language: finite vocabulary makes prediction tractable
- •Vision: pixel prediction is combinatorially large and harder to optimize
- •Images are a noisy measurement pipeline (lighting, sensors, compression)
- •Current vision SSL often avoids pixel prediction and instead matches embeddings
43:36 – 47:49
Contrastive learning and energy-based models: positives, negatives, and a unifying lens
Ishan defines contrastive learning as shaping an embedding space by pulling positives together and pushing negatives apart. He then describes how Yann LeCun’s energy-based view provides a common language for relating contrastive methods, GANs, and VAEs.
- •Contrastive learning: learn representations via positive/negative comparisons
- •Negatives matter: their quality and quantity affect learning and scalability
- •Energy-based models: interpret learning as minimizing/maximizing an energy function
- •Unifies seemingly different methods (contrastive, GANs, VAEs) under one frame
47:49 – 1:03:54
Data augmentation as the secret sauce (and the hidden human bias in ‘self-supervision’)
They dig into augmentation as the key mechanism for creating positive pairs in vision SSL—crops, color jitter, blur, and more. Ishan highlights an irony: augmentations encode strong human priors (e.g., color changes shouldn’t change object identity), and more realistic/learned augmentations could be a major breakthrough.
- •Augmentation creates multiple ‘views’ of the same image for representation matching
- •Common augmentations: crops, rotations, blur, brightness/contrast/color changes
- •Human priors leak into SSL through augmentation design
- •Future direction: parameterized/learned, content-aware, physically realistic augmentations
1:03:54 – 1:10:15
Beyond contrastive: non-contrastive methods, collapse avoidance, and SwAV clustering
Ishan explains why non-contrastive approaches became popular: contrastive learning can require many negatives and careful sampling. They discuss collapse (all inputs map to the same representation) and how clustering/self-distillation methods prevent it, then detail SwAV’s online clustering with equipartition constraints.
- •Non-contrastive motivation: reduce dependence on large numbers of negatives
- •Collapse: the central failure mode for similarity-maximization objectives
- •Self-distillation (teacher/student) and clustering as alternatives to contrastive
- •SwAV: online clustering with fixed K and equipartition to avoid collapse
1:10:15 – 1:15:21
SEER: self-supervised learning ‘in the wild’ at billion-scale and what it reveals
They move from ImageNet-centric benchmarking to SEER’s billion-image pretraining on less-curated internet data. Ishan discusses dataset biases (even “uncurated” internet photos have framing and demographic skews), and shares the headline result: large-scale SSL can work robustly beyond curated benchmarks.
- •ImageNet ‘cheat’: throw away labels but keep curated distribution biases
- •SEER trains on ~1B internet images to test SSL in the wild
- •Findings: large-scale self-supervised pretraining works without heavy filtering
- •Reality check: internet data still contains photographer/user-base biases
1:15:21 – 1:21:06
Architectures and scaling: ConvNets vs transformers, RegNets, and distributed training realities
Ishan contrasts architecture choices for SSL (ConvNets and ViTs) and describes why SEER used RegNets for compute/memory efficiency. They also touch on the practicalities of training enormous models: lots of GPUs and systems constraints like synchronization and communication overhead.
- •Both ConvNets and transformers can work well for SSL (task-dependent tradeoffs)
- •RegNets: optimize not just FLOPs but activation/memory efficiency
- •Data + augmentation + algorithm matter more than architecture choice alone
- •Scaling challenges: distributed training, synchronization costs, communication minimization
1:21:06 – 1:24:15
VISSL: a practical PyTorch toolbox for SSL research and benchmarking
Ishan presents VISSL as an internal-to-open framework to standardize implementations and evaluations of vision SSL methods. They discuss the difficulty of creating small-scale ‘hello world’ setups that reliably predict large-scale behavior.
- •VISSL: shared library of SSL methods + standardized evaluation tasks
- •Solves reproducibility issues across inconsistent experimental setups
- •Includes benchmarking work to make comparisons more meaningful
- •Small-scale experiments often fail to translate to ImageNet or web-scale regimes
1:24:15 – 1:31:43
Multimodal self-supervision: aligning audio and video representations
They discuss learning from multiple modalities through cross-modal agreement: train separate audio and video networks and bring their embeddings together for synchronized pairs. The approach can yield strong video representations useful for action recognition and even localizing sound sources without labels.
- •Multimodal setup: match embeddings from paired audio and video tracks
- •Contrastive pairing: corresponding A/V are positives; mismatched are negatives
- •Downstream gains: action recognition (e.g., Kinetics) and sound understanding
- •Emergent behavior: localizing the sound source (guitar, mouth/voice) without supervision
1:31:43 – 1:48:24
Active learning and autonomy: asking the right questions and closing the loop in driving
Ishan argues active learning is powerful but inherently chicken-and-egg: you need some understanding to ask good questions. They connect this to autonomous driving ‘data engines’ that harvest edge cases via uncertainty or disagreement between model predictions and human actions, then retrain on the most informative failures.
- •Learning by Asking Questions: agents choose questions to maximize learning value
- •Key challenge: model what the model knows vs. doesn’t know
- •Autonomous driving loop: collect edge cases via surprise/uncertainty and retrain
- •Vision-only driving outlook: optimism tempered by domain complexity and safety requirements
1:48:24 – 2:05:52
Limits of deep learning: data efficiency, guarantees, and the gap between learning and reasoning
They zoom out to foundational limitations: deep models are often data-hungry, struggle with one-shot generalization, and lack crisp correctness guarantees typical in classical algorithms. Ishan distinguishes snap recognition from deliberate reasoning and highlights ongoing challenges like continual learning and catastrophic forgetting.
- •Data efficiency remains a core bottleneck (especially for rare edge cases)
- •ML correctness is ‘nebulous’—failures aren’t treated like conventional bugs
- •Learning vs reasoning: recognition is strong; compositional reasoning is weak
- •Continual learning is underdeveloped; catastrophic forgetting persists
2:05:52 – 2:09:50
Emergence and beauty in SSL: objectness from DINO and what else might emerge
Ishan identifies a striking outcome: object boundaries and ‘objectness’ can emerge from simple SSL objectives without explicit segmentation labels. They speculate about other potentially emergent concepts (object permanence, rotation, counting) and what that implies about the richness of pixel-level signal.
- •DINO-style SSL can yield attention maps aligned with object boundaries
- •Surprise: simple crop-based objectives can produce high-level structure
- •Potential emergent concepts: object permanence (from video), rotation, counting
- •Implication: there is abundant usable structure in raw pixels
2:09:50 – 2:30:29
Skepticism of simulation, VR futures, and practical career advice (papers, tools, learning)
Ishan explains why he’s skeptical of simulation-first approaches: high cost, imperfect realism, and shifting real-world behavior (especially with humans in the loop). The conversation closes with advice on writing research papers (focus on one clear idea, write early), tooling choices (Python, PyTorch), perseverance in debugging, and reflections on meaning-of-life questions.
- •Simulation is expensive and incomplete—especially for behavior and edge cases
- •Worlds change over time (e.g., mixed human/AV traffic), breaking fixed simulators
- •Writing advice: pick solvable problems, focus on one idea, start writing early
- •Beginner advice: learn Python, embrace debugging/struggle, and stay hungry

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Why self-supervised learning matters: scaling beyond human labels

Supervised vs. semi-supervised vs. self-supervised: where the learning signal comes from

Core self-supervision tricks: masking in language, prediction and crops in vision

“Dark matter of intelligence”: common sense from observation (and the limits of labeling)

Why taxonomies break: categorization, compositional concepts, and similarity as the substrate

Is computer vision still really hard? From pixels to humor, intent, and social context

Self-supervised NLP success and what transfers to vision: distributional hypothesis and transformers

Vision vs. language: why vision self-supervision is harder to scale naively

Contrastive learning and energy-based models: positives, negatives, and a unifying lens

Data augmentation as the secret sauce (and the hidden human bias in ‘self-supervision’)

Beyond contrastive: non-contrastive methods, collapse avoidance, and SwAV clustering

SEER: self-supervised learning ‘in the wild’ at billion-scale and what it reveals

Architectures and scaling: ConvNets vs transformers, RegNets, and distributed training realities

VISSL: a practical PyTorch toolbox for SSL research and benchmarking

Multimodal self-supervision: aligning audio and video representations

Active learning and autonomy: asking the right questions and closing the loop in driving

Limits of deep learning: data efficiency, guarantees, and the gap between learning and reasoning

Emergence and beauty in SSL: objectness from DINO and what else might emerge

Skepticism of simulation, VR futures, and practical career advice (papers, tools, learning)

Get more out of YouTube videos.