Lex Fridman PodcastJitendra Malik: Computer Vision | Lex Fridman Podcast #110
At a glance
WHAT IT’S REALLY ABOUT
Jitendra Malik explains why real computer vision is still hard
- Jitendra Malik and Lex Fridman explore why computer vision is fundamentally more difficult than it appears from human experience, and why the field repeatedly underestimates that difficulty. Malik argues that vision is deeply tied to cognition, prediction, and action, and that current deep learning systems solve only parts of the problem, often with unrealistic amounts of supervision and data. They discuss autonomous driving, 3D understanding, video and long-form activity recognition, and child-like learning as core open challenges. Malik also reflects on brain–computer compute differences, multimodal and embodied learning, the limits of end-to-end supervised learning, and what constitutes good research problems and realistic pathways to human-level intelligence.
- They touch on ethical and societal implications of AI already deployed today and why fears of near-term AGI are misplaced relative to concrete risks from current systems.
IDEAS WORTH REMEMBERING
5 ideasHuman vision feels easy because it is mostly unconscious, masking extreme computational complexity.
Large portions of the primate cortex are devoted to vision, yet we experience perception as effortless, leading early AI researchers (and some today) to underestimate the difficulty of replicating it in machines.
The “fallacy of the successful first step” misleads researchers about progress in vision.
Many vision tasks allow quick progress to 50–90% performance, but pushing toward 99.9%—the level needed for safety-critical applications like driving—can take orders of magnitude more time and may expose qualitatively new challenges.
True visual intelligence requires tight coupling with cognition, prediction, and action.
Malik stresses that perception evolved to guide action; systems must not only label scenes but also build predictive models of agents (e.g., pedestrians vs. skateboarders) and reason about what will happen next in order to act safely.
Current deep learning is overly reliant on supervised, tabula rasa learning and unrealistic labels.
Humans arrive at driving, 3D understanding, and object recognition with years of general visual knowledge acquired through self-directed exploration, multimodal experience, and sparse linguistic cues—far richer than image–label pairs.
Child-like, embodied, multimodal learning is a promising direction for future vision systems.
Children actively manipulate objects, link touch, sight, and sound, and perform informal experiments that build causal models; Malik argues robotics and high-fidelity simulation should be used to emulate this developmental process in AI.
WORDS WORTH SAVING
5 quotesThere are many problems in vision where getting 50% of the solution you can get in one minute, getting to 90% can take you a day, getting to 99% may take you five years, and 99.99% may be not in your lifetime.
— Jitendra Malik
Perception always has to not tell us what is now, but what will happen, because what’s now is boring. It’s done, it’s over with.
— Jitendra Malik
At the age of 16, when they go into driver ed, they are already visual geniuses.
— Jitendra Malik
I don’t think we should create a single test of intelligence... I would rather have a list of ten different tasks.
— Jitendra Malik
The history of AI is when we have made progress at a slower rate than we expected.
— Jitendra Malik
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome