Skip to content
Lex Fridman PodcastLex Fridman Podcast

Jitendra Malik: Computer Vision | Lex Fridman Podcast #110

Jitendra Malik is a professor at Berkeley and one of the seminal figures in the field of computer vision, the kind before the deep learning revolution, and the kind after. He has been cited over 180,000 times and has mentored many world-class researchers in computer science. Support this podcast by supporting our sponsors: - BetterHelp: http://betterhelp.com/lex - ExpressVPN at https://www.expressvpn.com/lexpod EPISODE LINKS: Jitendra's website: https://people.eecs.berkeley.edu/~malik/ Jitendra's wiki: https://en.wikipedia.org/wiki/Jitendra_Malik PODCAST INFO: Podcast website: https://lexfridman.com/podcast Apple Podcasts: https://apple.co/2lwqZIr Spotify: https://spoti.fi/2nEwCF8 RSS: https://lexfridman.com/feed/podcast/ Full episodes playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4 Clips playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOeciFP3CBCIEElOJeitOr41 OUTLINE: 0:00 - Introduction 3:17 - Computer vision is hard 10:05 - Tesla Autopilot 21:20 - Human brain vs computers 23:14 - The general problem of computer vision 29:09 - Images vs video in computer vision 37:47 - Benchmarks in computer vision 40:06 - Active learning 45:34 - From pixels to semantics 52:47 - Semantic segmentation 57:05 - The three R's of computer vision 1:02:52 - End-to-end learning in computer vision 1:04:24 - 6 lessons we can learn from children 1:08:36 - Vision and language 1:12:30 - Turing test 1:16:17 - Open problems in computer vision 1:24:49 - AGI 1:35:47 - Pick the right problem CONNECT: - Subscribe to this YouTube channel - Twitter: https://twitter.com/lexfridman - LinkedIn: https://www.linkedin.com/in/lexfridman - Facebook: https://www.facebook.com/LexFridmanPage - Instagram: https://www.instagram.com/lexfridman - Medium: https://medium.com/@lexfridman - Support on Patreon: https://www.patreon.com/lexfridman

Lex FridmanhostJitendra Malikguest
Jul 20, 20201h 41mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

Jitendra Malik explains why real computer vision is still hard

  1. Jitendra Malik and Lex Fridman explore why computer vision is fundamentally more difficult than it appears from human experience, and why the field repeatedly underestimates that difficulty. Malik argues that vision is deeply tied to cognition, prediction, and action, and that current deep learning systems solve only parts of the problem, often with unrealistic amounts of supervision and data. They discuss autonomous driving, 3D understanding, video and long-form activity recognition, and child-like learning as core open challenges. Malik also reflects on brain–computer compute differences, multimodal and embodied learning, the limits of end-to-end supervised learning, and what constitutes good research problems and realistic pathways to human-level intelligence.
  2. They touch on ethical and societal implications of AI already deployed today and why fears of near-term AGI are misplaced relative to concrete risks from current systems.

IDEAS WORTH REMEMBERING

5 ideas

Human vision feels easy because it is mostly unconscious, masking extreme computational complexity.

Large portions of the primate cortex are devoted to vision, yet we experience perception as effortless, leading early AI researchers (and some today) to underestimate the difficulty of replicating it in machines.

The “fallacy of the successful first step” misleads researchers about progress in vision.

Many vision tasks allow quick progress to 50–90% performance, but pushing toward 99.9%—the level needed for safety-critical applications like driving—can take orders of magnitude more time and may expose qualitatively new challenges.

True visual intelligence requires tight coupling with cognition, prediction, and action.

Malik stresses that perception evolved to guide action; systems must not only label scenes but also build predictive models of agents (e.g., pedestrians vs. skateboarders) and reason about what will happen next in order to act safely.

Current deep learning is overly reliant on supervised, tabula rasa learning and unrealistic labels.

Humans arrive at driving, 3D understanding, and object recognition with years of general visual knowledge acquired through self-directed exploration, multimodal experience, and sparse linguistic cues—far richer than image–label pairs.

Child-like, embodied, multimodal learning is a promising direction for future vision systems.

Children actively manipulate objects, link touch, sight, and sound, and perform informal experiments that build causal models; Malik argues robotics and high-fidelity simulation should be used to emulate this developmental process in AI.

WORDS WORTH SAVING

5 quotes

There are many problems in vision where getting 50% of the solution you can get in one minute, getting to 90% can take you a day, getting to 99% may take you five years, and 99.99% may be not in your lifetime.

Jitendra Malik

Perception always has to not tell us what is now, but what will happen, because what’s now is boring. It’s done, it’s over with.

Jitendra Malik

At the age of 16, when they go into driver ed, they are already visual geniuses.

Jitendra Malik

I don’t think we should create a single test of intelligence... I would rather have a list of ten different tasks.

Jitendra Malik

The history of AI is when we have made progress at a slower rate than we expected.

Jitendra Malik

Why computer vision is harder than it appears from human perceptionAutonomous driving, edge cases, and prediction of agents’ behaviorLimits of current deep learning: supervision, data hunger, and architecturesChild-like learning: multimodal, embodied, exploratory, and incremental3D understanding, segmentation, and the three Rs: recognition, reconstruction, reorganizationVideo and long-form activity understanding versus static image benchmarksAI safety today, explainability, and realistic timelines for AGIResearch philosophy: choosing good problems and mentoring in computer vision

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome