Jitendra Malik: Computer Vision | Lex Fridman Podcast #110

Lex Fridman PodcastJul 21, 20201h 41m

Lex Fridman (host), Jitendra Malik (guest)

Why computer vision is harder than it appears from human perceptionAutonomous driving, edge cases, and prediction of agents’ behaviorLimits of current deep learning: supervision, data hunger, and architecturesChild-like learning: multimodal, embodied, exploratory, and incremental3D understanding, segmentation, and the three Rs: recognition, reconstruction, reorganizationVideo and long-form activity understanding versus static image benchmarksAI safety today, explainability, and realistic timelines for AGIResearch philosophy: choosing good problems and mentoring in computer vision

In this episode of Lex Fridman Podcast, featuring Lex Fridman and Jitendra Malik, Jitendra Malik: Computer Vision | Lex Fridman Podcast #110 explores jitendra Malik explains why real computer vision is still hard Jitendra Malik and Lex Fridman explore why computer vision is fundamentally more difficult than it appears from human experience, and why the field repeatedly underestimates that difficulty. Malik argues that vision is deeply tied to cognition, prediction, and action, and that current deep learning systems solve only parts of the problem, often with unrealistic amounts of supervision and data. They discuss autonomous driving, 3D understanding, video and long-form activity recognition, and child-like learning as core open challenges. Malik also reflects on brain–computer compute differences, multimodal and embodied learning, the limits of end-to-end supervised learning, and what constitutes good research problems and realistic pathways to human-level intelligence.

Jitendra Malik explains why real computer vision is still hard

Jitendra Malik and Lex Fridman explore why computer vision is fundamentally more difficult than it appears from human experience, and why the field repeatedly underestimates that difficulty. Malik argues that vision is deeply tied to cognition, prediction, and action, and that current deep learning systems solve only parts of the problem, often with unrealistic amounts of supervision and data. They discuss autonomous driving, 3D understanding, video and long-form activity recognition, and child-like learning as core open challenges. Malik also reflects on brain–computer compute differences, multimodal and embodied learning, the limits of end-to-end supervised learning, and what constitutes good research problems and realistic pathways to human-level intelligence.

They touch on ethical and societal implications of AI already deployed today and why fears of near-term AGI are misplaced relative to concrete risks from current systems.

Key Takeaways

Human vision feels easy because it is mostly unconscious, masking extreme computational complexity.

Large portions of the primate cortex are devoted to vision, yet we experience perception as effortless, leading early AI researchers (and some today) to underestimate the difficulty of replicating it in machines.

Get the full analysis with uListen AI

The “fallacy of the successful first step” misleads researchers about progress in vision.

Many vision tasks allow quick progress to 50–90% performance, but pushing toward 99. ...

Get the full analysis with uListen AI

True visual intelligence requires tight coupling with cognition, prediction, and action.

Malik stresses that perception evolved to guide action; systems must not only label scenes but also build predictive models of agents (e. ...

Get the full analysis with uListen AI

Current deep learning is overly reliant on supervised, tabula rasa learning and unrealistic labels.

Humans arrive at driving, 3D understanding, and object recognition with years of general visual knowledge acquired through self-directed exploration, multimodal experience, and sparse linguistic cues—far richer than image–label pairs.

Get the full analysis with uListen AI

Child-like, embodied, multimodal learning is a promising direction for future vision systems.

Children actively manipulate objects, link touch, sight, and sound, and perform informal experiments that build causal models; Malik argues robotics and high-fidelity simulation should be used to emulate this developmental process in AI.

Get the full analysis with uListen AI

Key unsolved Hilbert-style problems include long-form video understanding and natural 3D learning.

We lack systems that can understand extended activities with goals and intentions, or infer robust 3D structure from normal life experience (moving around, seeing objects from many views) rather than from engineered CAD supervision.

Get the full analysis with uListen AI

AI risk is already real in deployed systems, regardless of distant AGI scenarios.

Malik views current harms—from biased decision systems to unsafe autonomous vehicles and large-scale recommendation algorithms—as more urgent than speculative AGI threats, arguing for continuous attention to safety, error bounds, and fairness.

Get the full analysis with uListen AI

Notable Quotes

“There are many problems in vision where getting 50% of the solution you can get in one minute, getting to 90% can take you a day, getting to 99% may take you five years, and 99.99% may be not in your lifetime.”
— Jitendra Malik

“Perception always has to not tell us what is now, but what will happen, because what’s now is boring. It’s done, it’s over with.”
— Jitendra Malik

“At the age of 16, when they go into driver ed, they are already visual geniuses.”
— Jitendra Malik

“I don’t think we should create a single test of intelligence... I would rather have a list of ten different tasks.”
— Jitendra Malik

“The history of AI is when we have made progress at a slower rate than we expected.”
— Jitendra Malik

Questions Answered in This Episode

How could we practically build AI systems that learn like children—through embodied, multimodal exploration—at scale and with current hardware?

Jitendra Malik and Lex Fridman explore why computer vision is fundamentally more difficult than it appears from human experience, and why the field repeatedly underestimates that difficulty. ...

Get the full analysis with uListen AI

What kinds of new learning paradigms beyond supervised deep learning are most promising for achieving robust 3D and long-form video understanding?

They touch on ethical and societal implications of AI already deployed today and why fears of near-term AGI are misplaced relative to concrete risks from current systems.

Get the full analysis with uListen AI

Given Malik’s skepticism about short-term fully autonomous driving, what incremental deployment strategies might balance safety, usefulness, and realistic capabilities?

Get the full analysis with uListen AI

How should we design benchmarks that better capture real-world visual intelligence, such as assisting a blind person or understanding complex narratives in video?

Get the full analysis with uListen AI

What concrete steps can researchers and companies take today to reduce bias and harm in deployed AI systems, while still pushing for ambitious advances in vision and cognition?

Get the full analysis with uListen AI

Transcript Preview

Lex Fridman

The following is a conversation with Jitendra Malik, a professor at Berkeley and one of the seminal figures in the field of computer vision, the kind before the deep learning revolution and the kind after. He has been cited over 180,000 times and has mentored many world-class researchers in computer science. Quick summary of the ads. Two sponsors, one new one, which is BetterHelp, and an old goodie, ExpressVPN. Please consider supporting this podcast by going to betterhelp.com/lex and signing up at expressvpn.com/lexpod. Click the links, buy the stuff. It really is the best way to support this podcast and the journey I'm on. If you enjoy this thing, subscribe on YouTube, review it with five stars in Apple Podcasts, support it on Patreon, or connect with me on Twitter @LexFridman, however the heck you spell that. As usual, I'll do a few minutes of ads now and never any ads in the middle that can break the flow of the conversation. This show is sponsored by BetterHelp, spelled H-E-L-P, help. Check it out at betterhelp.com/lex. They figure out what you need and match you with a licensed professional therapist in under 48 hours. It's not a crisis line, it's not self-help, it's professional counseling done securely online. I'm a bit from the David Goggins line of creatures, as you may know, and so have some demons to contend with, usually on long runs or all-nights working, forever impossibly full of self-doubt. It may be because I'm Russian, but I think suffering is essential for creation, but I also think you can suffer beautifully in a way that doesn't destroy you. For most people, I think a good therapist can help in this, so it's at least worth a try. Check out their reviews. They're good. It's easy, private, affordable, available worldwide. You can communicate by text any time and schedule weekly audio and video sessions. I highly recommend that you check them out at betterhelp.com/lex. This show is also sponsored by ExpressVPN. Get it at expressvpn.com/lexpod to support this podcast and to get an extra three months free on a one-year package. I've been using ExpressVPN for many years. I love it. I think ExpressVPN is the best VPN out there. They told me to say it, but it happens to be true. It doesn't log your data, it's crazy fast, and is easy to use. Literally just one big, sexy power on button. Again, for obvious reasons, it's really important that they don't log your data. It works on Linux and everywhere else too, but really, why use anything else? Shout out to my favorite flavor of Linux, Ubuntu Mate 20.04. Once again, get it at expressvpn.com/lexpod to support this podcast and to get an extra three months free on a one-year package. And now, here's my conversation with Jitendra Malik. In 1966, Seymour Papert at MIT wrote up a proposal called the Summer Vision Project to be given, as far as we know, to 10 students to work on and solve that summer. So that proposal outlined many of the computer vision tasks we still work on today. Why do you think we underestimate, and perhaps we did underestimate and perhaps still underestimate, how hard computer vision is?

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome