Lex Fridman PodcastJitendra Malik: Computer Vision | Lex Fridman Podcast #110
EVERY SPOKEN WORD
150 min read · 30,044 words- 0:00 – 3:17
Introduction
- LFLex Fridman
The following is a conversation with Jitendra Malik, a professor at Berkeley and one of the seminal figures in the field of computer vision, the kind before the deep learning revolution and the kind after. He has been cited over 180,000 times and has mentored many world-class researchers in computer science. Quick summary of the ads. Two sponsors, one new one, which is BetterHelp, and an old goodie, ExpressVPN. Please consider supporting this podcast by going to betterhelp.com/lex and signing up at expressvpn.com/lexpod. Click the links, buy the stuff. It really is the best way to support this podcast and the journey I'm on. If you enjoy this thing, subscribe on YouTube, review it with five stars in Apple Podcasts, support it on Patreon, or connect with me on Twitter @LexFridman, however the heck you spell that. As usual, I'll do a few minutes of ads now and never any ads in the middle that can break the flow of the conversation. This show is sponsored by BetterHelp, spelled H-E-L-P, help. Check it out at betterhelp.com/lex. They figure out what you need and match you with a licensed professional therapist in under 48 hours. It's not a crisis line, it's not self-help, it's professional counseling done securely online. I'm a bit from the David Goggins line of creatures, as you may know, and so have some demons to contend with, usually on long runs or all-nights working, forever impossibly full of self-doubt. It may be because I'm Russian, but I think suffering is essential for creation, but I also think you can suffer beautifully in a way that doesn't destroy you. For most people, I think a good therapist can help in this, so it's at least worth a try. Check out their reviews. They're good. It's easy, private, affordable, available worldwide. You can communicate by text any time and schedule weekly audio and video sessions. I highly recommend that you check them out at betterhelp.com/lex. This show is also sponsored by ExpressVPN. Get it at expressvpn.com/lexpod to support this podcast and to get an extra three months free on a one-year package. I've been using ExpressVPN for many years. I love it. I think ExpressVPN is the best VPN out there. They told me to say it, but it happens to be true. It doesn't log your data, it's crazy fast, and is easy to use. Literally just one big, sexy power on button. Again, for obvious reasons, it's really important that they don't log your data. It works on Linux and everywhere else too, but really, why use anything else? Shout out to my favorite flavor of Linux, Ubuntu Mate 20.04. Once again, get it at expressvpn.com/lexpod to support this podcast and to get an extra three months free on a one-year package. And now, here's my conversation with Jitendra Malik.
- 3:17 – 10:05
Computer vision is hard
- LFLex Fridman
In 1966, Seymour Papert at MIT wrote up a proposal called the Summer Vision Project to be given, as far as we know, to 10 students to work on and solve that summer. So that proposal outlined many of the computer vision tasks we still work on today. Why do you think we underestimate, and perhaps we did underestimate and perhaps still underestimate, how hard computer vision is?
- JMJitendra Malik
Because most of what we do in vision we do unconsciously or subconsciously.
- LFLex Fridman
In human vision.
- JMJitendra Malik
In human vision. So that gives us this, that effortlessness gives us the sense that, oh, this must be very easy to implement in a computer. Now, uh, this is why the early researchers in AI got it so wrong. Uh, however, if you go into neuroscience or psychology of, of human vision, then the complexity becomes very clear. The fact is that we- a very large part of the, uh, the cerebral cortex is devoted to visual processing. I mean, and, and this is true in other primates as well. So once we looked at it from a neuroscience or psychology perspective, it, it becomes quite clear that the problem is very challenging and it will take some time.
- LFLex Fridman
You said the higher level parts are the harder parts?
- JMJitendra Malik
I think vision appears to, to be easy because, uh, most of what visual processing is, uh, subconscious or unconscious.
- LFLex Fridman
Right, right.
- JMJitendra Malik
Uh, so we underestimate the difficulty. Whereas, uh, when you are, uh, like proving a mathematical theorem or playing chess, the difficulty is much more evident, so, because it is your conscious brain which is processing, uh, various aspects of the problem-solving behavior. Whereas in vision ar- all this is happening, but it's not in your a- awareness, it's in your, it's operating below that.
- LFLex Fridman
But it, it s- it still seems strange. Yes, that's true, but it seems strange that as computer vision researchers, for example, the community broadly is, time and time again makes the mistake of, um, thinking the problem is easier than it is. Or maybe it's not a mistake. We'll talk a little bit about autonomous driving, for example.
- JMJitendra Malik
Mm-hmm.
- LFLex Fridman
How hard of a vision task that is. It, i- do, do you think, I mean, what, i- is it just human nature or is there something fundamental to the vision problem that we, we underestimate, we're still not able to be cognizant of how hard the problem is?
- JMJitendra Malik
Yeah, I think in the early days it, uh, could have been excused because in the early days all aspects of AI were regarded-
- LFLex Fridman
Right.
- JMJitendra Malik
... as too easy. Uh, but I think today it is much less excusable. And, uh, I think why people fall for this is because of what I call the fallacy of the successful first step.
- LFLex Fridman
(laughs) Yeah.
- JMJitendra Malik
There are many problems in vision where getting 50% of the solution you can get in one minute, getting to 90% can take you a day, getting to 99% may take you five years, and 99.99% may be not in your lifetime.
- LFLex Fridman
I wonder if that's unique to vision. Uh, it seems that language people are not so confident about, so natural language processing. People are a little bit more cautious about our ability to, to, uh, solve that problem. I think for language, people intuit that we have to be able to do natural language understanding. For vision, it seems that we're not, uh, cognizant or we don't think about how much understanding is required. It's probably still an open problem. But in your sense, how much understanding is required to solve vision? Like, this... Put another way, how much something called common sense reasoning is required to really be able to interpret even static scenes?
- JMJitendra Malik
Yeah. So vision operates at, uh, at all levels. And there are parts which are, which can be solved with what we could call maybe peripheral processing. So in the, in the human vision literature, there used to be these terms sensation, perception, and cognition, which roughly speaking referred to, like, the front end of processing, middle stages of processing, and higher level of processing. And I think they made a big deal out of, out of this and they wanted to just study only perception and then dismiss certain, certain problems as being, quote, "cognitive." But really, uh, I think these are artificial divides. The problem is continuous at all level and there are challenges at all levels. The techniques that we have today, they work better at the lower and mid-levels of the problem. I think the higher levels of the problem, quote, "The cognitive levels of the problem are there and, uh, we, in many real applications, we have to confront them." Now, how much that is necessary will depend on the application. For some problems, it doesn't matter. For some problems, it matters a lot. So, uh, I am, for example, a, a p- a pessimist on fully autonomous driving in the near future. And the reason is because I think there will be that 0.01% of the cases where quite sophisticated cognitive reasoning is called for. However, there are tasks where, uh, you can... Uh, first of all, they are much more, uh, they, they are robust, so in the sense that error rates, error is not so much of a problem. For example, uh, uh, let's say we are, you're doing, uh, image search. You're trying to get images based on some, some k- some, uh, description, some visual description. We are very tolerant of errors there, right? I mean, when Google Image Search gives you some images back and a few of them are wrong, it's okay. It doesn't hurt anybody. There's no... This is not a matter of life and death. But making mistakes when you are driving, uh, s- at 60 miles per hour and you could potentially kill somebody is much more important.
- 10:05 – 21:20
Tesla Autopilot
- JMJitendra Malik
- LFLex Fridman
So just for the, for the fun of it, since you mentioned, let's go there briefly, about autonomous vehicles. So one of the companies in the space, Tesla, is wor- with Andrej Karpathy and Elon Musk are working on a system called Autopilot, which is primarily a vision-based system with eight cameras and a, basically, a single neural network, a multitask neural network. They c- they call it HydraNet. Multi- multiple heads, so it does multiple task, but is forming the same representation at the core. Do you think driving can be converted in this way to, uh, purely a vision problem and then solved with the n- with learning? Or even more specifically, in the current approach, what do you think about what Tesla Autopilot team is doing?
- JMJitendra Malik
So the way I think about it is that there are certainly subset, uh, subsets of the visual-based driving problem which are quite solvable. So for example, driving in freeway conditions is, uh, quite a solvable problem. I think, uh, there were demonstrations of that going back to the 1980s by, uh, someone called Ernst Dickmann in, uh, Munich. Uh, in the '90s, there were approaches from, uh, Carnegie Mellon, there were approaches from a team at Berkeley. In the 2000s, there were approaches from Stanford and so on. So autonomous driving in certain settings is very doable. The challenge is to have an autopilot work under all kinds of driving conditions. At that point, it's not just a question of, uh, vision or perception, but really also of control and dealing with all the edge cases.
- LFLex Fridman
So where do you think most of the difficult cases... To me, even the highway driving is an open problem because, uh, it applies the same 50, 90, 95, 99 rule or the first step-... the fallacy of the first step, I forget how you put it.
- JMJitendra Malik
Yeah.
- LFLex Fridman
Uh, will fall victim to. I think even highway driving has a lot of elements. Because to solve autonomous driving, you have to completely relinquish the r- the fa- help of a human being. You're always in control, so the- you're really going to feel the edge cases. So, I- I- I think even highway driving is really difficult. But in terms of the general driving task, do you think vision is the fundamental problem or is it also your action, the- the interaction with the environment, the ability to, uh... And then, like, the middle ground, I don't know if you put that under vision, which is trying to predict the behavior of others, which is a little bit in the world of understanding the scene.
- JMJitendra Malik
Mm-hmm.
- LFLex Fridman
But it's also trying to form a model of the actors in the scene and predict their behavior.
- JMJitendra Malik
Yeah, I- I include that in vision, because to me, perception blends into cognition and building predictive models of other agents in the world, which could be... Other agents could be people, other agents could be other cars. That is part of the task of perception. Because, uh, perception always has to, uh, not tell us what is now, but what will happen because what's now is boring. It's done, it's over with.
- LFLex Fridman
(laughs)
- JMJitendra Malik
Okay?
- LFLex Fridman
Yeah.
- JMJitendra Malik
Uh, uh, eh, we care about the future because we a- act in the future.
- LFLex Fridman
And we care about the past in as much as it informs what's gonna happen in the future.
- JMJitendra Malik
Yeah. Yeah. Uh, so I think we have to build predictive models of- of, uh, of behaviors of people and- and those can get quite, uh, complicated. So, uh, uh, I mean, uh, I- I've seen examples of this in, uh... Actually, I mean, I own a Tesla and it has, uh, various safety features built in. And, uh, what I see are these examples where, let's say, there is some, uh, skateboarder. I mean, I- there's- I- and I- I- I don't want to be too critical because obviously this is- these are- these systems are always being improved and any specific criticism I have, maybe the system six months from now will not have that cr- that, uh, that particular failure mode. So, uh, e- e- it- it- it had a- e- e- it- it had the wrong response and it's because it couldn't predict what- what this skateboarder was going to do, okay? And bec- because it really required that higher level cognitive understanding of what skateboarders typically do as opposed to a normal pedestrian. So, what might have been the correct behavior for a pedestrian, a typical behavior of a pedestrian was not the typical behavior for a skateboarder, right?
- LFLex Fridman
Yeah.
- JMJitendra Malik
And, uh, uh... So- so therefore, to do a good job there, you need to have enough data where you have pedestrians, you also have skateboarders. You've seen enough skateboarders to see what, uh, what kinds of, uh, patterns of behavior they have.
- LFLex Fridman
Yeah.
- JMJitendra Malik
So they- it is- it is- in principle, with enough data, that problem could be solved. But, uh, uh, I think our current, uh, systems, uh, computer vision systems, they need far, far more data than humans do, uh, for learning those same capabilities.
- LFLex Fridman
So, say that there is going to be a system that solves autonomous driving. Do you think it will look similar to what we have today, but have a lot more data, perhaps more compute, but the fundamental architecture is involved, like neuro... Well, in the case of Tesla autopilot, this is neural networks. Do you think it will look similar in that regard and it'll just have more data?
- JMJitendra Malik
That's a- a scientific hypothesis as to which way is it going to go. Uh, I will tell you what I would bet on. Uh, so, uh, and this is, uh, my general philosophical position on how these, uh, learning systems have been. Uh, what we have found currently very effective in computer vision, uh, with- in- in the deep learning paradigm is sort of tabula rasa learning and tabula rasa learning in a supervised way with lots and lots of example.
- LFLex Fridman
What's tabula rasa learning?
- JMJitendra Malik
Tabula rasa in the sense that blank slate. We just have the system which is given a series of experiences in this setting and then it learns there. Now, if- let's think about human driving. It is not tabula rasa learning. So, at the age of 16 in high school, uh, a teenager goes into, uh, goes into driver ed class, right? And now, at that point, they learn, but at the age of 16, they are already visual geniuses. Because from zero to 16, they have built a certain repertoire of vision. In fact, most of it has probably been achieved by age two, right? In th- in this period of age, up to age two, they know that the world is three-dimensional, they know how objects look like from different perspectives, they know about occlusion, they, uh, know about common dynamics of humans and other bodies, they have some notion of intuitive physics. So, they- they have built that up from their observations and interactions in early childhood, and of course reinforced through their- their growing up to age 16. So then at age 16, when they go into driver ed, what are they learning? They're not learning afresh the visual world. They have a mastery of the visual world. What they are learning is control, okay? They're learning how to be smooth about control, about steering and brakes and so forth.
- LFLex Fridman
Okay.
- JMJitendra Malik
They're learning a sense of typical traffic situations. Now, the- the- that education process can be quite short because they are coming in as visual geniuses.... and of course, in their future, they're going to encounter situations which are very novel, right? So during my driver ed class, the- I may not have had to deal with a skateboarder, I may not have had to deal with a truck driving in front of me whose from- whose, uh, where the back opens up and some junk gets dropped from the truck-
- LFLex Fridman
Mm-hmm.
- JMJitendra Malik
... and I have to deal with it, right? But I can deal with this as a driver even though I did not encounter this in my driver ed class. And the reason I can deal with it is because I have all this general visual knowledge and expertise.
- LFLex Fridman
And, uh, do you think the learning mechanisms we have today can do that kind of long-term accumulation of knowledge? Or do we have to, uh, do some kind of, you know, in there- th- the work that led up to expert systems with knowledge representation, you know, the broader field of what- of artificial intelligence, uh, worked on this kind of accumulation of knowledge. Do you think neural networks can do the same?
- JMJitendra Malik
I think, uh, I don't see any in principle problem with neural networks doing it, but I think the learning techniques would need to evolve significantly. So the current, uh, the current, uh, learning techniques that we have, yeah, is- are supervised learning. You're given lots of examples, XIYIPS, and you, you learn the functional mapping between them. I think that human learning is far richer than that. It includes many different components. There are- there is, uh, a child explores the world and sees a s- for example, a, a child takes an object and manipulates it in, uh, his or her hand, and therefore gets to see the object from different points of view, and the child has commanded the movement. So that's a kind of learning data, but the learning data has been arranged by the child. And this is a very rich kind of data. The child can do various experiments with the world. So, uh, so there are many aspects of a sort of human learning, and these have been studied in, uh, in child development by psychologists. And they- what they tell us is that supervised learning is a very small part of it. There are many different k- aspects of learning. And what we would need to do is to develop models of all of these, and then, uh, train our systems in that- with that kind of, uh, uh, protocol.
- LFLex Fridman
So new, new methods of learning?
- 21:20 – 23:14
Human brain vs computers
- LFLex Fridman
do you- do you think there's something interesting, valuable to consider about the difference in the computational power of the human brain versus the computers of today in terms of instructions per second?
- JMJitendra Malik
Yes. So if we go back, uh, so, uh, so this is a point I've been making for 20 years now.
- LFLex Fridman
Yeah.
- JMJitendra Malik
And, uh, I think once upon a time, the way I used to argue this was that we just didn't have the computing power of the human brain. Our computers were, uh, were not quite there. And I mean, there is a, a well, well-known, uh, trade-off, which we know that the- n- that neurons are slow compared to, uh, transistors, but, uh, but we have a lot of them and they have a very high connectivity. Whereas in silicon, you have much faster devices, transistors which a- on the order of nanoseconds, but the connectivity is usually smaller.
- LFLex Fridman
Right.
- JMJitendra Malik
Uh, at this point in time, I mean, we are now talking about 2020, we do have, if you consider the latest GPUs and so on, amazing computing power. And if we look back at Hans Moravec's type of calculations, which he did in the 1990s, we may be there today in terms of computing power comparable to the brain, but it's not in the- of the same style, right? It's of a very different style. Uh, so I, I mean, for example, the, the style of computing that we have in our GPUs is far, far more power hungry than the style of computing that is there in the human brain or other, uh, biological, uh, uh, entities.
- LFLex Fridman
Yeah, and that's- the efficiency part is, uh, we're gonna have to solve that in order to build actual real world systems of large scale.
- 23:14 – 29:09
The general problem of computer vision
- LFLex Fridman
Let me ask sort of the high level question d- step- taking a step back. How would you articulate the general problem of computer vision? Does such a thing exist? So if you look at the computer vision conferences and the work that's been going on, it's often separated into different little segments, um, breaking the problem of vision apart into whether seg- segmentation, or 3D reconstruction, object detection, I don't know, image capturing, whatever, uh, there's benchmarks for each. But if you were to sort of philosophically say, "What is the big problem of computer vision?" Does such a thing exist?
- JMJitendra Malik
Uh, yes, uh, but it's not in isolation. So if- if we have- we have to- so for all intelligence, uh, tasks, I always go back to sort of biology or, uh, humans. And if you think about, uh, vision or perception in that setting, we realize that perception is always to guide action.... perception in a, for a biological system does not give any benefits unless it is coupled with action. So we can go back and think about the first multicellular animals which arose in the Cambrian era, you know, 500 million years ago. And, uh, these animals could move and they could see in some way, and their two activities helped each other because, uh, uh, uh, well, well, how does movement help? Movement helps that, uh, because you can get food in different places.
- LFLex Fridman
Right.
- JMJitendra Malik
But you need to know where to go and that's really about perception or seeing, I mean- I mean vision is, mm, perhaps the single most perception sense. Uh, but all the others are equally im- are also important. So, uh, so perception and action kind of grow- uh, go together. So earlier it was in these very simple feedback loops which were about, uh, finding food or avoiding becoming food if there's a predator running, uh, uh, trying to, you know, eat you up, and -and so forth. So- so we must, at the fundamental level, connect perception to action. Then, uh, as we evolved, uh, perception became more and more sophisticated because it served many more purposes. And, uh, so today we have what seems like a fairly general purpose capability which can look at the external world and build an- a model of the external world inside the head. We do have that capability. That model is not perfect, and psychologists have great fun in pointing out the ways in which the model in your head is not a perfect model of the external world. They point, um, they create various illusions to show the ways in which it is imperfect. But it's amazing how far it has come from a very simple perception action loop that you exists in, uh, you know, uh, an animal 500 million years ago. Once we have this- th- these very sophisticated visual systems, we can then impose a structure on them. It's we who, as scientists, who are imposing that structure where we have chosen to characterize this part of the system as this, quote, "module of object detection," or quote, "this module of 3D reconstruction." What's going on is really all of these processes are running simultaneously and, uh, I- and- and they are running simultaneously because originally their purpose was in fact to help guide action.
- LFLex Fridman
So as a guiding general statement of a problem, do you think we can say that the- the general problem of computer vision, you said in humans it was tied to action. Do you think we should also say that ultimately the- the goal, the problem of computer vision is to sense the world in a way that helps you act in the world?
- JMJitendra Malik
Yes. I think that's the most fundamental, uh, the- that's the most fundamental purpose. Uh, we have by now hyper evolved. So we have this visual system which can be used for other things. For example, judging the aesthetic value of a painting.
- LFLex Fridman
Right.
- JMJitendra Malik
And this is not guiding action. Maybe it's guiding action in terms of how much money you will put in your auction bid, but that's a bit stretched. But the basics are in fact in terms of action. But we have- we've evolved r- really this hyper, uh, we have hyper evolved our visual system.
- LFLex Fridman
Actually, just to, uh, sorry to interrupt, but perhaps it is fundamentally about action. Y- you kind of jokingly said about spending, but perhaps the capitalistic, uh, drive that d- drives a lot of the development in this world (laughs) is- is about the exchange of money and the fundamental action is money. If you watch Netflix, if you enjoy watching movies, you're using your perception system to interpret the movie. Ultimately, your enjoyment of that movie means you'll subscribe to Netflix, so the action is this, uh, this extra layer that we've developed in modern society, perhaps is (laughs) is fundamentally tied to the action of spending money.
- JMJitendra Malik
Well, certainly with respect to, uh, uh, you know, interactions with firms.
- LFLex Fridman
Right.
- JMJitendra Malik
So- so in this homo economicus role when you're interacting with firms, it does become, uh, it does become that. That's-
- LFLex Fridman
What else is there? (laughs)
- JMJitendra Malik
Oh... (laughs)
- LFLex Fridman
Uh, no, it was a rhetorical question.
- JMJitendra Malik
Yeah.
- LFLex Fridman
Okay. So
- 29:09 – 37:47
Images vs video in computer vision
- LFLex Fridman
to- to linger on the division between the static and the dynamic, so much of the work in computer vision, so many of the breakthroughs that you've been a part of have been i- in the static world, in the looking at static images. And then you've also worked on starting, but it- it's a much smaller degree, the community is looking at dynamic, at video, at dynamic scenes. And then there is robotic vision which is dynamic, but also where you actually have a robot in the physical world interacting based on that vision. Which problem is harder? Bec- the int- sort of the trivial first answer is well, of course one image is harder. But sort of if you look at a deeper question there, are we, um, what's the term? Cutting ourselves at- cutting ourselves at the knees or like making the problem harder by focusing on the im- images?
- JMJitendra Malik
That's a fair question. I think, uh, sometimes we- we can simplify a problem so much, uh, that, uh, we essentially lose part of the juice that could enable us to solve the problem.... and one could reasonably argue that, to some extent, this happens when we go from video to single images. Now, historically, uh, you have to consider the limits of-- i- imposed by the computation ca-- uh, capabilities we had. So, if we... Many of the choices made in the computer vision community, uh, through the 70s, 80s, 90s, can be understood as choices which were forced upon us by the, uh, fact that we just didn't have access to compute, enough compute.
- LFLex Fridman
Not enough memory, not enough hard drives.
- JMJitendra Malik
Not e- exactly. Not enough com- uh, not enough compute, not enough storage. So, so think of these choices... So, one of the choices is focusing on single images rather than video. Okay. S- clear question is storage and compute. Uh, uh, w- uh, we had to focus on... W- we did, uh, we d- we used to detect the edges and throw away the image, right? So, you have an image which is, say, 256 by 256 pixels and instead of keeping around a grayscale value, what we did was we detected edges, find the places where the brightness changes a lot. So, now that's... And now... And then throw away the rest. So, this was a major compression device and the hope was that this makes it that you can still work with it and the logic was humans can interpret a line drawing. And, uh, uh, and, uh... Yes, and this will save us on computation. So, many of the choices were dictated by that. I think, uh, today, uh, we are no longer detecting edges, right? We process images with ConvNets because we don't need to. We don't have that... Those compute restrictions anymore. Now, video is still understudied because video compute is still quite challenging if you are a university researcher. I think video computing is not so challenging if you are at Google or Facebook or Amazon.
- LFLex Fridman
Still super challenging.
- JMJitendra Malik
Yeah.
- LFLex Fridman
I have a... I just spoke with, uh, VP of engineering at Google, head of, uh, YouTube search and discovery, and they still struggle doing stuff on video. It's very difficult except doing... Except using techniques that are essentially the techniques you used in, in the 90s.
- JMJitendra Malik
Yeah.
- LFLex Fridman
Some very basic computer vision techniques.
- JMJitendra Malik
Yeah. No, I... Th- that's when you want to do things at scale. So if y-
- LFLex Fridman
Right.
- JMJitendra Malik
... if you want to operate at the scale of all the content of YouTube, it's very challenging, and there are similar issues with Facebook. But as a researcher, you-
- LFLex Fridman
Right.
- JMJitendra Malik
... you have, you have more, uh, you know, opportunities.
- LFLex Fridman
You can train large... You know, that works with-
- JMJitendra Malik
Yeah.
- LFLex Fridman
... relatively large, uh, video datasets, yeah.
- JMJitendra Malik
Yes. So, I think that th- this is part of the reason why we have... So, I emphasize static images. I think that this is changing, and over the next few years, I see a lot more progress happening in video. So, I... I have this generic statement that, uh, to me, video recognition feels like 10 years behind object recognition. And you can quantify that because you can take some of the challenging video datasets and their performance and action classification is like, say, 30%, which is kind of what we used to have around, uh, 2009 in object detection, you know? It's like about 10 years behind. And, uh, whether it'll take 10 years to catch up is a different question. Hopefully it will take less than that.
- LFLex Fridman
Let me ask a similar question I've already asked, but o- once again. So, for dynamic scenes, d- do you think... Do you think some kind of injection of knowledge bases and reasoning is required to help improve, like, action recognition? Like, if, if, if, um... If we solved the general action recognition problem, what do you think the solution would look like? Is another way of putting it.
- JMJitendra Malik
Yeah. So I, I completely agree that knowledge is called for and that knowledge can be quite sophisticated. So, the way I would say it is that perception blends into cognition.
- LFLex Fridman
Mm-hmm.
- JMJitendra Malik
And cognition brings in issues of memory and, uh, uh, this notion of a schema, w- from, uh, psychology, which is... Uh, let me use the classic example which is, uh, you go to a restaurant, right? Now, there are things that happen in a certain order. You walk in, somebody takes you to a table, uh, waiter comes, sh- gives you a menu, takes the order, food arrives. Eventually, uh, uh, bill arrives, et cetera, et cetera. Uh, there's a classic example of AI from the 1970s. Uh, it was called, uh... There was the term frames and scripts and schemas. These are all quite similar ideas. Okay, and t- in the '70s, the way, uh, the AI of the time dealt with it was by buil- hand-coding this.
- LFLex Fridman
Right.
- JMJitendra Malik
So they hand-coded in this notion of a script and the various stages and the actors and so on and so forth, and used that to interpret, for example, language. I mean, if there's a description of a, uh, of a story involving some people eating at a restaurant, there are very... All these inferences you can make because you know what happens typically at a restaurant. So, I think this kind of... Uh, this kind of knowledge is absolutely essential. So I think that when we are going to do long-form video understanding, we are going to need to do this. I think the kinds of technology that we have right now with 3D convolutions or where couple of second of clip of video, it's very much tailored towards short-term video understanding, not that long-term u- understanding. Long-term understanding requires a notion of, uh-... a, this notion of schema that I talked about, perhaps some notions of goals, intentionality, functionality, and so on and so forth. Now, how will we bring that in? So we could either revert back to the 70s and say, "Okay, I'm going to hand code in a script." Or we might try to learn it. So I tend to believe that we have to find learning ways of doing this because I think learning ways will end up being more robust. And there must be a learning version of this story because, uh, children acquire a lot of this knowledge by, uh, sort of just observation. So at no moment in a child's life does a... It's possible, but I think it's not something so typical that somebody, that a mother coaches a child through all the stages of what happens in a restaurant. They just go as a family. They, they, they go to the restaurant, they eat, come back and the child goes through 10 such experiences and the child has, has got a schema of what happens when you go to a restaurant. So we somehow need to, uh, we need to provide that capability to our
- 37:47 – 40:06
Benchmarks in computer vision
- JMJitendra Malik
systems.
- LFLex Fridman
You mentioned, uh, the following line from the end of the Alan Turing paper, uh, Computing Machinery and Intelligence, that many people, li- like you said, many people know and very few have read-
- JMJitendra Malik
Okay.
- LFLex Fridman
... where (laughs) where he proposes the Turing test. This is, this is how you know because it's towards the end of the paper.
- JMJitendra Malik
Yeah.
- LFLex Fridman
"Instead of trying to produce a program to simulate the adult mind, why not rather try to produce one which simulates the child's?" So that's a really interesting point that, a- and if I think about the benchmarks we have before us, the t- the, the tests of our computer vision systems, they're often kind of trying to get to the adult. So what kind of benchmarks should we have? What kind of tests for computer vision do you think we should have that mimic the child's in computer vision?
- JMJitendra Malik
Yeah. Uh, I think we should have those and we don't have those today. And I think, uh, the part of that, the challenge is that we should really be collecting data of the type that a child, uh, that a child experiences.
- LFLex Fridman
Right.
- JMJitendra Malik
Right? So that gets into issues of, you know, privacy and so on and so forth. But there are attempts in this direction to sor- sort of try to collect the kind of data that a child encounters in, uh, growing up. So what's the child's linguistic environment? What's the child's visual environment? So if we could collect that kind of data and, uh, then develop learning schemes based on that data, that would be one way to do it. Uh, I, I think that's a very promising direction myself. There might be people who would argue that we could just short circuit this in some way. And, uh, sometimes, uh, we have imitated, uh, uh... We, we have not... We have had success by not imitating nature in detail. So if we-
- LFLex Fridman
Right.
- JMJitendra Malik
... take the usual example as airplanes, right? We don't build flapping winds, uh, uh, flapping wings. So, uh, yes, that's, uh, that's one of the points of debate. Uh, in my mind, I, I, I would, I would bet on this, this learning like a child approach.
- LFLex Fridman
So one
- 40:06 – 45:34
Active learning
- LFLex Fridman
of the fundamental aspects of learning like a child is the interactivity, so the child gets to play with the dataset it's learning from. (laughs)
- JMJitendra Malik
Yes.
- LFLex Fridman
So it gets to select. I mean, you can call that active learning, you can, you, you know, in, in the machine learning world, you can call it a lot of terms. What are your thoughts about this whole space of being able to play with the dataset and select what you're learning?
- JMJitendra Malik
Yeah. So I think that, uh, I, I believe in that. And I think that, uh, we could achieve it in, in two ways and I think we should use both. So one is, uh, actually real robotics, right? So real, um, you know, physical embodiments of agents w- who are interacting with the world and they have a physical body with the dynamics and mass and moment of inertia and friction and all the rest. And you learn your body, the robot learns its body by doing a series of, uh, actions.
- LFLex Fridman
Mm-hmm.
- JMJitendra Malik
Uh, the second is that, uh, simulation environments. So, uh, I think simulation environments are getting much, much better. In my, uh, in my life in, uh, Facebook AI Research, our group has worked on something called Habitat, which is a, a simulation environment, uh, which is a visually photorealistic environment of, uh, you know, places like houses or interiors of various urban spaces and so forth. And as you move, you get a picture which is a pretty accurate picture. So, uh, I, I can now... Uh, you can imagine that subsequent generations of these simulators will be accurate not just visually but with respect to, you know, forces and masses and, uh, haptic interactions and so on. And, uh, then, then we have that environment to play with. I think the, uh, the, uh... Let me state one reason why I think this active, being able to act in the world is important. I think that this is one way to break the, uh, correlation versus causation barrier. So, uh, th- this is something which is of a, a great deal of interest these days. I mean, people like Judea Pearl have talked a lot about, uh, uh, wh- that, uh, we are neglecting causality and he describes the entire-... set of successes of deep learning as just curve fitting. Right? Because it's, uh... But I, I don't quite agree, but, um-
- LFLex Fridman
He's a troublemaker, he is.
- JMJitendra Malik
But, uh, causality is important, but causality is not, uh, is not like a single silver bullet. It's not like one single principle. There are many different aspects here. And one of the ways in which, uh, o- one of our most reliable ways of establishing causal links, uh, and is the way, for example, the, the medical community does this, is randomized control trials. So you have a c... You, you pick some situation, and now in some situation you perform an action and for certain, uh, others you don't, right? So, uh, so you have a controlled experiment. Or the child is in fact performing controlled experiments all the time. Right?
- LFLex Fridman
Right, right.
- JMJitendra Malik
Okay?
- LFLex Fridman
Small scale, yeah. (laughs)
- JMJitendra Malik
And, uh, in a small scale. And, but, but that is a way that the child gets to build and refine its causal models of the world. And, uh, my colleague Alison Gopnik has, uh, together with a couple of authors, co-authors, has this book called The Scientist in the Crib, referring to the children.
- LFLex Fridman
Mm-hmm.
- JMJitendra Malik
So I like... The part that I like about that is the scientist h- wants to do, wants to build causal models, and the scientist does controlled experiments. And I think the child is doing that. So to enable that, we will need to have these, th- uh, these active experiments. And I think this could be done some in the real world and some in simulation.
- LFLex Fridman
So you have hope for simulation?
- JMJitendra Malik
I have hope for simulation.
- LFLex Fridman
So that's, that's an exciting possibility if we can get to not just photorealistic, but what's that called? R- life realistic-
- JMJitendra Malik
Yeah.
- LFLex Fridman
... uh, simulation. So you don't see any fundamental blocks to why we can't eventually simulate the, the, the principles of what it means to exist in the world a f- as a physical entity?
- JMJitendra Malik
No, I, I, I don't see any fundamental problems there. I mean, and, and, uh, loo- look, the computer graphics community has come a long way.
- LFLex Fridman
Right.
- JMJitendra Malik
So the... In the early days, back, going back to the '80s and '90s, they were, uh, they were focusing on visual realism, right? And then they could do the easy stuff, but they couldn't do stuff like hair or fur and so on. Okay, well, they managed to do that. Then they couldn't do physicals, uh, actions, right? Like there's a bowl of glass and it falls down and it shatters.
- LFLex Fridman
Right.
- JMJitendra Malik
But then they could start to do pretty realistic models of that, and so on and so forth. So the graphics people have shown that they can do this forward direction not just for optical interactions, but also for physical interactions. So I think, uh, of course some of that is very compute intensive, but I think by and by we will find ways of, uh, making our models ever more realistic.
- 45:34 – 52:47
From pixels to semantics
- JMJitendra Malik
- LFLex Fridman
You, you, you break vision apart into, in one of your presentations, early vision, static scene understanding, dynamic scene understanding, and raise a few interesting questions. I thought I could just throw some, some at you just to see if you wanna talk about them. So early vision. So it's... What is it that you said? Um, sensation, perception and cognition. So this is sensation.
- JMJitendra Malik
Yes.
- LFLex Fridman
What can we learn from image statistics that we don't already know? So at the lowest level, what, um, what can we make from just the st- the, the, the statistics, the basics, or the, the variations in the raw pixels, the textures, and so on?
- JMJitendra Malik
Mm-hmm. Yeah, so what we seem to have learned is, uh, uh, uh, is that there's a lot of redundancy in these images. And as a result, we are able to do a lot of compression. And, uh, and this compression is very important in biological settings, right? So you might have ten to the eight photoreceptors and only ten to the six fibers in the optic nerve, so you have to do this compression by a factor of 100 is to 1. And, uh, and, uh, so there are analogs of that which are happening in, in our neural net- artificial neural networks.
- LFLex Fridman
That's the early layers. So you think-
- JMJitendra Malik
At the early layers.
- LFLex Fridman
... there's a lot of compression that can be done in the beginning?
- JMJitendra Malik
Yeah.
- LFLex Fridman
Just, just the statistics?
- JMJitendra Malik
Yeah.
- LFLex Fridman
Um, how much? How much (laughs)
- JMJitendra Malik
Well, I... Well, so, I mean, the, the, uh, way to think about it is, is just how successful is image compression, right? And we, we-
- LFLex Fridman
Right.
- JMJitendra Malik
And there are... And, and that's been done with older technologies, but it can be done with... There are, uh, several companies which are trying to use, uh, sort of these more advanced neural network type techniques for compression, both for static images as well as for, for video. One of my former students, uh, has a company which is trying to do stuff like this. Uh, and, uh, I think, I think that they are showing quite interesting results. And I think that that's all the success of... That's really about image statistics and video statistics.
- LFLex Fridman
But that's still not doing compression of the kind when I see a picture of a cat, all I have to say is it's, it's a cat.
- JMJitendra Malik
Yeah.
- LFLex Fridman
That's another semantic kinda compression.
- JMJitendra Malik
Yeah, yeah. So this is, this is at the lower level, right? So we are, we are... We... As I said, yeah, tha- that's focusing on low level statistics.
- LFLex Fridman
So to linger on that for a little bit, uh, you mention how far can bottom-up image segmentation go? And in general, what... You mentioned that the central question for scene understanding is the interplay of bottom-up and top-down information. Maybe this is a good time to elaborate on that. Maybe define what is, what is, uh, bottom-up, what is top-down in the context of computer vision.
- JMJitendra Malik
Yeah. Uh, right. That's, uh... So today what we have are...... a, a, are very interesting systems, because they work completely bottom-up. However, they're trained-
- LFLex Fridman
What does bottom-up mean? Sorry.
- JMJitendra Malik
So bottom-up means, in this case, means a feedforward net, neural network. So are currently-
- LFLex Fridman
So starting from the raw pixels, trying to-
- JMJitendra Malik
Yeah. They start from the raw pixels and they, they end up with some- something like cat or not a cat. Right? So our, our systems are running totally feedforward. They're trained in a very top-down way, so they're trained by saying, "Okay. This is a cat. This is a cat. This is a dog. This is a zebra." Et cetera. And I'm not happy with either of these choices fully. We have gone into, uh... Because we, we have sep- completely separated these processes. Right? So there is a... So I would like the, uh, the process, uh, the, the... Eh, so what do we know compared to biology? So in biology, what we know is that the processes in, at test time, at run time, those processes are not purely feedforward but they involve feedback, so... And they involve much shallower neural networks. So the kinds of neural networks we are using in computer vision, say a ResNet-50, has 50 layers. Well, in, in the brain, in the visual cortex going from the retina to IT, maybe we have like seven.
- LFLex Fridman
Mm-hmm.
- JMJitendra Malik
Right? So they are far shallower, but we have the possibility of feedback. So there are backward connections. And this might enable us to, uh, to deal with the more ambiguous stimuli, for example. So the, the biological solution seems to involve feedback. The solution in, in, uh, artificial vision seems to be just feedforward but with a much deeper network, and the two are functionally equivalent. Because if you have a feedback network which just has like three rounds of feedback, you can just unroll it and make it three times the depth and create it in a totally feedforward way. So this is something which... I mean, we have written some papers on this theme, but I really feel that this should... this theme, uh, should be pursued further and-
- LFLex Fridman
Have some kind of occurrence mechanisms?
- JMJitendra Malik
Yeah. Okay. The other... Uh, so the, so that's... Uh, so I, so I want to have a little bit more top-down in the... at test time. Okay. Then, at training time, we make use of a lot of top-down knowledge right now. So basically, to learn to segment an object, we have to have all these examples of this is the boundary of a cat, and this is the boundary of a chair, and this is the boundary of a horse, and so on. And this is too much top-down knowledge. How do hu- uh, humans do this? We manage to m- we manage with far less supervision, and we do it in a sort of bottom-up way. Because, for example, we're looking at a video stream and the horse moves, and that enables me to say that all these pixels are together.
- LFLex Fridman
Yeah.
- 52:47 – 57:05
Semantic segmentation
- JMJitendra Malik
- LFLex Fridman
(inhales deeply) Okay. So then, then maybe taking a step into, uh, segmentation and static scene understanding, what is the interaction between segmentation and recognition? You mentioned, um, the movement of objects. So for people who don't know computer vision, segmentation is this weird activity that we... that computer vision folks have all agreed is very important, uh, of drawing outlines around objects-
- JMJitendra Malik
Yeah.
- LFLex Fridman
... versus a bounding box or... uh, and then classifying that object. Um, what's, what's the value of segmentation? What is it as a problem in computer vision? How was it fundamentally different from detection and recognition and the other problems?
- JMJitendra Malik
Yeah. So I think... Uh, so, uh, so segmentation enables us to say that some set of pixels are an object without necessarily even being able to name that object or knowing properties of that object.
- LFLex Fridman
Oh, so you mean segmentation purely as, uh, as, as the act of separating an object fr-
- JMJitendra Malik
From its background.
- LFLex Fridman
... uh, a blob of, uh, of... that's united in some way-
- JMJitendra Malik
Yeah.
- LFLex Fridman
... from its background. Yeah.
- JMJitendra Malik
Yeah. So entitification, if you will.
- LFLex Fridman
Mm-hmm. Got it.
- JMJitendra Malik
Making an entity out of it.
- LFLex Fridman
Entitification.
- JMJitendra Malik
Yeah.
- LFLex Fridman
Beautifully... beautiful term. (laughs)
- JMJitendra Malik
Uh, so, uh, yeah, so I think that, uh, we have that capability and that is... that enables us to, uh, as we are growing up, to acquire, uh, names of objects with very little supervision. So suppose the child... Let's posit that the child has this ability to separate out objects in the world. Then when the, uh, there's, uh, um, the mother says, "Pick up your bottle," or, "Uh, the cat's behaving funny today," eh, uh, the word cat suggests some object and then the child sort of does a mapping.
- LFLex Fridman
Right.
- JMJitendra Malik
Right? The, the mother doesn't have to teach...... specific object labels by pointing to them. Weak supervision works in the context that you have the ability to create objects. So, I think that, uh, so I- to me, that's- that's a very fundamental capability. Uh, there are applications where this is very important. Uh, for example, medical diagnosis. So in medical diagnosis, uh, you have some, uh, brain scan. I mean, some, this is some work that we did in my group where you have CT scans of people who have had traumatic brain injury and what, uh, what the radiologist needs to do is to precisely delineate various places where there might be, uh, uh, bleeds, for example.
- LFLex Fridman
Right.
- JMJitendra Malik
And- and there's- there are clear needs like that. So there are certainly very practical applications of computer vision where segmentation is necessary. But philosophically, segmentation enables the task of recognition to proceed with much weaker supervision than we require today.
- LFLex Fridman
And you think of segmentation as this kinda task that takes on a visual scene and breaks it apart into- into interesting entities-
- JMJitendra Malik
Yeah.
- LFLex Fridman
... that might be useful for whatever the task is.
- JMJitendra Malik
Yeah. And- and it is not semantics-free. So I think I- I- I mean, it- it blends into- it involves perception and cognition. It is not- it is not, uh, I- I think the mistake that we used to make in the early days of computer vision was to treat it as a purely bottom-up perceptual task. It is not just that. Because we do revise our notion of, uh, segmentation with more experience, right? Because for example, there are objects which are non-rigid, like animals or humans. And, uh, I think understanding that all the pixels of a human are one entity is actually quite a challenge.
- LFLex Fridman
Mm-hmm.
- JMJitendra Malik
Because the parts of the human, they can move independently, and the human wears clothes so they might be differently colored. So it's all sort of a challenge.
- 57:05 – 1:02:52
The three R's of computer vision
- JMJitendra Malik
- LFLex Fridman
You mention the three Rs of computer vision, or recognition, reconstruction r- re- reorganization. Can you describe these three Rs-
- JMJitendra Malik
Sure.
- LFLex Fridman
... and how they interact?
- JMJitendra Malik
Yeah. So, uh, so recognition is the easiest one because that's, uh, what I think people generally think of as computer vision, uh, achieving these days, which is, uh, labels. So is this a cat? Is this a dog? Is this a- a Chihuahua? I mean, you know, it could be very fine-grained like, uh, you know, specific breed of a dog or a specific species of bird, or it could be very abstract like animal.
- LFLex Fridman
But given a part of an image or a whole image, say, put a label on that.
- JMJitendra Malik
Yeah.
- LFLex Fridman
That's recognition.
- JMJitendra Malik
So that's- that's recognition. Reconstruction is, uh, essentially i- it- you can think of it as inverse graphics. I mean, that's one way t- uh, to think about it. So graphics is you're- you have some internal computer representation.
- LFLex Fridman
Mm-hmm.
- JMJitendra Malik
And, uh, you have a computer representation of some objects arranged in a scene, and what you do is you produce a- a picture. You produce the pixels corresponding to a rendering of that scene. So, uh, uh, so let's do the inverse of this. We are given an image and we try to- w- we- we say, "Oh, this image arises from some objects in a scene looked at with a camera from this viewpoint." And we might have more information about the objects like their shape, maybe their textures, maybe, you know, color, et cetera, et cetera. So that's the reconstruction problem. In a way, th- you are, in your head, creating a model of the external world.
- LFLex Fridman
Right.
- JMJitendra Malik
Okay, reorganization is to do with essentially finding these entities. So, uh, so it's, uh, organization. Org- the word organization implies structure. So, uh, the- in- in, uh, perception, in psychology, we use the term perceptual organization. That, uh, the- the world is not just- an image is not just seen as- it's not internally represented as just a collection of pixels, but we make these entities. We create these entities, objects, whatever you wanna call them.
- LFLex Fridman
And the relationship between the entities as well or is it purely about the entities?
- JMJitendra Malik
It could be about the relationships, but mainly we focus on the fact that there are entities. Okay.
- LFLex Fridman
So I'm trying to- I'm trying to pinpoint what the organization means.
- JMJitendra Malik
So organization is that instead of like a uniform grid, we have the structure of objects.
- LFLex Fridman
So s- segmentation is the small part of that?
- JMJitendra Malik
So segmentation-
- LFLex Fridman
It's-
- JMJitendra Malik
... gets us going towards that.
- LFLex Fridman
Yeah. And you kinda have this triangle where they all interact together.
- JMJitendra Malik
Yes.
- LFLex Fridman
So how do you see that interaction in, uh, s- sort of, uh, reorganization is yes, defining the entities in the world, the recognition is labeling those entities, and then reconstruction is what? Filling in the gaps?
- JMJitendra Malik
Well, uh, to, for example, see, uh, uh, impute some 3D objects corresponding to each of these entities. That would be part of it as well.
- LFLex Fridman
So adding more information that's not there in the raw data.
- JMJitendra Malik
Correct. Uh, I mean, I started-... pushing this kind of a view in the, around 2010 or something like that. Uh, because at that time, in computer vision, the distinction that pe- people, uh, were, were just working on many, uh, different problems, but they treated each of them as a separate, isolated problem-
- LFLex Fridman
Yeah. Right.
- JMJitendra Malik
... with each, with its own dataset and then you try to solve that and get good numbers on it. So I wasn't, uh, I didn't like that approach because I wanted to see the connection between these. And if people divided up vision into, uh, into various modules, the way they would do it is as low level, mid-level, and high-level vision, corresponding roughly to the psychologist's notion of sensation, perception, and cognition. And I didn't, that didn't map to tasks that people cared about. Okay? So therefore, I tried to promote this particular framework as a way of considering the problem that people in computer vision were actually working on, and trying to be more explicit about the fact that they actually are connected to each other. And I was, at that time, just doing this on the basis of information flow. Now, it turns out, in the last five years or so, uh, uh, in the post deep learning revolution, that this, this architecture has turned out to be very, uh, conducive to that. Because basically, in these neural networks, we are trying to build multiple representations. Uh, there can be multiple output heads sharing common representations. So in a certain sense, today, given the reality of what solutions people have to this, I- I- I- I- I do not need to preach this anymore.
- LFLex Fridman
(laughs)
- 1:02:52 – 1:04:24
End-to-end learning in computer vision
- JMJitendra Malik
- LFLex Fridman
So speaking of neural networks, how much of this, uh, problem of computer vision, of reorganization, recognition, can be recon-, um, reconstruction, how much of it can be learned end-to-end do you think? Sort of (laughs) , uh, set it and forget it, just plug and play, have a giant dataset, multiple perhaps, multimodal, and then just learn the entirety of it?
- JMJitendra Malik
Well, so I, I think that currently what that end-to-end learning means nowadays is end-to-end supervised learning.
- LFLex Fridman
Right.
- JMJitendra Malik
And, and that, I would argue, is a too narrow a view of the problem. I would, uh, I like this child development view, this lifelong learning view, one where there are certain capabilities that are built up and then there are certain capabilities which are built up on top of that. So, uh, that's, that's what I, uh, I believe in. So I think, uh, end-to-end learning in the supervised setting for a very precise task to me is, uh, i- i- i- kind of is, uh, it's sort of a limited view of the, of the learning process.
- LFLex Fridman
Got it. So if we think about beyond purely supervised, look at, back to children,
- 1:04:24 – 1:08:36
6 lessons we can learn from children
- LFLex Fridman
you mentioned six lessons that we can learn from children, uh, of be multimodal, be incremental, be physical, explore, be social, use language. Can you speak to these? Perhaps picking one that you find most fundamental to our-
- JMJitendra Malik
Yeah.
- LFLex Fridman
... time today?
- JMJitendra Malik
Yeah, so, uh, I mean, I should say to give, uh, due credit, this is from a paper by Smith and Gasser. Uh, and it reflects, uh, uh, essentially, I would say common wisdom among, uh, uh, child development people. It's just that these are, this is not common wisdom among people in-
- LFLex Fridman
(laughs)
- JMJitendra Malik
... computer vision and AI and machine learning.
- LFLex Fridman
Yeah.
- JMJitendra Malik
So I view my role as, uh, trying to, uh, spread-
- LFLex Fridman
Bridge the two worlds?
- JMJitendra Malik
Bridge the two worlds.
- LFLex Fridman
(laughs)
- JMJitendra Malik
So, uh, so let's take an example of multimodal. I like that. So multimodal, a canonical example is, uh, a, a child interacting b- with, uh, with an object, so then the child, so the child holds a ball and plays with it. So at that point, it's getting a touch signal. So the touch signal is, uh, is getting as the notion of 3D shape, but it is sparse. And then it, the child is also seeing a visual signal, right? And, and these two... So imagine these are two and totally different spaces, right? So one is the space of receptors on the skin of the fingers and the thumb and the palm, right? And then these map onto these neuronal fibers are g- ac- getting activated somewhere. Uh, right? These lead to some activation in somatosensory cortex. I mean, a similar thing will happen if we have a robot, uh, hand. Okay, and then we have the pixels corresponding to the visual view. But we know that they correspond to the same object. Right? So that's a very, very strong cross-calibration signal. And it is self-supervisory, which is beautiful, right?
- LFLex Fridman
Mm-hmm.
- JMJitendra Malik
There's nobody assigning a label. The mother doesn't have to come and assign a label. The child doesn't even have to be, know that this object is called a ball. Okay? But the obj- the child is learning something about the three-dimensional world from this signal. Uh, I think tactile and visual, there is some work on. There is a lot of work currently on audio and visual.
- LFLex Fridman
... right?
- JMJitendra Malik
Yeah. Okay. And audiovisual, so there is some event that happens in the world, and that event has a visual signature and it has a auditory signature. So there is this glass bowl on the table and it falls and breaks and I hear the smashing sound and I see the pieces of glass. Okay. I built that connection between the two, right? We have, uh, people... Uh, I mean, there's become a hot topic in computer vision in the last couple of years, s- there is, there are problems like, uh, uh, separating out multiple speakers, right?
- LFLex Fridman
Right.
- JMJitendra Malik
Which was a classic problem in, uh, in audition. They call this the problem of source separation or the cocktail party effect, and so on. But just try to do it visually when you also have... It becomes so much easier and so much more, uh, useful.
- LFLex Fridman
(laughs) So the multi- multimodal... I mean, there's so much more signal with multimodal and you can use that f- for some kind of weak supervision as well.
- JMJitendra Malik
Yes. Because they are occurring at the same time in time.
- LFLex Fridman
Yeah.
- JMJitendra Malik
So you have time which links the two, right? So at a certain moment, T1, you got a certain signal in the auditory domain and a certain signal in the visual domain, but they must be causally related.
- LFLex Fridman
That's an exciting area, not well-studied yet, not that-
- JMJitendra Malik
Yeah. I mean, we have a-
- LFLex Fridman
... if you look at the s-
- JMJitendra Malik
... little bit of work at this but, uh, but, uh, but so much more needs to be done.
- LFLex Fridman
Yeah.
- JMJitendra Malik
So, so, so, so this, this is, this is a good example. Be physical, that's to do with, uh, like someone thing we talked about-
- LFLex Fridman
Yes.
- JMJitendra Malik
... earlier that there's a, there's a embodied world.
- 1:08:36 – 1:12:30
Vision and language
- LFLex Fridman
To mention language, used language. So, Noam Chomsky believes that language may be at the core of cognition, at the core of everything in the human mind. What is the connection between language and vision to you? Like, wha- what's more fundamental? Are they neighbors? Is one the parent and the child, the chicken and the egg?
- JMJitendra Malik
Oh, it's very clear. It is vision which is the parent.
- LFLex Fridman
The fundament- the parent. (laughs)
- JMJitendra Malik
Vision's the fundamental ability. Okay.
- LFLex Fridman
Wait, wait, wait. (laughs)
- JMJitendra Malik
So, so, uh-
- LFLex Fridman
It comes before... You think vision is more fundamental than language?
- JMJitendra Malik
Correct. And, uh, and, and it, uh, and, uh, you can think of it either in phylogeny or in ontogeny. So phylogeny means, if you look in evolutionary time, right? So you, uh, we have vision that developed 500 million years ago. Okay. Then something like when we get to maybe like five million years ago, you have the first bipedal primate. So when we started to walk then the hand became free, and so then manipulation, the ability to manipulate objects and build tools and so on and so forth, so I'm not-
- LFLex Fridman
Wait, you said 500 a 1000 years ago?
- JMJitendra Malik
No, no, sorry.
- LFLex Fridman
Oh.
- JMJitendra Malik
The, the first multicellular animals which you can say had some intelligence wa- arose 500 million years ago.
- LFLex Fridman
Million.
- JMJitendra Malik
Okay. And now let's fast-forward to, say, the last seven million years, which is the development of the hominid line, right? Where-
- LFLex Fridman
Mm-hmm.
- JMJitendra Malik
... from the other primates we have the branch which leads on to modern humans. Now, there many of these, uh, hominids, but the, the, the ones which, uh, y- you know, people talk about Lucy because that's like a skeleton from three million years ago and we know that Lucy walked. Okay. So at this stage you have that the hand is free for manipulating objects, and then the ability to manipulate objects, build tools, and the brain size, uh, grew in this era. So, okay, so now you have manipulation. Now, we don't know exactly when language arose.
- LFLex Fridman
But after that.
- JMJitendra Malik
But after that.
- LFLex Fridman
(inhales deeply)
- JMJitendra Malik
Because no apes have... I mean, so, I mean, Chomsky is correct in that, that it is a uniquely human capability-
- LFLex Fridman
Right.
- JMJitendra Malik
... and, uh, we, uh, uh, primates, other primates don't have that. But... So it developed somewhere in this era, but it developed m-... I would, uh, I mean, uh, argue that it probably developed after we had the stage of-
- LFLex Fridman
Mm-hmm.
- JMJitendra Malik
... uh, uh, uh, humans al-... Uh, I mean, the human species already able to manipulate and ha- hands-free much bigger brain size.
- LFLex Fridman
And for that m- there's a lot of vision has already had, had to have developed.
- JMJitendra Malik
Yeah.
- LFLex Fridman
So the sensation and the perception, maybe some of the cognition.
- JMJitendra Malik
Yeah. So we, we, we... So those... So, so that vis-... So the world... So there... So th- so these ancestors of ours, you know, three, four million years ago, they had, uh, they had spatial intelligence, so they knew that the world consists of objects. They knew that the objects were in certain relationships to each other. They had observed causal, uh, interactions among objects. They could move in space, so they had space and time and all of that. So language builds on that substrate, so language has a lot of... I mean, uh, I mean, the non-... All human languages have constructs which depend on a notion of space and time. Where did that notion of space and time come from? It had to come from perception and action in the world
- 1:12:30 – 1:16:17
Turing test
- JMJitendra Malik
we live in.
- LFLex Fridman
Yeah. What you refer to as the spatial intelligence.
- JMJitendra Malik
Yeah.
- LFLex Fridman
Yeah. So to linger a little bit, w- we mentioned Turing and his, uh, mention of we should learn from children. Nevertheless, language is the fundamental piece of the test of intelligence that Turing proposed.
- JMJitendra Malik
Yes.
- LFLex Fridman
What, what do you think is a good test of intelligence? Are you... What would impress the heck out of you? Is it fundamentally natural language or is there something in vision?
- JMJitendra Malik
I, I think, uh, I, I wouldn't. I, I don't think we should have, create a single test of intelligence. So just like I don't believe in IQ as a single number, I think, uh, generally there can be many capabilities which are correlated perhaps. So I think that there will be, uh, there will be accomplishments which are visual accomplishments, accomplishments which are, uh, um, accomplishments in manipulation or robotics, and then accomplishments in language. I do believe that language, it'll be the hardest nut to crack.
- LFLex Fridman
Really?
- JMJitendra Malik
Yeah.
- LFLex Fridman
So what's, what's harder to pass the spirit of the Turing test or like whatever formulation will make it natural language, convincingly a natural language, like somebody you would wanna have a beer with, hang out and have a chat with, or the general n- natural scene understanding? You think language is-
- JMJitendra Malik
But I think, I think-
- LFLex Fridman
... is the tougher problem?
- JMJitendra Malik
... I'm not a, a fan of the... I think, I think Turing test, uh, uh, Turing as he proposed the test in 1950 was trying to solve a certain problem.
- LFLex Fridman
Yeah, imitation was the thing.
- JMJitendra Malik
Yeah. And, and I think it made a lot of sense then. Where we are today, 70 years later, I think, I think w- w- we, we, th- we should not worry about that. I mean, I think the Turing test is no longer the right way to, uh, to f- to channel research in, uh, in AI because that, it takes us down this path of this chatbot which can fool us for five minutes or whatever, okay? I think I would rather have a list of 10 different tasks. I mean, I think there are tasks which, uh, there are tasks in the manipulation domain, tasks in navigation, tasks in visual scene understanding, tasks in under- reading a story and answering questions based on that. I mean, so my favorite language, uh, understanding task would be, you know, reading a novel and being able to answer arbitrary questions from it, okay?
- LFLex Fridman
Right.
- JMJitendra Malik
I, I, I think that to me, uh, and, and this is not an exhaustive list by any means, so I would, uh, I think that that's what we, where we need to be going to. And each of these, on each of these axes there's a fair amount of work to be done.
- LFLex Fridman
So on the visual understanding side, in this intelligence Olympics that we've set up-
- JMJitendra Malik
Yeah.
- LFLex Fridman
... what's a good test for v- one of many of visual scene understanding?
- JMJitendra Malik
Uh, I think-
- LFLex Fridman
Do you think such benchmarks exist? Sorry to interrupt.
- JMJitendra Malik
No, there, there aren't any. I, I think, I think essentially to me, a really, uh, good aid to the blind. So suppose there was a blind person-
- LFLex Fridman
Mm-hmm.
- JMJitendra Malik
... and I needed to assist the blind person.
- LFLex Fridman
So ultimately, like we said, vision that aids in the action in the survival in this world.
- JMJitendra Malik
Yeah.
- LFLex Fridman
Maybe in a simulated world. (laughs)
- JMJitendra Malik
Maybe easier to, uh, to measure-
- LFLex Fridman
Right.
- 1:16:17 – 1:24:49
Open problems in computer vision
- LFLex Fridman
So David Hilbert in 1900 proposed 23 open problems in mathematics, some of which are still unsolved, most important and famous of which is probably the Riemann hypothesis. You've thought about and presented about the Hilbert problems of computer vision, so let me ask, what do you today... I don't know when the last year you've presented at-
- JMJitendra Malik
Yeah.
- LFLex Fridman
... 2015, but versions of it.
- JMJitendra Malik
Yeah.
- LFLex Fridman
You're kind of the, the face and the spokesperson for computer vision, so (laughs) it's y- uh, it's your job to, to, to state what the problem, the open problems are for the field. So what today are the Hilbert problems of computer vision do you think?
- JMJitendra Malik
Let me pick, uh, pick one to, which I regard as, uh, clearly, uh, clearly unsolved, which is, uh, what I would call long form video understanding. So, uh, so we have a video clip and we want to understand, uh, the behavior in there in terms of agents, their goals, intentionality, and, uh, make predictions about what might happen. Uh, you know, uh, so, so that, that kind of understanding which go- goes away from atomic visual action. So, so in the short range, the question is are you sitting, are you standing, are you catching a ball, right? That we can do now. Or we... Even if we can't do it fully accurately, we... if we can do it at 50%, maybe next year we'll do it at 65 and so forth. But I think the long range video understanding, I don't think we, we, we can do toda- well, uh, today.
- LFLex Fridman
And that means, so long-
- JMJitendra Malik
And it blends into cognition, that's the reason-
- LFLex Fridman
Right.
- JMJitendra Malik
... why it's challenging.
- LFLex Fridman
And so you have to track, you have to understand the entities, you have to understand the entities, you have to track them, and you have to have some kind of model of their behavior.
- JMJitendra Malik
Correct. And they're, and then their behavior might be these are, these are agents, so they are not just like passive objects but they're agents, so therefore we, they might, they would exhibit goal-directed behavior. Okay, so this is, this is one area. Then l- uh, I will talk about, say, understanding the world in 3D. Now, this may seem, uh, paradoxical because in a way we have been able to do 3D understanding even like 30 years ago, right? But I don't think we currently have the richness of 3D understanding in our computer vision systems that we would like. Because, uh... So let me elaborate on that a bit. So currently we have two kinds of techniques which are not fully unified. So they are the kinds of techniques from-... multi-view geometry, that you have multiple pictures of a scene and you do a reconstruction using stereoscopic vision or structure from motion. But these techniques do not, uh, they, they totally fail if you just have a single view-
- LFLex Fridman
Right.
- JMJitendra Malik
... because they are relying on this, uh, this multiple-view geometry. Okay, then we have some techniques that we have developed in the computer vision community which try to guess 3D from single views. And these techniques are based on, on supervised learning, and they are based on having at training time 3D models of objects available.
- LFLex Fridman
Right.
- JMJitendra Malik
And this is completely unnatural supervision. Right? That's not, uh, CAD models are not injected into your brain.
- LFLex Fridman
(laughs) Yes.
- JMJitendra Malik
Okay, so what would I like? What I would like would be a kind of, uh, learning as you move around the world, uh, notion of 3D.
- LFLex Fridman
Wow. Yeah.
- JMJitendra Malik
So, uh, so we, we have our succession of visual experiences, and from those we, uh, so in, as part of that I might see a chair from different viewpoints, or a table from viewpoint, different viewpoints and so on. Now as part, that enables me to build some internal representation, and then next time I just see a single photograph, and it may not even be of that chair, it's of some other chair, and I have a guess of what its 3D shape is like.
- LFLex Fridman
So you're almost learning the CAD model, kind of, uh-
- JMJitendra Malik
Yeah, implicitly. I mean-
- LFLex Fridman
Implicitly.
- JMJitendra Malik
I mean, the CAD model need not be in the same form as used by computer graphics programs.
- LFLex Fridman
It's hidden, hidden in the representation somehow.
- JMJitendra Malik
It's hidden in the representation, the ability to predict new views, and what I would see if I went to such and such position.
- LFLex Fridman
By the way, on a, on a small tangent on that, are you uncomf- are you okay or comfortable with neural networks that do achieve visual understanding, that do, for example, achieve this kind of 3D understanding, and you don't know how they, you don't know the repre- uh, you're not able to intrispe- you're not able to, uh, visualize or understand or interact with the representation? So, the fact that they're not or may not be explainable.
- JMJitendra Malik
Yeah, I think that's fine.
- LFLex Fridman
(laughs)
- JMJitendra Malik
I, to me that is, uh, so, uh, so let me put some caveats on that. So it depends on the setting. So first of all, I think, uh, uh, the, uh, uh, eh, uh, we, humans are not explainable. So-
- 1:24:49 – 1:35:47
AGI
- LFLex Fridman
Do you think we will ever, uh, build a system of human-level or superhuman-level intelligence? We've kind of defined what it takes to try to approach that, but do you think we'll, do you think that's within our reach? The thing that we thought we could do, what Turing thought actually we could do by year 2000, right-... do you think we'll ever be able to do?
- JMJitendra Malik
Yeah. So, I think there are two answers here. One question- one answer is, in principle, can we do this at some time? And my answer is yes. Uh, the second answer is a pragmatic one. Do you think we will be able to do it in the next 20 years or whatever? And to that, my answer is no. So, and of course that's a wild guess.
- LFLex Fridman
Yeah, of course.
- JMJitendra Malik
I, I, I think that, uh, uh, you know, Donald Rumsfeld is not a favorite person of mine, but one of his lines was very good, which is about known knowns, known unknowns and unknown unknowns. So, in the business we are in, there are known unknowns and we have unknown unknowns. So, I think with respect to a lot of what's the case in vision and robotics, I feel like we have known unknowns. So, I have a sense of where we need to go and what the problems that need to be solved are. I feel with respect to natural language, understanding and high level cognition, it's not just known unknowns, but also unknown unknowns. So, it is very difficult to put any kind of a timeframe to that.
- LFLex Fridman
(laughs) Uh, do you think some of the unknown unknowns might be positive, in that they'll surprise us and make the job much easier? So, fundamental breakthroughs?
- JMJitendra Malik
Yeah. I think that is possible, because certainly I have been very positively surprised by how effective, uh, these deep learning systems have been, because I certainly would not have believed that in 2010. Uh, I think, uh, uh, what we knew from the mathematical theory was that convex optimization works when there's a single global optima that these gradient descent techniques would work. Now, these are nonlinear systems with non-convex systems.
- LFLex Fridman
Huge number of variables, so over-parameterized, yeah.
- JMJitendra Malik
And over-parameterized. And the people who used to play with them a lot, the ones who were totally immersed in the lore and the black magic, they knew that they worked, uh, well, even though they were-
- LFLex Fridman
Really? I thought, like, everybody w-
- JMJitendra Malik
No, the claim that I hear from, uh, my friends like Yann LeCun and so forth is-
- LFLex Fridman
Oh, now, yeah. (laughs)
- JMJitendra Malik
... that they feel that they were comfortable with them.
- LFLex Fridman
Well, he says that now, yeah.
- JMJitendra Malik
But the community as a whole was certainly not. And I think, uh, we were, uh, to me that was the surprise, that they actually worked robustly for a w- wide range of problems from a wide range of initializations and so on. And, uh, so that was, uh, that, that was, uh, certainly more rapid progress than, uh, we expected. But then there are certainly lots of times, in fact, most of the history in AI is when we have made less prog- progress at a slower rate than we expected. So, uh, we just keep going. I think, uh, what I regard as, uh, really unwarranted are these, these fears of, uh, you know, AGI in 10 years and 20 years and that kind of stuff, because that's based on completely unrealistic models of how rapidly we will make progress in this field.
- LFLex Fridman
So, I agree with you, but I've also gotten the chance to interact with very smart people who really worry about the existential threats of AI. And I, as an open-minded person, am sort of taking it, taking it in. Do you think if AI systems, in some way, the unknown unknowns, not superintelligent AI but in ways we don't quite understand, uh, the nature of superintelligence, will have a detrimental effect on society, do you think this is something we should be worried about? Or we need to first allow the unknown unknowns to become known unknowns?
- JMJitendra Malik
I think we need to be worried about AI today. I think that it is not just a worry we need to have when we get that AGI. I think that AI is being used in many systems today and there might be settings, for example, when it causes biases or decisions which could be harmful. I mean, uh, decisions which could be unfair to some people, or it could be a self-driving car which kills a pedestrian. So, AI systems are being deployed today, right? And they are being deployed in many different settings, maybe in medical diagnosis, maybe in a self-driving car, maybe in selecting applicants for an interview.
- LFLex Fridman
Mm-hmm.
- JMJitendra Malik
So, I would argue that when these systems make mistakes, there are consequences, and we are in a certain sense responsible for those consequences. Uh, so, I would argue that this is a continuous effort.
- LFLex Fridman
Hmm.
- JMJitendra Malik
It is, we... And, and this is something that in a way is not so surprising. It's about all engineering and scientific progress, which, uh, great power comes great responsibility. So, as these systems are deployed, we have to worry about them, and it's a continuous problem. I don't think of it as something which will suddenly happen on some day in 2079 for which I need to design some clever trick. I am saying that these problems exist today-
- LFLex Fridman
Yeah.
- JMJitendra Malik
... and we need to be continuously on the lookout for worrying about safety, biases, risks, right? I mean, the, if a self-driving car kills a pedestrian, and they have, right?
- LFLex Fridman
Mm-hmm.
- JMJitendra Malik
I mean, the, this Uber incident in Arizona.
- LFLex Fridman
Yeah.
- JMJitendra Malik
Right? It has happened.... right? This is not about AGI. It, in fact, it's about a very dumb intelligence-
- LFLex Fridman
Well-
- JMJitendra Malik
... which is still killing people.
- LFLex Fridman
... the worry people have with AGI is the scale and I, but I think you're 100% right, is, like, the thing that worries me about AI today, and it's happening in a huge scale, is recommend- recommender system, recommendation systems. So if you look at, uh, Twitter or Facebook or YouTube, they're controlling the ideas that we have access to, the news and- and so on, and that's a fundamentally machine learning algorithm behind each of these recommendations.
- JMJitendra Malik
Mm-hmm.
- 1:35:47 – 1:37:41
Pick the right problem
- JMJitendra Malik
yeah.
- LFLex Fridman
So but you also, lest I forget to mention, you've also mentored some of the biggest names of computer vision-
- JMJitendra Malik
Yeah.
- LFLex Fridman
... computer science, and AI today. Uh, there's so many questions I could ask but really is, what- what is it... How did you do it? What does it take to be a good mentor? What does it take to be a good guide?
- JMJitendra Malik
Yeah, I- I think what I feel, I've been lucky to have had very, very, uh, smart and hardworking and creative students. I think some part of the credit just belongs to being at Berkeley.
- LFLex Fridman
(laughs)
- JMJitendra Malik
I think those of us who are at top universities are blessed because we have, uh, very, very smart and capable students coming and knocking on our door. So- so I have to be humble enough to acknowledge that. But what have I added? I think I have added something. What I have added is, uh, I think what I've always tried to teach them is a sense of picking the right problems.
- LFLex Fridman
Mm.
- JMJitendra Malik
So, uh, uh, I think that in science, in the short run, success is always based on technical competence. You're, you know, you're quick with math or you are whatever. I mean, there, there's certain technical capabilities which make for short range progress. Long range progress is really determined by asking the right questions and focusing on the right problems, and I feel that-What I've been able to bring to the table in terms of advising these students is, uh, some sense of taste of what are good problems, what are problems that are worth attacking now as opposed to waiting 10 years.
Episode duration: 1:41:36
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode LRYkH-fAVGE
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome