Lex Fridman PodcastYann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI | Lex Fridman Podcast #416
EVERY SPOKEN WORD
150 min read · 30,009 words- 0:00 – 2:18
Introduction
- YLYann LeCun
I see the danger of this concentration of power through, uh, through proprietary AI systems as a much bigger danger than everything else. What works against this is people who think that for reasons of security, we should keep AI systems under lock and key, because it's too dangerous to put it in the hands of- of everybody. That would lead to a very bad future (laughs) in which all of our information diet is controlled by a small number of, uh, uh, companies who proprietary systems.
- LFLex Fridman
I believe that people are fundamentally good, and so if AI, especially open source AI, can, um, make them smarter, it just empowers the goodness in humans.
- YLYann LeCun
So I sh- I share that feeling, okay? I think people are fundamentally good. (laughs) Uh, and in fact a lot of doomers are doomers because they don't think that people are fundamentally good.
- LFLex Fridman
The following is a conversation with Yann LeCun, his third time on the spot cast. He is the chief AI scientist at Meta, professor at NYU, Turing Award winner, and one of the seminal figures in the history of artificial intelligence. He and Meta AI have been big proponents of open sourcing AI development, and have been walking the walk by open sourcing many of their biggest models, including LLaMA 2 and eventually LLaMA 3. Also, Yann has been an outspoken critic of those people in the AI community who warn about the looming danger and existential threat of AGI. He believes the AGI will be created one day, but it will be good, it will not escape human control, nor will it dominate and kill all humans. At this moment of rapid AI development, this happens to be somewhat a controversial position. And so it's been fun seeing Yann get into a lot of intense and fascinating discussions online, as we do in this very conversation. This is the Lex Fridman Podcast. To support it, please check out our sponsors in the description. And now, dear friends, here's Yann LeCun.
- 2:18 – 13:54
Limits of LLMs
- LFLex Fridman
You've had some strong statements, technical statements, about the future of artificial intelligence recently, throughout your career actually, but recently as well. Uh, you've said that autoregressive LLMs are, uh, not the way we're going to make progress towards superhuman intelligence. These are the large language models like GPT-4, like LLaMA 2 and 3 soon, and so on. How do they work and why are they not going to take us all the way?
- YLYann LeCun
For a number of reasons. The first is that there is a number of characteristics of intelligent behavior. For example, the capacity to understand the world, understand the physical world. The ability to remember and retrieve things. Um, persistent memory. The ability to reason and the ability to plan. Those are four essential characteristic of intelligent, um, systems or entities. Humans, animals. LLMs can do none of those, or they can only do them in a very primitive way, and, uh, they don't really understand the physical world, they don't really have persistent memory, they can't really reason, and they certainly can't plan. And so, eh, you know, if- if- if you expect a system to become intelligent just, eh, you know, without having the possibility of doing those things, uh, you're making a mistake. That is not to say that autoregressive LLMs are not useful, they're certainly useful. Um, that they are not interesting, that we can't build a whole ecosystem of, uh, applications around them. Of course we can. But as a path towards human level intelligence, they're missing essential competence. And then there is another tidbit or- or fact that I think is very interesting. Those LLMs are trained on enormous amounts of text. Basically the entirety of all publicly available texts on the internet, right? That's typically on the order of, uh, 10 to the 13th tokens. Each token is typically two bytes, so that's two 10 to the 13th bytes as training data. It would take you or me 170,000 years to just read through this at eight hours a day. (laughs) Uh, so it seems like an enormous amount of knowledge, right, that those systems can accumulate. Um, but then you realize it's really not that much data. If you- you talk to developmental psychologists and they tell you a four-year-old has been awake for 16,000 hours in his or her life, um, and the amount of information that has, uh, reached the visual cortex of that child in four years, um, is about 10 to the 15th bytes, and you can compute this by estimating that the, uh, optical nerve carry about 20 megabit- megabytes per second roughly. And so 10 to the 15th bytes for a four-year-old versus two times 10 to the 13th bytes for 170,000 years worth of reading, what that tells you is that, uh, through sensory input we see a lot more information than we- than we do through language, and that despite our intuition, most of what we learn and most of our knowledge is through our observation and interaction with the real world, not through language. Everything that we learn in the first few years of life, and, uh, certainly everything that animals learn, has nothing to do with language.
- LFLex Fridman
So it would be good to, um, maybe push against some of the intuition behind what you're saying. So-It is true there's several orders of magnitude more data coming into the human mind, uh, much faster, and the human mind is able to learn very quickly from that, filter the data very quickly. You know, somebody might argue your comparison between sensory data versus language. That language is already very compressed. It already contains a lot more information than the bytes it takes to store them if you compare it to visual data. So, there's a lot of wisdom in language. There's words and the way we stitch them together, it already contains a lot of information. So, is it possible that language alone already has enough wisdom and knowledge in there to be able to, from that language, construct a, a world model, an understanding of the world, an understanding of the physical world that you're saying LLMs lack?
- YLYann LeCun
So it's a big debate among, uh, philosophers-
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
... and also cognitive scientists, like whether intelligence needs to be grounded in reality. Uh, I'm clearly in the camp that, uh, yes, uh, intelligence cannot appear without some grounding in, uh, some reality. It doesn't need to be, you know, a physical reality. It could be simulated. But, um, but the environment is just much richer than what you can express in language. Language is a very approximate representation of our percepts and our mental models, right? I mean, the- there's a lot of tasks that we accomplish where we manipulate, uh, a mental model of, uh, of the situation at hand. And that has nothing to do with language. Everything that's physical, mechanical, whatever, when we build something, when we accomplish a task, a motor task of, you know, grabbing something, et cetera, we plan our action sequences. And we do this by essentially imagining the result of... the outcome of a sequence of actions that we might imagine. And that requires mental models that don't have much to do with language. And that's, uh, I would argue, most of our knowledge is derived from that interaction with the physical world. And so a lot of, a lot of my, my colleagues who are more, uh, interested in things like computer vision are really on that camp that, uh, AI needs to be embodied essentially. And then other people coming from the NLP side or maybe, uh, you know, some, some other, uh, motivation don't necessarily agree with that. Um, and philosophers are split as well. Uh, and the, um, the complexity of the world is hard to, um, it's hard to imagine. It- it- i- i- uh, you know, it's hard to represent, uh, all the complexities that we take completely for granted in the real world that we don't even imagine require intelligence, right? This is the old Moravec paradox from the pioneer of robotics, Hans Moravec, who said, you know, "How is it that with computers, it seems to be easy to do high level complex tasks like playing chess and solving integrals and doing things like that, whereas the thing we take for granted that we do every day, um," like, I don't know, learning to drive a car or, you know, grabbing an object, "we can't do with computers" (laughs) ? Um, and y- y- you know, we have LLMs that can pass, pass the bar exam, so they must be smart. But then they can't learn to drive in 20 hours like any 17 year old. They can't learn to clear out the dinner table and fill up the dishwasher like any 10 year old can learn in one shot. Um, why is that? Like, you know, what, what are we missing? What, what type of learning or, or reasoning architecture or whatever are we missing that, um, um, basically prevent us from, from, you know, having level five self-driving cars and domestic robots?
- LFLex Fridman
Can a large language model construct a world model that does know how to drive and does know how to fill a dishwasher, but just doesn't know how to deal with visual data at this time? So, it, it can, hmm, operate in a space of concepts.
- YLYann LeCun
So yeah, that's what a lot of people are working on. Uh, so the answer, the short answer is no.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
And the more complex answer is you can use all kind of tricks to get, uh, uh, an LLM to basically digest, um, visual representations of, representations of images, uh, or video or audio for that matter. Um, and, uh, a classical way of doing this is, uh, you train a vision system in some way, and we have a number of ways to train vision systems, either supervised, semi-supervised, self-supervised, all kinds of different ways, uh, that will turn any image into a high level representation, basically a list of tokens that are really similar to the kind of tokens that, uh, typical LLM takes as an input. And then you just feed that to the LLM in addition to the text, and you just expect the LLM to kind of, uh, you know, during training to kind of be able to, uh, use those representations to help, uh, make decisions. I mean, there's been work a- along those lines for, for quite a long time. Um, and now you see those systems, right? I mean, there are LLMs that can... that have some vision extension. But they're basically hacks in the sense that, um, those things are not, like, trained end to end to, to handle, to really understand the world. They're not trained with video, for example. Uh, they don't really understand intuitive physics, at least not at the moment.
- LFLex Fridman
So you don't think there's something special to you about intuitive physics, about sort of common sense reasoning about the physical space, about physical reality? That's, that to you is a giant leap that LLMs are just not able to do.
- YLYann LeCun
We're not gonna be able to do this with the type of LLMs that we are, uh, working with today. And there's a number of reasons for this. But, uh, the main reason is-The way LLM are ch- LLMs are trained is that you, you take a piece of text, you remove some of the words in that text, you mask them, you replace by- replace them by blank markers, and you train a gigantic neural net to predict the words that are missing. Uh, and if you build, uh, this neural net in a particular way so that it can only look at, um, words that are to the left of the one it's trying to predict, then what you have is a system that, uh, basically is trying to predict the next word in a text, right? So then you can feed it, um, a text, a prompt, and you can ask it to predict the next word. It can never predict the next word exactly. And so what it's gonna do is, uh, produce a probability distribution of all the possible words in your dictionary. In fact, it doesn't predict words, it predicts tokens that are kind of sub-word units. And so it's easy to handle the uncertainty in the prediction there, because there is only a finite number of possible words in the dictionary, and you can just compute a distribution over them. Um, then what you- what the system does is that it- it picks a word from that distribution. Of course, there's a higher chance of picking words that have a higher probability within that distribution, so you sample from that distribution to actually produce a word. And then you shift that word into the input. And so that allows the system now to predict the second word, right? And once you do this, you shift it into the input, et cetera. That's called autoregressive prediction, um, which is why those LLMs should be called autoregressive LLMs. Uh-
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
But we just call them LLMs. And there is a difference between this kind of process and a process by which before producing a word,
- 13:54 – 17:46
Bilingualism and thinking
- YLYann LeCun
when you talk, when you and I talk, you and I are bilingual.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
We think about what we're gonna say, and it's relatively independent of the language in which we're gonna say it. When we, when we talk about, like, uh, I don't know, let's say a mathematical concept or something-
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
... the kind of thinking that we're doing and the answer that we're planning to, uh, produce is not linked to whether we're gonna say it in French or Russian or English.
- LFLex Fridman
Chomsky just rolled his eyes, but I understand. So you're saying that there's a, a bigger abstraction that repre- that's, uh, that goes before language.
- YLYann LeCun
Yeah.
- LFLex Fridman
It maps onto language.
- YLYann LeCun
Right. It's certainly true for a lot of thinking that we, that we do.
- LFLex Fridman
Is that obvious that we don't... Like, you're saying your thinking is s- same in French as it is in English?
- YLYann LeCun
Yeah, pretty much. Yeah.
- LFLex Fridman
Pretty much? Or is this like... How, how flexible are you, like if, if there's a probability of distribution?
- YLYann LeCun
(laughs) Well, it, it depends what kind of thinking, right? If it's just, uh, if it's like producing puns, I get much better in French than English about that. (laughs)
- LFLex Fridman
Mm-hmm. No, but so-
- YLYann LeCun
Or much worse. It de-
- LFLex Fridman
Is there an abstract representation of puns, like is your humor an abstract rep- Like when you tweet, uh, and your tweets are sometimes a little bit spicy, uh, what's- is there an abstract representation in your brain of a tweet before it maps onto English?
- YLYann LeCun
There is an abstract representation of, uh, imagining the reaction of a reader-
- LFLex Fridman
Right.
- YLYann LeCun
... to that, uh, text.
- LFLex Fridman
Where you start with laughter and then figure out how to make that happen?
- YLYann LeCun
Or so... No, uh, yeah, figure out, uh, like a reaction you wanna cause-
- LFLex Fridman
Right.
- YLYann LeCun
... and then, and then figure out how to say it, right?
- LFLex Fridman
Okay.
- YLYann LeCun
So that it causes that reaction. But that's like really close to language, but think about like a matema- mathematical concept, uh, or, um, you know, imagining, you know, something you want to build out of wood or something like this, right? The kind of thinking you're doing has absolutely nothing to do with language really. Like it's not like you have necessarily like an internal monologue in any particular language. You're, you're, you know, imagining mental models of, of the thing, right? I mean, if I, if I asked you to, like, imagine what this, uh, water bottle will look like if I rotate it-
- LFLex Fridman
Huh.
- YLYann LeCun
... 90 degrees, um, that has nothing to do with language. And so, uh, so clearly there is, you know, a more abstract level of representation, uh, in which we, we do most of our thinking and we plan what we're gonna say if the output is, is, you know, uttered words as opposed to an output being, uh, you know, muscle actions-
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
... right? Um, we, we plan our answer before we produce it. And LLMs don't do that. They just produce one word after the other. Instinctively, if you want. It's like, it's a bit like the, you know, subconscious, uh, actions where you don't... Like you're distracted, you're doing something, you're completely concentrated and someone comes to you and, uh, you know, asks you a question and you kind of answer the question. You don't have time to think about the answer, but the answer is easy so you don't need to pay attention and you sort of respond automatically. That's kind of what an LLM does, right? It doesn't think about its answer really. Uh, it retrieves it because it's accumulated a lot of, uh, knowledge, so it can retrieve some, some things, but it's going to just spit out one token after the other without planning the answer.
- LFLex Fridman
But you're making it sound just one token after the other, one token-at-a-time generation is, uh, bound to be simplistic. But if the world model is sufficiently sophisticated, that one token at a time, the, the most likely thing it generates is a sequence of tokens is going to be a deeply profound thing.
- 17:46 – 25:07
Video prediction
- LFLex Fridman
I, I think the fundamental question is can you build a, a really complete world model? Not complete, but a, uh, one that has a deep understanding of the world?
- YLYann LeCun
Yeah. So can you build this, first of all, by prediction?
- LFLex Fridman
Right.
- YLYann LeCun
And the answer is probably yes. Can you predi- can you build it by predicting words? And the answer is most probably no.... because language is very poor, in terms of weak or low bandwidth, if you want. There's just not enough information there. So building world models means observing the world and, uh, understanding why the world is evolving the way, the way it is. And then, uh, the e- the extra component of a world model is something that can predict how the world is going to evolve as a consequence of an action you might take, right? So world models really is, here is my idea of the state of the world at time T, here is an action I might take. What is the predicted state of the world at time T plus one? Now, that state of the world doesn't, does not need to represent everything about the world. It just needs to represent enough that's relevant for this planning of, of the action, but not necessarily all the details. Now, here is the problem. Um, you're not going to be able to do this with generative models. So a generative model is trained on video, and we've tried to do this for 10 years. You take a video, show a system a piece of video, and then ask it to predict the remainder of the video. Basically, predict what's gonna happen.
- LFLex Fridman
One frame at a time, do the same thing as sort of, uh, the auto aggressive LLMs do, but for video.
- YLYann LeCun
Right. Either one frame at a time or a group of frames at a time.
- LFLex Fridman
LVM.
- YLYann LeCun
Um, but yeah, uh, a large video model if you want. Uh, the idea of, of doing this has been floating around for a long time. And at, uh, at FAIR, uh, some of my co- colleagues and I have been trying to do this for about 10 years. Um, and y- you can't, you can't really do the same trick as with LLMs because, uh, you know, LLMs as I said, you can't predict, uh, exactly which word is gonna follow a sequence of words, but you can predict the distribution of the words. Now, if you go to video, what you would have to do is predict the distribution over all possible frames in a video. And we don't really know how to do that properly. Uh, we d- we, we do not know how to represent distributions over high dimensional continuous spaces in ways that are useful. A- and, and that's, that, there lies the main issue. And the reason we can't do this is because the world is incredibly more complicated and richer in terms of information than, than text. Text is discreet. Uh, video is high dimensional and continuous. A lot of details in this. Um, so if I take a, a video of this room, uh, and the video is, you know, a camera panning around-
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
... um, there is no way I can predict everything that's gonna be in the room as I pan around. The system cannot predict what's gonna be in the room as the camera is panning. Maybe it's gonna predict this is, this is a room where there is a light and there is a wall and things like that. It can't predict what the painting on the wall looks like, or what the texture of the couch looks like. Certainly not the texture of the carpet. So th- there's no way it can predict all those details. So the, the way to handle this is one way to possibly to handle this, which we've been working for a long time, is to have a model that has what's called a latent variable. And the latent variable is fed to a neural net, and it's supposed to represent all the information about the world that you don't perceive yet, and, uh, that you need to augment, uh, the, the system for the prediction to do a good job at predicting pixels. Including the, you know, fine texture of the, of the carpet and the, and the couch, and, and the painting on the wall. Um, uh, that has been a complete failure, essentially. And we've tried lots of things. We tried, um, just straight neural nets, we tried GANs, we tried, um, you know, VAEs, uh, all kinds of regularized autoencoders. We tried, um, many things. We also tried those kind of methods to learn, uh, good representations of images or video, um, that could then be used as input to, for example, an image classification system.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
And that also has basically failed. Like, all the systems that attempt to predict missing parts of an image or video, um, fr- you know, uh, uh, from a corrupted version of it, basically. So I take an image or a video, corrupt it or transform it in some way, and then try to reconstruct the complete video or image from the corrupted version, and then hope that internally, the system will develop good representations of images that you can use for object recognition, segmentation, whatever it is. That has been essentially a complete failure. And it works really well for text. That's the principle that is used for LLMs, right?
- LFLex Fridman
So w- wha- where is the failure exactly? Is it that it's very difficult to form a good representation of an image, a good, in a, like a good embedding of all, all the important information in the image? Is it in terms of the consistency of image to image to image to image that forms the video? Like where, what are the, if we do a highlight reel of all the ways you failed, uh, what, what's that look like?
- YLYann LeCun
Okay. So the reason this doesn't work, uh, is first of all I have to tell you exactly what doesn't work, because there is something else that does work.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
Uh, so the thing that does not work is, uh, training the system to learn representations of images by training it to reconstruct, uh, a good image from a corrupted version of it. Okay. That's what doesn't work. And we have a whole slew of techniques for this, uh, that are, you know, a variant of denoising autoencoders. Something called MAE, developed by, uh, some of my colleagues at FAIR, masked autoencoder. So it's basically like the, you know, LLMs or, or, or, or things like this where you train the system by corrupting text, except you corrupt images. You remove patches from it and you train a gigantic neural net to reconstruct.... the features you get are not good. And you know they're not good because if you now train the same architecture, but you train it supervise-
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
... with, uh, label data, with text- textual descriptions of images, et cetera, you do get good representations. And the performance on recognition tasks is much better than if you do this self-supervised pre-training.
- LFLex Fridman
So the architecture is good.
- YLYann LeCun
The architecture is good. The architecture of the encoder is good.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
Okay? But the fact that you train the system to reconstruct images does not lead it to produce, to learn good generic features of images.
- LFLex Fridman
When you train in a self-supervised way?
- YLYann LeCun
Self-supervised by reconstruction.
- LFLex Fridman
Yeah, by reconstruction.
- YLYann LeCun
Okay. So what's the alternative? (laughs)
- LFLex Fridman
(laughs)
- YLYann LeCun
The alternative-
- LFLex Fridman
Yes.
- YLYann LeCun
... is, uh, joint embedding.
- 25:07 – 28:15
JEPA (Joint-Embedding Predictive Architecture)
- YLYann LeCun
- LFLex Fridman
What is joint embedding? What are- what are these architectures that you're so excited about?
- YLYann LeCun
Okay. So now instead of training a system to encode the image and then training it to reconstruct the- the full image from a corrupted version, you take the full image, you take the corrupted or transformed version, you run them both through encoders-
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
... which- which in general are identical, but not necessarily. And then you- you train a predictor on top of those, uh, encoders, um, to predict the representation of the full input from the representation of the corrupted one.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
Okay?
- LFLex Fridman
Great.
- YLYann LeCun
So joint embedding, because you're- you're taking the- the full input and the corrupted version, or transformed version, run them both through encoders, so you get a joint embedding, and then you s- and then you're- you're saying, "Can I predict the representation of the full one from the representation of the corrupted one?" Okay? Um, and I call this a JEPA. So that means joint embedding predictive architecture, because there's joint embedding, and there is this predictor that predicts the representation of the good guy from- from the bad guy. Um, and the big question is, how do you train something like this? Uh, and until five years ago or six years ago, we didn't have particularly good answers for how you train those things, except for one, um, called contrastive tr- uh, contrastive learning, where... Uh, and the idea of contrastive learning is, you- you take a pair of images that are, again, an image and a corrupted version or degraded version somehow, or transformed version, of the original one, and you train the predicted representation to be the same as- as that. If you only do this, the system collapses. It basically completely ignores the input and produces representations that are constant.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
So the contrastive methods avoid this. And- and those things have been around since the early '90s. I had a paper on this in 1993. Um, is you also show pairs of images that you know are different, and then you push away the representations from each other. So you say, not only do representations of things that we know are the same should be the same, or should be similar, but representation of things that we know are different should be different. And that prevents the collapse, but it has some limitation. And there's a whole bunch of, uh, techniques that have appeared over the last six, seven years, um, that can revive this- this type of method, um, some of them from FAIR, some of them from- from Google and other places. Um, but there are limitations to those contrastive method. What has changed in the last, uh, you know, three, four years is now- now we have methods that are non-contrastive. So they don't require those negative contrastive samples of images that are- that we know are different. You can only... You train them only with images that are, you know, different versions or different views of the same thing. Uh, and you rely on some other tweaks to prevent the system from collapsing. And we have half a dozen different methods for this now.
- 28:15 – 37:31
JEPA vs LLMs
- YLYann LeCun
- LFLex Fridman
So what is the fundamental difference between joint embedding architectures and LLMs? So can, uh, can, uh, JEPA take us to AGI? But whether we should say that you don't like, uh, the term AGI, and we'll probably argue, I think every single time I've talked to you, we've argued about the G in AGI.
- YLYann LeCun
Yes.
- LFLex Fridman
Like... (laughs)
- YLYann LeCun
(laughs)
- LFLex Fridman
I get- I get it. (laughs) I get it. Well, we'll probably continue to argue about it. It's great. Uh, you- you like, uh, AMI. Uh, this, 'cause- 'cause you like French, and, um, AMI is- is- is, uh, I guess, friend in French.
- YLYann LeCun
Yes.
- LFLex Fridman
And AMI stands for advanced machine intelligence.
- YLYann LeCun
Right.
- LFLex Fridman
Um, but either way, can JEPA take us to that, towards that advanced machine intelligence?
- YLYann LeCun
Well, so it's a- it's a first step. Okay? So first of all, uh, what- what's the difference with generative architectures like LLMs? Um, so LLMs, um, or vision systems that are trained by reconstruction generate the inputs, right? They generate the original input that is non-corrupted, non-transformed, right? So you have to predict all the pixels. And there is a huge amount of resources spent in the system to actually predict all those pixels, all the details. Uh, in a JEPA, you're not trying to predict all the pixels. You're only trying to predict an abstract representation of- of the inputs, right? And that's much easier in many ways. So what the JEPA system, when it's being trained, is trying to do is extract as much information as possible from the input, but yet only extract information that is relatively easily predictable. Okay.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
So there's a lot of things in the world that we cannot predict. Like for example, if you have a self-driving car driving down the street or road, uh, there may be, uh, trees around the- around the road, and it could be a windy day. So the- the leaves on the tree are kind of moving in kind of semi-chaotic, random ways that you-... can't predict and you don't care, you don't wanna predict. So what you want is your encoder to basically eliminate all those details. It will tell you there is moving leaves, but it's not gonna keep the details of exactly what's going on. Um, and so when you do the prediction in representation space, you're not going to have to predict every single pixel of every leaf. And that, you know, um, not only is a lot simpler, but also it allows the system to essentially learn an abstract representation of, of the world, where, you know, what can be modeled and predicted is preserved and the rest is viewed as noise and eliminated by the encoder. So it kind of lifts the level of abstraction of the representation.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
If you think about this, this is something we do absolutely all the time. Whenever we describe a phenomenon, we describe it at a particular level of abstraction. And we don't always describe every natural phenomenon in terms of quantum field theory, right? (laughs) That would be impossible, right? So we have multiple levels of, of abstraction to describe what happens in the world, you know, starting from quantum field theory to, like, atomic theory and molecules, you know, and chemistry, materials and, you know, all the way up to, you know, kind of concrete objects in the real world, and things like that. So the, we, we can't just only model everything at the lowest level. And that, that's what the idea of JEPA is really on, is really about, learn abstract representation in a self-supervised, uh, manner. And, you know, you can do it hierarchically as well. So that I think is an essential component of an intelligent system. And in language, we can get away with that doing this, because language is already to some level abstract, and already has eliminated a lot of information that is not predictable. And, um, so we can get away without doing the joint embedding, without, you know, lifting the abstraction level and by directly predicting words.
- LFLex Fridman
So joint embedding, it's still generative, but it's generative in this abstract representation space.
- YLYann LeCun
Yeah.
- LFLex Fridman
And you're saying language, we were lazy with language, 'cause we already got the abstract representation for free, and now we have to zoom out and actually think about generally intelligent systems, we have to (sighs) deal with the full mess of physical reality, of reality, and you can't, you j- you do have to do this step of jumping from, uh, the full, rich, detailed reality to a, uh, abstract representation of that reality, based on which you can then reason and all that kind of stuff.
- YLYann LeCun
Right. And the thing is, those self-supervised algorithm that, that learn by prediction, even in representation space, uh, they learn more, uh, concept if the input data you feed them is more redundant. The, the more redundancy there is in the data, the more they're able to capture some internal structure of it. And so there, there is way more redundancy and structure in perceptual, uh, inputs, sensory input like, like, like vision than there is in, uh, text, which is not nearly as redundant. This is back to the question you were asking a few minutes ago. Language might represent more information really, because it's already compressed, you're, you're right about that. But that means it's also less redundant, and so self-supervised learning will not work as well.
- LFLex Fridman
Is it possible to join the self-supervised training on visual data and self-supervised training on language data? There is a huge amount of knowledge, even though you talked on about those 10 to the 13 tokens. Those 10 to the 13 tokens represent the entirety, a large fraction of what us humans have figured out, both the shit talk on Reddit and the contents of all the books and the articles and the full spectrum of human, uh, intellectual creation. So is it possible to join those two together?
- YLYann LeCun
Well, eventually, yes. But I think, uh, if we do this too early, we run the risk of being tempted to cheat. And in fact that's what people are doing at the moment with vision-language model. We're basically cheating. We're, uh, using, uh, language as a crutch to help the deficiencies o- of our, uh, vision systems to kind of l- learn good representations from, uh, images and video. And, uh, the problem with this is that we might, you know, improve our, uh, vision-language system a bit, I mean, our language models by, you know, feeding them images, but we're not gonna get to the level of even the intelligence or level of understanding of the world of a cat or a dog, which doesn't have language. You know, they don't have language, and they understand the world much better than any LLM. They can plan really complex actions and sort of imagine the result of a bunch of actions. How do we get machines to learn that before we combine that with language? Obviously if we combine this with language, this is gonna be a, a winner.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
Um, but, but before that, we have to focus on, like, how do we get systems to learn how the world works?
- LFLex Fridman
So this kind of joint embedding predictive architecture, for you that's going to be able to learn something like common sense, something like what a cat uses to predict how to mess with its owner most optimally by knocking over a thing?
- YLYann LeCun
That's, that's the hope. Uh, in fact, the techniques we're using are non-contrastive.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
Uh, so not only is the architecture non-generative, the learning procedures we're using are non-contrastive. So we have two, two sets of techniques, uh, o- one set is based on distillation, and there's a number of, uh, methods that use this principle. Uh, one by DeepMind could be BYOL, uh, uh, a couple by, by FAIR, one, one called, uh, VicReg, and another one called I-JEPA. And VicReg, I should say, is not a distillation method actually, but I-JEPA and BYOL certainly are. And there's another one also called DINO or DINO, uh, also produced from, uh, at FAIR. And the idea of those things is that you take the full input, let's say an image, uh, you run it through an encoder.... uh, produces a representation. And then you corrupt that input to transform it, run it through the, essentially what amounts to the same encoder with some minor differences. And then train, uh, a predictor. Sometimes the predictor is very simple, sometime doesn't exist. But train a predictor to predict a representation of the first, uh, uncorrupted input from the corrupted input. Um, but you only train the- the second branch. Um, you only train the part of the network that is fed with the corrupted input. The other network you don't- you don't train. But since they share the same weight, when you modify the first one, it also modifies the second one. Uh, and with various tweaks, you can prevent this system from collapsing, uh, with the collapse of the type I was explaining before, where the system basically ignores the input. Um, so that works very well.
- 37:31 – 38:51
DINO and I-JEPA
- YLYann LeCun
The- the technique, uh, we've ... The two techniques which are open fair, uh, DINO and, uh, and I-JEPA, work really well for that.
- LFLex Fridman
So what kind of data are we talking about here?
- YLYann LeCun
So there's- there's several scenario. One- uh, one scenario is you take an image. You corrupt it by, um, changing the cropping, for example, changing the size a little bit, maybe changing the orientation, blurring it, changing the colors, doing all kinds of horrible things to it.
- LFLex Fridman
But basic horrible things.
- YLYann LeCun
Basic horrible things, that sort of degrade the quality a little bit and change the framing, uh, you know, crop the image. Um, or ... And in some cases, in the case of I-JEPA, you don't need to do any of this. You just- you just mask some parts of it, right? You just basically remove some regions, like a big block essentially. And- and then, you know, run through the encoders, um, and train the entire system, encoder and predictor, to predict the representation of the good one from the representation of the corrupted one. Um, so that's the I-JEPA. Doesn't need to know that it's an image, for example, because the only thing it needs to know is how to do this masking.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
Um, whereas with DINO, you need to know it's an image, because you need to do things like, you know, geometry transformation and blurring and things like that, that are really image specific.
- 38:51 – 44:22
V-JEPA
- YLYann LeCun
Uh, a more recent version of- of this that we have is called V-JEPA. So it's basically the same idea as I-JEPA, except, um, it's applied to video. So now you take a whole video, and you mask a whole chunk of it. And what we mask is actually kind of a temporal tube, so an all ... Like a whole, uh, segment of each frame in the video over the entire video.
- LFLex Fridman
Mm-hmm. And that tube is, like, statically positioned throughout the frames?
- YLYann LeCun
Throughout- throughout the-
- LFLex Fridman
Just literally a straight tube?
- YLYann LeCun
... the- the- the tube, yeah. Typically, it's 16 frames or something, and we mask the same region over the entire 16 frames. It's a different one for every video obviously, and, um, and then again, uh, train that system so as to predict the representation of the full video from the partially masked video. Uh, that works really well. It's the first system that we have that learns good representations of video, so that when you feed those representations to a supervised, uh, classifier head, it can- it can tell you what action is taking place in the video with, you know, pretty good accuracy. Um, so that- that's s- first time we get something of that, uh, of that quality.
- LFLex Fridman
So th- that's a good test that a good representation is formed-
- YLYann LeCun
Yeah.
- LFLex Fridman
... that means there's something to this.
- YLYann LeCun
Yeah. Um, we also preliminary result that, uh, seemed to indicate that the representation allows us ... allow our system to tell whether the video is physically possible or completely impossible, because some object disappeared or an object, you know, suddenly jumped from one location to another, or- or changed shape or something.
- LFLex Fridman
So it's able to capture some physical cons-, some physics-based constraints about the reality represented in the video-
- YLYann LeCun
Yeah.
- LFLex Fridman
... about the appearance and the disappearance of objects?
- YLYann LeCun
Yeah. That's really new.
- LFLex Fridman
Okay. But c- can this actually get us to this kind of, uh, world model that understands enough about the world to be able to drive a car?
- YLYann LeCun
Uh, possibly. Um, this is gonna take a while before we get to that point, but, um, um ... And there are systems already, you know, robotic systems, that are based on this, uh, idea. Uh, and, uh, what you need for this is a slightly modified version of this, where, um, imagine that you have, uh, a video and a- a complete video, and what you're doing to this video is that you are either translating it in time towards the future, so you only see the beginning of the video, but you don't see the latter part of it that is in the original one. Or you just mask the second half of the video, for example. Um, and then you- you train a s- a JEPA system, or the type I described, to predict the representation of the full video from the- the shifted one. But you also feed the predictor with an action, for example, you know, the wheel is turned 10 degrees to the le- to the right or something, right? So if it's a, you know, a dash cam in a car, and you know the angle of the wheel, you should be able to predict to some extent what's go- what's gonna go, what's going to happen to what you see. Uh, you're not gonna be able to predict all the details of, you know, objects that appear in the view obviously, but at a abstract representation level, you can- you can probably predict what's gonna happen.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
So now what you have is a internal model that says, "Here is my idea of state of the world at time T. Here is an action I'm taking. Here is a prediction of the state of the world at time T plus one, T plus delta T, T plus two seconds," whatever it is. If you have a model of this type, you can use it for planning. So now you can do what LLMs cannot do, which is planning what you're gonna do so as to arrive at a particular, uh, outcome or satisfy a particular objective.... right? So you can have a number of objectives, um, right? If, you know, I can, I can predict that, uh, if I have, uh, an object like this, right? And I open my hand, it's gonna fall, right? (laughs) And, uh, and if I push it with a particular force on the table, it's gonna move. If I push the table itself, it's probably not gonna move, uh, with the same force. Um, so we have, we have this internal model of the world in our, in our mind, uh, which allows us to plan sequences of actions to arrive at a particular goal. Um, and so, um, so now if you have this world model, we can imagine a sequence of actions, predict what the outcome of the sequence of action is going to be, measure to what extent the final state satisfies a particular objective, like, you know, moving the bottle to the (laughs) left of the table-
- LFLex Fridman
Uh-huh.
- YLYann LeCun
... um, and then plan a sequence of actions that will minimize this objective at runtime. We're not talking about learning, we're talking about inference time, right? So this is planning really. And in optimal control, this is a very classical thing, it's called, uh, model predictive control. You have a model of the system you want to control that, you know, can predict the sequence of states corresponding to a sequence of commands, and you are planning a sequence of commands so that according to your world model, the, the, the end state of the system will, uh, satisfy, uh, an objective that you fix. This is the way, uh, you know, rocket trajectories have been planned since computers have been around, so since the early '60s essentially.
- LFLex Fridman
So yes, for model predictive control, but
- 44:22 – 50:40
Hierarchical planning
- LFLex Fridman
you also often talk about hierarchical planning.
- YLYann LeCun
Yeah.
- LFLex Fridman
Can hierarchical planning emerge from this somehow?
- YLYann LeCun
Well, so no, you, you will have to build a specific architecture to allow for hierarchical planning. So hierarchical planning is absolutely necessary if you want to plan complex actions. Uh, if I wanna go from, let's say, from New York to Paris, that's the example I use all the time, (laughs) and I'm sitting, uh, in my office at NYU, my objective that I need to minimize is my distance to Paris. At a high level, a very abstract representation of my, uh, my location, I will have to decompose this into two sub-goals. First one is, um, go to the airport, second one is catch a plane to Paris.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
Okay, so my sub-goal is now, uh, going to the airport. My objective function is my distance to the airport. How do I go to the airport? Well, I have to go in the street and hail a taxi-
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
... which you can do in New York. Um, okay, now I have another sub-goal, go down on the street. Uh, what that means, uh, going to the elevator, going down the elevator, walk out the street. How do I go to the elevator? I have to, uh, stand up from my chair, open the door of my office, go to the elevator, push, push the button. How do I get up from my chair? Like, you know, you can imagine going down, all the way down to basically what amounts to millisecond by millisecond muscle control.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
Okay? And obviously you're not going to plan your entire trip from New York to Paris in terms of millisecond by millisecond muscle control. First, that would be incredibly expensive, but it will also be completely impossible because you don't know all the conditions of what's gonna happen. Uh, you know, how long it's gonna take to catch a taxi, um, or to go to the airport with traffic, you know, uh, I mean, you, you would have to know exactly the condition of everything to be able to do this planning, and you don't have the information. So you, you have to do this hierarchical planning so that you can start acting and then sort of re-planning as you go. And nobody really knows how to do this in AI. Um, nobody knows how to train a system to learn the appropriate multiple levels of representation so that hierarchical planning works.
- LFLex Fridman
Does something like that already emerge? So like can you use an LLM, state of the art LLM to get you from New York to Paris by doing exactly the kind of detailed set of questions that you just did? Which is, can you give me a high- a list of 10 steps I need to do to get from New York to Paris? And then for each of those steps, can you give me a list of 10 steps how I make that step happen? And for each of those steps, can you give me a list of 10 steps to make each one of those until you're moving your mus- individual muscles? Uh, maybe not. Whatever you can actually act upon using your own mind.
- YLYann LeCun
Right. So there's a lot of questions that are sort of implied by this, right? So the first thing is, uh, LLMs will be able to answer some of those questions down to some level of abstraction-
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
... under the condition that they've been trained with similar scenarios in their training set.
- LFLex Fridman
They would be able to answer all those questions, but some of them may be hallucinated, meaning non-factual.
- YLYann LeCun
Yeah, true. I mean, they will probably produce some answer, except they're not gonna be able to really kind of produce millisecond by millisecond muscle control of how you, how you stand up from your chair.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
Right? So, but down to some level of abstraction where you can describe things by words, they might be able to give you a plan, but only under the condition that they've been trained to produce those kind of plans.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
Right? They're not gonna be able to plan for situations where that, that they never encountered before. They basically are going to have to regurgitate the template that they've been trained on.
- LFLex Fridman
But where, like just for the example of New York to Paris, is, is it gonna start getting into trouble? Like at which layer, layer of abstraction do you think you'll start? Because like I can imagine almost every single part of that an LLM would be able to answer somewhat accurately, especially when you're talking about New York and Paris, major cities.
- YLYann LeCun
So I mean, certainly, uh, LLM would be able to solve that problem if you fine-tune it for it, uh, you know-
- LFLex Fridman
Sure.
- YLYann LeCun
... just, uh, and, and so, uh, I can't say that an LLM cannot do this. It can do this if you train it for it, there's no question. Uh, down to a certain level-... where things can be formulated in terms of words. But, like, if you want to go down to, like, how you, you know, climb down the stairs or just stand up from your chair in terms of, uh, words, like, you- you can't- you can't do it. Um, y- you- you need... That's one of the reasons you need experience of the physical world, which is much higher bandwidth than what you can express in words, in human language.
- LFLex Fridman
So everything we've been talking about on the joint embedding space, is it possible that that's what we need for, like, the interaction with physical reality for- on the robotics front? And then just the LLMs are the thing that sits on top of it for the bigger reasoning, about like-
- YLYann LeCun
Yeah.
- LFLex Fridman
... the fact that I need to book a plane ticket and I need to know- I know how to go to the websites and so on.
- YLYann LeCun
Sure. And, you know, a lot of plans that people know about, um, that are relatively high level are actually learned. They're not... People- mo- most people don't invent the, you know, plans, um, uh, they- they- by themselves. They, uh... You know, we have some ability to do this, of course, uh, obviously, but, um, but most plans that people use are plans that they've been trained on. Like, they've seen other people use those plans or they've been told how to do things, right? Um, like, you can't invent how you... Like, take a person who's never heard of airplanes and tell them, like, "How do you go from New York to Paris?" And they're probably not going to be able to kind of (laughs) you know, deconstruct the whole plan, uh, unless they've seen examples of that before. Um, so certainly LLMs are going to be able to do this. But- but then, um, how you link this from the- the low level of- of- of actions, uh, that needs to be done with things li- like JPAS that basically lift the abstraction level of the representation without attempting to reconstruct every detail of the situation. That's what you will need JPAS for.
- 50:40 – 1:06:06
Autoregressive LLMs
- YLYann LeCun
- LFLex Fridman
I would love to sort of linger on your skepticism around, uh, auto-aggressive LLMs. So one way I- I would like to test that skepticism is... Everything you say makes a lot of sense. But if I apply everything you said today and in general to like, I don't know, 10 years ago, maybe a little bit less... No, let's say three years ago. I wouldn't be able to predict the, uh, success of LLMs. So does- does it make sense to you that auto-aggressive LLMs are able to be so damn good?
- YLYann LeCun
Yes.
- LFLex Fridman
Can you explain your intuition? Because if I were to take your wisdom and intuition at face value, I would say there's no way auto-aggressive LLMs, one token at a time, would be able to do the kind of things they're doing.
- YLYann LeCun
No, there's one thing that, uh, auto-aggressive LLMs, uh, or that LLMs in general, not just the auto-aggressive one but including the BERT-style bidirectional ones-
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
... uh, are exploiting, and it's self-supervised learning. And I've been a very, very strong advocate of self-supervised learning for many years. So those things are a incredibly impressive demonstration that self-supervised learning actually works. Uh, the idea that, you know, started... Uh, it didn't start with- with, uh, with BERT, but it was really kind of a good demonstration with this. So the- the- the idea that, you know, you take a piece of text, you corrupt it, and then you train some gigantic neural net to reconstruct the parts that are missing, um, that has been an enormous, uh, uh... produced an enormous amount of benefits. Uh, it allowed us- allowed us to create systems that understand- understand language, uh, systems that can translate, um, hundreds of languages in any direction, systems that are multilingual. So they're not... It's a single system that can be trained to understand hundreds of languages and translate in any direction, um, and produce, uh, summaries, um, and then answer questions and produce text. And then there's a special case of it where, you know, you... Which is the auto-aggressive, uh, trick where you constrain the system to not elaborate a representation of the text from looking at the entire text but only predicting a word from the words that are come before, right? And you do this by the constraining the architecture of the network, and that's what you can build an auto-agressive LLM from. So there was a surprise, uh, many years ago with wha- what's called decoder on the, uh, LLM. So since, you know, systems of this type that are just trying to produce, uh, words from the, from the previous one and- and the fact that when you scale them up they- they tend to really kind of understand more about the, uh, about language when you train them on lots of data and you make them really big. That was kind of a surprise, and that surprise occurred quite a while back like, you know, uh, with, uh, work from, uh, you know, Google, Meta, OpenAI, et cetera, you know, going back to, you know, the GPT kind of, uh, or general pre- pre-trained transformers.
- LFLex Fridman
You mean like GPT-2? Like, there's a certain place where you start to realize scaling might actually keep giving us, uh, an emergent benefit.
- YLYann LeCun
Yeah, I mean, there were, there were work from, from various places but, uh, uh, if, if you want to kind of, you know, place it in the, in the GPT, uh, timeline, that would be around GPT-2, yeah.
- LFLex Fridman
Well, (sighs) I just... 'Cause you said it, you- you're so charismatic and you said so many words but self-supervised learning, yes. But again, the same intuition you're applying to saying that auto-aggressive LLMs cannot have a deep understanding of the world, if we just apply that same intuition, does it make sense to you that they're able to form enough of a representation of the world to be damn convincing, essentially passing the original Turing test with flying colors?
- YLYann LeCun
Well, we're fooled by their fluency, right? We just assume that if a system is fluent...... in, in manipulating language, then it has all the characteristics of human intelligence, but that impression is false. We- we're, we're really fooled by it. Um...
- LFLex Fridman
Wha- what do you think Alan Turing would say? I- without understanding anything, just hanging out with it?
- YLYann LeCun
Alan Turing would decide that a Turing test is a really bad test.
- LFLex Fridman
(laughs)
- YLYann LeCun
Okay? This is what the AI community has decided many years ago, that the Turing test was a really bad test of intelligence.
- LFLex Fridman
What would Hans Moravec say about the, uh, about the large language models?
- YLYann LeCun
Hans Moravec would say the Moravec paradox still applies.
- LFLex Fridman
Okay.
- YLYann LeCun
Okay? Okay. We can pass th-
- LFLex Fridman
You don't think he would be really impressed?
- YLYann LeCun
No, of course. Everybody would be impressed, but...
- LFLex Fridman
Yeah.
- YLYann LeCun
(laughs)
- LFLex Fridman
(laughs)
- YLYann LeCun
You know, uh, it's not a question of being impressed or not. The, it's a question of knowing what the limit of those systems can do. Like, they're, uh, again, they are impressive. They can do a lot of useful things. There's a whole industry that is being built around them. They're gonna make progress. Uh, but there is a lot of things they cannot do, and we have to realize what they cannot do, and, uh, and then figure out, you know, how we get there. And, you know, and, and I'm not saying this ... I'm saying this from basically, you know, 10 years of, of research, uh, on, on the idea of self-supervised learning. Actually, that's going back more than 10 years, but the idea of self-supervised learning, so basically capturing the internal structure of a piece of, uh, of a, of a set of inputs without training the system for any particular task, right? Learning representations. Um, you know, the, the conference I co-founded 14 years ago is called internet- is International Conference on Learning Representations. That's the entire issue that deep learning is d- is dealing with, right? And it's been my obsession for, you know, almost 40 years now, so, um, so learning representation is really the thing. Uh, for the longest time, we could only do this with supervised learning, and then we started working on, uh, you know, what we used to call unsupervised learning, uh, uh, uh, and sort of revive the idea of unsupervised learning, uh, in the early 2000s with, uh, Yoshua Bengio and Geoff Hinton, then discovered that supervised learning actually works pretty well-
- LFLex Fridman
Right.
- YLYann LeCun
... if you can collect enough data. And so, the whole idea of, you know, unsupervised, self-supervised learning kind of took a, a backseat for, for a bit, and then I kind of tried to revive it, um, uh, in a big way, you know, starting in 2014, basically when we started FAIR, and, uh, and really pushing for, like finding new, new methods to do self-supervised learning, both for text and for images and for video and audio. And some of that work has been incredibly successful. Um, I mean, the reason why we have multilingual transition system, you know, things to do content moderation on, on Meta, for example, on Facebook, uh, that are multilingual, that understand whether a piece of text is hate speech or not or something, is due to their progress using self-supervised learning for NLP, combining this with, you know, transformer architectures and, and blah blah blah. But that's the big success of self-supervised learning. We had similar success in speech recognition, a system called wav2vec, which is also a joint embedding architecture by the way, trained with contrastive learning. And, and that, that system also can produce, um, speech recognition systems that are multilingual with mostly unlabeled data, and only need a few minutes of labeled data to actually do speech recognition. That's, that's amazing. Um, we have systems now, based on those combination of ideas, that can do real-time translation of hundreds of languages into each other. Uh, speech to speech.
- LFLex Fridman
Speech to speech, even including, which is fascinating, languages that, uh, don't have written forms.
- YLYann LeCun
That's right.
- LFLex Fridman
They're spoken only.
- 1:06:06 – 1:11:30
AI hallucination
- YLYann LeCun
- LFLex Fridman
I think in one of your slides, you have this nice plot that is one of the ways you show that LLMs are limited. I wonder if you could talk about hallucinations from your perspectives. The why hallucinations happen from large language models, and why, and to what degree is that a fundamental flaw of large language models?
- YLYann LeCun
Right. So because of the autoregressive prediction, every time an LLM produces a token or a word, uh, there is some level of probability for that word to take you out of the set of reasonable answers. Uh, and if you assume, which is a very strong assumption, that the probability of such error, um, is... That those errors are independent across a, a sequence of tokens being produced.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
What that means is that every time you produce a token, the probability that you rest- you, you stay within the, the set of correct answer decreases, and it decreases exponentially.
- LFLex Fridman
So there's a strong, like you said, assumption there that-If, uh, there's a non-zero probability of making a mistake, which there appears to be, then there is going to be a kind of drift.
- YLYann LeCun
Yeah. And that drift is exponential. It's like errors accumulate, right? So- so the probability that an answer would be nonsensical increases exponentially with the number of tokens.
- LFLex Fridman
Is that obvious to you, by the way? Like, uh, well, so mathematically speaking, maybe, but, like, isn't there a kind of gravitational pull towards the truth? Because on- on average, hopefully, the truth is well-represented in the, uh, training set?
- YLYann LeCun
No. It's basically a struggle against the- the curse of dimensionality. So the way you can correct for this is that you fine-tune the system by having it produce answers for all kinds of questions that people might come up with.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
And people are people, so they... a lot of the questions that they have are very similar to each other, so you can probably cover, you know, 80% or whatever, of, uh, questions that people will- will ask, um, by, you know, collecting data, and then, um, and then you fine-tune the system to produce good answers for all of those things. And it's probably gonna be able to learn that, because it's got a lot of capacity to- to learn, uh, but then there is, you know, the enormous set of prompts that you have not covered during training, and that set is enormous. Like, within the set of all possible prompts, the proportion of prompts that have been, uh, used for training is absolutely tiny.
- LFLex Fridman
Hm.
- YLYann LeCun
Um, it's a, it's a tiny, tiny, tiny subset of all possible prompts, and so the system will behave properly on the prompts that has been either trained, pre-trained, or fine-tuned, um, but then there is an entire space of things that it cannot possibly have been trained on because it's just... the- the number is gigantic. So, um, so whatever training the system, uh, has been subject to, to produce appropriate answers, you can break it by finding out a prompt that will... the outside of the- the- the set of prompts it's been trained on, o- or things that are similar, and then it will just, you know, spew complete nonsense.
- LFLex Fridman
Do you... when you say prompt, do you mean that exact prompt? Or do you mean a prompt that's, like, in many parts very different than... like, i- is it that easy to ask a question or to say a thing that hasn't been said before on the internet?
- YLYann LeCun
I mean, people have come up with, uh, things where, like, you- you put a... essentially a random sequence of characters in the prompt, and that's enough to kind of throw the system, uh, into a mode where, you know, it- it's gonna answer something completely different than it would n- have answered without this.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
So that's a way to jailbreak the system, basically, get it, you know, go outside of its, uh, of its conditioning, right?
- LFLex Fridman
So that- that's a very clear demonstration of it, but, of course, uh, (laughs) you know, that's, uh, that goes outside of what it's designed to do, right? If you actually stitched together reasonably grammatical sentences, is that the e- is it that easy to break it?
- YLYann LeCun
Yeah, some people have done things like... you- you- you write a sentence in English, right? It has an... or you ask a question in English, and it- it produces a perfectly fine answer, and then you just substitute a few words-
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
... by the same word in another language, and all of a sudden the answer is complete nonsense.
- LFLex Fridman
Yeah, so- so I guess what I'm saying is, like, which fraction of prompts that humans are likely to generate are going to break the system?
- YLYann LeCun
So the- the problem is that there is a long tail.
- LFLex Fridman
Yes.
- YLYann LeCun
Uh, this is, uh, an issue that (laughs) a lot of people have realized, you know, in social networks and stuff like that, which is, uh... there's a very, very long tail of- of things that people will ask, and you can fine-tune the system for the 80% or whatever of, uh, of the things that most people will- will ask, and then this long tail is- is so large that you're not gonna be able to fine-tune the system for all the conditions. And in the end, the system ends up being kind of a giant lookup table, right? (laughs) Essentially. Which is not really what you want. You want systems that can reason, certainly they
- 1:11:30 – 1:29:02
Reasoning in AI
- YLYann LeCun
can plan. So the type of reasoning that takes place in, uh, LLM is very, very primitive, and the reason you can tell it's primitive is because the amount of computation that is spent per token produced is constant. So if you ask a question and that question has an answer in a given number of token, the amount of computation devoted to computing that answer can be exactly estimated. It's like, you know... it's h- it's the- the size of the prediction network, you know, with its 36 layers or 92 layers or whatever it is, uh, multiplied by the number of tokens. That's it. And so essentially it doesn't matter if the question being asked is- is simple to answer, complicated to answer, impossible to answer because it's undecidable or something. Um, the amount of computation the system will be able to devote to that, to the answer is constant or is proportional to the number of token produced in the answer, right? This is not the way we work. The way we reason is that w- when we're faced with a complex problem or a complex question, we spend more time trying to solve it-
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
... and answer it, right? Because it's more difficult.
- LFLex Fridman
There's a prediction element. There's a iterative element where you're, like, uh, adjusting your understanding of a thing by going o- over and over and over, and there's a hierarchical element and so on. Does this mean it's a fundamental flaw of LLMs?
- YLYann LeCun
Yeah.
- LFLex Fridman
Or does it mean that... (laughs) there's more part to that question. (laughs) Now you're just behaving like an LLM. (laughs)
- YLYann LeCun
(laughs)
- LFLex Fridman
Immediately answering. No, that- that it's just the low level world model...... on top of which we can then build some of these kinds of mechanisms, like you said, persistent long-term memory or, uh, reasoning, so on. But we need that world model that comes from language. Is it, maybe it is not so difficult to build this kind of, uh, reasoning system on top of a well-constructed world model.
- YLYann LeCun
Okay. Whether it's difficult or not, the n- near future will, will say because-
- LFLex Fridman
Yeah.
- YLYann LeCun
... a lot of people are working on-
- LFLex Fridman
Yes.
- YLYann LeCun
... reasoning and planning abilities for, for dialogue systems. Um, I mean, if we're, even if we restrict ourselves to language, uh, just having the ability to plan your answer before you answer-
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
... uh, in terms that are not necessarily linked with the language you're gonna use to produce the answer, right? So, the, this idea of this mental model that allows you to plan what you're gonna say before you say it.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
Um, that is very important. I think there's going to be a lot of systems over the next few years that are going to have this capability. But the blueprint of those systems would be extremely different from autoregressive LLMs. So, um, uh, it's the same difference as the difference between what psychologist call system one and system two in humans, right? So system one is the type of task that you can accomplish without, like, deliberately, consciously think about how you do them. You just do them, you've done them enough that you can just do it subconsciously, right? Without thinking about them. If you're an experienced driver, you can drive without really thinking about it, and you can talk to someone at the same time, or listen to the radio, right?
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
Um, if you are a very experienced chess player, you can play against a non-experienced chess player without really thinking either. You just recognize the pattern and you play.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
Right? That's s- system one. Um, so all the things that you do instinctively without really having to deliberately plan and think about it. And then there is other tasks where you need to plan. So if you are a not-so-experienced, uh, chess player, or you are experienced but you play against another experienced chess player, you think about all kinds of options, right? You, uh, you think about it for a while, right? And you, you're, you're much better if you have time to think about it than you are if you are, if you play blitz, uh, with, uh, limited time. So, and, um, so this type of deliberate, uh, planning, which uses your internal world model, um, that's system two. This is what LLMs currently cannot do.
- LFLex Fridman
Well-
- YLYann LeCun
So how, how do we get them to do this, right?
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
How, how, how do we build a system that can do this kind of, uh, planning that, uh, or reasoning, that devotes more resources to complex problems than to simple problems? And it's not going to be autoregressive prediction of tokens. It's going to be more something akin to inference of latent variables in, um, you know, what used to be called, uh, probabilistic models, or graphical models and things of that type. So basically, the principle is like this. You, you know, the prompt is like observed, uh, variables.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
And what you're, what the model does is that it's basically a measure of... it can measure to what extent an answer is a good answer for a prompt.
- LFLex Fridman
Mm-hmm.
- YLYann LeCun
Okay? So, think of it as some gigantic neural net, but it's got only one output. And that output is a scalar number, which is, let's say, zero if the answer is a good answer for the question, and a large number if the answer is not a good answer for the question.
- LFLex Fridman
Mm-hmm.
Episode duration: 2:47:16
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode 5t1vTLU7s40
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome