Skip to content
Lex Fridman PodcastLex Fridman Podcast

Oriol Vinyals: Deep Learning and Artificial General Intelligence | Lex Fridman Podcast #306

Oriol Vinyals is the Research Director and Deep Learning Lead at DeepMind. Please support this podcast by checking out our sponsors: - Shopify: https://shopify.com/lex to get 14-day free trial - Weights & Biases: https://lexfridman.com/wnb - Magic Spoon: https://magicspoon.com/lex and use code LEX to get $5 off - Blinkist: https://blinkist.com/lex and use code LEX to get 25% off premium EPISODE LINKS: Oriol's Twitter: https://twitter.com/oriolvinyalsml Oriol's publications: https://scholar.google.com/citations?user=NkzyCvUAAAAJ DeepMind's Twitter: https://twitter.com/DeepMind DeepMind's Instagram: https://instagram.com/deepmind DeepMind's Website: https://deepmind.com Papers: 1. Gato: https://deepmind.com/publications/a-generalist-agent 2. Flamingo: https://deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model 3. Language Models are Few-Shot Learners: https://arxiv.org/abs/2005.14165 4. Emergent Abilities of Large Language Models: https://arxiv.org/abs/2206.07682 5. Attention Is All You Need: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf PODCAST INFO: Podcast website: https://lexfridman.com/podcast Apple Podcasts: https://apple.co/2lwqZIr Spotify: https://spoti.fi/2nEwCF8 RSS: https://lexfridman.com/feed/podcast/ Full episodes playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4 Clips playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOeciFP3CBCIEElOJeitOr41 OUTLINE: 0:00 - Introduction 0:34 - AI 15:31 - Weights 21:50 - Gato 56:38 - Meta learning 1:10:37 - Neural networks 1:33:02 - Emergence 1:39:47 - AI sentience 2:03:43 - AGI SOCIAL: - Twitter: https://twitter.com/lexfridman - LinkedIn: https://www.linkedin.com/in/lexfridman - Facebook: https://www.facebook.com/lexfridman - Instagram: https://www.instagram.com/lexfridman - Medium: https://medium.com/@lexfridman - Reddit: https://reddit.com/r/lexfridman - Support on Patreon: https://www.patreon.com/lexfridman

Lex FridmanhostOriol Vinyalsguest
Jul 26, 20222h 10mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:000:34

    Introduction

    1. LF

      At which point is a neural network a being versus a tool? The following is a conversation with Ariel Vinialis, his second time on the podcast. Ariel is the research director and deep learning lead at DeepMind, and one of the most brilliant thinkers and researchers in the history of artificial intelligence. This is the Lex Fridman podcast. To support it, please check out our sponsors in the description, and now, dear friends, here's Ariel Vinialis.

  2. 0:3415:31

    AI

    1. LF

      You are one of the most brilliant researchers in the history of AI, working across all kinds of modalities. Probably the one common theme is, it's always sequences of data. Uh, so that we're talking about languages, images, even biology, and, uh, games as we talked about last time. So, you're a good person to ask this. In your lifetime, will we be able to build an AI system that's able to replace me as the interviewer in this conversation, in terms of ability to ask questions that are compelling to somebody listening? And then, further question is, are we close... Will we be able to build a system that replaces you as the interviewee, in order to create a compelling conversation? How far away are we, do you think?

    2. OV

      It's a good question. Um, I think partly I would say, do we want that? I- I really like when we start now with very powerful models, interacting with them, and thinking of them more closer to us. The question is, if you remove the human side of the conversation, is that an interesting, you know, is that an interesting artifact? And I would say probably not. I've seen, for instance, um, last time we spoke, Lex, was, we were talking about StarCraft, um, and creating, you know, agents that play games involves self-play. But ultimately, what people care about was, well, how does this agent behave when the opposite side is- is a human? So without a doubt, we will probably be more empowered by AI. Um, maybe you can source some questions from an AI system. I mean, that, even today, I would say it's quite plausible that with your creativity, you might actually find very interesting questions that you can filter. Um, we call this cherry-picking sometimes in the field of language. Um, and likewise, if I had now the tools on my side, I could say, "Look, you're asking this interesting question. From this answer, I like the words chosen by this particular system that created a few words." Completely replacing it feels not exactly exciting to me. Um, although in my lifetime, I think way... I mean, given the trajectory, I think it's possible that perhaps there could be interesting, um, maybe self-play interviews as you- you're suggesting that would look g- look or sound k- quite interesting and probably would educate. Or, you could learn a topic through listening to one of these interviews at- at a basic level, at least.

    3. LF

      So you said it doesn't seem exciting to you, but what if exciting is part of the objective function the thing is optimized over? So you can... There's probably a huge amount of data of humans, if you look correctly, of humans communicating online, and there's probably ways to measure the degree of, you know, as they talk about engagement. So you can probably optimize the question that's most created an engaging conversation in the past. So actually, if you strictly use the word "exciting," uh, there is probably a way to create a optimally exciting conversations that are, involve AI systems. At least one side is AI.

    4. OV

      Yeah. That makes sense. I think maybe looping back a bit to- to games and the game industry. When you design algorithms, um, you're thinking about winning as the objective, right? Or the reward function. But in fact, when we discussed this with Blizzard, the creators of StarCraft in this case, I think what's exciting, fun, um, if you could measure that and optimize for that, that's probably why we play video games or why we interact or listen or look at cat videos or whatever on the internet. So it's true that modeling rewards beyond the obvious reward functions we've used to in reinforcement learning is definitely very exciting. And again, there is some progress actually into, um, a particular aspect of AI which is quite critical, which is, um, for instance, is a conversation that... O- or is the information truthful, right? So you could start trying to evaluate, um, these from, um, excerpts from the internet, right? That has lots of information. And then, if you can learn a function, automated ideally, so you can also optimize it more easily, um, then you could actually have conversations that optimize for non-obvious things such as excitement.

    5. LF

      Mm-hmm.

    6. OV

      Um, so yeah. That's quite possible. And then I would say, in that case, it would definitely be fun, a fun exercise and quite unique to have at least one side that is fully driven by, um, an excitement reward function. Um, but obviously there would be still quite a lot of humanity in the system both from who are, who is building the system, of course, and also, um, ultimately we think of labeling for excitement. That, those labels must come from us because it's just hard to have a computational measure of excitement. As far as I understand, there's no such thing.

    7. LF

      Well, oof.

    8. OV

      (laughs)

    9. LF

      You mentioned truth also. I would actually venture to say that excitement is easier to label than truth. Or is perhaps, h- uh, has lower consequences of- of failure. Um, but there is perhaps the- the humanness that you mentioned, that's perhaps part of a thing that could be labeled. And that could mean an AI system that's doing dialogue, that's doing conversations should be...... flawed, for example. Like, that's the thing you optimize for, which is, uh, have inherent contradictions by design, have flaws by design. Maybe it also needs to have a strong sense of identity, so it has a backstory it told itself that it sticks to. It has memories, not in terms of the, how the system is designed, but it's able to tell stories about its past. It's able to have, um, mortality and fear of mortality in the following way, that it has an identity. And, like, if it says something stupid and gets canceled on Twitter, that's the end of that system. So it's not like you get to rebrand yourself. That system is, that's it. So maybe that, the, the high stakes nature of it, because, like, you can't say anything stupid now, O- Orville, because you'll be canceled (laughs) on Twitter.

    10. OV

      (laughs)

    11. LF

      And that there's, there's stakes to that, and that I think part of the reason that makes it, uh, interesting. And then you have a perspective, like, you've built up over time that you stick with, and then people can disagree with you, so h- holding that perspective strongly, holding sort of a, maybe a controversial, at least a, a strong opinion, all of those elements, it feels like they can be learned because it feels like there's a lot of data on the internet of people having an op- opinion. (laughs) And then combine that with a metric of excitement, you can start to create something that, as opposed to trying to optimize for, uh, sort of grammatical clarity and truthfulness, the, the factual, uh, consistency over many sentences, you optimize for, uh, the humanness. And there's obviously data for humanness on the internet. So I wonder, I wonder if there's a future where that's part... Or, I mean, I, I, I sometimes wonder that about myself. I'm a huge fan of podcasts, and I listen to pod- some podcasts and I think, like, "What is interesting about this? What is compelling?" Uh, the same way you watch other games, like you said, watch me play StarCraft or, uh, have Magnus Carlsen play chess. So I'm not a chess player, so, but it's still interesting to me. What is that? That's the, uh, the stakes of it, maybe, um, the end of a domination of a series of wins. I don't know. There's all those elements somehow connect to a compelling conversation, and I wonder how hard is that to replace, 'cause ultimately, all of that connects to the initial proposition of how to test whether an AI is intelligent or not with the Turing test, uh, which I guess the... my question comes from a place of the spirit of that test.

    12. OV

      Yes. Um, I actually recall, I was just listening to our, uh, first podcast, where we discussed Turing test. Um, so I would say from a neural network, you know, AI builder, um, perspective, um, there's, you know, usually you try to map many of these interesting topics you discuss to, to benchmarks and then also to actual architectures on the how these systems are currently built, how they learn, what data they learn from, what are they learning, right? We're talking about weights of a mathematical function. And then looking at the current state, uh, of the game, maybe what do we need leaps forward to get to the ultimate stage of all these experiences, um, lifetime experience, uh, fears, like, words that currently barely we're, we're seeing, um, progress just because what's happening today is you take, um, all these human interactions, um, it's a large vast, uh, variety of, of human interactions online, and then you're distilling these sequences, right? Going back to my passion, like sequences of words, letters, um, images, sound. There's more modalities here to be, to be at play. And then you're trying to just learn a function that will be happy, that maximizes the, the likelihood of seeing all these, um, through a neural network. Um, now, I think there's a few places where the way currently we train these models would clearly lack to be able to develop the kinds of capabilities you say. I'll tell you maybe a couple. One is the lifetime of an agent or a model. Uh, so you, you learn from this data offline, right? So you're just passively observing and maximizing these, you know, it's almost like a mountains, like a sca- a landscape of mountains and then everywhere there's data that humans interacted in this way, you're trying to make that higher and then, you know, lower where there's no data. And then these models generally don't then experience themselves these, they just are observers, right? They're passive observers of the data. And then we're putting them to then generate data when we interact with them, but that's very limiting the experience they actually experience, um, when they could maybe be optimizing or further optimizing the weights. We're not even doing that. So to be clear, and again, mapping to AlphaGo, AlphaStar, we train the model and when we deploy it, um, to play against humans or in this case interact with humans, um, uh, like language models, they don't even keep training, right? They're not learning in the sense of the ways that you've learned from the data, they don't keep changing. Now there's something a bit more... feels magical, but it's understandable if you're into neural net, which is, well, they might not learn in the strict sense of the words, the weights changing, maybe that's mapping to how neurons interconnect and how we learn over our lifetime. But it's true that the context of the conversation that they, they, that takes, t- takes place with when you talk to these systems, it's held in their working memory, right? It's almost like, um, you start a computer, it has a hard drive that has a lot of information, you have access to the internet, which has probably all the information, but, uh, there's also a working memory where the, these agents as we call them or start calling them build upon.Now, this memory is very limited. Um, I mean, right now we're talking, to be concrete, about 2,000 words that we hold, and then beyond that, we start forgetting what we've seen. So you can see that there's some short-term coherence already, right? With when you said, I mean, it's a very interesting topic, um, having sort of a mapping, um, an agent to, like, have consistency. Then, you know, if, if, if you say, "Oh, what's your name?" Um, it could remember that, but then it might forget beyond 2,000 words, which is not that long of context. If we think even of these podcast, um, books, uh, are much longer. So technically speaking, there's a limitation there. Super exciting from people that work on deep learning to be working on. Uh, but I would say we lack, maybe benchmarks and the technology to have this lifetime-like experience of memory that keeps building up. Um, however, the way it learns offline is clearly very powerful, right? So I, you know, you asked me three years ago, I would say, "Oh, we're very far." I think we've seen the power of this imitation again, uh, on the internet scale that has enabled this, um, to feel like at least the knowledge, the basic knowledge about the world now is incorporated into the weights. Uh, but then this experience is lacking. And in fact, as I said, we don't even train them when, you know, when we're talking to them, other than their working memory, of course, is affected. So that's the dynamic part, but they don't learn in the same way that you and I have learned, right, when, from basically when we are born and probably before. Uh, so lots of fascinating, interesting questions you asked there. I think, um, the one I mentioned is this idea of memory and experience versus just kind of observe the world and learn its knowledge, which I think for that, I would argue, lots of recent advancements that make me very excited about the field. And then the second, maybe, issue that I see is all these models, we train them from scratch. That's something I would have complained three years ago, or six years ago, or 10 years ago. And it feels, if we take inspiration from how we got here, how the universe evolved us, um, and we keep evolving, it feels that is a missing piece, that we should not be training models from scratch, um, every few months. That there should be some sort of way in which we can grow models, um, much like as a species and many other elements in the universe is building from the previous sort of iterations. And that, from a just purely neural network perspective, even though we, we, we would like to make it work, it's proven very hard to not, you know, throw away the previous weights, right? This landscape we learn from the data and, you know, refresh it with a brand new set of weights, um, given maybe a recent snapshot of this data set we train on, et cetera, or even a new game we're learning. So that's, that feels like something is missing fundamentally. We might find it, but it's not very clear how it will look like. There's many ideas and it's super exciting as well.

    13. LF

      Yes.

  3. 15:3121:50

    Weights

    1. LF

      Just for people who don't know, s- when you approach a new problem in machine learning, you're going to come up with an architecture that has, uh, a, a bunch of weights and then you ini- initialize them somehow, which f- in most cases is some version of random. So that's what you mean by starting from scratch, and it seems like it's a, it's a waste every time you solve, uh, the game of Go and chess, StarCraft, uh, protein folding. Like, surely there's some way to reuse the weights as we grow this giant database of, um, of neu- of neural networks that have solved some of the toughest problems in the world. And so, some of that is, um, what is that? Methods how to reuse weights. How to learn, extract what's generalizable, or at least has a chance to be, and throw away the other stuff. Uh, and maybe e- the neural network itself should be able to tell you that. Like, what, um... Yeah, how do you... What, what ideas do you have for better initialization of weights?

    2. OV

      Maybe stepping back. If we look at the field of machine learning, but especially deep learning, right? The, at the core of deep learning, there's this beautiful idea that is, um, a single algorithm can solve any task, right? So it's been proven over and over with more increasing set of benchmarks and things that were thought impossible that are being cracked by this basic principle. That is, you take a neural network of uninitialized weights, so, like, a blank computational brain. Um, then you give it, in the case of supervised learning, a lot, ideally, of examples of, "Hey, here is what the input looks like and the desired output should look like this." I mean, image classification is very clear example. Images to maybe one of a thousand categories. That's what ImageNet is like. But many, many, if not all problems can be mapped this way. And then there's a generic recipe, right, that you can use. Um, and this recipe with very little change, and I think that's the core of deep learning research, right? That what is the recipe that is universal? That for any new given task I'll be able to use without thinking, without having to work very hard on the problem at stake. Um, we have not found this recipe, but I think the field is excited to find, um-

    3. LF

      Mm-hmm.

    4. OV

      ... less tweaks or tricks that people find when they work on important problems specific to those, and more of a general algorithm, right? So at an algorithmic level, I would say we have something general ready, which is this formula of training a very powerful model, a neural network, on a lot of data, and in many cases...... you need some specificity to the actual problem you're solving. Um, protein folding being such an important problem, has some basic recipe that is learned from beyo- be- before, right? Like transformer models, graph neural networks, um, ideas coming from NLP, like, uh, you know, something called BERT that is a kind of loss that you can emplace to help the model. Uh, mod- knowledge distillation is another technique, right? So this is the formula. We still had to find some particular things that were specific to AlphaFold, right? That's very important because protein folding is such a high value problem that as humans we should solve it no matter if we need to be a bit specific. And it's possible that some of these learnings will apply then to the next iteration of this recipe that deep learners are about. But it is true that so far the recipe's what's common, but the weights you generally throw away, which feels very sad. Um, although maybe in the last, especially in the last two, three years, um, and when we last spoke, I mentioned this area of meta-learning, which is the idea of learning to learn. Um, that idea and some progress has been had starting, I would say, mostly from GPT-3 on the language domain only, in which you could conceive a model that is trained once, and then this model is not narrow in that it only knows how to translate a pair of languages or it n- that only knows how to assign sentiment to a sentence. These, these actually, you could teach it by a prompting, it's called, and this prompting is essentially just showing it a few more examples, um, almost like you do show examples, input-output examples, algorithmically speaking to the process of creating this model. But now you're doing it through language, which is very natural way for us to learn from one another. I tell you, "Hey, you should do this new task." I'll tell you a bit more. Maybe you ask me some questions, and now you know the task, right? You didn't need to retrain it from scratch. And we've seen these magical moments almost, um, in this way to do few-shot prompting through language on language-only domain, and then in the last two years, we've seen these expanded to beyond language, um, adding vision, adding actions and games. Uh, lots of progress to be had, but this is maybe, if you ask me, like, about how are we going to crack this problem? This is perhaps one way in which you have a single model. Um, the problem of this model is it, it's hard to grow in weights or capacity, but the model is certainly so powerful that you can teach it some tasks, right? In this way that I teach you, I could teach you a new task now if we were, oh, let's... uh, a text, a text-based task or a classification, a vision, um, style task. Uh, but it still feels like more breakthroughs should be had. But it's a great beginning, right? We have a good baseline. We have an idea that this maybe is the way we want to benchmark progress towards AGI, and I think in my view, that's critical to always have a way to benchmark the community sort of converging to this overall, which is good to see. And then this is actually what excites me in terms of also next steps, um, for deep learning, is how to make these models more powerful, how do you train them, how to grow them if they must grow, um, should they change their ways as you teach a task or not? There's some interesting questions, many to be answered.

    5. LF

      Yeah, you've opened the door ab- about...

  4. 21:5056:38

    Gato

    1. LF

      to a bunch of questions I want to ask. But let's first return to the, uh, to your tweet and read it like a Shakespeare. You wrote, "GATO is not the end. It's the beginning." And then you wrote, "Meow," and then an emoji of a cat. Uh, so first, two questions. First, can you explain the "meow" and the cat emoji? And second, can you explain what GATO is and how it works?

    2. OV

      Right. Indeed, I mean, thanks, thanks for reminding me that, uh, we're all exposing t- on Twitter... (laughs)

    3. LF

      Permanently there.

    4. OV

      Yes, permanently there.

    5. LF

      One of the greatest AI researchers of all time, "Meow" and cat emoji.

    6. OV

      (laughs) Yes.

    7. LF

      There you go.

    8. OV

      Right. So-

    9. LF

      Can you imagine, like, Turing tweeting "meow" and cat e- probably he would. Probably would.

    10. OV

      Probably. So yeah, the tweet is important actually. Um, you know, I put thought on the tweets. I hope people do as well.

    11. LF

      Which part do you think... Okay. The, so, so there's three sentences. "GATO is not the end. GATO is the beginning. Meow, cat emoji." Okay, which, which is the important part? (laughs)

    12. OV

      The meow. No, no. Uh, definitely that it is the beginning. I mean, I, I probably was just explaining, um, a bit where the field is going. But, um, let me tell you about GATO. So first, the name GATO comes from maybe a sequence of releases that DeepMind had that named... uh, like used animal names to name some of their models that are based on this idea of large sequence models. Um, initially, they're only language, but we're expanding to other modalities. So we had to, you know, we had Gopher, Chinchilla, these were language only, and then more recently we released Flamingo, which adds vision to the equation, and then GATO, which adds vision and then also actions in the mix, right? Um, as we discussed, actually, actions, um, especially discrete actions like up, down, left, right. I just told you the actions, but they're words, so you can kind of see how actions naturally map to sequence modeling of words, which these models are very powerful. So GATO was named, um, after, I believe... I can only, from memory, right, these, you know, these things always happen with a- an amazing team of researchers behind. So before the release-

    13. LF

      Yeah.

    14. OV

      ... um, we had a discussion about which animal would we pick, right? And I think because of the word general agent, right, and, and this is a property quite unique to GATO, um, we, we kind of were playing with the GA words and then, you know, GATO is-

    15. LF

      Kind of rhymes with cat.

    16. OV

      Yes. Um-

    17. LF

      Gato.

    18. OV

      And gato is obviously a Spanish version of cat. I had nothing to do with it, although I'm from Spain.

    19. LF

      Oh, how do you s- wait, sorry. How do you say cat in Spanish?

    20. OV

      Gato.

    21. LF

      Oh, gato. Okay.

    22. OV

      Yeah.

    23. LF

      Now it all makes sense.

    24. OV

      Okay, okay. I see, I see, I see. You d-

    25. LF

      Now it all makes sense.

    26. OV

      Okay. So-

    27. LF

      How do you say meow in Spanish? No, that's probably the same.

    28. OV

      Um, I think you, you say it the same way. (laughs) Uh, but you write it, uh, as M-I...A-U, um-

    29. LF

      Okay. It's universal.

    30. OV

      Yeah.

  5. 56:381:10:37

    Meta learning

    1. OV

    2. LF

      You've mentioned meta-learning. So given this promise of, of GATO, can we try to redefine this term that's almost akin to consciousness? Because it means different things to different people throughout the history of artificial intelligence. But what do you think meta-learning is and looks like now, in the five years, 10 years? Will it look like a system like GATO but scaled? What's your sense of... what is, what, what does meta-learning look like, do you think-

    3. OV

      Great.

    4. LF

      ... with all the wisdom we've learned so far?

    5. OV

      Yeah, great, great question. Maybe it's good to give a- another data point looking backwards rather than forwards. So when, when we talk, um, in 2019, uh, meta-learning meant something that has changed mostly through the revolution of GPT-3 and beyond. So what meta-learning meant, meant at the time, um, was driven by what benchmarks people care about in meta-learning. And the benchmarks were about a capability to learn about object identities. So it was very much overfitted to vision and object classification. And the part that was meta about that was that, oh, we're not just learning a thousand categories that ImageNet tells us to learn. We're gonna learn object categories that can be defined when we interact with the model. So it's interesting to see the evolution, right? The way, the way this started was, we have a special language that was a dataset, a small dataset, that we prompted the model with, saying, "Hey, here is a new classification task. Um, I'll, I'll give you one image and the name," which was an integer at the time of the image and a different image and so on. So you have a small prompt in the form of a dataset, a ma- machine learning dataset. And then you got then a system that could then predict or classify these objects that you just defined kind of on the fly. So fast-forward, it was revealed that language models are few-shot learners. That's the title of, of the paper. So very good title. Sometimes titles are really good, so this one is really, really good. Because that's, that's the point of GPT-3, that showed that, look, sure, we can, we can focus on object classification and how... what meta-learning means within the space of learning object categories. This goes beyond, uh, or before rather to also Omniglot before ImageNet and so on. So there's a few benchmarks. To now all of a sudden, we're a bit unlocked from benchmarks, and through language we can define tasks, right? So we, we're literally telling the model some logical task or little thing that we want it to do. Uh, we prompt it much like we did before but now we prompt it through natural language. And then, not perfectly, I mean, they're... these models have failure modes and that's fine. But th- but these models then are now doing a new task, right? So they meta-learn, um, this new capability. Now, that, that's where we are now. Uh, FLAMINGO expanded this to visual and language, but it basically has the same abilities. You can teach it, for instance, um, an emergent property was that you can take pictures of numbers and then do, do arithmetic with the numbers just by teaching it, oh, that's... I mean, uh, when, when I show you three plus six, you know, I want you to output nine. And, and you show it a few examples, and now it does that. So it went way beyond the, uh, this ImageNet sort of catego- categorization of images that we were a bit stuck maybe before, um, this revelation mo- moment that happened, uh, in 2000, I believe it was 19, but it was after we checked.

    6. LF

      So in that way, it has solved meta-learning as was previously defined?

    7. OV

      Yes. It expanded what it meant, so that's what you say, "What does it mean?" So it's an evolving term. Um, but here is maybe now looking forward, looking at what's happening, um, you know, obviously in the community with more modalities, um, what we can expect, and I would certainly hope to see the following. And this is a pretty drastic hope, but in five years, maybe we chat again, and we have a system, right, a set of weights that we can teach it to play StarCraft. Maybe not at the level of AlphaStar but...... play StarCraft, a complex game. We teach it through interactions, to prompting. You can certainly prompt the system, that what Gato shows, to play some simple Atari games. So imagine if you start talking to a system, teaching it a new game, showing it examples of, you know, in this, in this particular game, um, this user did something good. Maybe the system can even play and ask you questions. Say, "Hey, I played this game. I just played this game. Did I do w- well? Can you teach me more?" So five, maybe to 10 years, these capabilities or what meta-learning means will be much more interactive, much more rich, and through, through domains that we were specializing, right? So you see the difference, right? We built AlphaStar specialized to play StarCraft. The algorithms were general, but the ways were specialized. And what, what we're hoping is that we can teach a network to play games, to play any game, just using games as an example, through interacting with it, teaching it, uploading the Wikipedia page of, of StarCraft. Like, this is in the horizon, and obviously there are details (laughs) need to be, to be filled and research need to be done. But that's how I see meta-learning above, which is gonna be beyond prompting. It's gonna be a bit more interactive. It's gonna, you know, the system might tell us to give it feedback after it maybe makes mistakes or it loses a game. Um, but it's nonetheless very exciting because if you think about this this way, the benchmarks are already there. We just repurpose the me- the benchmarks, right? So in a way, I like to map the space of what maybe AGI means to say, "Okay, like, we went 101% performance in Go, in chess, in StarCraft." The next iteration might be 20% performance across, quote unquote, "all tasks," right? And even if it's not as good, it's fine. We, we, we actually, we have ways to also measure, uh, progress because we have those special agents, specialized agents, um, and so on. So this is, to me, very exciting, and these next iteration models are, are definitely hinting at that direction of progress, um, which hopefully we can have. There are obviously some things that could go wrong in terms of we might not have the tools. Maybe transformers are not enough. Then we must... There are some breakthroughs to come which makes the field more exciting to people like me as well, of course. Um, but that's... If I, if you ask me 5 to 10 years, you might see these models that start to look more like ways that are already trained and then it's more about teaching, uh, or make, you know, they meta-learn what you're, you're trying, um, uh, you're trying to, to induce in terms of tasks and so on. Well beyond the simple now task we're starting to see emerge like, you know, small automatic tasks and so on.

    8. LF

      So a few questions around that. (laughs)

    9. OV

      Yeah.

    10. LF

      This is fascinating. Uh, so that kind of teaching, interactive, not, so it's beyond prompting, so it's interacting with the neural network, that's different than the training process. So it's different than the, uh, optimization over differentiable, uh, functions. This is already trained and now you're teaching w- I mean, um, it's almost like akin to the, the brain. The n- the, the neuron's already set with their connections. On top of that, you're now using that infrastructure to build up further knowledge. Okay. (sighs) So that's a really interesting distinction that's actually not obvious from a software engineering perspective, that there's a line to be drawn. 'Cause you always think for a neural network to learn, it has to be retrained, trained and retrained.

    11. OV

      Yep.

    12. LF

      But maybe... And prompting is a way of teaching a neural network a little bit of context about whatever the heck you're trying it to do. So you can maybe expand this prompting capability by, um, making it interact. That's really, really, really interesting.

    13. OV

      Yeah. By the way, th- this is not... If you look at way back, um, at different ways to tackle even classification tasks, so this, this is, this comes from, from, like, longstanding literature in machine learning. Um, what I'm suggesting could sound to some, like, a bit like, um, nearest neighbor. So nearest neighbor is almost the simplest algorithm, uh, that you can, that, that does not require learning. So it has this interesting, like, you don't need to compute gradients. And what nearest neighbor does is, you, quote unquote, "have a data set or upload a data set," and then all you need to do is a way to measure distance between points. And then to classify a new point, you're just simply computing what's the closest point in this massive amount of data, and that's my answer. So you can think of prompting in a way as you're uploading not, not just simple points and, and, you know, the, the metric is not the distance between the images or something simple. It's something that you compute that's much more advanced. But in a way, it's very similar, right? You, you simply are uploading some knowledge to this pretrained system. In nearest neighbor, maybe the metric is learned or not, but you don't need to further train it. And then now you immediately get a classifier out, ou- out of this, right? Now it's just an evolution of that concept, very classical concept in machine learning, which is, um, yeah, just learning through what's the closest point, closest by some distance and that's it.

    14. LF

      Yeah.

    15. OV

      It, it, it's an evolution of that. And I will say how, how I saw meta-learning when we worked, um, on a few ideas in, in 2016 was precisely through the lens of nearest neighbor, which is very common in computer vision community, right? There's, there's a very active area of, of research about how do you compute the distance between two images? But if you have a good distance metric, you, you also have a good classifier, right? All I'm saying is now these distances and, and the points are not just images. They're, like, words or sequences of-... words, and images, and actions that teach you something new. But it might be that technique-wise, those come back. And I will say that it's not necessarily true that you might not ever train the weights a bit further. Some aspect of meta-learning, some techniques in meta-learning do actually do a bit of fine-tuning, as it's called. Right? They, they train the weights a little bit when they get a new task. So as I call the how, or, or how we're gonna achieve this, um, as a deep learner, I'm very skeptic. We're gonna try a few things, whether it's a bit of training, adding a few parameters, thinking of these as nearest neighbor, or just sim- th- simply thinking of there's a sequence of words, it's a prefix, and that's the new classifier. We'll see, right? There's, there's, there's the beauty of research. But, um, but what's, what's important is that is a good goal in itself that I see as very worthwhile pursuing for the next stages of not only meta-learning, I think this is basically what's exciting about machine learning period, to me, at least.

    16. LF

      Well, the... and the, the interactive aspect of that is also very interesting.

    17. OV

      Yes.

    18. LF

      The interactive version of nearest neighbor. (laughs)

    19. OV

      Yes.

    20. LF

      Uh, to help you, uh, pull out the classifier from this giant thing. Okay. Uh, is, is this the way we can go in five, 10-plus years, uh, from any task, s- sorry, from many tasks to any task? So... and what does that mean? Like, what does it need to be actually trained on? At which point is the network had enough? So what, um, what does a network need to learn about this world in order to be able to perform any task? Is it just as simple as language, image, and action? Or do you need some set of representative im- images? Uh, like, if you only see land images, will you know anything about underwater? Is that some h- fundamentally different? I don't know.

    21. OV

      Th- those, I mean, those are open questions, I would say. I mean, the way you put... let me maybe further your example, right? If, if all you see is land images, but you're reading all about land and water worlds, but in books, right? Imagine.

    22. LF

      Yes.

    23. OV

      Like, would that be enough? I mean-

    24. LF

      Yeah.

    25. OV

      ... good question. We don't know, but I guess maybe you can, you can, you can join us if you want in our quest to find this.

    26. LF

      (laughs)

    27. OV

      That's, that's precisely-

    28. LF

      Water, water worlds, yeah.

    29. OV

      Yes, that's precisely, I mean, the beauty of research and, and that's the, you know, the, the, the, the research business we're in here, I guess, is to figure this out and ask the right questions and then iterate with, with the whole community, um, publishing, like, findings and so on. Uh, but yeah, these a- this is a question. It's not the only question, but it's certainly, as you ask, is, is on my mind constantly, right? And so we'll, we'll need to wait for maybe the, let's say, five years, let's hope it's, it's not ten, to, to see what, what are the answers. Um, some people will largely believe in unsupervised or self-supervised learning of single modalities, uh, and then crossing them. Um, some people might think end-to-end learning is the answer. Um, modularity is maybe the answer. So we don't know, but we're just definitely excited to find out.

    30. LF

      But it feels like this is the right time and we're at the beginning of this, uh, of this journey.

  6. 1:10:371:33:02

    Neural networks

    1. LF

      What do you sort of specific technical thing about Gato, Flamingo, Chinchilla, Gopher, any of these that is especially beautiful? That was surprising, maybe? I- is there something that just jumps out at you? Uh, of course, there's the general thing of like you didn't think it was possible and then you're (laughs) you realize it's possible in terms of the generalizability across modalities and all that kind of stuff, or maybe the how small of a network, relatively speaking, Gato is, all that kind of stuff. But is, is there some weird little things that were surprising?

    2. OV

      Look, I... I'll give you an answer that's very important because maybe people don't quite realize this, but the teams behind these efforts, the actual humans-

    3. LF

      Yeah.

    4. OV

      ... that's maybe the surprising, um, in a obviously positive way. So anytime you see these, these breakthroughs, I mean, it's, it's easy to map it to a few people. There's people that are great at explaining things and so on. That, that's very nice, but maybe the, the, the learnings or the meta-learnings (laughs) that I get as a human about this is, um, sure we can move forward, um, eh, but the surprising bit is how, how important are all the pieces of, of these projects? How do they come together? So I'll, I'll give you, um, maybe some of the ingredients of success that are common across these, um, but not the obvious ones on machine learning. I, I can always, always, a- also give you those. But basically, there is... engineering is, is critical. So, so very good engineering, uh, because ultimately we're, we're collecting, um, datasets, r- right? So the, the, the engineering of data and then of deploying the models at scale, um, into some compute cluster that cannot go understated. That is a huge factor of success. Um, and it's hard to believe that details matter so much. Um, we, we would like to believe that it's true that there is more and more of a standard formula, as I was saying, like, right, this recipe that works for everything. But then when you zoom into this, each of these projects, then you realize the, the, the devil is indeed in the details.

    5. LF

      Yeah.

    6. OV

      And then the teams have to work kind of together towards these goals. Um, so engineering of data, and obviously clusters and large scale, is very important. And then one that is often-... not, maybe nowadays it is more, more clear, is benchmark progress, right? So we are talking here about multiple months of, you know, tens of researchers, um, and, and, and people that are trying to organize the research and so on, working together. And you don't know that you can get there. I mean, it is, this, this is, this is the beauty. Like, if you're not risking to trying to do something that feels impossible, you're not gonna get there. Um, but you need a way to measure progress. So the benchmarks that you build are critical. Um, I've seen this beautifully pay out in many projects. I mean, um, maybe the one I've seen it more consistently, wh- which means we, we established a metric, actually the community did, and then we leveraged that massively, is AlphaFold. This is a project where the data, the metrics were all there, and all it took was, and it's easier said than done, an amazing team working not to try to find some incremental improvement and publish, which, which is one way to do research that is valid, but aim very high and work literally for years to iterate over that process. And working for years with a team, I mean, it- it is, it is tricky. That also happen- happened to happen partly during a pandemic and so on. Um, so I think my meta-learning (laughs) from all this is, the teams are critical to the success. And then if... Now, going to the machine learning. The part that's surprising is, um... So we like architectures like neural networks, um, and I would say this was a very rapidly evolving field until the transformer came. So, Attention Might Indeed Be All You Need, which is the title, also good title, although in hindsight is good. I don't think at the time I thought, "This is a great (laughs) title for a paper." But that, that architecture is proving that the dream of modeling sequences of any bites, there is something there that will stick. And, and I think these a- these advance in architectures, in, in kind of how neural networks are architectured to do what they do. Um, it's been hard to find one that has been so stable and relatively has changed very little since it was invented five or so years ago. So that is a surprising, keeps... is a surprise that keeps recurring, like, into other projects.

    7. LF

      Can you try to on a philosophical or technical level, introspect what is the magic of attention?

    8. OV

      Yeah.

    9. LF

      What is, what is attention? 'Cause attention in, uh, people that study cognition, so human attention, I think there's giant wars over what attention means, how it works in the human mind. So what... Uh, there's very simple looks at what attention is in a neural network from the days of Attention Is All You Need. But broad... Do you think there's a general principle that's, that's really powerful here?

    10. OV

      Yeah. So a distinction between transformers and LSTMs which were what came before and, and... You know, there was a transitional period where you could, you could use both. In fact, when we talked about AlphaStar, we used transformers and LSTMs. So it was still the beginning of transformers. They were very powerful. But LSTMs were still very pow- also very powerful sequence models. So the power of the transformer is that it has built-in what we call an inductive bias of attention that makes the model... When, when you think of a sequence of integers, right? Like, we discussed this before, right? This is a sequence of words. Um, when you, when you have to do very hard tasks over these words, this could be we're gonna translate a whole paragraph, or we're gonna predict the next paragraph given 10 paragraphs before. There's some loose intuition from how we do it as a human that is very nicely mimicked and re- like, re- replicated structurally speaking in the transformer, which is this idea of you're looking for something, right? So you're sort of... When you're... You, you, you just read a piece of text, now you're thinking what comes next. You might wanna re-look at the text or look it from scratch. I mean, the li- literally is, is because there's no recurrence. You're just thinking what comes next, and it's almost hypothesis-driven, right? So if, if I'm thinking the next word that I'll write is "cat" or "dog," okay? Um, the way the transformer works, almost philosophically, is it has these two hypotheses, "Is it, is it gonna be cat, or is it gonna be dog?" And then it thinks, "Okay, if it's 'cat,' I'm gonna look for certain words." Not necessarily cat, although cat is an obvious word you would look in the past to, to see whether it makes more sense to output cat or dog. And then it does some very deep computation over the words and beyond, right? So it combines the words and... But m- but it has the query as we call it that is "cat." And then similarly for dog, right? And so it's, it's very... It's a very computational way to think about... Look, if I'm, if I'm thinking deeply about text, I need to go back to, to look at all of the text, attend over it. But it's not just attention. Like, what, what is guiding the attention? And that was the key insight from an earlier paper, is not how far away is it. I mean, how far away is it is important. What, what, what did I just write about? That's critical. But what you wrote about 10 pages ago might also be critical. So you're looking not positionally but content-wise, right? And you... Transformers have this beautiful way to query for certain content and pull it out com- in a compressed way, so then you can make a more informed decision. I mean, that's one way to explain transformers. Um, but I think it's, it's very... It's a very powerful inductive bias. There might be some details that might change over time, but I think that is...... what makes transformers so much more powerful than the recurrent networks that were more recency bias based, which obviously works in some tasks, but there, it has major flaws. Transformer itself has flaws, and I think the main one, the main challenge is these prompts that we, we just were talking about, they can be a thousand words long. But if I'm teaching you StarCraft, I mean, I'll have to show you videos. I'll have to, I'll have to point you to whole Wikipedia articles about the game. Um, we'll have to interact probably as you play, you'll ask me questions. The context required for us to achieve me being a good teacher to you on the game, as you would want to do it with a model, well, what I think goes well beyond the current capabilities. Um, so the question is how do we benchmark this and then how do we change the structure of the architectures? I think there's ideas on both sides, but we'll have to see empirically, right, obviously what ends up working in the ...

    11. LF

      And as, as you talked about, some of the ideas could be, you know, keeping the constraint of that length in place, but then forming like hierarchical representations to where you can start being much cleverer in how you use those thousand tokens.

    12. OV

      Indeed.

    13. LF

      Yeah, that's really interesting. But it also is possible that this attention mechanism where you basically, uh, you don't have a recency bias, but you, you, you look more generally, you, you make it learnable. The mechanism in which way you look back into the past, you make that learnable. It's also possible we're at the very beginning of that, because that you might become smarter and smarter in the way you query the past. Uh, so recent past and distant past, and maybe very, very distant past. So almo- almost like the attention mechanism will have to improve and evolve as good as the, uh, the tokenization mechanism where so you can represent long-term memory somehow.

    14. OV

      Yes. And I mean, hierarchies are, are very... I mean, it's a very nice word that sounds appealing. Um, there's lots of work adding hierarchy to the memories. Um, in practice, it does seem like we keep coming back to the main formula or main architecture. That sometimes tells us something. There, there is such a sentence that, um, a friend of mine told me, like, whether it wants to work or not. So transformer was clearly an idea that wanted to work.

    15. LF

      Mm-hmm.

    16. OV

      And then I think there's some principles we believe will be needed, but finding the exact details, details matter so much, right? That's gonna be tricky.

    17. LF

      I love the idea that there's like y- you as a human being, you want, you want some ideas to work, and then there's the model that wants some ideas to work, and you get to have a conversation to see which (laughs) -

    18. OV

      Yes.

    19. LF

      ... more likely the model will win in the end. Because it's, it's the one... You don't have to do any work. The, the model is the one that has to do the work. So you should listen to the model. And I, I really love this idea that you talked about the humans in this picture, if I can just briefly ask, um... One is you're saying the benchmarks about the, the modular humans working on this. Uh, the benchmarks providing a sturdy ground of a wish to do these things that seem impossible. They, they give you, in the darkest of times, give you hope, because the little signs of improvement. You could, you could-

    20. OV

      Yes.

    21. LF

      ... like you're not, you're (laughs) ... Somehow you're not lost if you have metrics to measure your, your improvement. And then there's other aspect, you said elsewhere and, and here today, like titles matter.

    22. OV

      (laughs)

    23. LF

      I wonder how much humans matter in the evolution of all this. Meaning individual humans. You know, something about their interactions, something about their ideas, how much they change the direction of all of this. Like, if you change the humans in this picture, like is, is it that the model is sitting there and it wants you th- it wants some idea to work? Or is it the humans, or maybe the model is providing 20 ideas that could work, and depending on the humans you pick, they're, they're going to be able to hear some of those ideas. Like in, in all the... Because you're now directing all of deep learning at DeepMind, you get to track a lot of projects, a lot of brilliant researchers. Um, how much variability is created by the humans in all of this?

    24. OV

      Yeah, I mean, you, I do believe humans matter a lot, at the very least, at the, you know, time scale of years on when things are happening and what's the sequencing of it, right? So you get to interact with people that, I mean, you mentioned this, um, some people really want some idea to work and they'll persist. Um, and then some other people might be more practical, like, "I don't care what idea works. I care about, you know, cracking protein folding."

    25. LF

      Yes.

    26. OV

      Um, and these, at least these two kind of seem opposite sides. We need both. And we've clearly had both, um, historically, and that made certain things happen earlier or later. So definitely humans involved in all of this endeavor have had, I would say, years of change or of, of ordering how, how things have happened, which breakthroughs came before, which other breakthroughs and so on. So certainly that does happen. And so one other, maybe one other axis of distinction is what I called, and this is most commonly used in reinforcement learning, is the exploration-exploitation trade-off as well. It's not exactly what I meant, although quite related. So when you start trying to help others, right? Like you, you're, you're, you know, you're, you become a bit more of a mentor to a large group of people, be it a project or the deep learning team or something, um, or even in the community when you interact with people in conferences and so on, um, you're identifying quickly, right? Um, some i- some things that are explorative or exploitative, and...... it's tempting to try to guide people, obviously. I mean, that's what makes, like, our experience. We bring it and we try to shape things, um, sometimes wrongly, and there's many times that I've been wrong in the past. That's great. But it would be wrong to dismiss any sort of, of the research styles that I'm observing. Um, and I often get asked, "Well, you're in industry, right?" So we do have access to large compute scale and so on, so there are certain kinds of research I almost feel like we need to do responsibly and so on, but it is we have the particle accelerator here, so to speak, in physics. So we need to use it, we need to answer the questions that we should be answering right now for the scientific progress. But then, at the same time, I look at many advancements, including attention, which was discovered in Montreal initially because of lack of, uh, compute, right? So we were working on sequence-to-sequence, um, with, with my friends over at Google Brain at the time, and we were using, I think, eight GPUs, which was somehow a lot at the time. (laughs)

    27. LF

      (laughs)

    28. OV

      And then I think Mon- Montreal was a bit more limited in the scale, but then they discovered this content-based attention concept that then has obviously triggered things like transformer. Not everything obviously starts transformer. There's, there's always a history that is, is, is important to recognize, because then you can make sure that then those who might feel now, "Well, we don't have so much compute," you need to then help them optimize that kind of research that might actually produce amazing change. Perhaps it's not as short term as some of these advancements, or perhaps it's a different time scale, but, um, the people and the diversity of the field is quite critical to man- to... that we maintain it. And at times, especially mixed a bit with hype or other things, it's, it's a bit tricky to be observing, um, maybe too much of the same thinking across the board. Um, but the humans definitely are critical, and I can think of, yeah, quite a few personal examples where also someone told me something that had a huge, you know, huge effect on, onto some idea, and then that's why I'm saying at least at the term- in terms of years, probably some things do happen.

    29. LF

      Yeah, it's fascinating.

    30. OV

      Slightly different, yeah.

Episode duration: 2:10:08

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode aGBLRlLe7X8

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome