Skip to content
Lex Fridman PodcastLex Fridman Podcast

Juergen Schmidhuber: Godel Machines, Meta-Learning, and LSTMs | Lex Fridman Podcast #11

Lex Fridman and Jürgen Schmidhuber on jurgen Schmidhuber on self-improving AI, curiosity, and universal intelligence.

Lex FridmanhostJürgen Schmidhuberguest
Dec 23, 20181h 19mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:0015:00

    The following is a…

    1. LF

      The following is a conversation with Jurgen Schmidhuber. He's the co-director of atia Swiss AI lab and the co-creator of long short-term memory networks. LSTMs are used in billions of devices today for speech recognition, translation, and much more. Over 30 years, he has proposed a lot of interesting out of the box ideas on meta-learning, adversarial networks, computer vision, and even a formal theory of, quote, "Creativity, curiosity, and fun." This conversation is part of the MIT course on Artificial General Intelligence and the Artificial Intelligence Podcast. If you enjoy it, subscribe on YouTube, iTunes, or simply connect with me on Twitter @lexfridspelled F-R-I-D. And now here's my conversation with Jurgen Schmidhuber. Early on, you dreamed of AI systems that self-improve recursively. When was that dream born?

    2. JS

      When I was a baby. No, that's not true.

    3. LF

      (laughs)

    4. JS

      When I was a teenager.

    5. LF

      And what was the catalyst for that birth? What was the thing that first inspired you?

    6. JS

      When I was a boy, I'm, I was thinking about what to do in my life and then I thought the most exciting thing is to solve the riddles of the universe and, and that means you have to become a physicist. However, then I realized that there's something even grander, you can try to build a machine that isn't really a machine any longer that learns to become a much better physicist than I could ever hope to be. And that's how I thought maybe I can multiply my tiny little bit of creativity into infinity.

    7. LF

      But ultimately, that creativity will be multiplied to understand the universe around us? That's, that's the, the curiosity for that mystery that, that drove you?

    8. JS

      Yes. Uh, so if you can build a machine that learns to solve more and more complex problems and more and more general problem solver, then you basically ha-have, um, solved all the problems. At least all the solvable problems.

    9. LF

      So how do you think, what is the mechanism for that kind of general solver look like? Because obviously we don't quite yet have one or know how to build one, but we have ideas and you have had throughout your career several ideas about it. So how do you think about that mechanism?

    10. JS

      So in the '80s, I thought about how to build this machine that learns to solve all these problems that I cannot solve myself and I thought it is clear it has to be a machine that not only learns to solve this problem here and this problem here, but it also has to learn to improve the learning algorithm itself.

    11. LF

      Right.

    12. JS

      So it has to have the learning algorithm in, um, representation that allows it to inspect it and modify it, such that it can come up with a better learning algorithm. So I call that meta-learning, learning to learn and recursive self-improvement, um, that is really the pinnacle of that where you then not only learn, um, how to improve on that problem and on that, but you also improve the way the machine improves, and you also improve the way it improves the way it improves itself. And that was my 1987 diploma thesis which was all about that, hierarchy of meta-learners that have no computational limits except for the well-known limits, uh, that Goedel identified in 1931, and, uh, for the limits are physics.

    13. LF

      Mm-hmm. In the recent years, meta-learning has gained popularity in a v- in a specific kind of form. You've talked about how that's not really meta-learning w-with neural networks, that's more m- basic transfer learning. Can you talk about the difference between the big general meta-learning-

    14. JS

      Mm-hmm.

    15. LF

      ... and a more narrow sense of meta-learning the way it's used today, the way it's talked about today?

    16. JS

      Let's take the example of a deep neural network that has, uh, learned to classify images and maybe you have trained that, um, network on 100 different databases of images.

    17. LF

      Mm-hmm.

    18. JS

      And now a new database comes along and you want to quickly learn the new thing as well. So one simple way of doing that is you take the network which already knows 100 types of databases and then you just take the top layer of that and you retrain that, uh, using the new label data that you have in the new image database. And then it turns out that it really, really quickly can learn that too.

    19. LF

      Mm-hmm.

    20. JS

      One shot basically.

    21. LF

      Mm-hmm.

    22. JS

      Because from the first 100 datasets, it already has learned so much about, about computer vision that it can reuse that and that is then almost good enough to solve the new task except you need a little bit of, um, adjustment on the top. So that is transfer learning and it has been done in principle for many decades. People have done similar things for decades. Meta-learning, true meta-learning is about having the learning algorithm itself open to introspection by the system that is using it.... and also open to modification, such that the learning system has an opportunity to modify any part of the learning algorithm and then evaluate the consequences of that modification and then learn from that, uh, to create a better learning algorithm and so on recursively. So that's a very different animal where you are opening the space of possible learning algorithms to the learning system itself.

    23. LF

      Right. So you've, uh, like in the 2004 paper, you described, uh, Gödel machines and programs that rewrite themselves.

    24. JS

      Yeah.

    25. LF

      Right? Philosophically and even in your paper mathematically, these are really compelling ideas. But practically, do you see these self-referential programs being successful in the near term to having an impact where sort of it demonstrates to the world that th- this direction is a, is a good one to pursue in the near term?

    26. JS

      Yes. We had these two different types of, um, fundamental research, um, how to build a universal problem solver. One basically exploiting proof search and things like that that you need to come up with asymptotically optimal, theoretically optimal self-improvers and problem solvers. However, one has to admit that through this proof search comes in an additive constant, an overhead, an additive overhead that vanishes in comparison to, uh, what you have to do to solve large problems.

    27. LF

      Mm-hmm.

    28. JS

      However, for many of the small problems that we want to solve in our everyday life, we cannot ignore this constant overhead and that's why we also have been, um, doing other things, non-universal things such as recurrent neural networks which are trained by gradient descent-

    29. LF

      Mm-hmm.

    30. JS

      ... and local search techniques which aren't universal at all, which aren't provably optimal at all like the other stuff that we did, but which are much more practical as long as we only want to solve the small problems that we are typically trying to solve in this environment here. Yeah, so the universal problem solvers-

  2. 15:0030:00

    You said that an…

    1. JS

    2. LF

      You said that an AGI system will ultimately be a simple one. Uh, a general intelligence system will ultimately be a simple one, maybe a pseudocode of a few lines will be able to describe it. Can you talk through your intuition behind this idea, why you feel that a s- at its core, intelligence is a simple algorithm?

    3. JS

      Experience tells us that the stuff that works best is really simple. So the asymptotically optimal ways of solving problems, if you look at them, they're just a few lines of code. It's really true. Although they have these amazing properties, just a few lines of code. Then the most, mm, promising and most useful practical things maybe don't have this proof of optimality associated with them. However, they are also just a few lines of code. The most successful, mm, recurrent neural networks, you can write them down in five lines of pseudocode.

    4. LF

      Th- that's a beautiful, almost poetic idea, but w- what you're describing there is the s- the lines of pseudocode are sitting on top of layers and layers of abstractions in a sense.

    5. JS

      Mm-hmm.

    6. LF

      So y- you're saying at the very top-

    7. JS

      Mm-hmm.

    8. LF

      ... it will be a beautifully written sort of, uh, algorithm.

    9. JS

      Mm-hmm.

    10. LF

      But do you think that there's many layers of abstractions we have to first learn to construct?

    11. JS

      Yeah. Of course, we are building on all these, um, great abstractions that people have invented over the millennia such as matrix multiplications and real numbers and basic arithmetics and calculus and derivations of, um, arrow functions and derivatives of arrow functions and stuff like that. So without that language, that greatly simplifies our way of thinking about these problems, we couldn't do anything. So in that sense as always, we are standing on the shoulders of the giants who, in the past, um, simplified the problem of problem-solving so much that now we have a chance to do the final step.

    12. LF

      (laughs) So the final step will be a simple one. Uh, w- if we, if we take a step back through all of human civilization and just the universe in general (laughs) , uh, w- how do you think about evolution and what if creating a universe is required to achieve this final step?

    13. JS

      Mm-hmm.

    14. LF

      What if going through the very painful and inefficient process of evolution is needed-

    15. JS

      Mm-hmm.

    16. LF

      ... to come up with this set of abstractions that ultimately lead to intelligence? Do you think there's a shortcut or do you think we have to create something like our universe in order to create something like human level intelligence?

    17. JS

      Hmm. So far the only example we have, uh-

    18. LF

      (laughs)

    19. JS

      ... is this one.

    20. LF

      Yeah.

    21. JS

      This universe, um, in which we are living.

    22. LF

      Do you think it can do better?

    23. JS

      Maybe not.

    24. LF

      (laughs)

    25. JS

      But, um, we are part of this whole process.

    26. LF

      Right.

    27. JS

      So...... apparently, so it might be the case that the code that runs the universe is really, really simple. Everything points to that possibility because gravity and other basic forces are really simple laws that can be easily described also in just a few lines of code basically. And, uh, and then there are these other, um, events that... the apparently random events in the history of the universe which, as far as we know at the moment, don't have a compact code, but who knows? Maybe somebody in the near future is going to figure out the pseudo-random generator which is, um, which, um, is computing whether the measurement of that, um, spin up or down thing here is, um, going to be positive or negative.

    28. LF

      Underlying quantum mechanics?

    29. JS

      Yes. So-

    30. LF

      Do you ultimately think quantum mechanics is a, a pseudo-random number gen- so it's all deterministic? There's no randomness in our universe? Does God play dice?

  3. 30:0045:00

    Mm-hmm. …

    1. JS

    2. LF

      Mm-hmm.

    3. JS

      W- and since they are trying to maximize the, um, rewards they get, they are suddenly motivated to come up with new action sequences, with new experiments that have the property that the data that is coming in as a consequence of these experiments has the property that they can learn something about, see a pattern in there which they hadn't seen yet before.

    4. LF

      S- there's an idea of power play that you've described-... uh, uh, training a general problem solver in this kind of way of looking for the unsolved problems.

    5. JS

      Yeah.

    6. LF

      Can you describe that idea a little further?

    7. JS

      Yeah. It's another very simple idea. So normally, what you do in computer science, you have, um, you have some guy who gives you a problem, and then there is a- a huge, uh, search space of potential solution candidates. And you somehow try them out and, um, you have more or less sophisticated ways of, mm, moving around in that search space until you finally found a solution, uh, which you consider satisfactory.

    8. LF

      Mm-hmm.

    9. JS

      That's what most of computer science is about. Power play just goes one little step further and says, "Let's not only search for solutions to a given problem, but let's search two pairs of problems and their solutions where the system itself has the opportunity to phrase its own problem." So we are looking suddenly at pairs of problems and their solutions or, uh, modifications of the problem solver that is supposed to generate a solution to that, um, new problem. And- and this additional, um, degree of freedom allows us to build courier systems that are like scientists in the sense that they not only try to solve and try to find answers to existing questions, no, they are also free to, um, pose their own questions. So if you want to build an artificial scientist, we have to give it that freedom and power play is exactly doing that.

    10. LF

      So that's- that's a dimension of freedom that's important to have, but how do you th- how hard do you think that, how multi-dimensional and difficult the space of then coming up with your own questions is?

    11. JS

      Yeah.

    12. LF

      So it's w- as, it's one of the things that as human beings we, uh, consider to be the thing that makes us special-

    13. JS

      Mm.

    14. LF

      ... the intelligence that makes us special is that brilliant insight-

    15. JS

      Yeah.

    16. LF

      ... that can create something totally new.

    17. JS

      Yes. So now let's look at the extreme case. Let's look at the set of all possible problems that you can formally describe-

    18. LF

      Mm-hmm.

    19. JS

      ... which is infinite, which should be the next problem that a scientist or power play is going to solve. Well, it should be the easiest problem that goes beyond what you already know. So it should be the simplest problem that the current problem solver that you have, which can already solve 100 problems, that he cannot solve yet by just generalizing. So it has to be new, so it has to require a modification of the problem solver such that the new problem solver can solve this new thing, but the old problem solver cannot do it. And in addition to that, we have to make sure that the problem solver doesn't forget any of the previous solutions.

    20. LF

      Right.

    21. JS

      And so by definition, power play is now trying always to search in this pair of-

    22. LF

      Mm-hmm.

    23. JS

      ... in- in- in the set of pairs of problems and problem solver modifications for a combination that, uh, minimize the time-

    24. LF

      Mm-hmm.

    25. JS

      ... to achieve these criteria. So it's always trying to find the problem which is easiest to add to the repertoire.

    26. LF

      So just like grad students and, uh, academics and researchers can spend their whole career in a local minima-

    27. JS

      Mm-hmm.

    28. LF

      ... stuck trying to, uh, come up with interesting questions-

    29. JS

      Mm-hmm.

    30. LF

      ... but ultimately doing very little, do you think it's easy w- in this approach of looking for the simplest unsolvable problem to get stuck in a local minima and not never really discovering new, uh, you know, really jumping outside of the 100 problems that you've already solved-

  4. 45:001:00:00

    Mm-hmm. …

    1. JS

      use this model of the network, uh, of the world, this model network of the world, this predictive model of the world, to plan ahead and say-

    2. LF

      Mm-hmm.

    3. JS

      ... "Let's not do this action sequence. Let's do this action sequence instead because it leads to more predicted rewards."

    4. LF

      Mm-hmm.

    5. JS

      And whenever it's waking up these little subnetworks that stand for itself, then it's thinking about itself. Then it's thinking about itself and it's exploring mentally the consequences of its own actions. And, and now you tell me what is still missing, um, uh-

    6. LF

      Missing the next, the, the gap to consciousness.

    7. JS

      Yeah.

    8. LF

      Uh, there, there isn't. That's a really beautiful idea that, um, you know, if, if life is a collection of data and, and life is a process of compressing that data to act, uh, efficiently, uh, in that data, you yourself appear very often. (laughs) So it's useful to, uh, form compressions of yourself. I mean, it's a really beautiful formulation of what consciousness is, is a necessary side effect. It's, uh, actually quite c- compelling to me. You've described RNNs, developed, uh, LSTMs, long short-term memory networks that are, they're a type of recurrent neural networks that have gotten a lot of success recently. So these are networks that model the temporal aspects in the data, t- temporal patterns in the data, and you've called them the deepest of the neural networks, right? So what do you think is the value of depth in the models that we use to learn?

    9. JS

      Yeah. Since you mentioned the long short-term memory and the LSTM, um, I have to mention the names of the brilliant students who made that possible.

    10. LF

      Yes, of course, of course.

    11. JS

      Um, first of all, my first student ever, Sepp Hochreiter, who had fundamental insights already in his diploma thesis. Then Felix Gears, who had additional important contributions. Alex Gray is a guy from Scotland who, um, uh, is mostly responsible for this, uh, CTC algorithm which is now often used to, to train, uh, the LSTM to do the speech recognition on all the Google Android phones and whatever, um, and CRV and so on. So, um, uh, these guys, without these guys, um-

    12. LF

      Yeah.

    13. JS

      ... I would be nothing.

    14. LF

      It's a lot of incredible work.

    15. JS

      What is now the depth? Uh, what is the importance of depth? Well, um, most problems in the real world are deep in the sense that, um, the current input doesn't tell you all you need to know about the environment.

    16. LF

      Mm-hmm.

    17. JS

      So instead, um, you have to have a memory of what happened in the past. And often important parts of that memory are dated. They are pretty old.

    18. LF

      Mm-hmm.

    19. JS

      So, um, when you're doing speech recognition, for example, and somebody says, "11,"-

    20. LF

      Mm-hmm.

    21. JS

      ... then that's about half a second or something like that-

    22. LF

      Mm-hmm.

    23. JS

      ... which means it's already 50 timesteps.

    24. LF

      Mm-hmm.

    25. JS

      And another guy or the same guy sa- says, "Seven," so the ending is the same, -even.

    26. LF

      Mm-hmm.

    27. JS

      But now the system has to see the distinction between seven and 11, and the only way it can see the difference is it has to store that, uh, 50 steps ago, there was an S or an L, 11 or a 7.... so there, you have already a problem of depth 50 because for each time step, you have something like a virtual, uh, layer in the expanded unrolled version of this recurrent network which is doing the speech recognition. So these long time lags, they translate into problem depth and most problems in this world are such that you really, um, have to look far back in time to understand what is the problem and to solve it.

    28. LF

      But just like with LSTMs, you don't necessarily need to, when you look back in time, remember every aspect. You just need to r- remember the important aspects.

    29. JS

      That's right. The network has to learn to put the important stuff in- into memory and to ignore the unimportant noise.

    30. LF

      So, but in that sense, deeper and deeper is better or is there a limitation? Is- is there... I mean, LSTM is- is one of the great examples of architectures that, uh, do something beyond just deeper and deeper networks. Uh, there's clever mechanisms for filtering data, for remembering and forgetting. Uh, so do you think th- that kind of thinking is necessary? If you think about LSTMs as a leap, a big leap forward over traditional vanilla RNNs, what do you think is the next leap-

  5. 1:00:001:15:00

    So I, um, b-…

    1. LF

    2. JS

      So I, um, b- became aware of all of that in the '80s, and back then a logic program, logic programming was a huge thing.

    3. LF

      Was it inspiring to you yourself? Did you find it compelling? Because most, a lot of your work was, uh, not so much in that realm, right? It was more in the learning systems.

    4. JS

      Yes and no, but we did all of that. Uh-

    5. LF

      You did.

    6. JS

      ... so we, we... My first, um, publication ever actually was, um-... 1987 was, uh, the implementation of, um, genetic algorithm of a genetic programming system in Prolog. So, Prolog, that's what you learned back then, which is a logic programming language. And the Japanese, they had this huge fifth generation AI project, which was mostly about, uh, logic pro- programming back then. Although neural networks ex- existed and were well-known back then, and deep learning has existed since, um, 1965, um, since this guy in the Ukraine, um, Ivakhnenko, started it. But, um, uh, the Japanese and many other people, uh, they focused really on this logic programming, and I was influenced to the extent that I said, "Okay, let's take these biologically inspired algorithms like evolution-"

    7. LF

      Mm-hmm.

    8. JS

      "... uh, programs, uh, and, um, and, and, mm, and implement that in the language which I know, which was Prolog, for example, back then."

    9. LF

      Yeah.

    10. JS

      And then, um, in, in many ways this came back later because, uh, the Gödel machine, for example, has a proof searcher on board, and without that, it would not be optimal. While Marcus Hutter's, uh, universal algorithm for solving all well-defined problems has a proof search on board. So, that's very much logic programming.

    11. LF

      Mm-hmm.

    12. JS

      Without that, it would not be asymptotically optimum. But then on the other hand, because we are very pragmatic guys also, um, we, we focused on recurrent neural networks and, and, and suboptimal, uh, stuff such as gradient-based search and program space, rather than provably optimal things.

    13. LF

      The logic programming does it ha- certainly has a usefulness in, uh, when you're trying to construct something provably optimal or provably good or s- something like that, but is it useful for, for practical problems?

    14. JS

      It's really useful for theorem proving.

    15. LF

      Theorem. (laughs)

    16. JS

      The best theorem provers today are not, uh, neural networks.

    17. LF

      Right.

    18. JS

      No. They are, uh, logic programming systems, and they are much better theorem provers than most, uh, math students in their first or second semester.

    19. LF

      Mm-hmm. But for reasoning, to, for playing games of Go or chess, or for robots, autonomous vehicles that operate in the real world-

    20. JS

      Yeah.

    21. LF

      ... or, uh, object manipulation-

    22. JS

      Yeah.

    23. LF

      ... you think learning...

    24. JS

      Yeah, as long as the problems have little to do, um, with, um, with, um, theorem proving-

    25. LF

      Yeah.

    26. JS

      ... uh, themselves, then, um, as long as that is not the case, um, you, you just want to have better pattern recognition. So, to build a self-driving car, you want to have better pattern recognition and, um, and, uh, pedestrian recognition and all these things, and you want to, uh, you mini- you want to minimize the number of false positives, which is currently slowing down self-driving cars in many ways. And, um, and all of that has very little to do with logic programming. Yeah.

    27. LF

      What are you most excited about in terms of directions of artificial intelligence at this moment in the next few years in your own research and in the broader community?

    28. JS

      So, I think in the not-so-distant future, we will have for the first time little robots that learn like kids. And I will be able to say to the robot, um, "Look here, robot, we are going to assemble a smartphone."

    29. LF

      Mm-hmm.

    30. JS

      "Let's take this slab of plastic, um, and the screwdriver, and let's screw in the screw like that," you know? No, not like that, like that.

  6. 1:15:001:20:04

    Do you think we…

    1. JS

      now we realize that the universe is still young. It's only 13.8 billion years old, and it's going to be a thousand times older than that. So there's plenty of time to conquer the entire universe and to fill it with intelligence and senders and receivers such that AIs can travel the way they are traveling in our labs today, which is by radio from sender to receiver. And let's call the current age of the universe one eon. One eon. Now, it will take just a few eons from now and the entire visible universe is going to be full of that stuff. And let's look ahead to a time when the universe is going to be 1,000 times older than it is now. They will look back and they will say, "Look, almost immediately after the Big Bang, only a few eons later, the entire universe started to become intelligent." Now to your question, how do we see whether anything like that has already happened or is already in a more advanced stage in some other part of the universe, of the visible universe? We are trying to look out there and nothing like that has happened so far, or is that true?

    2. LF

      Do you think we would recognize it? Well, how do we know it's not among us?

    3. JS

      Yeah.

    4. LF

      How do we know planets aren't, in themselves, intelligent beings?

    5. JS

      Yeah.

    6. LF

      How do we know, uh, ants, um, seen as a collective are not much greater intelligence-

    7. JS

      Yeah.

    8. LF

      ... than our own? These kinds of ideas.

    9. JS

      Yeah. When I was a boy, I was thinking about these things and I thought, hmm, maybe it has already happened because back then I know, I knew, I learned from popular physics books that, um, the structure, the large-scale structure of the universe is not homogeneous. And you have these clusters of galaxies, and then in between there are these huge empty spaces. And I thought, hmm, maybe they aren't really empty. It's just that in the middle of that, some AI civilization already has expanded and then has covered a bubble of a billion light years diameter and is using all the energy, uh, of all the stars within that bubble for its own unfathomable purposes. And so it already has happened, and we just, um, fail to interpret the signs. But then I learned, uh, that gravity by itself explains the large-scale structure of the universe and that this is not a convincing explanation. And then I thought maybe, maybe it's the dark matter because as far as we know today, 80% of the measurable matter is invisible. And we know that because otherwise our galaxy or other galaxies would fall apart. They would... They are rotating too quickly. And, um, then the idea was maybe all of these AI civilizations that are already out there, they, they, they are just invisible because they're really efficient in using the energies of their own, uh, local systems and that's why they appear dark to us. But this is also not a convincing explanation because then the question becomes why is there... Are there still any, uh, visible stars left in our own galaxy, which also must have a lot of dark matter? So that is also not a convincing thing. And today, (laughs) I like to, um, think it's quite plausible that maybe we are the first, at least in our local light cone, within, um, the few hundreds of millions of light years that we can reliably observe.

    10. LF

      Observe. Is that exciting to you, that we might be the first?

    11. JS

      And, um, it would make us much more important because if we mess it up through a nuclear war, then, then maybe this will have an effect on the, on the, on the development on, of the entire universe.

    12. LF

      So let's not mess it up.

    13. JS

      Let's not mess it up.

    14. LF

      Jürgen, thank you so much for talking today. I really appreciate it.

    15. JS

      It's my pleasure.

Episode duration: 1:19:58

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode 3FIo6evmweo

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome