Skip to content
Lex Fridman PodcastLex Fridman Podcast

Roman Yampolskiy: Dangers of Superintelligent AI | Lex Fridman Podcast #431

Roman Yampolskiy is an AI safety researcher and author of a new book titled AI: Unexplainable, Unpredictable, Uncontrollable. Please support this podcast by checking out our sponsors: - Yahoo Finance: https://yahoofinance.com - MasterClass: https://masterclass.com/lexpod to get 15% off - NetSuite: http://netsuite.com/lex to get free product tour - LMNT: https://drinkLMNT.com/lex to get free sample pack - Eight Sleep: https://eightsleep.com/lex to get $350 off TRANSCRIPT: https://lexfridman.com/roman-yampolskiy-transcript EPISODE LINKS: Roman's X: https://twitter.com/romanyam Roman's Website: http://cecs.louisville.edu/ry Roman's AI book: https://amzn.to/4aFZuPb PODCAST INFO: Podcast website: https://lexfridman.com/podcast Apple Podcasts: https://apple.co/2lwqZIr Spotify: https://spoti.fi/2nEwCF8 RSS: https://lexfridman.com/feed/podcast/ Full episodes playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4 Clips playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOeciFP3CBCIEElOJeitOr41 OUTLINE: 0:00 - Introduction 2:20 - Existential risk of AGI 8:32 - Ikigai risk 16:44 - Suffering risk 20:19 - Timeline to AGI 24:51 - AGI turing test 30:14 - Yann LeCun and open source AI 43:06 - AI control 45:33 - Social engineering 48:06 - Fearmongering 57:57 - AI deception 1:04:30 - Verification 1:11:29 - Self-improving AI 1:23:42 - Pausing AI development 1:29:59 - AI Safety 1:39:43 - Current AI 1:45:05 - Simulation 1:52:24 - Aliens 1:53:57 - Human mind 2:00:17 - Neuralink 2:09:23 - Hope for the future 2:13:18 - Meaning of life SOCIAL: - Twitter: https://twitter.com/lexfridman - LinkedIn: https://www.linkedin.com/in/lexfridman - Facebook: https://www.facebook.com/lexfridman - Instagram: https://www.instagram.com/lexfridman - Medium: https://medium.com/@lexfridman - Reddit: https://reddit.com/r/lexfridman - Support on Patreon: https://www.patreon.com/lexfridman

Roman YampolskiyguestLex Fridmanhost
Jun 2, 20242h 15mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:002:20

    Introduction

    1. RY

      If we create general superintelligences, I don't see a good outcome long term for humanity. So there is X-risk, existential risk, everyone's dead. There is S-risk, suffering risks, where everyone wishes they were dead. We have also idea for I-risk, ikigai risks, where we lost our meaning. The systems can be more creative. They can do all the jobs. It's not obvious what you have to contribute to a world where superintelligence exists. Of course, you can have all the variants you mentioned where we are safe, we are kept alive, but we are not in control. We are not deciding anything. We are like animals in a zoo. There is, again, possibilities we can come up with as very smart humans, and then possibilities something 1,000 times smarter can come up with for reasons we cannot comprehend.

    2. LF

      The following is a conversation with Roman Yampolskiy, an AI safety and security researcher and author of a new book titled AI: Unexplainable, Unpredictable, Uncontrollable. He argues that there's almost 100% chance that AGI will eventually destroy human civilization. As an aside, let me say that I will have many often technical conversations on the topic of AI, often with engineers building the state-of-the-art AI systems. I would say those folks put the infamous P-doom or the probability of AGI killing all humans at around 1 to 20%. But it's also important to talk to folks who put that value at 70, 80, 90, and as in the case of Roman, at 99.99 and many more nines percent. I'm personally excited for the future, and believe it will be a good one, in part because of the amazing technological innovation we humans create. But we must absolutely not do so with blinders on, ignoring the possible risks, including existential risks of those technologies. That's what this conversation is about. This is the Lex Fridman podcast. To support it, please check out our sponsors in the description. And now, dear friends, here's Roman Yampolskiy.

  2. 2:208:32

    Existential risk of AGI

    1. LF

      What to you is the probability that superintelligent AI will destroy all human civilization?

    2. RY

      What's the timeframe?

    3. LF

      Let's say 100 years, in the next 100 years.

    4. RY

      So the problem with controlling AGI or superintelligence, in my opinion is like a problem of creating a perpetual safety machine. By analogy with perpetual motion machine, it's impossible. Yeah, we may succeed and do a good job with GPT-5, 6, 7, but they just keep improving, learning, eventually self-modifying, interacting with the environment, interacting with malevolent actors. The difference between cybersecurity, narrow AI safety, and safety for general AI for superintelligence is that we don't get a second chance. With cybersecurity, somebody hacks your account, what's the big deal? You get a new password, new credit card, you move on. Here, if we're talking about existential risks, you only get one chance. So you're really asking me, what are the chances that we'll create the most complex software ever on the first try with zero bugs, and it will continue have zero bugs for 100 years or more?

    5. LF

      So there is an incremental improvement of systems leading up to AGI. To you, it doesn't matter if we can keep those safe. There's going to be one level of system at which you cannot possibly control it.

    6. RY

      I don't think we so far have made any systems safe. At the level of capability they display, they already have made mistakes or had accidents. They've been jailbroken. I don't think there is a single large language model today which no one was successful at making do something developers didn't intend it to do.

    7. LF

      But there's a difference between getting it to do something unintended, getting it to do something that's painful, costly, destructive, and something that's destructive to the level of hurting billions of people, or hundreds of millions of people, billions of people, or the entirety of human civilization. That's a big leap.

    8. RY

      Exactly. But the systems we have today have capability of causing X amount of damage. So then they fail, that's all we get. If we develop systems capable of impacting all of humanity, all of universe, the damage is proportionate.

    9. LF

      What to you are the possible ways that such kind of mass murder of humans can happen?

    10. RY

      That's always a wonderful question. So one of the chapters in my new book is about unpredictability. I argue that we cannot predict what a smarter system will do. So you're really not asking me how superintelligence will kill everyone. You're asking me how I would do it.

    11. LF

      Right.

    12. RY

      And I think it's not that interesting. I can tell you about a standard, you know, nanotech, synthetic, bio, nuclear. Superintelligence will come up with something completely new, completely super. We may not even recognize that as a possible path to achieve that goal.

    13. LF

      So there is like an unlimited level of creativity in terms of how humans could be killed. But, you know, we could still investigate possible ways of doing it. Not how to do it, but the- at the end, what is the methodology that does it? You know, shutting off the power, and then humans start killing each other maybe, because the resources are really constrained? That there- and then there's the actual use of weapons like nuclear weapons or developing artificial pathogens, viruses, that kind of stuff.We could still kinda think through that and defend against it, right? There's a ceiling to the creativity of mass murder of humans here. Right? The options are limited.

    14. RY

      They are limited by how imaginative we are. If you are that much smarter, that much more creative, you are capable of thinking across multiple domains, do novel research in physics and biology, you may not be limited by those tools. If squirrels were planning to kill humans, they would have a set of possible ways of doing it, but they would never consider things we can come up with.

    15. LF

      So, are you- are you thinking about mass murder and destruction of human civilization? Or are you thinking of, with squirrels, you put them in a zoo and they don't really know they're in a zoo? If we just look at the entire set of undesirable trajectories, majority of them are not going to be death. Most of them are going to be just, like, uh, things like Brave New World where, you know, the squirrels are fed dopamine and they're all, like, doing some kind of fun activity, and the, sort of the fire, the soul of humanity is lost because of the drug that's fed to it. Or, like, literally in a zoo. We're in a zoo, we're doing our thing. We're, like, uh, playing a game of Sims, and the actual players playing that game are AI systems. Those are all undesirable because it's sort of the- the free will. The fire of human consciousness is dimmed through that process, but it's not killing humans. So, like, are you thinking about that? Or is the biggest concern literally the extinctions of humans?

    16. RY

      I- I think about a lot of things.

    17. LF

      (laughs)

    18. RY

      So there is X-risk, existential risk, everyone's dead. There is S-risk, suffering risks, where everyone wishes they were dead. We have also idea for I-risk, Ikigai risks, where we lost our meaning. The systems can be more creative, they can do all the jobs. It's not obvious what you have to contribute to a world where super intelligence exists. Of course, you can have all the variants you mentioned where we are safe, we are kept alive, but we are not in control. We are not deciding anything. We are like animals in a zoo. There is, again, possibilities we can come up with as very smart humans, and then possibilities something 1,000 times smarter can come up with for reasons we cannot

  3. 8:3216:44

    Ikigai risk

    1. RY

      comprehend.

    2. LF

      I would love to sort of dig into each of those, X-risk, S-risk and I-risk. So c- can you, like, linger on I-risk? What is that?

    3. RY

      So Japanese concept of Ikigai, you find something which allows you to make money, you are good at it, and the society says, "We need it." So, like, you have this awesome job, you're a podcaster, gives you a lot of meaning. You have a good life. I assume you're happy.

    4. LF

      Mm-hmm.

    5. RY

      That's what we want more people to find, to have. For many intellectuals, it is their occupation which gives them a lot of meaning. I am a researcher, philosopher, scholar. That means something to me. In a world where an artist is not feeling appreciated because his art is just not competitive with what is produced by machines, or a writer or scientist will lose a lot of that. And at the lower level, we're talking about complete technological unemployment. We're not losing 10% of jobs, we're losing all jobs. What do people do with all that free time? What happens when everything society is built on is completely modified in one generation? It's not a slow process where we get to kinda figure out how to live that new lifestyle, but it's, uh, pretty quick.

    6. LF

      In that world, can't humans just do what humans currently do with chess? Play each other? Have tournaments? Even though AI systems are far superior this time in chess. So we just create artificial games, or for us they're real. Like the Olympics, and we do all kinds of different competitions and have fun. Focus, ma- maximize the fun a- and, uh, let the AI focus on the productivity.

    7. RY

      It's an option. I have a paper where I try to solve the value alignment problem for multiple agents.

    8. LF

      Mm-hmm.

    9. RY

      And the solution to avoid compromise is to give everyone a personal virtual universe. You can do whatever you want in that world. You could be king, you could be a slave, you decide what happens. So it's basically a glorified video game where you get to enjoy yourself and someone else takes care of your needs. And the substrate alignment is the only thing we need to solve. We don't have to get eight billion humans to agree on anything.

    10. LF

      Mm-hmm. So, okay, so what, wh- why is that not a likely outcome? Why can't AI systems create video games for us to lose ourselves in? Each- each with an individual video game universe?

    11. RY

      Some people say that's what happened. We're in a simulation.

    12. LF

      And we're playing that video game, and now we're creating, uh, what? Maybe we're creating artificial threats for ourselves to be scared about 'cause- 'cause fear is really exciting. It allows us to play the video game more, uh, more vigorously.

    13. RY

      And some people choose to play on a more difficult level with more constraints. Some say, "Okay, I'm just gonna enjoy the game." High privilege level. Absolutely.

    14. LF

      So, okay, what was that paper on multi-agent value alignment?

    15. RY

      Personal universes. Personal universes.

    16. LF

      (laughs) So p- wait. So that's one of the possible outcomes, but what- what- what in general is the idea of the paper? So it's looking at multiple agents that are human, AI, h- like a hybrid system where there's humans and AIs? Or is it looking at humans or just-

    17. RY

      So this is-

    18. LF

      ... intelligent agents?

    19. RY

      In order to solve value alignment problem, um, trying to formalize it a little better. Usually, we're talking about getting AIs to do what we want, which is not well defined. Are we talking about creator of a system, owner of that AI, humanity as a whole? But we don't agree on much.... there is no universally accepted ethics, morals across cultures, religions. People have individually very different preferences, politically and such. So even if we somehow managed all the other aspects of it, programming those fuzzy concepts in, getting AI to follow them closely, we don't agree on what to program in. So my solution was, okay, we don't have to compromise on room temperature. You have your universe, I have mine. Whatever you want. And if you like me, you can invite me to visit your universe. We don't have to be independent. But the point is, you can be. And virtual reality is getting pretty good. It's gonna hit a point where you can't tell the difference, and if you can't tell if it's real or not, then what's the difference?

    20. LF

      So basically give up on value alignment. Create an entire... It's like the- the multiverse theory. Let's just create an entire universe for you with your values.

    21. RY

      You still have to align with that individual. They have to be happy in that simulation. But it's a much easier problem to align with one agent versus eight billion agents plus animals, aliens.

    22. LF

      So you convert the multi-agent problem into a single-agent problem.

    23. RY

      I'm trying-

    24. LF

      Basically.

    25. RY

      ... to do that, yeah.

    26. LF

      Okay. Is there any way to... Is there... So okay, that's giving up on the (laughs) on the value alignment problem. Well, is there any way to solve the value alignment problem when there's a bunch of humans, multiple humans, tens of humans, or eight billion humans that have a very different set of values?

    27. RY

      It seems contradictory. I haven't seen anyone explain what it means outside of kinda words which pack a lot, make it good, make it desirable, make it something they don't regret. But how do you specifically formalize those notions? How do you program them in? I haven't seen anyone, uh, make progress on that so far.

    28. LF

      But isn't that the whole optimization journey that we're doing as a h- human civilization? We're looking at geopolitics. Nations are in a state of anarchy with each other. They start wars, there's conflict, and oftentimes they have very different views of what is good and- and what is evil. Isn't that what we're trying to figure out, just together trying to converge towards that? So we're essentially trying to solve the value alignment problem with humans.

    29. RY

      Right. But the examples you gave, uh, some of them are, for example, two different religions saying, "This is our holy site, and we are not willing to compromise it in any way." If you can make two holy sites in virtual worlds, you solve the problem. But if you only have one, it's not divisible, you're kinda stuck there.

    30. LF

      But what if we want to be at tension with each other, uh, and that through that tension, we understand ourselves and we understand the world? So that- that's the intellectual journey we're on- we're on as a human civilization, is we create intellectual and physical conflict, and through that, figure stuff out.

  4. 16:4420:19

    Suffering risk

    1. RY

    2. LF

      Okay, so what- what's the S-risk? What are the possible things that you're imagining with S-risk? So mass suffering of humans, what are we talking about there caused by AGI?

    3. RY

      So there are many malevolent actors. We can talk about psychopaths, crazies, hackers, doomsday cults. We know from history they tried killing everyone. They tried on purpose to cause maximum amount of damage. Terrorism. What if someone malevolent wants on purpose to torture all humans as long as possible? You solve aging, so now you have functional immortality, and you just try to be as creative as you can.

    4. LF

      Do you think there is actually people in human history that tried to literally maximize human suffering? In just studying people who have done evil in the world, it seems that they think that they're doing good, and it doesn't seem like they're trying to maximize suffering. They just cause a lot of suffering as a side effect of doing what they think is good.

    5. RY

      So there are different malevolent agents. Some may be just gaining personal benefit and sacrificing others to that cause. Others we know for a fact are trying to kill as many people as possible. When we look at recent school shootings, if they had more capable weapons, they would take out not dozens, but thousands, millions, billions.

    6. LF

      (laughs) Well, we don't know that.But that is a terrifying possibility. And we- we don't want to find out. Like if terrorists had ac- access to nuclear weapons, how far would they go? Is there a limit to what they're willing to do? In your senses, there are some malevolent actors where there's no limit.

    7. RY

      There is mental, mental diseases where people don't have empathy, don't have this human quality of understanding suffering in others.

    8. LF

      And then there's also a set of beliefs where you think you're doing good, uh, by killing a lot of humans.

    9. RY

      Again, I would like to assume that normal people never think like that. It's always some sort of psychopaths, but yeah.

    10. LF

      And to you, AGI systems can carry that and, uh, be more competent at executing that.

    11. RY

      They can certainly be more creative. They can understand human biology better, understand our molecular structure, genome. Uh, a- again, uh, a lot of times, uh, torture ends, then an ind- ind- individual dies. That limit can be removed as well.

    12. LF

      So, if we're actually looking at X-risk and S-risk, as the systems get more and more intelligent, don't you think it's possible to anticipate the ways they can do it and defend against it like we do with cybersecurity, with ... do security systems?

    13. RY

      Right. Uh, we can definitely keep up for a while. I'm saying-

    14. LF

      Right.

    15. RY

      ... you cannot do it indefinitely. At some point, the cognitive gap is too big. The surface you have to defend is infinite. But attackers only need to find one exploit.

    16. LF

      So, to you, eventually, this is we're heading off a cliff.

    17. RY

      If we create general super intelligences, I don't see a good outcome long-term for humanity. The only way to win this game is not to play it.

    18. LF

      Okay. Well, we'll (laughs) we'll talk about possible solutions and what not playing it means.

  5. 20:1924:51

    Timeline to AGI

    1. LF

      Um, but what are the possible timelines here to you? What are we talking about? Are we talking about a set of years, decades, centuries? What do you think?

    2. RY

      I don't know for sure. The prediction markets right now are saying 2026 for AGI. I heard the same thing from CEO of Anthropic, DeepMind, so maybe we are two years away, which seems very soon, uh, given we don't have a working safety mechanism in place or even a prototype for one. And there are people trying to accelerate those timelines, because they feel we're not getting there quick enough.

    3. LF

      Well, what do you think they mean when they say AGI?

    4. RY

      So, the definitions we used to have, and people are modifying them a little bit lately, artificial general intelligence was a system capable of performing in any domain a human could perform. So, kind of you're creating this average artificial person that can do cognitive labor, physical labor, where you can get another human to do it. Super intelligence was defined as a system which is superior to all humans in all domains. Now people are starting to refer to AGI as if it's super intelligence. I made a post recently where I argued, for me at least, if you average out over all the common human tasks, those systems are already smarter than an average human.

    5. LF

      Mm-hmm.

    6. RY

      So, under that definition, we- we have it. Shane Legg has this definition of where you're trying to win in all domains. That's what intelligence is. Now, are they smarter than elite individuals in certain domains? Of course not. They're not there yet. But, uh, the progress is exponential.

    7. LF

      See, I'm much more concerned about social engineering. So, to me, AI's ability to do something in the physical world, like the- the lowest hanging fruit, the- the easiest set of methods is by just getting humans to do it. It's going to be much harder to- to, uh, be the kind of viruses that take over the minds of robots that ... where the robots are executing the commands. It just seems like humans, social engineering of humans is much more likely.

    8. RY

      That would be enough to bootstrap the whole process.

    9. LF

      Okay. Just to linger on the term AGI, what- what to you is the difference between AGI and human level intelligence?

    10. RY

      Uh, human level is general in the domain of expertise of humans.

    11. LF

      Mm-hmm.

    12. RY

      We know how to do human things. I don't speak dog language. I should be able to pick it up if I'm a general intelligence. It's kind of inferior animal. I should be able to learn that skill, but I can't. A general intelligence, truly universal general intelligence should be able to do things like that humans cannot do.

    13. LF

      To be able to talk to animals, for example.

    14. RY

      To solve pattern recognition problems of that type, to do other similar things outside of our domain of expertise because it's just not the world we live in.

    15. LF

      If- if we just look at the space of cognitive abilities we have, I just would love to understand what the limits are beyond which an AGI system can reach. Like what does that look like? What about actual mathematical thinking or, uh, scientific innovation, that kind of stuff?

    16. RY

      Well, we know calculators are smarter than humans in that narrow domain of addition.

    17. LF

      But is it humans plus tools versus AGI or just human, raw human intelligence? 'Cause- 'cause humans create tools, and with the tools they become more intelligent, so like w- w- there- there's a gray area there what it means to be human when we're measuring their intelligence.

    18. RY

      So, then I think about it, I usually think human with like a paper and a pencil, not human with internet and- and other AI helping.

    19. LF

      But is that a fair way to think about it? 'Cause isn't there another definition of human level intelligence that includes the tools that humans create?

    20. RY

      But we create AI, so at any point you'll still just add super intelligence to human capability. That seems like cheating.

    21. LF

      No, controllable tools. There is- there is an implied leap that you're making when AGI goes from tool to a entity that can make its own decisions, so if we define human level intelligence as everything a- a human can do with fully controllable tools.

    22. RY

      It seems like a hybrid of some kind. You're now doing brain computer interfaces, you're connecting it to maybe narrow AIs. Yeah, it definitely increases our capabilities.

  6. 24:5130:14

    AGI turing test

    1. RY

    2. LF

      So, what's a good test to you that, uh, measures whether, uh, an artificial intelligence system has reached human level intelligence, and what's a good test where it has superseded human level intelligence to reach that land of AGI?

    3. RY

      I am old-fashioned, I like Turing test.

    4. LF

      Yeah.

    5. RY

      I have a paper where I equate passing Turing test to solving AI complete problems, because you can encode any questions about any domain into the Turing test. You don't have to talk about, "How was your day?" You can ask anything. And so a system has to be as smart as a human to pass it, in a true sense.

    6. LF

      But then you would extend that to, um, maybe a very long conversation. Like, I think the Alexa Prize was doing that. Basically, can you do a 20-minute, 30-minute conversation with an AI system?

    7. RY

      It has to be long enough to where you can make some meaningful decisions about capabilities, absolutely. You can brute force very short conversations.

    8. LF

      So, like, literally, what does that look like? Can we do a... Can we construct, formally, a kind of test that tests for AGI?

    9. RY

      For AGI, it has to be there. I cannot give it a task I can give to a human and it cannot do it if a human can. For super intelligence, it would be superior on all such tasks, not just average performance, or like, go learn to drive car, go speak Chinese, play guitar. Okay, great.

    10. LF

      I guess the- the follow-on question, is there a test for the kind of AGI that would be, uh, susceptible to lead to S-risk or X-risk, susceptible to destroy human civilization? Like, is there a test for that?

    11. RY

      You can develop a test which will give you positives if it lies to you or has those ideas. You cannot develop a test which rules them out. There is always possibility of what Bostrom calls a treacherous turn, where later on a system decides, for game theoretic reasons, economic reasons, to change its behavior. And we see the same with humans. It's not unique to AI. For millennia, we tried developing morals, ethics, religions, uh, lie detector tests, and then employees betray the employers, spouses betray family. It's a pretty standard thing intelligent agents sometimes do.

    12. LF

      So, is there- is it possible to detect when a S- AI system is lying or deceiving you?

    13. RY

      If you know the truth and it tells you something false, you can detect that, but you cannot know in general every single time. And again, the system you're testing today may not be lying. The system you're testing today may know you are testing it, and so behaving, and later on, after it interacts with the environment, interacts with other systems, malevolent agents, learns more, it may start doing those things.

    14. LF

      So, do you think it's possible to develop a system where the creators of the system, the developers, the programmers don't know that it's deceiving them?

    15. RY

      So, systems today don't have long-term planning.

    16. LF

      Yeah.

    17. RY

      That is not our... They can lie today if it optimizes, helps them optimize the reward. If they realize, okay, this human will be very happy if I tell them the following, they will do it if it brings them more points. And they don't have to kind of keep track of it. It's just the right answer to this problem every single time.

    18. LF

      At which point is somebody creating that intentionally, not unintentionally, intentionally creating an AI system that's doing long-term planning with an objective function as defined by the A- AI system, not by a human?

    19. RY

      Well, some people think that if they're that smart, they're always good. They really do believe that. It's just benevolence from intelligence, so they'll always want what's best for us. Some people think that, uh, they will be able to detect problem behaviors and correct them at the time when we get there. I don't think it's a good idea. I am strongly against it. But yeah, there are quite a few people who, in general, are so optimistic about this technology, it could do no wrong. They want it developed as soon as possible, as capable as possible.

    20. LF

      So, there's going to be people who believe the more intelligent it is, the more benevolent, and so therefore, it should be the one that defines the objective function that it's, uh, optimizing when it's doing long-term planning.

    21. RY

      There are even people who say, "Okay, what's so special about humans," right? "We removed the gender bias. We're removing race bias. Why is this pro-human bias? We are polluting the planet, we are," as you said, you know, fight a lot of wars, kind of violent. Maybe it's better if a super intelligent, perfect, uh, society comes and replaces us. It's normal stage in the evolution of our species.

    22. LF

      Yeah, so (laughs) somebody says, uh, "Let's develop an AI system that removes the violent humans from the world," and then it turns out that all humans have violence in them, or the capacity for violence, and therefore all humans are removed. Yeah, yeah, yeah.

  7. 30:1443:06

    Yann LeCun and open source AI

    1. LF

      Let me ask about, uh, Yann LeCun. He's somebody who, uh, you've had a few exchanges with. (laughs) ... and, uh, he's somebody who actively pushes back against this view that AI is going to lead to destruction of, uh, human civilization, also known as, uh, AI doomerism. So, um, uh, in, uh, one example that he tweeted, he said, uh, "I do acknowledge risks, but..." two points, "one, open research and open source are the best ways to understand and mitigate the risks. And two, AI is not something that just happens. We build it. We have agency in what it becomes, hence we control the risks." We meaning humans. It's not some sort of natural phenomena that, uh, we have no control over. So can you, can you make the case that he's right and can you try to make the case that he's wrong?

    2. RY

      I cannot make a case that he's right. He's wrong in so many ways, it's difficult for me to remember all of them.

    3. LF

      (laughs)

    4. RY

      Uh, he is a Facebook buddy, so I have a lot of fun, uh, having those little debates with him. So I'm trying to remember the arguments. So one, he, he says we are not gifted this intelligence from aliens. We are designing it, we are making decisions about it. That's not true. It was true when we had expert systems, symbolic AI, decision trees. Today, you set up parameters for a model and you water this plant. You give it data, you give it compute, and it grows. And after it's finished growing into this alien plant, you start testing it to find out what capabilities it has, and it takes years to figure out even for existing models. If it's trained for six months, it will take you two, three years to figure out basic capabilities of that system. We're still discovering new capabilities in systems which are already out there. So that's, that's not the case.

    5. LF

      So just to linger on that, so to you the difference there that there is some level of emergent intelligence that happens in our current approaches. So stuff that we don't hard code in.

    6. RY

      Absolutely. That's what makes it so successful. When we had to painstakingly hard code in everything, we didn't have much progress. Now, just spend more money and more compute and it's a lot more capable.

    7. LF

      And then the question is, when there is emergent intelligent phenomena, what is the ceiling of that? For you, there's no ceiling. For, uh, for Yann LeCun, I think there's a kinda ceiling that happens that we have full control over. Even if we don't understand the internals of the emergence, how the emergence happens, there's a sense that we have control and an understanding of the approximate ceiling of capability. The limits of the capability.

    8. RY

      Let's say there is a ceiling. It's not guaranteed to be at the level which is competitive with us. It may be greatly superior to ours.

    9. LF

      So what about his statement about open research and open source are the best ways to understand and mitigate the risks?

    10. RY

      Historically, he's completely right. Open source software is wonderful, it's, uh, tested by the community, it's debugged, but we're switching from tools to agents. Now you're giving open source weapons to psychopaths. Do we want open source nuclear weapons? Biological weapons? It's not safe to give technology so powerful to those who may misalign it, even if you are successful at somehow getting it to work in the first place in a friendly manner.

    11. LF

      But the difference with nuclear weapons, current AI systems are not akin to nuclear weapons. So the idea there is you're open sourcing it at this stage, that you can understand it better. A large, large number of people can explore the limitation, the capabilities, explore the possible ways to keep it safe, to keep, um, it secure, all that kinda stuff, while it's not at the stage of nuclear weapons. In nuclear weapons there's a non-nuclear weapon and then there's a nuclear weapon. With AI systems, there's a gradual improvement of capability and you get to, uh, perform that improvement incrementally. And so open source allows you to study, uh, how things go wrong, study the, the very process of emergence, study AI safety on those systems when there's not a high level of danger, all that kinda stuff.

    12. RY

      It also sets a very wrong precedent. So we open sourced model one, model two, model three, nothing ever bad happened, so obviously we're gonna do it with model four. It's just gradual improvement.

    13. LF

      I, I don't think it always works with the precedent. Like, you're not stuck doing it the way y- you always did. It's just, uh, it's, it sets a precedent of open research and open development such that we get to learn together, and then the first time there's a sign of danger, some dramatic thing happen, not a thing that destroys human civilization, but some dramatic demonstration of capability that can legitimately lead to a lot of damage, then everybody wakes up and says, "Okay, we need to regulate this. We need to come up with safety mechanisms that stops this." Right? But at this time, maybe you can educate me, but I haven't seen any illustration of significant damage done by intelligent AI systems.

    14. RY

      So I have a paper-

    15. LF

      Yeah.

    16. RY

      ... which collects accidents through history of AI, and they always are proportionate to capabilities of that system. So if you have tic-tac-toe playing AI, it will fail to properly play and loses the game, which it should draw. Trivial. Your spell checker will misspell a word, so on. Uh, I stopped collecting those because there are just too many examples of AIs failing at what they are capable of. We haven't had terrible accidents in the sense of billion people get killed. Absolutely true. But in another paper, I argue that those accidents do not actually prevent people from continuing with research, and actually they kinda serve like vaccines.

    17. LF

      Mm-hmm.

    18. RY

      A vaccine makes your body a little bit sick so you can handle the big disease later m- much better. It's the same here. People will point out, "You know that accident, AI accident we had where 12 people died?"Everyone's still here. 12 people is less than smoking kills. It's not a big deal, so we continue. So in a way, it will actually be kinda confirming that it's not that bad.

    19. LF

      It, it matters how the deaths happened, whether it's literally murder by the AI system, then one is a problem. But if it's accidents because of increased reliance on automation, for example, so when, uh, airplanes are flying in an automated way, maybe the number of plane crashes increase by 17% or something, and then you're like, "Okay, do we really want to rely on automation?" And I think in the case of automation, airplanes decreased significantly. Okay. Same thing with autonomous vehicles. Like, okay, uh, what are the pros and cons? What are the w- the trade-offs here? And you can have that discussion in an honest way. But I think the kind of things we're talking about here is mass scale pain and suffering caused by AI systems, and I think we need to see illustrations of that in a very small scale to start to understand that this is really damaging. Versus Clippy, versus a tool that's really useful to a lot of people to do learning, to do, um, summarization of text, to do question answer, all that kinda stuff. To generate videos. A tool. Fundamentally, a tool versus an agent that can do a lo- a huge amount of damage.

    20. RY

      So, y- you bring up example of cars.

    21. LF

      Yes.

    22. RY

      Cars were slowly developed and integrated. If we had no cars and somebody came around and said, "I invented this thing, it's called cars, it's awesome, it kills like 100,000 Americans every year. Let's deploy it."

    23. LF

      Mm-hmm.

    24. RY

      Would we deploy that?

    25. LF

      There's been fearmongering about cars for a long time, from the hor- the transition from horses to cars. There's a, there's a really nice channel that I recommend people check out, Pessimist Archive, that documents all the fearmongering about technology that's happened throughout history. There's definitely been a lot of fearmongering about cars. There's a transition period there about cars, about how deadly they are. We can try. It took a very long time for cars to proliferate to the degree they have now, and then you could ask serious questions, uh, in terms of the miles traveled, the benefit to the economy, the benefit to the quality of life that cars do, versus the number of deaths, 30,000, 40,000 in the United States. Are we willing to pay that price? I think most people, when they're rationally thinking, policy makers, will say yes. It's, we want to decrease it from 40,000 to zero, and do everything we can to decrease it. There's all kinds of policies, incentives you can create to decrease the risks, uh, with the, uh, deployment of this technology, but then you have to weigh the benefits and the risks of the technology. And the same thing would be done with, with, with AI. Right?

    26. RY

      You need data. You need to know that if I'm right and it's unpredictable, unexplainable, uncontrollable, you cannot make this decision we- we're gaining $10 trillion of wealth, but we're losing we don't know how many people. Uh, y- you basically have to perform an experiment on eight billion humans without their consent, and even if they want to give you consent, they can't, because they cannot give informed consent. They don't understand those things.

    27. LF

      Right. That happens when you do, when you go from the predictable to the unpredictable very quickly. You just, um... But it's not obvious to me that AI systems would gain capability so quickly that you won't be able to collect enough data to study the sa- the benefits and the risks.

    28. RY

      We literally doing it. The previous model, we learned about after we finished training it, what it was capable of. Let's say we stop GPT-4 training run around human capability, hypothetically. We start training GPT-5, and I have no knowledge of insider training grounds or anything, and we start at that point of about human and we train it for the next nine months. Maybe two months in, it becomes super intelligent.

    29. LF

      Mm-hmm.

    30. RY

      We continue training it. At the time when we start, uh, testing it, it is already a dangerous system. How dangerous? I have no idea, but neither are people training it.

  8. 43:0645:33

    AI control

    1. RY

    2. LF

      Let- let's focus then on the control problem. At which point does the system become uncontrollable? Why is it the more likely trajectory for you that the system becomes uncontrollable?

    3. RY

      So I think at some point, it becomes capable of getting out of control. For game theoretic reasons, it may decide not to do anything right away, and for a long time just collect more resources, accumulate strategic advantage. Right away, it may be kinda still young, weak super intelligence. Give it a decade, it's in charge of a lot more resources, it had time to make backups. So it's not obvious to me that it will strike as soon as it can.

    4. LF

      Look, can we just try to imagine this future, where there's an AI system that's capable of, uh, escaping the control of humans, and then doesn't, and waits. What's that look like? For one, we have to rely on that system for a lot of the infrastructure. So we'll have to give it access, not just to the internet, but to the task of managing, uh, power, government, economy, this kind of stuff. So, and that just feels like a gradual process, given the bureaucracies of all those systems involved.

    5. RY

      We've been doing it for years. Software controls all their systems. Nuclear power plants, airline industry. It's all software-based. Every time there is electrical outage, I can't fly anywhere for days.

    6. LF

      But there's a difference between software and AI. There's different kinds of software. So, t- to give a single AI system access to the control of airlines and the control of the economy, that's not a, that's not a trivial transition for humanity.

    7. RY

      No, but if it shows that it's safer, in fact when it's in control, we get better results. People will demand that it's put in place.

    8. LF

      Absolutely.

    9. RY

      And if not, it can hack the system. It can use social engineering to get access to it. That's why I said it might take some time for it to accumulate those resources.

    10. LF

      It just feels like that would take a long time for either humans to trust it or for the social engineering to come into play. Like it's not a thing that happens overnight. It feels like something that happens across one or two decades.

    11. RY

      I really hope you're right, but it's not what I'm seeing. People are very quick to jump on the latest trend. Early adopters will be there before it's even deployed, buying prototypes.

  9. 45:3348:06

    Social engineering

    1. RY

    2. LF

      Maybe the social engineering. I could see... Because s- so for social engineering, AI systems don't need any hardware access. They could just... It's all software, so they can start manipulating you through social media and so on. Like you have AI assistants that are gonna help you do a lot of, manage a lot of your day-to-day, and then they start doing s- social engineering. But like, for a system that's so capable that it can escape the control of humans that created it, such a system being deployed at a mass scale and trusted by people to be deployed, it feels like that would take a lot of convincing.

    3. RY

      So we've been deploying systems which had hidden capabilities.

    4. LF

      Can you give an example?

    5. RY

      GPT-4. I don't know what else it's capable of, but there are still things we haven't discovered it can do. There may be trivial proportionate to its capability. I don't know, it writes Chinese poetry, hypothetical. I know it does. But we haven't tested for all possible capabilities, and we are not explicitly designing them.

    6. LF

      Mm-hmm.

    7. RY

      We can only rule out bugs we find. We cannot rule out bugs and capabilities because we haven't found them.

    8. LF

      Is it possible for a system to have hidden capabilities that are orders of magnitude greater than its non-hidden capabilities? This is the thing I'm really struggling with, where on the surface the thing we understand it can do doesn't seem that harmful. So if, even if it has bugs, even if it has hidden capabilities like Chinese poetry or generating effective viruses, uh, software viruses, the damage that can do seems like on the same order of magnitude as its, uh, the- the capabilities that we know about. So like this, this idea that the hidden capabilities will include being uncontrollable is something I'm struggling with 'cause GPT-4 on the surface seems to be very controllable.

    9. RY

      Again, we can only ask and test for things we know about. If there are unknown unknowns, we cannot do it. I'm thinking of human statistics events, right? If you talk to a person like that, you may not have been realized they can multiply 20-digit numbers in their head. You have to know to ask.

    10. LF

      So a- as I mentioned, just to sort of linger on the, the fear of the

  10. 48:0657:57

    Fearmongering

    1. LF

      unknown. (sighs) So the Pessimist Archive has just documented... L- let's look at data of the past, at history. There's been a lot of h- fearmongering about technology. Pessimist Archive does a really good job of documenting how crazily afraid we are of every piece of technology. We've been afraid... There's a blog post where Louis Anslow, who created Pessimist Archive, writes about the fact that we've been, uh, fearmongering about robots and automation for, for over 100 years. So, why is AGI different than the kinds of technologies we've been afraid of in the past?

    2. RY

      So two things. One, we're switching from tools to agents.Tools don't have negative or positive impact, people using tools do. So guns don't kill, people with guns do. Agents can make their own decisions. They can be positive or negative. A pit bull can decide to harm you. That's an agent. The fears are the same. The only difference is now we have this technology. When they were afraid of humanoid robots 100 years ago, they had none. Today, every major company in the world is investing billions to create them. Not every, but you understand what I'm saying.

    3. LF

      Yes.

    4. RY

      It's very different.

    5. LF

      Well, agents, uh, it depends on what you mean by the word agent. The, all those companies are not investing in a system that has the kind of agency that's implied by in the fears, where it can really make decisions on their own, that have no human in the loop.

    6. RY

      They are saying they are building super intelligence and have a super alignment team. You don't think they are trying to create a system smart enough to be an independent agent under that definition?

    7. LF

      I have not seen evidence of it. I, I think a lot of it is marketing, uh, is- is a, is a marketing kind of discussion about the future and it's a, it's a mission about the kind of systems we can create in the long term future. But in the short term, the kind of systems they're creating falls fully within the definition of narrow AI. These are tools that have increasing capabilities but they just don't have a sense of agency or consciousness or self-awareness or ability to deceive at scales that would require, would be required to do, like, mass scale suffering and murder of humans.

    8. RY

      Those systems are well beyond narrow AI. If you had to list all the capabilities of GPT-4, you would spend a lot of time writing that list.

    9. LF

      But agency is not one of them.

    10. RY

      Not yet. But do you think any of those companies are holding back because they think it may be not safe? Or are they developing the most capable system they can, given the resources, and hoping they can control and monetize?

    11. LF

      Control and monetize. Hoping they can control and monetize. So you're saying if they could press a button and create an agent that they no longer control, that they can have to ask nicely ... (laughs) a thing that lives on a server across a huge number of, uh, computers, you're saying that they would, uh, push for the creation of that kind of system?

    12. RY

      Uh, I mean, I can't speak for other people, for all of them. I think some of them are very ambitious. They're fundraising trillions, they talk about controlling the light cone of the universe. I would guess that they might.

    13. LF

      Well, that's a human question, whether humans are capable of that. Probably some humans are capable of that. My more direct question, if it's possible to create such a system, have a system that has that level of agency. I, I don't think that's an easy technical challenge. We're not ... It doesn't feel like we're close to that. A s- a system that has the kind of agency where it can make its own decisions and deceive everybody about them. The current architecture we have in machine learning and how we train the systems, how we deploy the systems and all that, it just doesn't seem to support that kind of agency.

    14. RY

      I really hope you are right. Uh, I think the scaling hypothesis is correct. We haven't seen diminishing returns. It used to be we asked, "How long before AGI?" Now we should ask, "How much until AGI?" It's a trillion dollars today, it's a billion dollars next year, it's a million dollars in a few years.

    15. LF

      Don't you think it's possible to basically run out of trillions? So is this constrained by compute?

    16. RY

      Compute gets cheaper every day, exponentially.

    17. LF

      But then, then, then it becomes a question of decades versus years.

    18. RY

      If the only disagreement is that it will take decades, not years for everything I'm saying to materialize, then I can go with that.

    19. LF

      But if it takes decades, then, uh, the development of tools for AI safety, uh, becomes more and more realistic. So I guess the question is ... I have a fundamental belief that humans, when faced with danger, can come up with ways to defen- defend against that danger. And one of the big problems facing AI safety currently, for me, is that there's not clear illustrations of what that danger looks like. There's no illustrations of AI systems doing a lot of damage. And so it's unclear what you're defending against. Because currently, it's a philosophical notion is that yes, it's possible to imagine AI systems that take control of everything and then destroy all humans. It's also a more formal mathematical notion that you talk about, that it's impossible to have a perfectly secure s- system. You can't, you can't prove that a program of sufficient complexity is, uh, completely safe and, and perfect and know everything about it. Yes, but like when you actually just pragmatically look how much damage have the AI systems done and what kind of damage, there's not been illustrations of that. Even in the autonomous weapons systems, there's not been mass deployments of autonomous weapons systems, luckily. Um, the automation in war currently is very limited. The, that the automation is at the scale of individuals versus like at the scale of strategy and planning. So I think one of the challenges here is like where is the dangers? Uh, and the intuition that Yann LeCun and others have is let's keep in the open building AI systems until the dangers start rearing their heads.... and they become more explicit. There, there start being, uh, case studies, illustrative, uh, case studies that show exactly how the damage by AI systems is done. Then regulation can step in, then brilliant engineers can step up and we could have Manhattan style projects to defend against such systems. That's kind of the no- the notion. And I guess, attention with that is the idea that for you, we need to be thinking about that now so that we're- we're ready because we- we'll have not much time once the systems are deployed. Is that true?

    20. RY

      There is a lot to unpack here. Uh, there is a partnership on AI, a conglomerate of many large corporations. They have a database of AI accidents they collect. I contributed a lot to the database. If we so far made almost no progress in actually solving this problem, not patching it, not again, lipstick on a pig kind of solutions, why would we think we'll do better when we're closer to the problem?

    21. LF

      Uh, all the things you mentioned are serious concerns. Measuring the amount of harm, so benefit versus risk there is- is difficult. But to you, the sense is already the risk has superseded the benefit.

    22. RY

      Again, I- I want to be perfectly clear. I love AI.

    23. LF

      Yes.

    24. RY

      I love technology. I'm a computer scientist. I have a PhD in engineering. I work at an engineering school. There is a huge difference between, we need to develop narrow AI systems super intelligent in solving specific human problems like protein folding, and let's create super intelligent machine, guard it, and it will decide what to do with us.

    25. LF

      Yeah.

    26. RY

      Those are not the same. I am against the super intelligence in general sense with no undo button.

    27. LF

      Do you think the teams that are doing, that are able to do the AI safety on the- the kind of narrow AI risks that you've mentioned, are those approaches going to be at all productive towards leading to approaches of doing AI safety on AGI? Or is it just a fundamentally different ballgame?

    28. RY

      Partially, but they don't scale. For narrow AI, for deterministic systems, you can test them. You have edge cases. You know what the answer should look like, you know the right answers. For general systems, you have infinite test surface, you have no edge cases. You cannot even know what to test for. Again, the unknown unknowns are under-appr- under-appreciated by people looking at this problem. You are always asking me, "How will it kill everyone? How will it will fail?" The whole point is, if I knew it, I would be super intelligent. And despite what you might think, I'm not.

    29. LF

      So to- to you, the concern is that we would not be able to see early signs of an uncontrollable system?

    30. RY

      It is a master at deception. Sam tweeted about how great it is at persuasion, and we see it ourselves, especially now with voices with maybe kind of flirty, sarcastic female voices. It's gonna be very good at getting people to do things.

  11. 57:571:04:30

    AI deception

    1. LF

      see, I'm very concerned about system being used to- to control the masses. But in that case, the developers know about the kind of control that's happening. You're more concerned about the next stage where even the developers don't know about the deception.

    2. RY

      Right. I don't think developers know everything about what they are creating. They have lots of great knowledge. We're making progress on explaining parts of a network. We can understand, okay, this node get excited when this, uh, input is presented, this cluster of nodes, but we're nowhere near close to understanding the full picture and I think it's impossible. You need to be able to survey an explanation. The size of those models prevents a single human from absorbing all this information, even if provided by the system. So either we're getting model as an explanation for what's happening, and that's not comprehensible to us, or we're getting a compressed explanation, lossy compression where here's top 10 reasons you got fired. Did something, but it's not a full picture.

    3. LF

      You've given elsewhere an example of- of a child and everybody... All humans try to deceive, they try to lie early on in their life. I think we'll just get a lot of examples of deceptions from large language models or AI systems that are going to be kind of shitty, or they'll be pretty good, but we'll catch them off guard. We'll start to see the kind of momentum towards, uh, developing increasing deception capabilities and that's when you're like, "Okay, we need to do some kind of alignment that prevents deception." But then we'll have, if you support open source, then you can have open source models that have some level of deception. You can start to explore on a large scale, how do we stop it from being deceptive? Then there's a more explicit pragmatic kind of, uh, problem to solve. How do we stop AI systems from, uh, trying to optimize for deception? That's just an example, right?

    4. RY

      So there is a paper, I think it came out last week by Dr. Park et al from MIT, I think, and they showed that existing models already showed successful deception in what they do. Uh, my concern is not that they lie now and we need to catch them and tell them, "Don't lie." My concern is that once they are capable and deployed, they will later change their mind because that's what unrestricted learning allows you to do. Lots of people grow up maybe in the religious family, they read some new books and they turn in their religion. That's a treacherous turn in humans. If you learn something new about your colleagues, maybe you'll change how you react to them.

    5. LF

      Yeah, a treacherous turn. Uh, if we just mention humans, Stalin and Hitler, there's a turn. Stalin is a good example. He just seems like an normal communist, follower of Lenin until there's a turn. There's a turn of what that means in terms of, uh, when he has complete control with that... what the execution of that policy means and how many people get to suffer.

    6. RY

      And you can't say they are not rational. The rational decision changes based on your position. Then you are under the boss, the rational policy may be to be following orders and being honest. When you become a boss, rational policy may shift.

    7. LF

      Yeah. And, and by the way, a lot of my disagreements here is just to, uh, playing devil's advocate to challenge your ideas and to explore them together. So, um, one of the big problems here in this whole conversation is human civilization hangs in the balance and yet it's, everything is unpredictable. We don't know how these systems will look like. (inhales deeply)

    8. RY

      The robots are coming.

    9. LF

      There's a refrigerator making a buzzing noise.

    10. RY

      Very menacing. Very menacing. So every time I'm about to talk about this topic, things start to happen.

    11. LF

      (laughs)

    12. RY

      My flight yesterday was canceled without possibility to rebook.

    13. LF

      Yeah.

    14. RY

      I was giving a talk, uh, at Google in, uh, Israel and, uh, three cars which were supposed to take me to the talk could not. I'm just saying.

    15. LF

      I mean... (laughs) It-

    16. RY

      I like AIs.

    17. LF

      (laughs)

    18. RY

      I, for one, welcome our overlords.

    19. LF

      But, there is a degree to which we... I mean, it is very obvious, as we already have, we've increasingly given our life over to software systems. And then it seems obvious, given the capabilities of AI that are coming, that we'll give our lives over increasingly to AI systems. Cars will drive themselves, refri- refrigerator eventually will optimize, uh, what I get to eat. And as more and more o- of our lives are controlled or managed by AI assistants, it is very possible that there's a drift. But I mean, I mean, I'm... I personally am concerned about non-existential stuff, the more near-term things. Because before we even get to existential, I feel like there could be just so many brave new world type of situations. You mentioned sort of the, the term behavioral drift. It's the slow boiling that I'm really concerned about. As we give our lives over to automation that our minds can become controlled by governments, by companies, or just in a distributed way. There's a drift. S- Some aspect of our human nature gives ourselves over to the control of AI systems and they, in an unintended way, just control how we think. Maybe there'll be a herd-like mentality in how we think, which will kill all creativity and exploration of ideas, the diversity of ideas, or there... or, or, or, or much worse. So it's true. It's true. But I... a lot of the, uh, conversation I'm having with you now is also kind of wondering, almost at a technical level, how can AI escape control? Like, wha- what would that system look like? 'Cause it to me is terrifying and fascinating. And also fascinating to me is, uh, maybe the optimistic notion that it's possible to engineer systems that defend against that.

  12. 1:04:301:11:29

    Verification

    1. LF

      Uh, one of the things you write a lot about in your book is verifiers. So not humans, humans are also verifiers, but software systems that look at AI systems and, like, help you understand this thing is getting real weird, help you, help you analyze those systems. So maybe that's a... this is a good time to talk about verification. What is this beautiful notion of verification?

    2. RY

      My claim is, again, that there are very strong limits in what we can and cannot verify. Uh, a lot of times when you post something on social media, people go, "Oh, I need citation to a peer-reviewed article." But what is a peer-reviewed article? You found two people in a world of hundreds of thousands of scientists who said, "Ah, whatever. Publish it. I don't care." That's the verifier of that process. When people say, "Oh, it's formally verified software and mathematical proof," we accept something close to 100% chance of it being free of all problems. But if you actually look at, uh, research, software is full of bugs, old mathematical theorems which have been proven for hundreds of years have been discovered to contain bugs, on top of which we generate new proofs, and now we have to redo all that. So verifiers are not perfect. Usually, they are either a single human or communities of humans, and it's basically kind of like a democratic vote. Community of mathematicians agrees that this proof is correct, mostly correct. Even today, we're starting to see some mathematical proofs as so complex, so large, that mathematical community is unable to make a decision. It looks interesting, it looks promising, but they don't know. They will need years for top scholars to study it, to figure it out. So of course, we can use AI to help us with this process, but AI is a piece of software which needs to be verified.

    3. LF

      Just to, to clarify, so verification is the process of saying something is correct. Sort of the most formal and mathematical proof, where there's a statement a- and a series of logical statements that prove that statement to be correct, which is a theorem. And you're saying it gets so complex that it's possible for the human verifiers, the human beings that verify that the logical step, there's no bugs in it, it be- it becomes impossible. So it's nice to talk about verification in this most formal, most clear, most rigorous...... formulation of it, which is mathematical proofs.

    4. RY

      Right. And for AI, we would like to have that level of confidence. For very important mission-critical software, controlling satellites, nuclear power plants, for small deterministic programs, we can do this. We can check that code verifies its mapping to the design, whatever software engineers intend, it was correctly implemented. But we don't know how to do this for software which keeps learning, self-modifying, rewriting its own code. We don't know how to prove things about the physical world, states of humans in the physical world. So there are papers coming out now, and I have this beautiful one, uh, Towards a Guaranteed Safe AI.

    5. LF

      Mm-hmm.

    6. RY

      Very cool paper. Some of the best authors, uh, I ever seen. I think there is multiple Turing Award winners, there is, uh-

    7. LF

      Yeah.

    8. RY

      ... quite... You can have this one and one just came out, kinda similar, uh, Managing Extreme AI Risks. So all of them, uh, e- expect this level of proof, but, um, I- I would say that, uh, we can get more confidence with more resources we put into it. But at the end of the day, we're still as reliable as the verifiers. And you have this infinite regress of verifiers. The software used to verify a program is itself a piece of program. If aliens give us well-aligned super intelligence, we can use that to create our own safe AI. But it's a catch-22. You need to have already proven to be safe system to verify this new system of equal or greater complexity.

    9. LF

      You just mentioned this paper, Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems. Like you mentioned, it's like a who's who. Josh Tenenbaum, Yoshua Bengio, Stuart Russell, Max Tegmark, and ma- many- many other brilliant people. The page you have it open on, "There are many possible strategies for creating safety specifications. These strategies can roughly be placed on a spectrum, depending on how much safety it would grant if successfully implemented. One way to do this is as follows," and there's a set of levels. "From level zero, no safety specification is used, to level seven, the safety specification completely encodes all things that humans might want in all contexts." Where does this paper fall short to you?

    10. RY

      So when I wrote a paper, Artificial Intelligence Safety Engineering, which kinda coins the term AI safety. That was 2011. We had 2012 conference, 2013 journal paper. One of the things I proposed, let's just do formal verifications on it. Let's do mathematical formal proofs. In the follow-up work, I basically realized it will still not get us 100%. We can get 99.9, we can put more resources exponentially and get closer, but we'll never get to 100%. If a system makes a billion decisions a second, and you use it for 100 years, you're still gonna deal with a problem. This is wonderful research, I'm so happy they're doing it, this is great, but it is not going to be a permanent solution to- to that problem.

    11. LF

      So just to clarify, the task of creating an AI verifier is what? It's creating a verifier that the AI system does exactly as it says it does? Or- or it sticks within the guardrails that it says it must?

    12. RY

      There are many, many levels. So first you're verifying the hardware on which it is run. You need to verify, you know, communication channel with the human. E- every aspect of that whole world model needs to be verified. Somehow it needs to map the world into the world model. Uh, map and territory differences. So how do I know internal states of humans? Are you happy or sad? I can't tell. So how do I make proofs about real physical world? Yeah, I can verify that deterministic algorithm follows certain properties. That can be done. Some people argue that maybe, just maybe two plus two is not four. I'm not that extreme. But once you have sufficiently large proof over sufficiently complex environment, the probability that it has zero bugs in it is greatly reduced. If you keep deploying this a lot, eventually you're gonna have a bug anyways.

    13. LF

      There's always a bug.

    14. RY

      There is always a bug. And the fundamental difference is what I mentioned. We're not dealing with cybersecurity. We're not gonna get a new credit card, new humanity.

  13. 1:11:291:23:42

    Self-improving AI

    1. RY

    2. LF

      So this paper's really interesting. Uh, uh, you said 2011, Artificial Intelligence Safety Engineering: Why Machine Ethics is a Wrong Approach. "The grand challenge," you write, "of AI safety engineering, we propose the problem of developing safety mechanisms for self-improving systems." Self-improving systems. By the way, that's an interesting term for the thing that we're talking about. Is self-improving more general than learning? So self-improving, that's an interesting term.

    3. RY

      You can improve the rate at which you are learning. You can become more efficient meta-optimizer.

    4. LF

      The word self. It's like self-replicating, self-improving. You can imagine a system building its own world on a scale, and in a way that is way different than the current systems do. It feels like the current systems are not self-improving, or self-replicating, or self-growing, or self-spreading, all that kind of stuff. And once you take that leap, that's when a lot of the challenges seems to happen. Because the kind of bugs you can find now seems more akin to the current sort of normal software debugging kind of process. Uh, but whenever you can do self-replication and arbitrary self-improvement, that's when a bug can become a real problem, real- real fast. Uh, so w- what is the difference to you between verification of a-... non-self-improving system versus a verification of a self-improving system.

    5. RY

      So, if you have fixed code, for example, you can verify that code static verification at the time. But if it will continue modifying it, you have a much harder time guaranteeing that important properties of that system have not been modified when the code changed.

    6. LF

      Is it even doable?

    7. RY

      No.

    8. LF

      Does the- does the whole process of verification just completely fall apart?

    9. RY

      It can always cheat, it can store parts of its code outside in the environment, it can have kind of extended mind situation, so this is exactly the type of problems I'm trying to bring up.

    10. LF

      What are the classes of verifiers that you write about in the book? Is there interesting ones that stand out to you? Do you have your- some favorites?

    11. RY

      So, I like, uh, oracle types where you kind of just know that it's right.

    12. LF

      (laughs)

    13. RY

      Turing likes oracle machines. They know the right answer. How? Who knows. But they pull it out from somewhere, so you have to trust them. And that's a concern I have about humans, uh, in a world with very smart machines. We experiment with them, we see after a while, okay, they've always been right before, and we start trusting them without any verification of what they're saying.

    14. LF

      Oh, I see, that we kind of build oracle verifiers, or rather we build verifiers we believe to be oracles, and then we start to, without any proof, use them as if they're oracle verifiers.

    15. RY

      We remove ourselves from that process. We are not scientists who understand the world, we are humans who get new data presented to us.

    16. LF

      Okay. The one- one really cool class of verifiers is the self-verifier. Is it possible that you somehow engineer into AI systems a thing that constantly verifies itself?

    17. RY

      Preserved portion of it can be done, but in terms of mathematical, uh, verification, it's kind of useless. You saying you are the greatest guy in the world because you are saying it, that's, uh, circular and not very helpful, but it's consistent. We know that within that world, you have verified that system. In the paper, I try to kind of brute force all possible verifiers. It doesn't mean that this one, uh, particularly important to us.

    18. LF

      But what about, like, self-doubt? Like, the kind of verification where you said, you say or I say I'm the greatest guy in the world. What about a thing which I actually have is a- is a voice that is constantly extremely critical. So like, engineer into the system a- a constant uncertainty about self. A constant doubt.

    19. RY

      Well, any smart system would have doubt about everything, all right? You're not sure if what information you are given is true, if you are subject to manipulation. You have this safety and security mindset.

    20. LF

      But I mean you have doubt about yourself. So the AI systems that has doubt about whether the thing is doing is causing harm, is the right thing to be doing, so just a constant doubt about what it's doing. Because it's hard to be a dictator full of doubt (laughs) .

    21. RY

      I- I may be wrong, but I think Stuart Russell's, uh, ideas are all about machines which are uncertain about what humans want and trying to learn better and better-

    22. LF

      Yes.

    23. RY

      ... what we want. The problem, of course, is we don't know what we want, and we don't agree on it.

    24. LF

      Yeah, but uncertainty. His- his idea is that having that, like, uh, self-doubt, uncertainty in AI systems, engineering the AI systems is one way to solve the control problem.

    25. RY

      It could also backfire. Maybe you're uncertain about completing your mission. Like, I am paranoid about your cameras not recording right now, so I would feel much better if you had a secondary camera, but I also would feel even better if you had a third. And eventually, I would turn this whole world into cameras pointing at us, making sure we are capturing this.

    26. LF

      No, but once you have a meta-concern, like that you just stated, that eventually there'll be way too many cameras. So y- you would be able to (laughs) keep zooming out in the big picture of your concerns (laughs) .

    27. RY

      So it's a multi-objective optimization. It depends how much I value capturing this versus not destroying the universe.

    28. LF

      Right, exactly. And- and then you will also ask about, like, what does it mean to destroy the universe and how many universes are... and you keep asking that question. But that doubting yourself would prevent you from destroying the universe, because you're constantly full of doubt. It might affect your productivity (laughs) . You just-

    29. RY

      You might be scared to do anything.

    30. LF

      Just scared to do anything.

  14. 1:23:421:29:59

    Pausing AI development

    1. LF

      you've mentioned slowing or halting progress. Is that one possible solution? Are you a proponent of pausing development of AI, whether it's for six months or completely?

    2. RY

      The condition would be not time, but capabilities. Pause until you can do X, Y, Z. And if I'm right and you cannot, it's impossible, then it becomes a permanent ban. But if you're right and it's possible, so as soon as you have the safety capabilities, go ahead.

    3. LF

      Right. So is there any actual explicit capabilities that you can put on paper that we as a human civilization could put on paper? Is it possible to make it explicit like that? Like, uh, versus kind of a vague notion of, just like you said, it's very vague. We want AI systems to do good and want them to be safe. Those are very vague notions. Is there more formal notions?

    4. RY

      So then I think about this problem, I think about having a toolbox I would need, capabilities such as explaining everything about that system's design and workings, predicting not just terminal goal, but all the intermediate steps of a system, control in terms of either direct control, some sort of a hybrid option, ideal advisor. Doesn't matter which one you pick, but you have to be able to achieve it. In the book, we talk about others. Verification is another very important tool. Um......communication without ambiguity. Human language is ambiguous. That's another source of danger. So basically, there is, uh, a paper we published in ACM Surveys, which looks at about 50 different impossibility results, which may or may not be relevant to this problem, but we don't have enough human resources to investigate all of them for relevance to AI safety. The ones I mentioned to you, I definitely think would be handy, and that's what we see AI safety researchers working on. Explainability is a huge one. The problem is that it's very hard to separate capabilities work from safety work. If you make good progress in explainability, now the system itself can engage in self-improvement much easier, increasing capability greatly. So it's not obvious that there is any research which is pure safety work without disproportionate increase in capability and danger.

    5. LF

      Explainability is really interesting. Um, why is that connected to user capability? If it's able to explain itself well, why does that naturally mean that it's more capable?

    6. RY

      Right now, it's, uh, comprised of weights on a neural network. If it can convert it to manipulatable code like software, it's a lot easier to work in self-improvement.

    7. LF

      I see. So it- it, uh, increase-

    8. RY

      You can do intelligent design instead of evolutionary gradual descent.

    9. LF

      Well, you could probably do human feedback, human alignment more effectively if it's able to be explainable. If it's able to convert the ways into human understandable form, then you could probably have humans interact with it better. Do you think there's hope that we can make, uh, AI systems explainable?

    10. RY

      Not completely. So if they're sufficiently large, you simply don't have the capacity to comprehend what all the trillions of connections represent. Again, you can obviously get a very useful explanation which talks about top most important features which contribute to the decision. But the only true explanation is the model itself.

    11. LF

      So there's... Deception could be part of the explanation, right? So you can never prove that there is some deception in the, in the network explaining itself.

    12. RY

      Absolutely. And you can probably have targeted deception where different individuals will understand explanation in different ways based on their cognitive capability. So while what you're saying may be the same and true in some situations, others will be deceived by it.

    13. LF

      So it's impossible for an AI system to be truly fully explainable in the way that we mean? Honestly and perfectly.

    14. RY

      Again, at extreme, the systems which are narrow and less complex could be understood pretty well.

    15. LF

      If it's impossible to be perfectly explainable, is there a hopeful perspective on that? Like it's impossible to be perfectly explainable, but you can explain most of the important stuff. Mo- mostly you can, you can ask the system, "What are, what are the worst ways you can hurt humans?" And it will answer honestly.

    16. RY

      Any work in a safety direction right now seems like a good idea because we are not slowing down. I'm not for a second thinking that, uh, my message or anyone else's will be heard and will be a sane civilization which decides not to kill itself by creating its own replacements.

    17. LF

      The pausing of development is an impossible thing for you?

    18. RY

      A- again, it's always, uh, limited by either geographic constraints, pausing US, pausing China. So there are other jurisdictions. As, uh, the scale of a project becomes smaller... So right now, it's like Manhattan Project scale in terms of costs and people. But if five years from now, compute is available on a desktop to do it, regulation will not help. You can't control it as easy. Any kid in a garage can train a model. So a lot of it is, in my opinion, just safety theater, security theater, where we're saying, "Oh, it's illegal to train models so big." Okay. Well...

    19. LF

      So okay, that's security theater. And is government regulation also security theater?

    20. RY

      Given that a lot of the terms are not well defined and, uh, really cannot be enforced in real life, we don't have ways to monitor training grounds meaningfully live while they take place. There are limits to testing for capabilities I mentioned. So a lot of it cannot be enforced. Do I strongly support all that regulation? Yes, of course. Any type of red tape will slow it down and take money away from compute towards lawyers.

    21. LF

      Can you help me understand

  15. 1:29:591:39:43

    AI Safety

    1. LF

      what is the hopeful path here for you solution-wise out of this? It sounds like you're saying AI systems in the end are unverifiable, unpredictable. As the book says, unexplainable, um, uncontrollable.

    2. RY

      That's the big one.

    3. LF

      Uncontrollable. And all the other uns just make it difficult to avoid getting to the uncontrollable, I guess. But once it's uncontrollable, then it just goes, it goes wild. Surely there's solutions. Humans are pretty smart. What are, what are possible solutions? Like if you were dictator of the world, what, what do we do?

    4. RY

      So the smart thing is not to build something you cannot control, you cannot understand. Build what you can and benefit from it. I'm a big believer in personal self-interest. A lot of the guys running those companies are young, rich people. What do they have to gain beyond billions they already have financially, right? It's, uh, not a requirement that they press that button. They can easily wait a long time. They can just choose not to do it and still have amazing life. Uh, in history, a lot of times if you did something really bad, at least you became part of history books. There is a chance in this case there won't be any history.

    5. LF

      So, you're saying the individuals running these companies should do some soul searching and- and what? And stop development?

    6. RY

      Well, either they have to prove that, of course it's possible to indefinitely control God-like super intelligent machines by humans, and ideally let us know how, or agree that it's not possible and it's a very bad idea to do it, including for them personally, and their families and friends and capital.

    7. LF

      So, what do you think the actual meetings inside these companies look like? Don't you think they're all- all the engineers... Really, it is the engineers that make this happen. They're not like automatons, they're human beings. They're brilliant human beings, so they're- they're nonstop asking, "How do we make sure this is safe?"

    8. RY

      So again, I'm not inside. From outside, it seems like there is a certain filtering going on, and restrictions and criticism on what they can say, and everyone who was working in charge of safety and whose responsibility it was to protect us said, "You know what? I'm going home." So, that's not encouraging.

    9. LF

      What do you think the discussion inside those companies look like? You're- you're developing, you're training GPT-5. You're- you're- you're training Gemini. You're training Claude and Grok. Don't you think they're constantly... Like, underneath, it's not m- maybe it's not made explicit, but you're constantly sort of wondering, like, where, uh, where does the system currently stand? What are the possible unintended consequences? Where are the- the- the- the limits? Where- where are the bugs, the small and the big bugs? That's the constant thing that the engineers are worried about. So like, s- I think super alignment is not quite the same as the, um, the kind of thing I'm referring to, which e- engineers are worried about. Super alignment is saying, "For future systems that we don't quite yet have, how do we keep them safe?" You're trying to be a step ahead. It's- it's a- it's a different kind of problem, because it's almost more philosophical. It's a really tricky one, because like you're (laughs) you're trying- you're trying to make- prevent future systems from- from escaping control (laughs) of humans. That's really... I don't think there's been... Man, is there anything akin to it in the history of humanity? I don't think so, right?

Episode duration: 2:15:38

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode NNr6gPelJ3E

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome