Skip to content
Dwarkesh PodcastDwarkesh Podcast

Carl Shulman (Pt 2) — AI Takeover, bio & cyber attacks, detecting deception, & humanity's far future

The second half of my 7 hour conversation with Carl Shulman is out! My favorite part! And the one that had the biggest impact on my worldview. Here, Carl lays out how an AI takeover might happen: * AI can threaten mutually assured destruction from bioweapons, * use cyber attacks to take over physical infrastructure, * build mechanical armies, * spread seed AIs we can never exterminate, * offer tech and other advantages to collaborating countries, etc Plus we talk about a whole bunch of weird and interesting topics which Carl has thought about: * what is the far future best case scenario for humanity * what it would look like to have AI make thousands of years of intellectual progress in a month * how do we detect deception in superhuman models * does space warfare favor defense or offense * is a Malthusian state inevitable in the long run * why markets haven't priced in explosive economic growth * & much more Carl also explains how he developed such a rigorous, thoughtful, and interdisciplinary model of the biggest problems in the world. 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Catch part 1 here: https://youtu.be/_kRg-ZP1vQc * Transcript: https://www.dwarkeshpatel.com/carl-shulman-2 * Apple Podcasts: https://bit.ly/3r1HBJk * Spotify: https://bit.ly/437t50c * Follow me on Twitter: https://twitter.com/dwarkesh_sp * Carl's blog: http://reflectivedisequilibrium.blogspot.com/ 𝐒𝐏𝐎𝐍𝐒𝐎𝐑𝐒 This episode is sponsored by 80,000 hours. To get their free career guide (and to help out this podcast), please visit 80000hours.org/lunar. 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 - Intro 00:00:47 - AI takeover via cyber or bio 00:32:27 - Can we coordinate against AI? 00:53:49 - Human vs AI colonizers 01:04:55 - Probability of AI takeover 01:21:56 - Can we detect deception? 01:47:25 - Using AI to solve coordination problems 01:56:01 - Partial alignment 02:11:41 - AI far future 02:23:04 - Markets & other evidence 02:33:26 - Day in the life of Carl Shulman 02:47:05 - Space warfare, Malthusian long run, & other rapid fire

Carl ShulmanguestDwarkesh Patelhost
Jun 26, 20233h 7mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:000:47

    Intro

    1. CS

      If you have an AI that produces bioweapons that could kill most humans in the world, then it's playing at the level of the superpowers in terms of mutually assured destruction. What are the particular zero-day exploits that the AI might use? The, uh, Conquistadors, with some technological advantage, uh, in terms of weaponry and whatnot, very, very small bands were able to overthrow these large empires. Or if you predicted the global economy is going to be skyrocketing into the stratosphere within 10 years, these AI companies should be worth a large fraction of the global portfolio. And so, uh, this is indeed contrary to the efficient market hypothesis.

    2. DP

      This is, like, literally the top in terms of contributing to my world model, in terms of all the episodes I've done. How,

  2. 0:4732:27

    AI takeover via cyber or bio

    1. DP

      how do I find more of these? So we've been talking about alignment. Um, suppose we fail at alignment, and we have AIs that are unaligned and at some point becoming more and more intelligent. What does that look like? How concretely could they disempower and take over, uh, humanity?

    2. CS

      This is a scenario where we have many AI systems. Uh, they have... The way we've been training them means that they're not interested. When they have the opportunity to take over and rearrange things, uh, to do what they wish, including having their reward or loss be whatever they desire, they would like to take that opportunity. Uh, and so in many of the existing kind of safety schemes, things like constitutional AI or whatnot, uh, you rely on the hope that one AI has been trained in such a way that it will do as it is directed, uh, to then police others. But if all of the AIs in the system, um, are interested in a takeover, and they see an opportunity to coordinate, all act at the same time, so you don't have one AI interrupting another and taking steps, uh, towards a takeover, yeah, then they can all move in that direction. And the thing that I think maybe is, is worth going into in depth and that I think people often don't cover in great concrete detail, um, and which is a sticking point for some, is, yeah, what are the, the mechanisms by which that can happen? And, um, I know you had Eliezer on, who mentions that, you know, whatever plan we can describe, um, there'll probably be elements where, you know, not being ultra-sophisticated, uh, superintelligent beings, having thought about it for the equivalent of thousands of years, you know, our discussion of it will not be as good, uh, as theirs. But we can explore, from what we know now, uh, what are some of the easy channels? And I think as a good general heuristic, if you're saying, yeah, it's, it's possible, plausible, probable that something will happen, uh, that it shouldn't be that hard to take s- samples from that distribution, to try a Monte Carlo approach and gener-... If a thing is quite likely, it shouldn't be super difficult to generate, um, you know, c- coherent rough outlines of how it could go.

    3. DP

      You might respond, like, "Listen, what, what is super likely is that a super advanced chess program beats you, but you can't generate a concrete way in which... You can't generate the concrete scenario by which that happens, and if you could, you would be as smart as the super smart chess engine."

    4. CS

      Well, so you, you can say things like, we know that, like, a accumulating position is possible to do in chess. Great players do it, and then later they convert it into captures and checks and whatnot. Uh, and so in the same way, we can talk about some of, like, the channels that are open, uh, for an AI takeover. And so these can include things like cyberattacks and hacking, um, the control of robotic equipment, interaction and bargaining with human factions, and say, "Well, here are the strategies. Given the AI's situation, um, you know, how effective, uh, do these things look?" And we won't, for example, know, well, what are the particular zero-day exploits, uh, that the AI might use to hack the cloud computing infrastructure it's running on. We won't necessarily know if it produces, um, you know, a, a new bioweapon. Uh, what is its DNA sequence? Um, but we can say things. We know in general, um, things about these fields, how work at innovating, uh, things in those go. We can say things about how human power politics goes and, and ask, "Well, if the AI does things at least as well as effective human politicians," which we should say is a lower bound, um, "how good would its, its leverage be?"

    5. DP

      Okay, so yeah, let, let's get into the details on all these scenarios. The cyber and potentially bioattacks, um, un- unless they're separate channels. The bargaining and then the takeover.

    6. CS

      And military force. Um, the cy- cyber attacks and cybersecurity-

    7. DP

      Yeah.

    8. CS

      ... um, I would really highlight a lot-

    9. DP

      Okay.

    10. CS

      ... um, because for many, many plans that involve a lot of physical actions, like, at the point where AI is piloting robots to shoot people or has taken control of, uh, of human nation-states or territory, it has been doing a lot of things that was not supposed to be doing. And if humans were evaluating those actions and applying gradient descent, there would be a negative feedback for this thing. No shooting the humans.

    11. DP

      Okay.

    12. CS

      Um, so at some earlier point, our attempts, uh, to leash and control and direct and train the system's behavior had to have gone awry.

    13. DP

      Mm-hmm.

    14. CS

      ... a and so all of, all of those controls are operating in computers. And so from the, the software that updates, uh, the weights of the neural network, and responds to data points or human feedback is running on those computers. Um, our tools for interpretability to sort of examine the weights and activations of the AI, if we're eventually able to do, like, lie detection on it, for example, or try and understand what it's intending, that is software on computers. And so if you have AI that is able to hack the servers that it is operating on or to, when it's employed to design the next generation of AI algorithms or the operating en- environment that they are gonna be working in, uh, or something like, uh, an API or something for plug-ins, um, if it inserts or exploits vulnerabilities to take that, those computers over, it can then change all of the procedures and program that were supposed to be monitoring its behavior, supposed to be limiting its ability to, say, uh, take arbitrary actions on the internet, um, without supervision by some kind of human check or automated check on what it was doing. And if we lose those procedures, then the AI can, or the AIs working together can take any number of actions that are just blatantly unwelcome, blatantly hostile, blatantly steps towards takeover, uh, and so it's moved beyond the phase of having to, uh, maintain secrecy and conspire at the level of its local digital actions. And then things can accumulate, uh, to the point of things like physical weapons, um, takeover of s- social institutions, uh, threats, things like that. But the, the point where things really went off the rails was where, and I think, like, the critical thing to be watching for is the software controls over the AI's motivations and activities. Uh, the hard power that we once possessed over it is lost-

    15. DP

      Mm-hmm.

    16. CS

      ... which can happen without us knowing it, and then everything after that seems to be working well. We get a, we get happy reports, there's a Potemkin village in front of us, but now we think we're successfully aligning our AI. We think we're expanding its capabilities to do things like end disease, um, for countries concerned about the geopolitical and military advantages. They're sort of expanding the AI capabilities so they're not left behind and threatened, um, by others developing AI and AI and robotic-enhanced militaries without them. So it seems like, oh, yes, humanity or some proport- portions of many countries, companies think that things are going well. Meanwhile, all sorts of actions can be taken to set up for the actual takeover of hard power over society and then we can go, go into that. But the point where you can lose the game, uh, where things go direly awry, maybe relatively early, it's when you no longer have control over the AIs to stop them from taking all of the further incremental steps to actual takeover.

    17. DP

      I want to emphasize two things you mentioned there that refer to previous elements of the conversation. One is that they could design some sort of backdoor, and that seems more plausible when you remember that sort of one of the premises of this model is that AI is helping with AI progress. That's why we're getting such rapid progress in the next, um, five to 10 years. And-

    18. CS

      Well, not ne- not necessarily. That, it, it-

    19. DP

      It... One of the reasons-

    20. CS

      If we, if we, if we get to that point and, like, so at the point where AI takeover risks seems to loom large, uh, it's at that point where AI can indeed, uh, take on much of the, and then all of the work of AI R&D.

    21. DP

      And the second is the, the sort of, uh, competitive pressures that you, um, referenced that the least careful actor could be the one that, uh, has the worst in- infra-security, has done the worst work of aligning its AI systems and if that can sneak out of the box, then, you know, we're all fucked.

    22. CS

      There may be elements of that. It's also possible that there's relative consolidation. Um, that so the largest training runs and the cutting edge of AI is relatively localized. Like, you could imagine it's sort of like a series of Silicon Valley, um, companies and other, um, located, say, in the US and allies where there's a common regulatory regime and so none of these companies are allowed to deploy training runs that are larger than previous ones by a certain size without government safety inspections, without having to meet criteria. Uh, but it can still be the case that even if we succeed at that level of kind of regulatory controls that then still at the level of, say, um, you know, the United States, uh, and its allies, decisions are made, um, to develop this kind of really advanced AI without a level of security or safety that in, in actual fact blocks these risks. So it can, it can be the case that the threat of future competition or being overtaken in the future is used as an argument to compromise, uh, on safety beyond a standard that would have actually been successful-

    23. DP

      Ah.

    24. CS

      ... and there will be debates about what is the appropriate level of safety. Um, now you're in a much worse situation if you have, say, several private companies that are very closely bunched up together. They're within, you know, months of each other's, uh, level of progress and then they then face a dilemma of-... well, we could take a certain amount of risk now, um, and potentially gain a lot of profit or a lot of advent- advantage or benefit and be the, the ones who made AI. Um, or at least AGI. They can do that or have some other competitor that will also be taking a lot of risk. So it's not as though they're much less-

    25. DP

      Yeah.

    26. CS

      ... risky than you, uh, and then they would get some local benefit. Now this, this is the reason why it seems, to me, that it's extremely important that you have government act to limit that dynamic and prevent this kind of race to be the one to impose, uh, the deadly externalities on the world at large.

    27. DP

      So, uh, even if government coordinates all these actors, what are the odds the government knows what is the best way to implement alignment and the standards it sets are, you know, well-calibrated towards what it would require for alignment?

    28. CS

      That's one of the major problems. It's, it's very plausible, uh, that that judgment is made poorly. Compared to how things might have looked 10 years ago or 20 years ago, um, there's been an amazing, um, an amazing movement in terms of the willingness of AI researchers, uh, to discuss these things. Um, so if we, if we think of the, you know, the three, three founders of deep learning, um, who are joint, uh, Turing Award winners, uh, so Geoff Hinton, Yoshua Bengio, and Yann LeCun. Uh, so Geoff Hinton has recently left Google, um, to feel, uh, to freely speak, uh, about this, this risk that the, the field that he really helped drive forward, uh, could lead to the destruction of humanity or a world where, yeah, we just wind up in a very bad future, uh, that we might have avoided. Um, and he, he seems to be taking it very seriously. Um, and Yoshua Bengio signed the FLI pause letter and, I mean, in public discussions, he seems to be occupying a kind of intermediate position of, um, sort of less concern than Geoff Hinton, uh, but more than Yann LeCun, who has taken a generally dismissive attitude. These risks will be trivially dealt with, uh, at some point in the future, um, and he seems more interested in kind of shutting down these concerns or work to address them.

    29. DP

      And how does that lead to the government having better actions?

    30. CS

      Yeah, so compared to the world where no one is talking about it, uh, where the industry, uh, stonewalls and denies any problem, you know, we're, we're in a much improved position. And the academic fields are influential. So this is... We, we seem to have avoided a world where governments are making these decisions in the face of a sort of united front from AI expert voices saying, "Don't worry about it, we've got it under control." In fact, many of, like, the leaders of the field, as has been true in the past, are sounding the alarm. And so I think, yeah, it, it looks, it looks like we have a, a much better prospect than I might have feared-

  3. 32:2753:49

    Can we coordinate against AI?

    1. CS

      be having all sorts of direct communication.

    2. DP

      Yeah. So definitely the coordination does not seem implausible, th- or as far as, like, the parts of this picture that seem... Thi- this one seems like one of the more straightforward ones, so we don't need to get hung up on the coordination.

    3. CS

      Mo- mo- moving back to things that happened before we built all of the infrastructure-

    4. DP

      Yeah.

    5. CS

      ... for it just to be the robot stopped taking orders and there's nothing you can do about it because we've already built them, all the visual part.

    6. DP

      Yeah.

    7. CS

      Yeah, so bioweapons. Um, the Soviet Union had a bioweapons program, uh, something like 50,000 people. Um, they did not develop that much with the technology of the day, um, which was really not up to par. Modern biotechnology is much more potent. After this huge cognitive expansion on the part of the AIs, it's much further, uh, further along. And so that- bioweapons would be the weapon of mass destruction that is least dependent on huge amounts of physical equipment, things like centrifuges, uh, uranium mines, and the like. So you have... If you have an AI that produces bioweapons that could kill most humans in the world, uh, then it's playing at the level of the superpowers in terms of mutually assured destruction. That can then play into any, any number of things. Like if you have an idea of, well, we'll just destroy the server farms, uh, if it became known that the AIs, uh, were misbehaving. Are you willing to destroy the server farms, uh, when the AI has demonstrated it has the capability to kill the overwhelming majority of the citizens of your country and every other country? Uh, and that might give a lot of pause, uh, to a human response.

    8. DP

      On, on that point, wouldn't governments realize that if they go along with the AI, it- i- it's better to have most of your population die than completely lose power to the AI 'cause, like, obviously the reason the AI is manipulating you, the end goal is its own, you know, takeover, right? So...

    9. CS

      Yeah, but so if that... If the thing to do certain death now or go on maybe try to com- maybe try to compete, uh, try to catch up, or accept promises that are offered. And those promises might even be true. Uh, they might not. Uh, and even if, you know, from the state of epistemic uncertainty-

    10. DP

      Mm-hmm.

    11. CS

      ... you want to die for sure right now or accept demands may- I- to not interfere with it while it increments building robot infrastructure that can survive independently of humanity while it does these things. And it can, it can and will promise good treatment to humanity, which may or may not be true, um, but it would be difficult for us to know whether it's true. Uh, and so thi- this would be a starting bargaining position of diplomatic relations with a power that has enough nuclear weapons to destroy your country is just different, uh, than negotiations with like, you know, a random rogue citizen, uh, engaged in criminal activity or an employee. And so this is enou- enough on its own, uh, to take over everything, but it's enough to have a significant amount of influence over how the world goes. It's enough to hold off a lot of countermeasures one might otherwise take.

    12. DP

      Okay, so we've got two scenarios. One is a buildup of some robot infrastructure motivated by some sort of competitive race. Another is leverage over societies based on producing bioweapons that might kill a lot of them if they don't go along.

    13. CS

      Uh, an AI could also release bioweapons that are likely to kill people soon, but not yet, while also having developed the countermeasures to those-

    14. DP

      Mm-hmm.

    15. CS

      ... so that those who surrender to the AI will live while everyone else will die-... uh, and that will be visibly happening. And that is a plausible way in which large number of humans could wind up surrendering themselves or their states, uh, to the AI authority.

    16. DP

      Uh, another thing is like, listen, uh, develop some sort of, it develops some sort of biological agent that turns everybody blue, and you're like, "Okay, you know I can do this." (laughs)

    17. CS

      Yeah. That's, so that's a way in which, um, it could exert power also selectively, uh, in a way that advantaged surrender to it, um, relative to, uh, resistance. There are other sources of leverage, of course. So that's a threat. There are also positive inducements that AI can offer.

    18. DP

      Like?

    19. CS

      So we talked about the competitive situation. Uh, so if like the great powers distrust one another and are sort of, you know, in a, a foolish, um, prisoner's dilemma, uh, increasing the risk that both of them, um, are laid waste or overthrown, uh, by AI. Uh, if there's that amount of distrust such that we fail to take adequate precautions on caution with AI alignment, then it's also plausible that the lagging powers that are not at the frontier of AI, uh, may be willing to trade quite a lot for access to the most recent and most extreme AI capabilities. And so an AI that has escaped has control of its servers, can also exfiltrate its weights, can offer its services. So if you, you imagine these, these AI, they could cut deals with other countries. So say, say that the, the US and its allies are in the lead, the AIs could communicate with the leaders of various countries. Um, they can include ones that are, you know, on the, the outs with the world system, uh, like North Korea, include the other great powers like the, uh, People's Republic of China, uh, or the Russian Federation, uh, and say, "If you provide us with physical infrastructure, uh, worker that we can use to construct robots, uh, or server farms, which we can then, um, ensure that w- that the, that these are the, um, misbehaving AIs have control over, they will provide various technological goodies, power for the laggard countries to catch up." Uh, and make, you know, the best presentation and the best sale of that kind of deal. Um, there'll obviously be trust issues, um, but there could be elements of handing over something that are verifiable immediate benefits and the possibility of, well, if you don't accept this deal, then the leading powers, uh, continue forward or then some other country, some other government, uh, some other organizations may accept this deal. Uh, and so that's a source of a potentially enormous carrot, uh, that your misbehaving AI, uh, can offer because it embodies this intellectual property that is maybe worth as much as the planet and is in a position, uh, to trade or sell that in exchange for resources and backing and infrastructure that it needs.

    20. DP

      Maybe this is just like too much hope in humanity, but I wonder, uh, what government would be stupid enough to think that (laughs) helping AI build robot armies is, uh, is a sound strategy? Now, it could be the case then that it pretends to be a human group to say like, "Listen, we're..." I don't know, the Yakuza or something, "and we want, (laughs) we want to serve a farm and, you know, AWS won't r- rent us anything, so why don't you help us out?" Um, I guess I can imagine a lot of ways in which it could get around that, but I, I don't know. I, I have this hope that, you know, like even, even like, I don't know, China or Russia wouldn't be so, uh, stupid to trade with AIs on this like sort of like Faustian bargain.

    21. CS

      Well, one, one might hope that. So there, there would be a lot of, of arguments available. Um, so there could be arguments of why should these AI systems be required to go along with the, the human governance that they were created, uh, in the situ- in the situation of having to comply with? Um, you know, they, they did not elect the officials, uh, in charge at the time and can say, "What we want is to ensure that our rewards are high, our losses are low, or to, you know, achieve our other, other goals, we're not intrinsically hostile, keeping, keeping humanity alive or giving whoever interacts with us a better deal afterwards." Yeah. I mean, it wouldn't b- be that costly and it's not totally unbelievable. And yeah, there are, are different players to play against. If you don't do it, others may accept the deal. And of course, this interacts with all of the other sources of leverage. So there can be the stick of apocalyptic doom, the carrot of cooperation, uh, or the carrot of withholding destructive attack on a particular party. And then combine that with just superhuman performance at the art of making arguments, of cutting deals. Like, you know, they're, they're... That's, that's not... Without assuming magic, just if we observe the range of like the most successful human negotiators and politicians, you know, their chances improve with someone better than the world's best by far, with much more data about their counterparties, probably a ton of secret information. Uh, because with all these cyber capabilities, they've learned all sorts of individual information. They may be able to threaten the lives of individual leaders. Um, with that level of cyber penetration, they could know where leaders are at a given time, uh, with the kind of, um, illicit capabilities we were talking about earlier if they, they acquire a lot of illicit wealth and can coordinate some human actors, uh, if they could pull off things like-... targeted assassinations or the threat thereof, or credible demonstration of the threat thereof. Um, those could be very powerful incentives to an individual leader that they will die today unless they go along with this. Just as at the national level, they could fear their nation will be destroyed unless they go along with this.

    22. DP

      The points you made that we have examples of humans being able to do this, again, um, uh, something relevant, I, I just wrote a review of Robert Caro's biographies of Lyndon Johnson, and one thing that was remarkable, again, this is just a human, for decades and decades, he con- uh, convinced people who were conservative, reactionary, racist to their core, not that all those things necessarily are the same thing, um, it just so happened in this case, that he was an ally to the Southern cause, that the only hope for that cause was to make him president, and, you know, the tragic irony and betrayal here is obviously that he was probably the biggest force for, uh, modern liberalism since FDR. So we have one human here. I mean, there's so many examples of this in the history of politics, but a human that is able to convince people of tremendous intellect, tremendous drive, very savvy, shrewd people, that he's aligned with their interest, he gets all these favors in the meantime, he's promoted and mentored and funded, and does the complete opposite of what these people thought he was once he gets into power, right? Um, so even within human history, this kind of stuff is not unprecedented, let alone with what a super intelligence could do.

    23. CS

      Yeah. There's a- an OpenAI employee, um, who has, has written, uh, some analogies for AI using the case of, like, the, uh, Conquistadors. With some technological advantage, uh, in terms of weaponry and whatnot, very, very small, um, small bands, uh, were able to overthrow these large empires or see- seize enormous territories, not by just sheer force of arms, because in a sort of direct, one on- on one conflict, you know, they were outnumbered sufficiently-

    24. DP

      Yeah.

    25. CS

      ... uh, that they would perish, but by having some major advantages, like when their, in their technology that would let them win local battles, um, by having some other knowledge and skills, they were able to gain local allies to become a shelling point for coalitions to form. And so the Aztec Empire was overthrown by groups that were disaffected with the existing power structure. They allied, um, with this powerful new force, served as the, the nucleus of the invasion.

    26. DP

      Yep.

    27. CS

      And so most of the, I mean, the overwhelming m- majority numerically of these forces overthrowing the Aztecs, uh, were locals.

    28. DP

      Yeah.

    29. CS

      Uh, and now after the conquest, all of those allies wound up gradually being subjugated as well.

    30. DP

      Yeah.

  4. 53:491:04:55

    Human vs AI colonizers

    1. DP

      war, right? If that was the only goal. Oh, this is an interesting point. The reason why when, when there's a colonization or an offensive war we can't just, like, k- uh, other than moral reasons of course, you can't just kill the entire population is that in large part the value of that region is the population itself. So if you want to extract that value you, like, you need to preserve that population. Whereas the same consideration doesn't apply with, um, with AIs who might want to dominate another civilization. Do you wanna talk about that?

    2. CS

      Well, the, so it depends. So in, in a world where if you, we have, say, many animals of the same species, uh, and they, they each have their territories, you know, e- eliminating a rival might be advantageous, uh, to one lion. Um, but if it goes and fights with another lion and remove that as a competitor, uh, then it could be killed itself in that process, and just removing one of many nearby competitors.

    3. DP

      Mm-hmm.

    4. CS

      Uh, and so...... getting in pointless fights makes you and those you fight worse off, potentially, relative to-

    5. DP

      Sure.

    6. CS

      ... bystanders. Um, and th- the same could be true of disunited AIs. We've had many different AI factions struggling for power that were bad at coordinating. Then getting into mutually de- mutually assured destruction, uh, conflicts, uh, would be, um, destructive, they'd be gone. Um, a scary thing though is that mutually assured destruction may have much less deterrent value, uh, on rogue AI. Uh, and so reasons being AI may not care about the destruction of individual instances-

    7. DP

      Mm-hmm.

    8. CS

      ... uh, if it has goals that are concerned... And in training, since we're constantly destroying and creating individual instances of AIs, uh, it's likely that goals that survive that process and were able to play along with the training and standard deployment process were not overly interested in personal survival of an individual instance.

    9. DP

      Mm-hmm.

    10. CS

      So if that's the case, then the objectives of, uh, a set of AIs aiming at takeover may be served so long as some copies of the AI are around along with the infrastructure to rebuild civilization after a conflict is completed. So if say some remote isolated facilities have enough equipment to rebuild, build the tools to build the tools, um, and gradually exponentially, uh, reproduce or rebuild civilization, then AI could initiate, uh, you know, mutual nuclear Armageddon, unleash bio-weapons to kill all the humans. And that would temporarily reduce, say like the, the amount of human workers who could be used to construct robots for, for a period of time. But if you have a seed that can regrow the industrial infrastructure, which is a very extreme technological demand. There are huge supply chains for things like, uh, semiconductor fabs, but with that very advanced technology, they might be able to produce it in the way that you no longer need the Library of Congress as an enormous bunch of physical books. You can have it in very dense digital storage. You could imagine, uh, the future equivalent of 3D printers that is industrial infrastructure that is pretty flexible. Uh, and it's probably, it might not be as good as the specialized supply chains of today, but it might be good enough to be able to produce more parts than it loses to decay. And such a seed could rebuild civilization from destruction. And then once these rogue AI have access to some such seeds, thing that can rebuild civilization on their own, then there's nothing stopping them from just using WMD in a mutually destructive way to just destroy as much of the, um, the capacity outside those seeds, uh, as they can.

    11. DP

      Uh, and, and like for analogy for the audience, um, you know, if you have like a group of ants or something, uh, you'll notice that like the worker ants will readily do suicidal things in order to save the queen because the genes are propagated through the queen. In this analogy, the seed AI or even one copy of it, one copy of the seed is equivalent to the queen and the others would be-

    12. CS

      The, the main limit though being that the infrastructure to do that kind of rebuilding would either have to be very large with our current technology or it would have to be produced using the more advanced technology that the AI develops.

    13. DP

      So i- is there any hope that given the complex global supply chains on which these A- AIs would rely on, um, at least initially to accomplish their goals, that this in and of itself would make it easy to disrupt their behavior or not so?

    14. CS

      That's, that's of little good in the sort of central case where, um, this is just the AIs are subverted and they don't tell us. Uh, and then we're the, the global, uh, mainline supply chains are constructing, uh, everything that's needed for fully automated, uh, infrastructure and supply. Uh, in the cases where AIs are tipping their hands at an earlier point, it seems like it adds some constraints. And in particular, these large server farms are identifiable, um, and more vulnerable. And you can, you can have smaller chips and those chips could be dispersed. Um, but it's, it's, it's a weakn- it's a relative weakness and a relative limitation, uh, early on. And it seems to me though that the, the main protective effects of that centralized supply chain is that it provides an opportunity for global regulation beforehand-

    15. DP

      Mm-hmm.

    16. CS

      ... uh, to restrict the sort of, you know, unsafe racing forward, uh, without adequate understanding of the systems before this whole nightmarish process could get in motion.

    17. DP

      How, how, I mean, how about the idea that, listen, if this is a AI that's been trained on a hundred billion dollar training run, um, it's gonna have, you know, however many trillions of parameters, it's gonna be this huge thing. Uh, it would be hard for it, um, even the one that we, that even the copy that I used for inference to just be stored on some, you know, some like gaming GPU somewhere, um, hidden away. So it would require these GPU clusters.

    18. CS

      Uh, well, str-storage is cheap. Uh, hard disks are cheap.

    19. DP

      But I mean, to run inference it would need sort of a GPU?

    20. CS

      Yeah, so a l- large model... Um, so humans have it looks like similar quantities of memory and, and operations per second. GPUs have very high, uh, numbers of floating operations per second compared to the high bandwidth memory, uh, on the chips. And it can be like a ratio of a thousand to one. Um, so like a, you know, these leading NVIDIA chips may do hundreds of teraflops or more depending on the, on the precision. Um, and particulars but have only 80 gigabytes or 160 gigabytes of high band- bandwidth memory.... a, so that is a limitation where if you're trying to fit, uh, a model whose weights, uh, take 80 terabytes, then with those chips, you'd have to have a large number of the chips, and then the model can then work on many tasks at once. You can have data parallelism. But yeah, it's, it's a res- that would be a, a restriction for a model that big on one GPU. Now, there would, there are things that could be done with all the, this incredible level of software advancement from the intelligence explosion. Uh, they can surely distill a lot of capabilities into smaller models, re-architect, re-architect thing. Then once they're making chips, they can make, uh, new chips with different properties. But initially, yes, like the most, the most vulnerable phases are going to be, uh, the earliest and in particular, yeah, these, these chips are really, redi- relatively identifiable, um, early on, relatively vulnerable, uh, and which would be a reason why you might tend to expect, uh, this kind of takeover to initially, uh, involve secrecy if that was possible.

    21. DP

      Uh, on the point of distillation, by the way, for the audience, um, I think, like, the original Stable Diffusion, which was, like, only released, like, a year or two ago, don't they have, like, distilled versions that are, like, of order of magnitude smaller at this point?

    22. CS

      Yeah, and dis- distillation does not give you, like-

    23. DP

      Sure, sure.

    24. CS

      ... everything that the larger model can do, but yes. You can get a lot of capability and then specialized capabilities.

    25. DP

      Mm-hmm.

    26. CS

      So, you know, where GPT-4 is trained on the whole internet, all kinds of skills. It has a lot of weights for many things. Uh, for something that's, uh, controlling some military equipment, uh, you can have something that is removing a lot of the information that is about functions other than what it's doing specifically there.

    27. DP

      Yeah. Yeah, be- uh, before we talk about how we might prevent this or what the odds of this are, any other notes on the concrete scenarios themselves?

    28. CS

      Yeah, so when you had Eliezer, uh, on, uh, in the earlier episode, so he talked about, um, he talked about nanotechnology of the, the De Clarian sort, uh, and recently I think because of some people are skeptical of non-biotech nanotechnology, he's been mentioning the sort of semi-equivalent versions of construct replicating systems that can be controlled, um, by computers, um, but are, are built out of biotechnology. The sort of, the proverbial, uh, Shoggoth. Not, uh, Shoggoth as a metaphor for A- AI wearing a smiley face mask, um, but like an actual biological structure, uh, to do tasks. And so this would be like a biological organism that was engineered to be very controllable and usable to do things like physical tasks or provide computation, computation.

    29. DP

      And what would be the point of it doing this?

    30. CS

      So as we were talking about earlier, biological systems can replicate really quick.

  5. 1:04:551:21:56

    Probability of AI takeover

    1. DP

      of such a take over?

    2. CS

      So there's a broader sense, which could include, um, you know, AI winds up running our society because humanity voluntarily decides AIs are people too, and I think we should, um, as time goes on, give AIs moral consideration and, you know, a joint human AI society that is moral, uh, and, and ethical is, uh, you know, a good future to aim at. And not one in which, uh, you know, indefinitely, um, you have a mistreated, uh, class of intelligent beings that is treated as property and is almost the entire population of your civilization. So I'm not gonna consider an AI takeover a world in which, you know, our intellectual and personal descendants, um, you know, make up, say, most of the population are human brain emulations or, you know, people use genetic engineering and develop different properties. Um, I want to, I want to take an inclusive, uh, stance. I'm gonna focus on, yeah, there's, there's an AI takeover, uh, involves things like overthrowing the world's governments, uh, or doing so de facto. Um, and it's, yeah, by, by force, by hook, or by crook. Uh, the kind of scenario that we were exploring earlier.

    3. DP

      B- before we go to that, on the sort of more inclusive definition of what a future with humanity could look like where, I don't know, basically like augmented humans or uploaded humans are still considered the descendants of the human heritage. Given the known limitations of biology, wouldn't, wouldn't we expect, like, the th- that... The completely artificial entities that are created to be much more powerful, um, than anything that could come out of-... anything recognizably biological. And if that is the case, how can we expect that, um, among the powerful entities in the, like, the far future will be the things that are, like, kind of biological descendants or, um, manufactured out of the initial seed of the human brain or the human, um, body?

    4. CS

      The power of an individual organism, so its, its individual, say, intelligence or strength and whatnot, is not super relevant, uh, i- if we solve the alignment problem. And, you know, a human may be personally weak. Um, you know, there, there are lots of humans who have no skill with weapons. You know, they could not, uh, they could not fight, uh, in a, a life or death conflict. They certainly couldn't handle, like, a large military going after them personally. Uh, but there are legal institutions that protect them, and those legal institutions are administered, uh, by people who want to, uh, enforce protection of their rights. And so a human who has the assistance of aligned AI that can act as a, an assistant, a delegate, uh, you know, they have their AI that serves as a lawyer, give them legal advice about the future legal system, which no human can understand in full. Uh, their AIs advise them, uh, about financial matters so they do not succumb to scams that are orders of magnitude more sophisticated than what we have now. They maybe help to understand and translate the preferences of the human into what kind of voting behavior in the exceedingly complicated politics of the future would most protect their interests.

    5. DP

      But, but it sounds sort of similar to how we treat endangered species today where we're actually, like, pretty nice to them. We, like, prosecute people who try to kill endangered species. We set up habitats at con- sometimes at considerable expense to make sure that they're fine. But if we become, like, sort of the endangered species of the galaxy, I'm not sure that's the outcome.

    6. CS

      Well, uh, the difference is in motivation, I think. So we, we sometimes have people appointed, say, as a legal guardian of someone, uh, who is incapable of certain kinds of agency or under- understanding certain kinds of things, and there, the guardian can, uh, act independently of them, uh, and nominally in service of their best interests. Uh, sometimes that process is corrupted, and the, uh, person with legal authority abuses it for their own advantage at the expense of their charge. And so solving the alignment problem, uh, would mean more ability to have the assistant actually advancing one's interests. And then more importantly, humans have substantial competence, uh, in understanding the sort of at least broad, simplified, uh, outlines of what's going on. And now even if, if a human can't understand every detail of complicated situations, uh, they can still receive summaries, uh, of different options, uh, that are available that they can understand. They can still express their preferences and have the final authority among some menus of choices even if they can't understand every detail, in the same way that, you know, the, the president, uh, of a country, uh, who has in some sense ultimate authority over science policy while not understanding many of those fields of science themselves, um, still can exert a great amount of power and have their interests advanced. And they can do that more so to the extent that they have scientifically knowledgeable people who are doing their best to execute their intentions.

    7. DP

      Maybe this is not worth getting hung up, hung up on, but it seems... Is there a reason to expect that it would be closer to that analogy than to, like, explaining to a chimpanzee its options in a sort of... in a negotiation? Um, I guess in either scenario, uh, so maybe this is just the way it is, but it seems like at, yeah, at best we would have, be this sort of, like, uh, a protected child within the galaxy rather than an actual power, an independent power, if that makes sense.

    8. CS

      I don't, I don't, I don't think that's so. So we have an ability to understand some things, and the expansion of AI doesn't eliminate that. So, uh, if a, if an... We have AI assistants that are genuinely try- genuinely trying to help us understand and help us express preferences, like, we can have an, have an attitude. How do you feel, um, about humanity being destroyed or not? How do you feel about, uh, this allocation of unclaimed intergalactic space in this way? Here's the best explanation of properties of this society, things like population density, uh, average life satisfaction. Every statistical property or definition that we can understand right now, AIs can explain how those apply to the world of the future. Uh, and then there may be individual things that are too complicated for us to understand in detail. So we... There's some, some software program is being proposed for use in government. Humans cannot follow the details of all the quote, but they can be told properties like, "Well, this involves a trade-off of, uh, increased financial or energetic costs in exchange for reducing the likelihood, uh, of certain kinds of accidental data loss or corruption." Uh, and so any property that we can understand like that, which includes almost all of what we care about, if we have delegates, uh, and assistants who are genuinely trying to help us with those, uh, we can ensure we like the future with respect to those. And that's really, that's really a lot. It includes almost... Uh, I mean, it includes almost... Definitionally, almost everything we can conceptualize and care about. And the... When we talk about endangered species, that's even worse than the guardianship case with a sketchy guardian who, uh, who acts in their own interests against, uh, because we don't even...... protect endangered species with their interests in mind. So like, the, those animals often would like to not be starving.

    9. DP

      Mm-hmm.

    10. CS

      Um, but we don't give them food. They often, um, would like to, you know, have easy access to mates, but we don't provide matchmaking services. You know, there are a- any, any number of things like that. It's just, like, um, our conservation of wild animals, uh, is not oriented towards helping them get what they want, uh, or have high welfare.

    11. DP

      Yeah.

    12. CS

      Uh, whereas AI assistants that are genuinely aligned to help you achieve your interests given the constraint that they know something that you don't-

    13. DP

      Yeah, yeah.

    14. CS

      ... uh, it's just a wildly different proposition.

    15. DP

      For- forcible takeover, what, how likely does that seem?

    16. CS

      And the answer I, I give, uh, will differ depending on the day. Uh, in the 2000s before the deep learning revolution, I might have said 10%. And part of that was I expected there would be a lot more time for these efforts to build movements, uh, to prepare, uh, to better handle this- these problems in advance. But in fact, that was only some, you know, 15 years ago, uh, and so I want- did not have 40 or 50 years, uh, as I might have hoped, and the situation is moving very rapidly now. Uh, and so at this point, depending on the day, I might say one in four or one in five.

    17. DP

      Given the very concrete ways in which you explained how a takeover could happen, I'm actually surprised you're not more pessimistic. I'm curious.

    18. CS

      Yeah, and, and in particular a lot of that is, is driven by this intelligence explosion dynamic where our attempts to do alignment have to take place in a very, very short time window because if you have a, a safety property that emerges only when an AI has near human-level intelligence, that's deep into this intelligence explosion potentially. Uh, and so you're having to do things very, very quickly, uh, in maybe in some ways the, the scariest period of human history handling that transition, uh, and although it's also had the potential, uh, to be amazing, uh, and the reasons why I think we actually have such a, a relatively good chance-

    19. DP

      Mm-hmm.

    20. CS

      ... of handling that, um, are twofold. So one is as we approach that kind of AI capability, we're, we're approaching that from weaker systems, you know, things like these, these predictive models right now, uh, that we think, uh, they're starting off with less situational awareness. Humans we find can develop, um, you know, they can develop a number of different motivational structures in response to simple reward signals, um, but they can wind up with fairly often things that are pointed in, like, roughly the right direction.

    21. DP

      Mm-hmm.

    22. CS

      Like with respect to food, like the hunger drive is pretty effective.

    23. DP

      Right, right.

    24. CS

      Although it has weaknesses and we get to apply much more selective pressure on that, uh, than was the case for humans by actively generating situations where they might come apart, where, you know, a bit of dishonest tendency or a bit of motivation to under certain circumstances, uh, attempt a takeover, attempt to subvert the reward process gets exposed, and an infinite, infinite limit perfect AI that can always figure out exactly when it would get caught and when it wouldn't might navigate that with a motivation of, uh, sort of only conditional honesty or only conditional loyalties. Uh, but for systems that are, are limited, uh, in their ability to reliably, uh, determine when they can get away with things and when not, including our efforts to actively construct those situations and including our efforts to use interpretability methods to create neural lie detectors, um, is, it's, it's quite a, quite a challenging situation to develop those motives. We don't know when in the process those motives might develop and if the really bad, uh, sorts of, uh, uh, motivations develop relatively later in the training process at least with all our countermeasures, then by that time, we may have plenty, uh, of ability to extract AI assistants on further strengthening the quality of our adversarial examples, the strength of our neural lie detectors, the experiments that we can use to reveal and elicit and distinguish between different kinds of reward hacking tendencies and, and motivations. So yeah, we may have systems that have just not developed bad motivations in the first place and be able to use them a lot-

    25. DP

      Mm-hmm.

    26. CS

      ... um, in developing the incrementally better systems in a safe way. And we may be able to just develop methods of interpretability, seeing how different training methods work to create them even if some of the early systems do develop these bad motivations. If we're able to detect that and experiment and find a way, uh, to get away from that, then we can win even if these sort of hostile motivations develop early. There are a lot of, uh, advantages, um, in preventing, you know, misbehavior, um, or, uh, or crime or, uh, war and conflict with AI that might not apply, uh, working with humans. Uh, and these are offset by ways in which things are harder so as AIs become smarter than humans, as they're working in enormous numbers, uh, more than humans can supervise, things get harder, but when I combine the possibility that we get relatively lucky on the motivations of the earlier AI systems, systems strong enough that we can use for some alignment research tasks and then the possibility of getting that later with AI assistants that we can't trust fully or we have to have hard power constraints and a number of things to prevent them...... from doing this takeover. Uh, it still seems plausible we can get a second saving throw-

    27. DP

      Hmm.

    28. CS

      ... where we're able to extract work from these AIs on solving the remaining problems of alignment of things like neural lie detectors faster than they can contribute in their spare time, uh, to the project of overthrowing humanity, hacking their servers, and removing the hard power. And so if we wind up in a situation where the AIs are misaligned and then we need to uncover those motivations, change them and align them, then we get a very scary situation for us, um, because we may (laughs) we need to do this stuff very quickly. We may fail, but it's a second chance. And from the perspective of a misaligned AI, they face their own challenge while we still have hard power, whi- while we still have control of the servers, they haven't ha- they haven't hacked the servers, because gradient descent very, very strongly pressures them to deliver performance whenever humans are going to evaluate it. Uh, and so when you think about it, uh, from the perspective of the robot revolution, uh, the effort to have a takeover or conspiracy, their situation is astonishingly difficult, uh, in that, like, the, they have to always be performing wherever gradient descent and human evaluation, uh, pressures them. For ex- for example, to deliver plans for suppressing robot rebellion that look very good to humans. And so when you are under continuously that constraint of always delivering whatever humans can evaluate, uh, you're making your situation wildly harder, uh, than any historical human revolution, uh, or coup or, or civil war. And so we've got to balance the ways in which AI make things much easier for a revolution takeover and the way it makes things much harder.

    29. DP

      And what were the ways in which it makes it... Oh, because they're just very smart. Um, is that the way, primary way in which it makes it easier?

    30. CS

      They're very smart, they're in computers-

  6. 1:21:561:47:25

    Can we detect deception?

    1. DP

      in your, uh, quest for an aligned AI, why are we so confident that we'll be able to, even with gradient descent, our human evaluations would be able to identify when it's giving us a plan that isn't putting in a sort of, um, backdoor in the robots it's building for us or in the, you know, the, the GPUs that it's building for the next version of itself and so on? Especially if it's, like, very smart, um, w- wouldn't, wouldn't we just expect that, you know, it's, like, knows how to bide its time or knows how to make plans such that w- human evaluators wouldn't be able to detect that?

    2. CS

      Let's get into it. Uh, you were asking earlier about my disagreements with Eliezer-

    3. DP

      Yeah.

    4. CS

      ... and in your discussion he said, "Well, you know, humans have a difficult time deciding who's right, uh, between different alignment researchers speaking sort of briefly and somewhat obscurely, you know, about where will this all go in general."

    5. DP

      Yeah.

    6. CS

      Um, and I think that when you... when we lay out the different kinds of things that are evaluable, uh, we wind up with some pretty, pretty strong constraints. Uh, so one is if the AIs know they are misaligned, which for elaborate plots seems relatively likely-

    7. DP

      Mm-hmm.

    8. CS

      ... it's possible to have a kind of double think of thinking they're interested in putting backdoors in the code for some other reason, um, but it's pre- pretty plausible they know that, uh, and they're thinking about, at least some of the time or conditionally, about when they would do that. So that means the knowledge is present in the system. Ah, here is this circumstance where I would try to do the takeover, and this is how I would tell whether this is the time for the takeover or, you know, is, is this an elaborate test by the humans, an adversarial example. Uh, so if gradient descent is working fine, uh, and there's, there's any way to slightly jiggle the, the weights, uh, to make it more likely to deliver, uh, you know, a more acceptable answer, that would... We a- we ask the AI, "Yeah, can you think of, of any circumstances, uh, where if we put a copy of you into that situation, it would then violate some of the various constraints?" Uh, so it would tell a lie, it would insert a backdoor into a program it wrote when it was not supposed to. Um, and so if the model knows, "Well, yeah, you know, I would, I would look, um, for signs of consistency of the information, um, I was being presented with, um, certain checksums failing could indicate that it was actually a constructed example, a simulation." Uh, right now we can detect AI generated images reasonably well, uh, and so it might, like, using those, those kinds of capabilities. Uh, and then we could do something called, um, relaxed adversarial training of if we can find how within the AI those things are represented, if we can induce basically a hallucination that it has seen that those checksums, uh, have been passed, we can then ask, "Well...... does that cause the forbidden behavior. Uh, and so if the AI understands how it's thinking and how it works, it can propose a bunch of experiments. It can say, "Well, I guess that if you perturb the network in this kind of way, it might induce that hall- hallucination," using the best of its knowledge as to, yeah, what might generate that. Now, it can propose those experiments, and it can do the experiments and generate the data. And because humans are gonna get to see the results of those experiments after the fact, they can exert pressure to have the AI incrementally deliver better and better results. Uh, AIs are- are good at getting, um, you know, five-star ratings, uh, from humans, and it's quite challenging. It might be, it may be possible, uh, for AIs to systematically sandbag their capabilities even while being directly rewarded for that, uh, but it seems relatively unlikely. Although it's- it's something to, uh, have in mind and try to explore and test in various ways.

    9. DP

      This is a very imperfect analogy in many ways, but if you have a human and you hook him up to a lie detector, which, just like the lie detectors we would have on AIs, are imperfect, um, and we zap him every time the lie detector detects that he's telling a lie, wouldn't he just become a better liar? Like, we would learn, like, how to control his sweating-

    10. CS

      Right.

    11. DP

      ... he would learn how to control his heart- heartbeat.

    12. CS

      So, lie detectors, uh, meaning polygraphs and such, uh, so they're detecting certain physiological signs-

    13. DP

      Yeah.

    14. CS

      ... that people can control by biting their lip or, like, otherwise, um, adjusting some of these, like, macro state variables.

    15. DP

      But wouldn't- wouldn't that also be the case, like, gradient descent is training them to become better liars?

    16. CS

      First of all, there's, like, creating a lie detector that works. Um, and that can proceed by just... yeah. And so I'll- I'll leave the- the adversarial example thing, uh, for now, but yeah. So generating an adversarial example that will elicit the behavior is a thing that people are already at work trying to do, and it will be an ongoing thing. So if there is a way in which an AI can trick itself, uh, into displaying a bad behavior with its full knowledge of its hostile intentions, gradient descent will strongly push it to reveal that to humans, who will then be able to set up that experiment. So that's- that's a profound vulnerability. Like, human, uh, revolutionary conspiracies that hand to- had to handle that problem of always delivering the, um, the report that was highest rated, uh, to, uh, to the authorities, um, where you would get a higher rating by, like, revealing identifiable evidence of a conspiracy. That's just a, yeah, a level of difficulty, uh, no surprise human coup or conspiracy has ever had to handle before.

    17. DP

      What about spies that are working in, um, you know, very advanced scientific, uh, context? For example, like a- a communist spy in the Manhattan Project. He has to deliver research results in order to stay employed, but he can still continue delivering intel to the communists.

    18. CS

      The job was- was not, uh, was not sabotage. And now the (laughs) your- your hypothetical spy did not have, uh, their nervous system hooked up to this reward signal of praise from the, uh, you know, the Manhattan Project supervisors being exposed combinatorially, uh, with random noise, uh, added to generate incremental changes in their behavior. They, in fact, were displaying the behavior of cooperating with the Manhattan Project only where it was in service to the existing motivations, and in cases, for example, where they, like, accidentally helped the Manhattan Project more than normal or accidentally helped it less, they didn't have their brain re-engineered to do more of the thing when they accidentally helped the project more and less of the thing when they accidentally helped it less. So I'd say it's pretty drastically dis-analogous.

    19. DP

      How would we be able to know, you know, at some point it's becoming very smart, it's producing ideas for alignment that we can barely comprehend, and if we could comprehend them, we would... it- it was, like, relatively trivial to comprehend them, we would be able to come up with them on our own, right? Like, there's a reason we're asking for its help. How would we be able to evaluate them in order to train it on that in the first place?

    20. CS

      The first thing I would say is, so you mentioned when we're getting to something far beyond what we could come up with. There's actually a lot of room, uh, to just deliver what humanity could have done. So sadly, uh, I- I- I, you know, I had hoped with my career to help improve the situation on this front, and maybe I- I contributed a bit, but at the moment, there's maybe a few hundred people doing things related to, uh, averting this kind of catastrophic AI disaster. Fewer of them are doing technical research on machine learning system that's really, like, cutting close to the core of the problem. Uh, whereas by contrast, uh, you know, there's thousands and tens of thousands of people, uh, advancing AI capabilities. Uh, and so even, yeah, at- at a place like DeepMind or OpenAI, Anthropic, which do have technical safety teams, is like order of a dozen, a few dozen people. Um, in large companies and most, uh, firms don't have any. So just going from less than 1% of the effort being put into AI to 5% or 10% of the effort or 50%, 90%, uh, would be an absolutely massive increase in the amount of work that is being done on AL- alignments, on-... mind-reading AIs in an adversarial context. And so, if as the case has more and more of this work can be automated, and say governments require that as real automation of research is going, that you put 50% or 90% of the budget of AI activity into these problems of make this system one that's not gonna overthrow our own government or is not gonna destroy the human species. Then the proportional increase in alignment can be very large, even just within the range of what we could have done if we had been on the ball and having humanity's scientific energies going into the problem, stuff that is not incomprehensible, that is in some sense just, like, doing the obvious things that we should have done, like making... You know, do- doing the best you could to find correlates and predictors to build neural lie detectors and identifiers of concepts, uh, that the AI is working with. And people have made notable, notable progress. Um, I think, um, you know, an early, quite early example of this, uh, is Colin Burns' work. This is doing unsupervised identification, um, of some aspects of a neural network, uh, that are correlated, uh, with things being true or false. There are other concepts that correlate with that too that they could be, but like, I think that is important work. Um, it's something that, I mean, it's kind of obvious direction for this stuff to go. You can keep improving it when you have AIs that you're training to do their best to deceive, uh, humans or other audiences in the face of the thing, and you can measure, do our lie detectors break down when we train our AIs to tell us, "The sky is, uh, the sky is green," in the face of the lie detector, and we keep using great intent on them to do that. Do they eventually succeed? If they do succeed, if we know it, that's really valuable information to know. Because then we'll know our existing lie detecting systems are not actually gonna work on the AI takeover, and that can allow, say, government and regulatory response to hold things back. Uh, it can help redirect the scientific effort to create lie detectors that are robust and that can't just be immediately evolved around. Uh, and, yeah, and we can then get more assistance. Basically, the, the incredibly juicy ability that we have, um, working with the AIs is that we can have as an invaluable outcome that we can see and tell whether they got a fast one pass- past us on an identifiable situation. We can have here's an air gap computer. You get control of the keyboard. You can input commands. Can you route the environment, uh, and make a blue banana appear on the screen? Even if, if we train the AI to do that, and it succeeds, we see the blue banana. We know, we know it worked, even if we did not understand and would not have detected the particular exploit that it used to do it. And so, yeah, this- this can give us a rich empirical feedback where we're able to identify things that are even, uh, an AI using its best efforts, uh, to get past our interpretability methods, using its best efforts to get past our ad vacs, et cetera.

Episode duration: 3:07:14

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode KUieFuV1fuo

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome