Skip to content
Dwarkesh PodcastDwarkesh Podcast

Joe Carlsmith — Preventing an AI takeover

Chatted with Joe Carlsmith about whether we can trust power/techno-capital, how to not end up like Stalin in our urge to control the future, gentleness towards the artificial Other, and much more. Check out Joe’s excellent essay series on Otherness and control in the age of AGI: https://joecarlsmith.com/2024/01/02/otherness-and-control-in-the-age-of-agi/. Enjoy! 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Transcript: https://www.dwarkeshpatel.com/p/joe-carlsmith * Apple Podcasts: https://podcasts.apple.com/us/podcast/joe-carlsmith-otherness-and-control-in-the-age-of-agi/id1516093381?i=1000666255737 * Spotify: https://open.spotify.com/episode/0npJsKzUulSHDVAHumXNtO?si=vyKi0z_CRB6inwUBhIfeFA * Me on Twitter: https://twitter.com/dwarkesh_sp 𝐒𝐏𝐎𝐍𝐒𝐎𝐑𝐒 * Bland is an AI agent that automates enterprise phone calls in any language, 24/7. Their technology uses "conversational pathways" for accurate, versatile communication across sales, operations, and customer support. Try Bland at 415-549-9654 or bland.ai. Enterprises can get exclusive access to their advanced model at https://bland.ai * Stripe is financial infrastructure for the internet. Millions of companies from Anthropic to Amazon use Stripe to accept payments, automate financial processes and grow their revenue. Learn more here: https://stripe.com/ If you’re interested in advertising on the podcast: https://www.dwarkeshpatel.com/p/advertise 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 - Understanding the Basic Alignment Story 00:44:04 - Monkeys Inventing Humans 00:46:43 - Nietzsche, C.S. Lewis, and AI 01:22:51 - How should we treat AIs 01:52:33 - Balancing Being a Humanist and a Scholar 02:05:02 - Explore exploit tradeoffs and AI

Joe CarlsmithguestDwarkesh Patelhost
Aug 22, 20242h 31mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:0044:04

    Understanding the Basic Alignment Story

    1. JC

      (upbeat music plays) AIs can be more patient. Nazis are more patient. Enemy soldiers have souls, right? We need to learn the art of kind of hawk and dove, both. We're going to transition to a world in which we've created these beings that are just, like, vastly more powerful than us. Our continued empowerment is just effectively dependent on their motives. I think that's a transition we should not make until we have a, a very developed science of AI motivations. My actual prediction is that the AIs are gonna be very malleable. If you push an AI towards evil, like, it'll just go.

    2. DP

      Kinds of things that go visit Andromeda, did you really expect them to privilege whatever inclinations you have because you grew up in the African savanna? Of course, they're gonna be, like, weird. Today, I'm chatting with Joe Carlsmith. He's a philosopher, in my opinion. A capital G great philosopher. And you can find his essays at joecarlsmith.com. So we have a GPT-4, and it doesn't seem like a paper clipper kind of thing. It understands human values. In fact, if you help have it explain, like, why is being a paper clipper bad? Or, like, what, what, just tell me your opinions about being a paper clipper. Like, explain why the galaxy shouldn't be turned into paper clips. Um, okay, so what is happening such that, dot, dot, dot, we have a system that takes over and c- converts the world into something valueless?

    3. JC

      One thing I'll just say off the bat, is like, when I'm, when I'm thinking about misaligned AIs, I'm thinking about... Or the type that I'm worried about.

    4. DP

      Yeah.

    5. JC

      Uh, I'm thinking about AIs that have, um, a relatively specific set of properties related to agency and planning and kind of awareness and understanding of the world. One is this capacity to plan. Um, uh, and kind of make kind of relatively sophisticated plans on the basis of models of the world.

    6. DP

      Yeah.

    7. JC

      Um, where those plans are being kind of evaluated according to criteria.

    8. DP

      Mm-hmm.

    9. JC

      Um, that planning capability needs to be driving the model's behavior. So there are models that are sort of in some sense capable of planning, but it's not like when they give output. It's not like that output was determined-

    10. DP

      Yeah.

    11. JC

      ... by some process of planning. Like, here's what will happen if I give this output.

    12. DP

      Yeah.

    13. JC

      And do I want that to happen? The model needs to really understand the world, right? It needs to really be like, "Okay, um, here's what will happen. I'm, you know, here I am. Here's my situation. Here's, like, the politics of the situation." Like, really under, like, kind of, uh, having this kind of situational awareness, um, to be able to evaluate the consequences of different plans.

    14. DP

      Yeah.

    15. JC

      Um, I think the other thing is, like... So th- the verbal behavior of these models, um, I think need bear no... So when I talk about a model's values, I'm talking about the criteria that, that, uh, the, that kind of end up determining which plans the model pursues, right? And a model's verbal behavior, even if it has a planning process, which GPT-4 I think doesn't in many cases, it, its verbal behavior just doesn't, doesn't need to reflect those criteria, right? Um, and so, uh, y- you know, w- we know that we're going to be able to get models to say what we want to hear, right? We... Uh, uh, that is the magic of gradient descent.

    16. DP

      Yeah.

    17. JC

      You know, if you, uh, you know, modulo, like, some difficulties with capabilities, like, you can get a model to kind of output the behavior that you want. If it doesn't, then you, you crank it till it does, right?

    18. DP

      (laughs)

    19. JC

      And, um, and I think everyone admits for suitably sophisticated models, they're gonna have very detailed understanding of, of human morality. Um, uh, but the question is like, what relationship is there between, like, a model's verbal behavior, which is you've essentially kind of clamped. You're like, "The model must say, like, blah things," um, "And the criteria that end up influencing its choice, choice between plans." Um, and there, I think it's at least... I, I'm kind of pretty cautious about being like, "Well, when it says the thing I forced it to say," (laughs) um, or like, you know, gradient descent in it such that it says, um, that's a lot of evidence about, like, how it's gonna choose in a bunch of different scenarios. I mean, for one thing, like, even with humans, right? It's not necessarily the case that humans, um, their kind of verbal behavior reflects the actual factors that determine their choices. They, they can lie. They can not even know what they're, what they would do in a given situation. I mean...

    20. DP

      I think it is interesting to think about this in the context of humans, because there is that famous saying of, "Be careful who you pretend to be because you are who you pretend to be." And you do notice this, where if people, I don't know, are... Like, this is what culture does to children, where you're trained, like, your parents will punish you if you say, if you start saying things that are not consistent with your culture's values. And over time, you will become like your parents, right? Like, by default, it seems like it kind of works. (laughs)

    21. JC

      (laughs)

    22. DP

      And even with these models, it seems like it's kind of wor- works. It's, like, hard... It's like they don't really scheme against... Like, w- why, why would this happen?

    23. JC

      You know, f- for folks who, who are kind of unfamiliar with the basic story about... Maybe, maybe folks are like, "Wait, why, why are they taking over at all?"

    24. DP

      (laughs)

    25. JC

      Like, what is... Like, literally any reason that they would do that? So, you know, the, the, the general concern is like, um, y- you know, if you're really offering someone, especially if you're really offering someone, like, power for free, um, you know, power almost by definition is kind of useful for, for lots of values. Um, and if we're talking about an, an AI that, that r- really has, has the opportunity to kind of take control of things, um, if some component of its values is sort of focused on some outcome, like the world being a certain way and especially kind of, um, in a kind of longer term way, uh, such that the k- kind of horizon of its concern extends beyond the period that w- that kind of takeover plan would, would encompass, um, then the thought is it's just kind of often the case that, um, uh, the world will be more the way you want it if you control everything than if, um, you remain the instrument of the human will or, or of some, some other kind of, some other actor-

    26. DP

      Yeah.

    27. JC

      ... which is sort of what we're, what we're hoping these AIs will be. So that's a very specific scenario, and if we're in a scenario wh- where, where power is more distributed and especially where we're doing, like, decently on alignment, right? And we're gi- and we're giving the AI some amount of inhibition about doing different things-

    28. DP

      Yup.

    29. JC

      ... and maybe we're, we're succeeding in shaping their values somewhat.

    30. DP

      Yup.

  2. 44:0446:43

    Monkeys Inventing Humans

    1. JC

      I think... (laughs) It sounds to me like the thing you're thinking is, is something more like we end up feeling like, "Gosh, we wish we had paid no attention to the motives of our AIs, that we'd thought not at all about their impact on our society as we incorporated them." And instead, we had pursued a, uh, let's call it a kind of maximize for brute power option.

    2. DP

      Yeah.

    3. JC

      (laughs) Which is just kind of make a beeline for whatever is just the most powerful AI you can and don't think about anything else. Um, okay, so I'm very skeptical that that's-

    4. DP

      Mm.

    5. JC

      ... what we're gonna wish.

    6. DP

      Eh, if, if y- w- one k- a common example that's given in misalignment is humans from evolution, and you have one line in your series that, "Here's a simple argument for AI risk, uh, monkey should be careful before inventing humans." W- the, the sort of paper clipper metaphor imply something really banal and boring, um, w- with regards to misalignment, and I think if I'm steelmanning the people who worship power, they have this sense of humans got misaligned, and they had, they started pursuing things, if a monkey was creating them... This is a weird analogy because obviously monkeys didn't create humans, but if a monkey was creating them, um, there's thing, you know, they're not thinking about bananas all day. They're thinking about other things. On the other hand, they didn't just make useless stone tools and pile them up in caves in a sort of paper clipper fashion. There were all these things that emerged because of their greater intelligence th- which were misaligned with evolution of creativity and love and music and beauty and all the other things we value about human culture, and the prediction maybe they have, which is more of an empirical statement than a p- uh, a philosophical statement is, "Listen, with greater intelligence, if you're thinking about the paper clipper, even if it's misaligned, it will be th- in this kind of way." Th- uh, it'll be things like, that are alien to humans, but also alien in the way humans are aliens to monkeys, not in the way that paper clip will, p- paper clipper is alien to a human.

    7. JC

      Cool, so I think there's a bunch of different things to potentially unpack there. Um, one kind of conceptual point that I want to name off the bat, I don't think you're necessarily kind of, um, making a mistake in this vein, but I just wanna name it as, like, a possible mistake in this vicinity, is, um, I think we don't want to engage in the following form of reasoning. Let's say you have two entities. One is in the role of creator and one is in the role of creation, and then we're positing that there's this kind of misalignment relation between them-

    8. DP

      Mm-hmm.

    9. JC

      ... whatever that means, right? And here's a, a pattern of reasoning that I think you wanna watch out for, um, is to

  3. 46:431:22:51

    Nietzsche, C.S. Lewis, and AI

    1. JC

      say, "In my role as creator," or sorry, "In my role as creation," say, say you're thinking of humans in the role of creation relative to an entity like evolution or monkeys or mice or whoever you could imagine inventing humans or something like that, right? You say, "Uh, I'm, qua creation, I'm happy that I was created and happy with the misalignment."

    2. DP

      Mm-hmm.

    3. JC

      "Therefore, if, if I end up in the role of creator and, um, we have a structurally analogous relation in which there is misalignment-

    4. DP

      Yeah.

    5. JC

      ... with some creation, I should expect to be happy with that as well."

    6. DP

      Yeah. Th- there's a couple of philosophers that you brought up in this area is, which if you read the works that you talk about, actually seem incredibly foresighted in anticipating something like a singularity, our ability to shape a future thing that's different, smarter, maybe better than us. Um, obviously, C.S. Lewis, Abolition of Man, we'll talk about in a second, is one example, but even th- here, here, here's one passage from Nietzsche which I felt really highlighted this. "Man is a rope stretched between the animal and the superman, a rope over an abyss, a dangerous crossing, a dangerous wayfaring, a dangerous looking back, a dangerous trembling and halting." Is there some explanation for why... Is it just, like, somehow obvious that something like this is coming if you, even if you're thinking 200 years ago?

    7. JC

      I think I have a much better grip on what's going on with Lewis-

    8. DP

      Yeah.

    9. JC

      ... than with Nietzsche there, so maybe let's just talk about Lewis-

    10. DP

      Sure.

    11. JC

      ... for a second, so... (clears throat) And we should distinguish two... Th- there's a, there's a kind of version of the singularity that's specifically, like, a hypothesis about feedback loops with AI capabilities.

    12. DP

      Right.

    13. JC

      Um, I don't think that's present-

    14. DP

      Sure.

    15. JC

      ... in Lewis. Um, I think what Lewis is, uh, anticipating, uh, and I do think this is a relatively simple forecast, um, is something like the culmination of the project of scientific modernity, so Lewis is kind of looking out at the world and he's seeing this process of kind of increased-... understanding of a kind of the natural environment and- a- and a kind of corresponding increase in our ability to kind of control and direct that environment. Um, and then he's also pairing that with, uh, a kind of metaphysical hypothesis or... Well, his stance on this metaphysical hypo- hypothesis I think is, like, kind of problematically unclear in the, in the book. But there is this metaphysical hypothesis, um, naturalism, which says that, uh, humans too and kind of minds, beings, agents are a part of nature. And so, uh, insofar as this process of scientific modernity involves a kind of, uh, progressively greater understanding of an ability to control nature, um, that will presumably at some point grow to encompass, uh, our own natures and, uh, our- and- and kind of the natures of other beings that in principle we could, um, uh, we could create. Uh, and Lewis views this as a kind of cataclysmic event and crisis. Um, you know, a part of what I'm trying to say and- a- and- and that in particular that it will lead to all these kind of tyrannical-

    16. DP

      Yeah.

    17. JC

      ... uh, kind of behaviors and kind of tyrannical attitudes towards morality and stuff like that. And part of what I'm trying to... Y- you know, unless you, unless you believe in non-naturalism or in- in some form of kind of, uh, Tao, which is this kind of objective morality... So we can talk about that and part... But part of what I'm trying to do in that essay is to say, no. I- I think we can, we can be naturalists and also be kind of, uh, decent humans that remain in touch with, um, a kind of a rich set of norms that have to do with, like, how do we relate to the possibility of kind of, uh, creating creatures, altering ourselves, et cetera. Um, but I do think, I do think his... Yeah. It's like a relatively simple prediction. It's kind of science masters na- nature, humans part of nature, science masters humans.

    18. DP

      Hmm. And then you also have a very interesting other essay about suppose humans... Like, what should we expect of other humans, uh, if this sort of, uh, extrapolation... If they had greater capabilities and so on?

    19. JC

      Yeah. I mean, I think, I think an uncomfortable thing about the, um, kind of conceptual setup at stake in these sort of like abstract discussions of like, okay, you have this agent, it- it fooms, which is this sort of amorphous (laughs) process of kind of going from a sort of seed agent to a, like, super intelligent version of itself, often imagined to kind of preserve its values along the way. Bunch of questions we can raise about that.

    20. DP

      Right.

    21. JC

      Um, but I think a kind of, um... Many of the arguments that people will often talk about in the context of reasons to be scared of AI is like, "Oh, like, value is very fragile as you, like, foom." Um, uh, y- you know, kind of small differences in utility functions can kind of decorrelate very hard and- and kind of drive in- in quite different directions. Um, and like, oh, like, agents have instrumental incentives to- to seek power and if they- if they had... If it was arbitrarily easy to get power, then they would do it and stuff like that. Like, these are very general arguments that seem to suggest that they kind of, um... I- it's not just an AI thing, right? (laughs) There's... I mean, it's, like, no surprise, right? As talking about, like, take a thing, make it arbitrarily powerful such that it's, like, um, you know, God emperor of the universe or something. How scared are you of that? Like, clearly, we should be equally scared of that.

    22. DP

      (laughs)

    23. JC

      Or, I don't know, we should be really scared of that with humans too, right? Um, so I- I mean, part of what I'm saying in that essay is that I think this is... In some sense, this is much more a story about balance of power-

    24. DP

      Right.

    25. JC

      ... um, and about like maintaining a kind of, um, uh, a kind of checks and balances and- and kind of distribution of power, uh, period. Not just about, like, kind of humans versus AIs and kind of the differences between human values and AI values. No, that said, I mean, I do think humans, many humans would likely be nicer, uh, if they foomed than like certain types of AIs. So I- I mean, it's not... But- but I think the kind of conceptual structure of the, uh, the argument is not... It's sort of, um, a- a very open question how much it applies to humans as well.

    26. DP

      I think, o- one sort of big question I have is... I- I don't even know how to ex- uh, express this, but how confident are we with this ontology of expressing, like, what are agents, what are capabilities? O- o- how do we know this is the thing that's happening or, like, this is the way to think about what, what intelligences are?

    27. JC

      So it's- it's clearly this kind of very janky-

    28. DP

      Yeah.

    29. JC

      ... uh, kind of... I mean, well, I don't... People maybe disagree about this. I think it's, you know... Espe- I mean, it's obvious to everyone with respect to, like, real world human agents-

    30. DP

      Right.

Episode duration: 2:31:12

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode 5XsL_7TnfLU

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome