Joe Carlsmith — Preventing an AI takeover

Chatted with Joe Carlsmith about whether we can trust power/techno-capital, how to not end up like Stalin in our urge to control the future, gentleness towards the artificial Other, and much more. Check out Joe’s excellent essay series on Otherness and control in the age of AGI: https://joecarlsmith.com/2024/01/02/otherness-and-control-in-the-age-of-agi/. Enjoy! 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Transcript: https://www.dwarkeshpatel.com/p/joe-carlsmith * Apple Podcasts: https://podcasts.apple.com/us/podcast/joe-carlsmith-otherness-and-control-in-the-age-of-agi/id1516093381?i=1000666255737 * Spotify: https://open.spotify.com/episode/0npJsKzUulSHDVAHumXNtO?si=vyKi0z_CRB6inwUBhIfeFA * Me on Twitter: https://twitter.com/dwarkesh_sp 𝐒𝐏𝐎𝐍𝐒𝐎𝐑𝐒 * Bland is an AI agent that automates enterprise phone calls in any language, 24/7. Their technology uses "conversational pathways" for accurate, versatile communication across sales, operations, and customer support. Try Bland at 415-549-9654 or bland.ai. Enterprises can get exclusive access to their advanced model at https://bland.ai * Stripe is financial infrastructure for the internet. Millions of companies from Anthropic to Amazon use Stripe to accept payments, automate financial processes and grow their revenue. Learn more here: https://stripe.com/ If you’re interested in advertising on the podcast: https://www.dwarkeshpatel.com/p/advertise 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 - Understanding the Basic Alignment Story 00:44:04 - Monkeys Inventing Humans 00:46:43 - Nietzsche, C.S. Lewis, and AI 01:22:51 - How should we treat AIs 01:52:33 - Balancing Being a Humanist and a Scholar 02:05:02 - Explore exploit tradeoffs and AI

Joe CarlsmithguestDwarkesh Patelhost

Aug 22, 20242h 31mWatch on YouTube ↗

EVERY SPOKEN WORD

150 min read · 30,028 words

0:00 – 44:04
Understanding the Basic Alignment Story
1. JCJoe Carlsmith
  (upbeat music plays) AIs can be more patient. Nazis are more patient. Enemy soldiers have souls, right? We need to learn the art of kind of hawk and dove, both. We're going to transition to a world in which we've created these beings that are just, like, vastly more powerful than us. Our continued empowerment is just effectively dependent on their motives. I think that's a transition we should not make until we have a, a very developed science of AI motivations. My actual prediction is that the AIs are gonna be very malleable. If you push an AI towards evil, like, it'll just go.
2. DPDwarkesh Patel
  Kinds of things that go visit Andromeda, did you really expect them to privilege whatever inclinations you have because you grew up in the African savanna? Of course, they're gonna be, like, weird. Today, I'm chatting with Joe Carlsmith. He's a philosopher, in my opinion. A capital G great philosopher. And you can find his essays at joecarlsmith.com. So we have a GPT-4, and it doesn't seem like a paper clipper kind of thing. It understands human values. In fact, if you help have it explain, like, why is being a paper clipper bad? Or, like, what, what, just tell me your opinions about being a paper clipper. Like, explain why the galaxy shouldn't be turned into paper clips. Um, okay, so what is happening such that, dot, dot, dot, we have a system that takes over and c- converts the world into something valueless?
3. JCJoe Carlsmith
  One thing I'll just say off the bat, is like, when I'm, when I'm thinking about misaligned AIs, I'm thinking about... Or the type that I'm worried about.
4. DPDwarkesh Patel
  Yeah.
5. JCJoe Carlsmith
  Uh, I'm thinking about AIs that have, um, a relatively specific set of properties related to agency and planning and kind of awareness and understanding of the world. One is this capacity to plan. Um, uh, and kind of make kind of relatively sophisticated plans on the basis of models of the world.
6. DPDwarkesh Patel
  Yeah.
7. JCJoe Carlsmith
  Um, where those plans are being kind of evaluated according to criteria.
8. DPDwarkesh Patel
  Mm-hmm.
9. JCJoe Carlsmith
  Um, that planning capability needs to be driving the model's behavior. So there are models that are sort of in some sense capable of planning, but it's not like when they give output. It's not like that output was determined-
10. DPDwarkesh Patel
  Yeah.
11. JCJoe Carlsmith
  ... by some process of planning. Like, here's what will happen if I give this output.
12. DPDwarkesh Patel
  Yeah.
13. JCJoe Carlsmith
  And do I want that to happen? The model needs to really understand the world, right? It needs to really be like, "Okay, um, here's what will happen. I'm, you know, here I am. Here's my situation. Here's, like, the politics of the situation." Like, really under, like, kind of, uh, having this kind of situational awareness, um, to be able to evaluate the consequences of different plans.
14. DPDwarkesh Patel
  Yeah.
15. JCJoe Carlsmith
  Um, I think the other thing is, like... So th- the verbal behavior of these models, um, I think need bear no... So when I talk about a model's values, I'm talking about the criteria that, that, uh, the, that kind of end up determining which plans the model pursues, right? And a model's verbal behavior, even if it has a planning process, which GPT-4 I think doesn't in many cases, it, its verbal behavior just doesn't, doesn't need to reflect those criteria, right? Um, and so, uh, y- you know, w- we know that we're going to be able to get models to say what we want to hear, right? We... Uh, uh, that is the magic of gradient descent.
16. DPDwarkesh Patel
  Yeah.
17. JCJoe Carlsmith
  You know, if you, uh, you know, modulo, like, some difficulties with capabilities, like, you can get a model to kind of output the behavior that you want. If it doesn't, then you, you crank it till it does, right?
18. DPDwarkesh Patel
  (laughs)
19. JCJoe Carlsmith
  And, um, and I think everyone admits for suitably sophisticated models, they're gonna have very detailed understanding of, of human morality. Um, uh, but the question is like, what relationship is there between, like, a model's verbal behavior, which is you've essentially kind of clamped. You're like, "The model must say, like, blah things," um, "And the criteria that end up influencing its choice, choice between plans." Um, and there, I think it's at least... I, I'm kind of pretty cautious about being like, "Well, when it says the thing I forced it to say," (laughs) um, or like, you know, gradient descent in it such that it says, um, that's a lot of evidence about, like, how it's gonna choose in a bunch of different scenarios. I mean, for one thing, like, even with humans, right? It's not necessarily the case that humans, um, their kind of verbal behavior reflects the actual factors that determine their choices. They, they can lie. They can not even know what they're, what they would do in a given situation. I mean...
20. DPDwarkesh Patel
  I think it is interesting to think about this in the context of humans, because there is that famous saying of, "Be careful who you pretend to be because you are who you pretend to be." And you do notice this, where if people, I don't know, are... Like, this is what culture does to children, where you're trained, like, your parents will punish you if you say, if you start saying things that are not consistent with your culture's values. And over time, you will become like your parents, right? Like, by default, it seems like it kind of works. (laughs)
21. JCJoe Carlsmith
  (laughs)
22. DPDwarkesh Patel
  And even with these models, it seems like it's kind of wor- works. It's, like, hard... It's like they don't really scheme against... Like, w- why, why would this happen?
23. JCJoe Carlsmith
  You know, f- for folks who, who are kind of unfamiliar with the basic story about... Maybe, maybe folks are like, "Wait, why, why are they taking over at all?"
24. DPDwarkesh Patel
  (laughs)
25. JCJoe Carlsmith
  Like, what is... Like, literally any reason that they would do that? So, you know, the, the, the general concern is like, um, y- you know, if you're really offering someone, especially if you're really offering someone, like, power for free, um, you know, power almost by definition is kind of useful for, for lots of values. Um, and if we're talking about an, an AI that, that r- really has, has the opportunity to kind of take control of things, um, if some component of its values is sort of focused on some outcome, like the world being a certain way and especially kind of, um, in a kind of longer term way, uh, such that the k- kind of horizon of its concern extends beyond the period that w- that kind of takeover plan would, would encompass, um, then the thought is it's just kind of often the case that, um, uh, the world will be more the way you want it if you control everything than if, um, you remain the instrument of the human will or, or of some, some other kind of, some other actor-
26. DPDwarkesh Patel
  Yeah.
27. JCJoe Carlsmith
  ... which is sort of what we're, what we're hoping these AIs will be. So that's a very specific scenario, and if we're in a scenario wh- where, where power is more distributed and especially where we're doing, like, decently on alignment, right? And we're gi- and we're giving the AI some amount of inhibition about doing different things-
28. DPDwarkesh Patel
  Yup.
29. JCJoe Carlsmith
  ... and maybe we're, we're succeeding in shaping their values somewhat.
30. DPDwarkesh Patel
  Yup.
44:04 – 46:43
Monkeys Inventing Humans
1. JCJoe Carlsmith
  I think... (laughs) It sounds to me like the thing you're thinking is, is something more like we end up feeling like, "Gosh, we wish we had paid no attention to the motives of our AIs, that we'd thought not at all about their impact on our society as we incorporated them." And instead, we had pursued a, uh, let's call it a kind of maximize for brute power option.
2. DPDwarkesh Patel
  Yeah.
3. JCJoe Carlsmith
  (laughs) Which is just kind of make a beeline for whatever is just the most powerful AI you can and don't think about anything else. Um, okay, so I'm very skeptical that that's-
4. DPDwarkesh Patel
  Mm.
5. JCJoe Carlsmith
  ... what we're gonna wish.
6. DPDwarkesh Patel
  Eh, if, if y- w- one k- a common example that's given in misalignment is humans from evolution, and you have one line in your series that, "Here's a simple argument for AI risk, uh, monkey should be careful before inventing humans." W- the, the sort of paper clipper metaphor imply something really banal and boring, um, w- with regards to misalignment, and I think if I'm steelmanning the people who worship power, they have this sense of humans got misaligned, and they had, they started pursuing things, if a monkey was creating them... This is a weird analogy because obviously monkeys didn't create humans, but if a monkey was creating them, um, there's thing, you know, they're not thinking about bananas all day. They're thinking about other things. On the other hand, they didn't just make useless stone tools and pile them up in caves in a sort of paper clipper fashion. There were all these things that emerged because of their greater intelligence th- which were misaligned with evolution of creativity and love and music and beauty and all the other things we value about human culture, and the prediction maybe they have, which is more of an empirical statement than a p- uh, a philosophical statement is, "Listen, with greater intelligence, if you're thinking about the paper clipper, even if it's misaligned, it will be th- in this kind of way." Th- uh, it'll be things like, that are alien to humans, but also alien in the way humans are aliens to monkeys, not in the way that paper clip will, p- paper clipper is alien to a human.
7. JCJoe Carlsmith
  Cool, so I think there's a bunch of different things to potentially unpack there. Um, one kind of conceptual point that I want to name off the bat, I don't think you're necessarily kind of, um, making a mistake in this vein, but I just wanna name it as, like, a possible mistake in this vicinity, is, um, I think we don't want to engage in the following form of reasoning. Let's say you have two entities. One is in the role of creator and one is in the role of creation, and then we're positing that there's this kind of misalignment relation between them-
8. DPDwarkesh Patel
  Mm-hmm.
9. JCJoe Carlsmith
  ... whatever that means, right? And here's a, a pattern of reasoning that I think you wanna watch out for, um, is to
46:43 – 1:22:51
Nietzsche, C.S. Lewis, and AI
1. JCJoe Carlsmith
  say, "In my role as creator," or sorry, "In my role as creation," say, say you're thinking of humans in the role of creation relative to an entity like evolution or monkeys or mice or whoever you could imagine inventing humans or something like that, right? You say, "Uh, I'm, qua creation, I'm happy that I was created and happy with the misalignment."
2. DPDwarkesh Patel
  Mm-hmm.
3. JCJoe Carlsmith
  "Therefore, if, if I end up in the role of creator and, um, we have a structurally analogous relation in which there is misalignment-
4. DPDwarkesh Patel
  Yeah.
5. JCJoe Carlsmith
  ... with some creation, I should expect to be happy with that as well."
6. DPDwarkesh Patel
  Yeah. Th- there's a couple of philosophers that you brought up in this area is, which if you read the works that you talk about, actually seem incredibly foresighted in anticipating something like a singularity, our ability to shape a future thing that's different, smarter, maybe better than us. Um, obviously, C.S. Lewis, Abolition of Man, we'll talk about in a second, is one example, but even th- here, here, here's one passage from Nietzsche which I felt really highlighted this. "Man is a rope stretched between the animal and the superman, a rope over an abyss, a dangerous crossing, a dangerous wayfaring, a dangerous looking back, a dangerous trembling and halting." Is there some explanation for why... Is it just, like, somehow obvious that something like this is coming if you, even if you're thinking 200 years ago?
7. JCJoe Carlsmith
  I think I have a much better grip on what's going on with Lewis-
8. DPDwarkesh Patel
  Yeah.
9. JCJoe Carlsmith
  ... than with Nietzsche there, so maybe let's just talk about Lewis-
10. DPDwarkesh Patel
  Sure.
11. JCJoe Carlsmith
  ... for a second, so... (clears throat) And we should distinguish two... Th- there's a, there's a kind of version of the singularity that's specifically, like, a hypothesis about feedback loops with AI capabilities.
12. DPDwarkesh Patel
  Right.
13. JCJoe Carlsmith
  Um, I don't think that's present-
14. DPDwarkesh Patel
  Sure.
15. JCJoe Carlsmith
  ... in Lewis. Um, I think what Lewis is, uh, anticipating, uh, and I do think this is a relatively simple forecast, um, is something like the culmination of the project of scientific modernity, so Lewis is kind of looking out at the world and he's seeing this process of kind of increased-... understanding of a kind of the natural environment and- a- and a kind of corresponding increase in our ability to kind of control and direct that environment. Um, and then he's also pairing that with, uh, a kind of metaphysical hypothesis or... Well, his stance on this metaphysical hypo- hypothesis I think is, like, kind of problematically unclear in the, in the book. But there is this metaphysical hypothesis, um, naturalism, which says that, uh, humans too and kind of minds, beings, agents are a part of nature. And so, uh, insofar as this process of scientific modernity involves a kind of, uh, progressively greater understanding of an ability to control nature, um, that will presumably at some point grow to encompass, uh, our own natures and, uh, our- and- and kind of the natures of other beings that in principle we could, um, uh, we could create. Uh, and Lewis views this as a kind of cataclysmic event and crisis. Um, you know, a part of what I'm trying to say and- a- and- and that in particular that it will lead to all these kind of tyrannical-
16. DPDwarkesh Patel
  Yeah.
17. JCJoe Carlsmith
  ... uh, kind of behaviors and kind of tyrannical attitudes towards morality and stuff like that. And part of what I'm trying to... Y- you know, unless you, unless you believe in non-naturalism or in- in some form of kind of, uh, Tao, which is this kind of objective morality... So we can talk about that and part... But part of what I'm trying to do in that essay is to say, no. I- I think we can, we can be naturalists and also be kind of, uh, decent humans that remain in touch with, um, a kind of a rich set of norms that have to do with, like, how do we relate to the possibility of kind of, uh, creating creatures, altering ourselves, et cetera. Um, but I do think, I do think his... Yeah. It's like a relatively simple prediction. It's kind of science masters na- nature, humans part of nature, science masters humans.
18. DPDwarkesh Patel
  Hmm. And then you also have a very interesting other essay about suppose humans... Like, what should we expect of other humans, uh, if this sort of, uh, extrapolation... If they had greater capabilities and so on?
19. JCJoe Carlsmith
  Yeah. I mean, I think, I think an uncomfortable thing about the, um, kind of conceptual setup at stake in these sort of like abstract discussions of like, okay, you have this agent, it- it fooms, which is this sort of amorphous (laughs) process of kind of going from a sort of seed agent to a, like, super intelligent version of itself, often imagined to kind of preserve its values along the way. Bunch of questions we can raise about that.
20. DPDwarkesh Patel
  Right.
21. JCJoe Carlsmith
  Um, but I think a kind of, um... Many of the arguments that people will often talk about in the context of reasons to be scared of AI is like, "Oh, like, value is very fragile as you, like, foom." Um, uh, y- you know, kind of small differences in utility functions can kind of decorrelate very hard and- and kind of drive in- in quite different directions. Um, and like, oh, like, agents have instrumental incentives to- to seek power and if they- if they had... If it was arbitrarily easy to get power, then they would do it and stuff like that. Like, these are very general arguments that seem to suggest that they kind of, um... I- it's not just an AI thing, right? (laughs) There's... I mean, it's, like, no surprise, right? As talking about, like, take a thing, make it arbitrarily powerful such that it's, like, um, you know, God emperor of the universe or something. How scared are you of that? Like, clearly, we should be equally scared of that.
22. DPDwarkesh Patel
  (laughs)
23. JCJoe Carlsmith
  Or, I don't know, we should be really scared of that with humans too, right? Um, so I- I mean, part of what I'm saying in that essay is that I think this is... In some sense, this is much more a story about balance of power-
24. DPDwarkesh Patel
  Right.
25. JCJoe Carlsmith
  ... um, and about like maintaining a kind of, um, uh, a kind of checks and balances and- and kind of distribution of power, uh, period. Not just about, like, kind of humans versus AIs and kind of the differences between human values and AI values. No, that said, I mean, I do think humans, many humans would likely be nicer, uh, if they foomed than like certain types of AIs. So I- I mean, it's not... But- but I think the kind of conceptual structure of the, uh, the argument is not... It's sort of, um, a- a very open question how much it applies to humans as well.
26. DPDwarkesh Patel
  I think, o- one sort of big question I have is... I- I don't even know how to ex- uh, express this, but how confident are we with this ontology of expressing, like, what are agents, what are capabilities? O- o- how do we know this is the thing that's happening or, like, this is the way to think about what, what intelligences are?
27. JCJoe Carlsmith
  So it's- it's clearly this kind of very janky-
28. DPDwarkesh Patel
  Yeah.
29. JCJoe Carlsmith
  ... uh, kind of... I mean, well, I don't... People maybe disagree about this. I think it's, you know... Espe- I mean, it's obvious to everyone with respect to, like, real world human agents-
30. DPDwarkesh Patel
  Right.

Episode duration: 2:31:12

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode 5XsL_7TnfLU

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome