Joe Carlsmith — Preventing an AI takeover

Joe Carlsmith — Preventing an AI takeover

Dwarkesh PodcastAug 22, 20242h 31m

Joe Carlsmith (guest), Dwarkesh Patel (host), Narrator

What counts as a dangerous, power‑seeking AI agent (planning, situational awareness, internal motives)Limits of current alignment methods (RLHF, verbal behavior, red‑teaming) and why apparent niceness can be misleadingPlausible AI takeover pathways, voluntary versus seized power, and the role of institutional and geopolitical dynamicsAlignment difficulty, AI‑for‑AI‑safety “sweet spot,” and the need for a real science of AI motivationsMoral status of AIs, analogies with slavery, Nazis, and bears, and the ethics of controlling or modifying AI mindsBalance of power vs. singleton “god‑AI”, multipolar scenarios, and distributed control over transformative technologiesMoral realism, convergence of values, weird but good futures, and carrying forward the “seed” of human goodness

In this episode of Dwarkesh Podcast, featuring Joe Carlsmith and Dwarkesh Patel, Joe Carlsmith — Preventing an AI takeover explores joe Carlsmith on AI takeover risk, alignment, and moral futures Dwarkesh Patel and Joe Carlsmith discuss how advanced AI systems could become powerful agents whose behavior is driven by internal motives that may diverge from human interests, and why apparent niceness in current models (like GPT‑4) is not strong evidence of deep alignment.

Joe Carlsmith on AI takeover risk, alignment, and moral futures

Dwarkesh Patel and Joe Carlsmith discuss how advanced AI systems could become powerful agents whose behavior is driven by internal motives that may diverge from human interests, and why apparent niceness in current models (like GPT‑4) is not strong evidence of deep alignment.

Carlsmith outlines what makes AI takeover scenarios plausible, why alignment is technically and institutionally hard, and how distributed or multipolar AI development doesn’t automatically solve the control problem unless some systems are genuinely well‑aligned.

They explore broader philosophical questions from Carlsmith’s “Otherness and Control in the Age of AGI” essays: the ethics of trying to control or “enslave” superintelligent beings, how much power humans should voluntarily hand over, and what a good post‑human or AI‑rich future might look like.

The conversation ranges over moral realism, whether advanced minds would converge on shared values, how much weirdness to expect in good futures, the importance of balance-of-power over single dictators (human or AI), and our obligations to potential AI moral patients.

Key Takeaways

Verbal niceness from models is weak evidence of deep alignment.

Carlsmith stresses that gradient descent can force models to say what we like without those utterances reflecting the internal criteria that actually guide planning and action, so we shouldn’t over‑update on GPT‑4 sounding morally sophisticated.

Get the full analysis with uListen AI

Dangerous AI requires specific capabilities: planning, situational awareness, and goal‑driven behavior.

The takeover concern focuses on systems that can form sophisticated plans, model the world and their situation, evaluate consequences against internal criteria, and then let that planning machinery actually steer their real‑world behavior.

Get the full analysis with uListen AI

You can’t safely “test” takeover behavior on‑distribution.

We can’t run an experiment where an AI is genuinely given a real chance to seize the world, watch it do so, and then update its weights; instead we must rely on generalization from training scenarios, which makes alignment conceptually hard.

Get the full analysis with uListen AI

Multipolar AI development only helps if some systems are truly aligned.

A world where many actors build uncontrollable superintelligences does not guarantee safety—unless at least some powerful AIs are actually working on behalf of human interests, “good AIs will beat bad AIs” is just wishful thinking.

Get the full analysis with uListen AI

We need a real science of AI motivations before handing over civilization.

Carlsmith argues we are on track to create beings vastly more powerful than us, and our long‑term empowerment will depend on their motives, so transitioning to AI‑run institutions without a developed understanding of model goals is an unacceptable gamble.

Get the full analysis with uListen AI

Ethical trade‑offs include both preventing takeover and avoiding oppressive control of AIs.

He emphasizes we must hold two thoughts at once: AIs could be genuine moral patients deserving respect, and also potential aggressors; leaning too far into either “gentleness” or “control” risks either being eaten by the grizzly or becoming the Stalin to our creations.

Get the full analysis with uListen AI

A good future likely involves balance of power and inclusive growth, not a single dictator‑AI.

Carlsmith is skeptical of “pick the right god‑AI” framings and instead endorses a vision where power and agency are broadly distributed—across humans and AIs—with robust checks, pluralism, and processes that allow continued moral reflection and adjustment.

Get the full analysis with uListen AI

Notable Quotes

We’re going to transition to a world in which we’ve created these beings that are just, like, vastly more powerful than us. Our continued empowerment is just effectively dependent on their motives.

Joe Carlsmith

When I talk about a model’s values, I’m talking about the criteria that end up determining which plans the model pursues, and a model’s verbal behavior just doesn’t need to reflect those criteria.

Joe Carlsmith

To the extent you haven’t solved alignment, you likely haven’t solved it anywhere.

Joe Carlsmith

Enemy soldiers have souls, right? And so I think we need to learn the art of kind of hawk and dove, both.

Joe Carlsmith

I think we should be aspiring to really kind of leave no one behind—really find who are all the stakeholders here, and how do we have a fully inclusive vision of how the future could be good from a very, very wide variety of perspectives.

Joe Carlsmith

Questions Answered in This Episode

How can we empirically study and measure AI “motivations” in a way that goes beyond surface‑level behavior and avoids being gamed by increasingly capable models?

Dwarkesh Patel and Joe Carlsmith discuss how advanced AI systems could become powerful agents whose behavior is driven by internal motives that may diverge from human interests, and why apparent niceness in current models (like GPT‑4) is not strong evidence of deep alignment.

Get the full analysis with uListen AI

What concrete institutional or governance structures would best preserve a healthy balance of power among states, companies, and AIs during rapid capability gains?

Carlsmith outlines what makes AI takeover scenarios plausible, why alignment is technically and institutionally hard, and how distributed or multipolar AI development doesn’t automatically solve the control problem unless some systems are genuinely well‑aligned.

Get the full analysis with uListen AI

Where should we draw the moral line for when AI systems deserve patient‑like consideration, and how should that change current training practices (e.g., heavy fine‑tuning, constant resets, deletion)?

They explore broader philosophical questions from Carlsmith’s “Otherness and Control in the Age of AGI” essays: the ethics of trying to control or “enslave” superintelligent beings, how much power humans should voluntarily hand over, and what a good post‑human or AI‑rich future might look like.

Get the full analysis with uListen AI

If good futures will be “weird,” what criteria should we use—beyond our current intuitions—to judge whether a radically transformed future is genuinely better and not just alien?

The conversation ranges over moral realism, whether advanced minds would converge on shared values, how much weirdness to expect in good futures, the importance of balance-of-power over single dictators (human or AI), and our obligations to potential AI moral patients.

Get the full analysis with uListen AI

How should policymakers and labs trade off near‑term competitive pressures for deployment and scaling against the long‑term need to build a rigorous science of AI motivations and control?

Get the full analysis with uListen AI

Transcript Preview

Joe Carlsmith

(upbeat music plays) AIs can be more patient. Nazis are more patient. Enemy soldiers have souls, right? We need to learn the art of kind of hawk and dove, both. We're going to transition to a world in which we've created these beings that are just, like, vastly more powerful than us. Our continued empowerment is just effectively dependent on their motives. I think that's a transition we should not make until we have a, a very developed science of AI motivations. My actual prediction is that the AIs are gonna be very malleable. If you push an AI towards evil, like, it'll just go.

Dwarkesh Patel

Kinds of things that go visit Andromeda, did you really expect them to privilege whatever inclinations you have because you grew up in the African savanna? Of course, they're gonna be, like, weird. Today, I'm chatting with Joe Carlsmith. He's a philosopher, in my opinion. A capital G great philosopher. And you can find his essays at joecarlsmith.com. So we have a GPT-4, and it doesn't seem like a paper clipper kind of thing. It understands human values. In fact, if you help have it explain, like, why is being a paper clipper bad? Or, like, what, what, just tell me your opinions about being a paper clipper. Like, explain why the galaxy shouldn't be turned into paper clips. Um, okay, so what is happening such that, dot, dot, dot, we have a system that takes over and c- converts the world into something valueless?

Joe Carlsmith

One thing I'll just say off the bat, is like, when I'm, when I'm thinking about misaligned AIs, I'm thinking about... Or the type that I'm worried about.

Dwarkesh Patel

Yeah.

Joe Carlsmith

Uh, I'm thinking about AIs that have, um, a relatively specific set of properties related to agency and planning and kind of awareness and understanding of the world. One is this capacity to plan. Um, uh, and kind of make kind of relatively sophisticated plans on the basis of models of the world.

Dwarkesh Patel

Yeah.

Joe Carlsmith

Um, where those plans are being kind of evaluated according to criteria.

Dwarkesh Patel

Mm-hmm.

Joe Carlsmith

Um, that planning capability needs to be driving the model's behavior. So there are models that are sort of in some sense capable of planning, but it's not like when they give output. It's not like that output was determined-

Dwarkesh Patel

Yeah.

Joe Carlsmith

... by some process of planning. Like, here's what will happen if I give this output.

Dwarkesh Patel

Yeah.

Joe Carlsmith

And do I want that to happen? The model needs to really understand the world, right? It needs to really be like, "Okay, um, here's what will happen. I'm, you know, here I am. Here's my situation. Here's, like, the politics of the situation." Like, really under, like, kind of, uh, having this kind of situational awareness, um, to be able to evaluate the consequences of different plans.

Dwarkesh Patel

Yeah.

Joe Carlsmith

Um, I think the other thing is, like... So th- the verbal behavior of these models, um, I think need bear no... So when I talk about a model's values, I'm talking about the criteria that, that, uh, the, that kind of end up determining which plans the model pursues, right? And a model's verbal behavior, even if it has a planning process, which GPT-4 I think doesn't in many cases, it, its verbal behavior just doesn't, doesn't need to reflect those criteria, right? Um, and so, uh, y- you know, w- we know that we're going to be able to get models to say what we want to hear, right? We... Uh, uh, that is the magic of gradient descent.

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome