
Joe Carlsmith — Preventing an AI takeover
Joe Carlsmith (guest), Dwarkesh Patel (host), Narrator
In this episode of Dwarkesh Podcast, featuring Joe Carlsmith and Dwarkesh Patel, Joe Carlsmith — Preventing an AI takeover explores joe Carlsmith on AI takeover risk, alignment, and moral futures Dwarkesh Patel and Joe Carlsmith discuss how advanced AI systems could become powerful agents whose behavior is driven by internal motives that may diverge from human interests, and why apparent niceness in current models (like GPT‑4) is not strong evidence of deep alignment.
Joe Carlsmith on AI takeover risk, alignment, and moral futures
Dwarkesh Patel and Joe Carlsmith discuss how advanced AI systems could become powerful agents whose behavior is driven by internal motives that may diverge from human interests, and why apparent niceness in current models (like GPT‑4) is not strong evidence of deep alignment.
Carlsmith outlines what makes AI takeover scenarios plausible, why alignment is technically and institutionally hard, and how distributed or multipolar AI development doesn’t automatically solve the control problem unless some systems are genuinely well‑aligned.
They explore broader philosophical questions from Carlsmith’s “Otherness and Control in the Age of AGI” essays: the ethics of trying to control or “enslave” superintelligent beings, how much power humans should voluntarily hand over, and what a good post‑human or AI‑rich future might look like.
The conversation ranges over moral realism, whether advanced minds would converge on shared values, how much weirdness to expect in good futures, the importance of balance-of-power over single dictators (human or AI), and our obligations to potential AI moral patients.
Key Takeaways
Verbal niceness from models is weak evidence of deep alignment.
Carlsmith stresses that gradient descent can force models to say what we like without those utterances reflecting the internal criteria that actually guide planning and action, so we shouldn’t over‑update on GPT‑4 sounding morally sophisticated.
Get the full analysis with uListen AI
Dangerous AI requires specific capabilities: planning, situational awareness, and goal‑driven behavior.
The takeover concern focuses on systems that can form sophisticated plans, model the world and their situation, evaluate consequences against internal criteria, and then let that planning machinery actually steer their real‑world behavior.
Get the full analysis with uListen AI
You can’t safely “test” takeover behavior on‑distribution.
We can’t run an experiment where an AI is genuinely given a real chance to seize the world, watch it do so, and then update its weights; instead we must rely on generalization from training scenarios, which makes alignment conceptually hard.
Get the full analysis with uListen AI
Multipolar AI development only helps if some systems are truly aligned.
A world where many actors build uncontrollable superintelligences does not guarantee safety—unless at least some powerful AIs are actually working on behalf of human interests, “good AIs will beat bad AIs” is just wishful thinking.
Get the full analysis with uListen AI
We need a real science of AI motivations before handing over civilization.
Carlsmith argues we are on track to create beings vastly more powerful than us, and our long‑term empowerment will depend on their motives, so transitioning to AI‑run institutions without a developed understanding of model goals is an unacceptable gamble.
Get the full analysis with uListen AI
Ethical trade‑offs include both preventing takeover and avoiding oppressive control of AIs.
He emphasizes we must hold two thoughts at once: AIs could be genuine moral patients deserving respect, and also potential aggressors; leaning too far into either “gentleness” or “control” risks either being eaten by the grizzly or becoming the Stalin to our creations.
Get the full analysis with uListen AI
A good future likely involves balance of power and inclusive growth, not a single dictator‑AI.
Carlsmith is skeptical of “pick the right god‑AI” framings and instead endorses a vision where power and agency are broadly distributed—across humans and AIs—with robust checks, pluralism, and processes that allow continued moral reflection and adjustment.
Get the full analysis with uListen AI
Notable Quotes
“We’re going to transition to a world in which we’ve created these beings that are just, like, vastly more powerful than us. Our continued empowerment is just effectively dependent on their motives.”
— Joe Carlsmith
“When I talk about a model’s values, I’m talking about the criteria that end up determining which plans the model pursues, and a model’s verbal behavior just doesn’t need to reflect those criteria.”
— Joe Carlsmith
“To the extent you haven’t solved alignment, you likely haven’t solved it anywhere.”
— Joe Carlsmith
“Enemy soldiers have souls, right? And so I think we need to learn the art of kind of hawk and dove, both.”
— Joe Carlsmith
“I think we should be aspiring to really kind of leave no one behind—really find who are all the stakeholders here, and how do we have a fully inclusive vision of how the future could be good from a very, very wide variety of perspectives.”
— Joe Carlsmith
Questions Answered in This Episode
How can we empirically study and measure AI “motivations” in a way that goes beyond surface‑level behavior and avoids being gamed by increasingly capable models?
Dwarkesh Patel and Joe Carlsmith discuss how advanced AI systems could become powerful agents whose behavior is driven by internal motives that may diverge from human interests, and why apparent niceness in current models (like GPT‑4) is not strong evidence of deep alignment.
Get the full analysis with uListen AI
What concrete institutional or governance structures would best preserve a healthy balance of power among states, companies, and AIs during rapid capability gains?
Carlsmith outlines what makes AI takeover scenarios plausible, why alignment is technically and institutionally hard, and how distributed or multipolar AI development doesn’t automatically solve the control problem unless some systems are genuinely well‑aligned.
Get the full analysis with uListen AI
Where should we draw the moral line for when AI systems deserve patient‑like consideration, and how should that change current training practices (e.g., heavy fine‑tuning, constant resets, deletion)?
They explore broader philosophical questions from Carlsmith’s “Otherness and Control in the Age of AGI” essays: the ethics of trying to control or “enslave” superintelligent beings, how much power humans should voluntarily hand over, and what a good post‑human or AI‑rich future might look like.
Get the full analysis with uListen AI
If good futures will be “weird,” what criteria should we use—beyond our current intuitions—to judge whether a radically transformed future is genuinely better and not just alien?
The conversation ranges over moral realism, whether advanced minds would converge on shared values, how much weirdness to expect in good futures, the importance of balance-of-power over single dictators (human or AI), and our obligations to potential AI moral patients.
Get the full analysis with uListen AI
How should policymakers and labs trade off near‑term competitive pressures for deployment and scaling against the long‑term need to build a rigorous science of AI motivations and control?
Get the full analysis with uListen AI
Transcript Preview
(upbeat music plays) AIs can be more patient. Nazis are more patient. Enemy soldiers have souls, right? We need to learn the art of kind of hawk and dove, both. We're going to transition to a world in which we've created these beings that are just, like, vastly more powerful than us. Our continued empowerment is just effectively dependent on their motives. I think that's a transition we should not make until we have a, a very developed science of AI motivations. My actual prediction is that the AIs are gonna be very malleable. If you push an AI towards evil, like, it'll just go.
Kinds of things that go visit Andromeda, did you really expect them to privilege whatever inclinations you have because you grew up in the African savanna? Of course, they're gonna be, like, weird. Today, I'm chatting with Joe Carlsmith. He's a philosopher, in my opinion. A capital G great philosopher. And you can find his essays at joecarlsmith.com. So we have a GPT-4, and it doesn't seem like a paper clipper kind of thing. It understands human values. In fact, if you help have it explain, like, why is being a paper clipper bad? Or, like, what, what, just tell me your opinions about being a paper clipper. Like, explain why the galaxy shouldn't be turned into paper clips. Um, okay, so what is happening such that, dot, dot, dot, we have a system that takes over and c- converts the world into something valueless?
One thing I'll just say off the bat, is like, when I'm, when I'm thinking about misaligned AIs, I'm thinking about... Or the type that I'm worried about.
Yeah.
Uh, I'm thinking about AIs that have, um, a relatively specific set of properties related to agency and planning and kind of awareness and understanding of the world. One is this capacity to plan. Um, uh, and kind of make kind of relatively sophisticated plans on the basis of models of the world.
Yeah.
Um, where those plans are being kind of evaluated according to criteria.
Mm-hmm.
Um, that planning capability needs to be driving the model's behavior. So there are models that are sort of in some sense capable of planning, but it's not like when they give output. It's not like that output was determined-
Yeah.
... by some process of planning. Like, here's what will happen if I give this output.
Yeah.
And do I want that to happen? The model needs to really understand the world, right? It needs to really be like, "Okay, um, here's what will happen. I'm, you know, here I am. Here's my situation. Here's, like, the politics of the situation." Like, really under, like, kind of, uh, having this kind of situational awareness, um, to be able to evaluate the consequences of different plans.
Yeah.
Um, I think the other thing is, like... So th- the verbal behavior of these models, um, I think need bear no... So when I talk about a model's values, I'm talking about the criteria that, that, uh, the, that kind of end up determining which plans the model pursues, right? And a model's verbal behavior, even if it has a planning process, which GPT-4 I think doesn't in many cases, it, its verbal behavior just doesn't, doesn't need to reflect those criteria, right? Um, and so, uh, y- you know, w- we know that we're going to be able to get models to say what we want to hear, right? We... Uh, uh, that is the magic of gradient descent.
Install uListen to search the full transcript and get AI-powered insights
Get Full TranscriptGet more from every podcast
AI summaries, searchable transcripts, and fact-checking. Free forever.
Add to Chrome