Paul Christiano — Preventing an AI takeover

Talked with Paul Christiano (world’s leading AI safety researcher) about: * Does he regret inventing RLHF? * What do we want post-AGI world to look like (do we want to keep gods enslaved forever)? * Why he has relatively modest timelines (40% by 2040, 15% by 2030), * Why he’s leading the push to get to labs develop responsible scaling policies, & what it would take to prevent an AI coup or bioweapon, * His current research into a new proof system, and how this could solve alignment by explaining model's behavior, * and much more. 𝐎𝐏𝐄𝐍 𝐏𝐇𝐈𝐋𝐀𝐍𝐓𝐇𝐑𝐎𝐏𝐘 Open Philanthropy is currently hiring for twenty-two different roles to reduce catastrophic risks from fast-moving advances in AI and biotechnology, including grantmaking, research, and operations. For more information and to apply, please see this application: https://www.openphilanthropy.org/research/new-roles-on-our-gcr-team/ The deadline to apply is November 9th; make sure to check out those roles before they close: 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Transcript: https://www.dwarkeshpatel.com/p/paul-christiano * Apple Podcasts: https://podcasts.apple.com/us/podcast/paul-christiano-preventing-an-ai-takeover/id1516093381?i=1000633226398 * Spotify: https://open.spotify.com/episode/5vOuxDP246IG4t4K3EuEKj?si=VW7qTs8ZRHuQX9emnboGcA * Follow me on Twitter: https://twitter.com/dwarkesh_sp 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 - What do we want post-AGI world to look like? 00:24:25 - Timelines 00:45:28 - Evolution vs gradient descent 00:54:53 - Misalignment and takeover 01:17:23 - Is alignment dual-use? 01:31:38 - Responsible scaling policies 01:58:25 - Paul’s alignment research 02:35:01 - Will this revolutionize theoretical CS and math? 02:46:11 - How Paul invented RLHF 02:55:10 - Disagreements with Carl Shulman 03:01:53 - Long TSMC but not NVIDIA

Dwarkesh PatelhostPaul Christianoguest

Oct 31, 20233h 7mWatch on YouTube ↗

EVERY SPOKEN WORD

150 min read · 30,240 words

0:00 – 24:25
What do we want post-AGI world to look like?
1. DPDwarkesh Patel
  Okay. Today, I have the pleasure of interviewing Paul Christiano, who is the leading AI safety researcher. He's the person that labs and governments turn to when they want, uh, feedback and advice on their safety plans. He previously led the language model alignment team at OpenAI, where he led the invention of RLHF, and now he is the head of the Alignment Research Center, and they've been working with the big labs to identify when, uh, these models will be too unsafe to keep scaling. Paul, welcome to the podcast.
2. PCPaul Christiano
  Yeah. Thanks for having me. Looking forward to talking.
3. DPDwarkesh Patel
  Okay, so first question... And this is a question I've asked Holden, Ilya, Dario, and none of them have given me a satisfying answer. Give me a concrete sense of what a post-AGI world that would be good would look like. Like, how are humans interfacing with the AI? What is the, uh, the economic and political structure?
4. PCPaul Christiano
  Yeah, I guess this is a tough question for a bunch of reasons, uh, maybe the biggest one is concrete, and I think it's just... If we're talking about really long spans of time, then a lot will change, and it's really hard for someone to talk concretely about what that will look like without saying really silly things. But I can venture some guesses or fill in some parts. I think this is also a question of how good is good, like, often I'm thinking about worlds that seem like kind of the best achievable outcome or a likely achievable outcome. Um, so I am very often imagining my typical future has, right, sort of continuing economic and military competition amongst groups of humans. I think that competition is increasingly mediated by AI systems, so for example, if you imagine, right, humans making money, um, it will be less and less worthwhile for humans to spend any of their time trying to make money or any of their time trying to fight wars. Um, so increasingly, the world you imagine is one where AI systems are doing those activities on behalf of humans, so, like, I just invest in some index fund and a bunch of AIs are running companies, and those companies are competing with each other, but that is kind of a sphere where humans are not really engaging much. The reason I gave this, like, how good is good caveat is, like, it's not clear if this is the world you'd most love, like, I'm like, yeah, the world in... I'm leading with, like, the world still has a lot of war and a lot of-
5. DPDwarkesh Patel
  Right.
6. PCPaul Christiano
  ... economic competition and so on. But maybe what I'm trying to... or what I'm most often thinking about is, like, how can a world be reasonably good, like, during a long period where those things still exist?
7. DPDwarkesh Patel
  Mm-hmm.
8. PCPaul Christiano
  I think, like, in the very long run, I kind of expect something more like strong world government rather than just this, like, status quo. That's, like, a very long run. I think there's a long time left of, like, having a bunch of states and a bunch of different economic powers.
9. DPDwarkesh Patel
  O- o- one world government, uh, why do you think that's the transition that's likely to happen at some point?
10. PCPaul Christiano
  Yeah, so again, at some point, I'm, I'm imagining or I'm thinking of, like, the very broad sweep of history.
11. DPDwarkesh Patel
  Yeah.
12. PCPaul Christiano
  I think there are, like, a lot of losses, like, war is a very costly thing. We would all like to have fewer wars. If you just ask, like, "What is humanity's long-term future like?" Uh, I do expect to drive down the rate of war to very, very low levels eventually. It's sort of, like, this kind of technological or social technological problem with, like, sort of how do you organize society? How do you navigate conflicts in a way that doesn't have those kinds of losses?
13. DPDwarkesh Patel
  Mm-hmm.
14. PCPaul Christiano
  And in the long run, I do expect us to succeed. I expect it to take kind of a long time subjectively. I think an important fact about AI is just, like, doing a lot of cognitive work and more quickly getting you to that world, more quickly or figuring out how do we set things up that way.
15. DPDwarkesh Patel
  Mm-hmm. Yeah, c- the way Carl Shulman put it on the podcast was that you would have basically a thousand years of intellectual progress or social progress in a span of a month or whatever when the intelligence explosion happens. More broadly so, the, the situation where, you know, y- we have these AIs for managing our hedge funds and managing our factories and so on, tha- that seems like something that makes sense when the AI is human level, but when we have superhuman AIs, do we want, uh, gods who are enslaved forever, l- uh, in the long, in a hundred years, what, what, what, what is the situation we want?
16. PCPaul Christiano
  So h- hundred years is a very, very long time.
17. DPDwarkesh Patel
  Yeah.
18. PCPaul Christiano
  And maybe starting with the spirit of the question, or maybe I have a view which is perhaps less extreme than Carl's view, but still, like, a hundred objective years is, um, further ahead than I ever, than I ever think.
19. DPDwarkesh Patel
  Mm-hmm.
20. PCPaul Christiano
  I still think I'm describing a world which involves incredibly smart systems running around doing things like running companies on behalf of humans and fighting wars on behalf of humans, and you might be like, "Is that the world you really want?"
21. DPDwarkesh Patel
  Yeah.
22. PCPaul Christiano
  Or, like, certainly not the first best world as we, like, mentioned a little bit before. I think it is a world that probably is the... of the achievable worlds or, like, feasible worlds is the one that seems most desirable to me that is sort of decoupling the social transition from this technological transition. So you could say, like, we're about to build some AI systems, and, like, at the time we build AI systems, you would like to have either greatly changed the way world government works or you would like to have sort of humans have dec- decided, like, "We're done. We're passing off the baton to these AI systems."
23. DPDwarkesh Patel
  Yeah.
24. PCPaul Christiano
  I think that you would like to decouple those timescales. So I think AI development is by default, barring some kind of coordination, going to be very fast.
25. DPDwarkesh Patel
  Mm-hmm.
26. PCPaul Christiano
  So there's not going to be a lot of time for humans to think, like, "Hey, what do we want? If we're building the next generation instead of just raising it the normal way, like, what do we want that to look like?" I think that's, like, a crazy hard kind of collective decision that humans naturally want to cope with over, like, a bunch of generations, and the construction of AIs, this very fast technological process happening over years. So I don't think you want to say, like, by the time we have finished this technological progress, we will have made a decision about, like, the next species we're going to build and replace ourselves with.
27. DPDwarkesh Patel
  Mm-hmm.
28. PCPaul Christiano
  I think the world we want to be in is one where we say, like, either we are able to build the technology in a way that doesn't force us to have made those decisions, which probably means it's a kind of AI system that we're happy, like, delegating fighting a war or running a company to, or if we're not able to do that, then I really think you should not be doing... You shouldn't have been building that technology. If you're like, the only way you can cope with AI is being ready to hand off the world to some AI system you built, I think it's very unlikely we're going to be sort of ready to do that on the timelines that the technology would naturally dictate.
29. DPDwarkesh Patel
  Say we're in the situation in which we're happy with the thing, w- what would it look like for us to say we're ready to hand off the baton? But, like, what would make you satisfied? And the reason, uh, it's relevant to ask you is because you're on Anthropic's, uh, long term benefit trust and you'll choose, like, the board me- the majority of the board members on, uh, in the long run in, um, in, um, at Anthropic. These will presumably be the people who decide if Anthropic gets AI first, you know, what the AI ends up doing. So, what is a version of that that, look, you would be happy with?
30. PCPaul Christiano
  My main high level take here is that I would be unhappy about a world where, like, Anthropic just makes some call and Anthropic is like, "Here's the kind of AI. Like, we've seen enough. We're ready to hand off the future to this kind of AI." So, like, procedurally, I think it's like not a decision that kind of I want to be making personally or I want Anthropic to be making.
24:25 – 45:28
Timelines
1. PCPaul Christiano
2. DPDwarkesh Patel
  Okay, we can come back to this later. But let's get more specific on, uh, wha- what the timelines look for these kinds of changes. So, the time by which we'll have an AI that is capable of building a Dyson sphere. Feel free to give confidence intervals, and we understand these numbers are tentative and so on.
3. PCPaul Christiano
  I mean, I think AI capability in Dyson sphere is, like, a slightly odd way to put it, and I think it's a, sort of a property of a civilization, like, that depends on a lot of physical infrastructure. And by Dyson sphere, I just kind of understand this to mean, like, I don't know, like a billion times more energy than, like, all of the sunlight incident on Earth or something like that. I think, like, I most often think about what's the chance in like five years, 10 years, whatever. So maybe I'd say like 15% chance by 2030 and, like, 40% chance by 2040. Those are kind of like cast numbers from six months ago or nine months ago that I haven't revisited in a while.
4. DPDwarkesh Patel
  Oh, for- 40% by 2040. So I think that- that seems longer than, uh... I think Dario when he was on the podcast he said, "We would have AIs that are capable of doing lots of different kinds of..." They basically tr- passed a hu- uh, Turing test for a well-educated human for, like, an hour or something. Uh, and it's hard to imagine that something that actually is human is long after and from there something superhuman. So somebody like Dario, it seems like, is on the much shorter end. Ilia, I don't think he answered this question specifically but I'm guessing similar answer. Wha- uh, so why, uh, do you not buy the scaling picture? Like, what- what- what makes your timelines longer?
5. PCPaul Christiano
  Yeah, I mean, I'm happy... Maybe I wanna talk separately about the 2030 or 2040 forecast.
6. DPDwarkesh Patel
  Okay.
7. PCPaul Christiano
  Like, is the... Like, once you're talking the 2040 forecast, I think... Yeah, I mean, which one are you more interested in starting with? Is... Are you, are you complaining about 15% by 2030 for Dyson sphere being too low? Or 40% by 2040 being too low?
8. DPDwarkesh Patel
  Well, let's talk about the 2030. Why 15% by 2030?
9. PCPaul Christiano
  Yeah, I think my take is you can imagine, like, two- two poles in this discussion.
10. DPDwarkesh Patel
  Yeah.
11. PCPaul Christiano
  One is, like, the- the fast pole that's like, "Hey, AI seems pretty smart. Like, what exactly can it do? It's, like, getting smarter pretty fast."
12. DPDwarkesh Patel
  Yeah.
13. PCPaul Christiano
  That's, like, one pole. And the other pole is like, "Hey, everything takes a really long time and you're talking about this, like, crazy industrialization." Like, that's a factor of a billion growth from, like, where we're at today-
14. DPDwarkesh Patel
  Yeah.
15. PCPaul Christiano
  ... like, give or take. Like, we don't know if it's even possible to develop technology that fast or whatever. Like, you have this sort of two poles of that discussion. And I feel like, you know, I'm presenting it that way in part 'cause I'm like, and then I'm somewhere in between with this nice moderate position of, like, only a 15% chance. Um, but, like, in particular, the things that move me, I think, are kind of related to both of those extremes. Like, on the one hand, I'm like, AI systems do seem quite good at a lot of things and are getting better much more quickly, such that it's, like, really hard to say, like, here's what they can't do or here's the obstruction. On the other hand, like, there is not even much proof in principle right now of AI systems, like, doing super useful cognitive work. Like, we don't have a trend we can extrapolate. We're like, "Yeah, you've done this thing this year, you're gonna do this thing next year, and the other thing the following year."
16. DPDwarkesh Patel
  Mm-hmm.
17. PCPaul Christiano
  I think, like, right now, there are very broad error bars about, like, what... Like, where fundamental difficulties could be. And six years is just not... I guess six years and three months is not a lot of time. So I think this, like, 15% for 2030 Dyson sphere, you probably need, like, the human level AI or the AI that's, like, doing human jobs in, like, give or take, like, four years, three years. Like, something like that. So just not giving very many years... It's not very much time, and I think there are, like, a lot of things that your model... Like, yeah, maybe this is some generalized, like, things take longer than you'd think.
18. DPDwarkesh Patel
  Mm-hmm.
19. PCPaul Christiano
  And I feel most strongly about that when you're talking about, like, three or four years, and I feel, like, less strongly about that as you talk about 10 years or 20 years. But at three or four years, I feel... Or like six years for the Dyson sphere, I feel a lot of that. A lot of, like... There's a lot of ways this could take a while, a lot of ways in which AI systems could be... It could be hard to hand all the work to your AI systems or... Yeah.
20. DPDwarkesh Patel
  So okay, so m- maybe instead of speaking in terms of years, we should say... By- by the way, it's interesting that you think the d- distance between can take all human cognitive labor to Dyson sphere is two years, it seems like. We should talk about (laughs) that at some point. Um, uh, presumably, it's, like, intelligence explosion stuff.
21. PCPaul Christiano
  Yeah. I mean, I think amongst people you've interviewed, maybe that's, like, on the long end thinking-
22. DPDwarkesh Patel
  Okay.
23. PCPaul Christiano
  ... it would take a couple years. And it depends a little bit what you mean by, like... Like, I think literally all human cognitive labor is probably, like, more, like, weeks or months or something like that. Um, like, that's kind of deep into the singularity. Um, but yeah, there's a point where, like, AI wages are high relative to human wages, which I think is well before it can do literally everything a human can do.
24. DPDwarkesh Patel
  Sounds good. Uh, but before we get to that, uh, the intelligence explosion stuff, on the four years, so instead of four years, maybe we can say there's gonna be maybe two more scale-ups in four years, uh, like GPT-4 to GPT-5 to GPT-6. And let's say each one is 10X bigger. So w- what is GPT-4, like two E25 flops or...
25. PCPaul Christiano
  I don't think it's publicly stated what it is.
26. DPDwarkesh Patel
  Okay.
27. PCPaul Christiano
  But I'm happy to say, like, you know, four orders of magnitude or five or six or whatever effective tr- training compute past-
28. DPDwarkesh Patel
  Yeah.
29. PCPaul Christiano
  ... GPT-4 of, like, what would you guess would happen-
30. DPDwarkesh Patel
  Right.
45:28 – 54:53
Evolution vs gradient descent
1. PCPaul Christiano
  on.
2. DPDwarkesh Patel
  Mm. Okay. Let, let, let me back up and ask, uh, a question more generally about, you know, people make these analogies about y- humans were trained by evolution and really deployed in this, i- in the modern civilization. Do you buy those analogies? Is- is it valid to say that humans were trained by evolution rather than... I mean, if you look at the protein coding size of the genome, it's like 50 megabytes or something. And then, what part of that is for the brain? Anyways, how- how do you think about how much information, ah, is in, um... Like, do you think of the genome as hyperparameters or how much does that inform you wh- when you have these anchors for how much training humans get when they're just consuming information when they're walking up, up and about and so on?
3. PCPaul Christiano
  Okay. I guess the way that you could think of this is, like, I think both analogies are reasonable. One analogy being, like, evolution is like a training run and humans are like the end product of that training run. And a second analogy is like evolution is like an algorithm designer and then human over the course of, like, this modest amount of computation over their lifetime, um, is the algorithm being... That's been produced, the learning algorithm's been produced.
4. DPDwarkesh Patel
  Right.
5. PCPaul Christiano
  And I think, like, neither analogy is that great. Like, I like them both and lean on them a bunch bo- like both of them a bunch and think that's been, like, pretty good for having like a reasonable view of what's likely to happen. That said, like, the human genome is not that much like 100 trillion parameter model. It's, like, a much smaller number of parameters that behave in, like, a much more confusing way. Evolution did, like, a lot more optimization especially over, like, long, like, designing a brain to work well over a lifetime than gradient descent does over models. That's like a disanalogy on that side. And on the other side, like, I just... I think human learning over the course of human lifetime is in many ways just, like, much, much better than gradient descent over the space of neural nets. Like, gradient descent is working really well, but I think we can just be quite confident that, like, in a lot of ways, human learning is much better. Human learning is also constrained. Like, we just don't get to see much data and that's just an engineering constraint that you can relax. Like, you can just give your neural nets way more data than humans have access to.
6. DPDwarkesh Patel
  I- i- in what ways is human learning superior to gradient descent?
7. PCPaul Christiano
  Um, I mean, the most obvious one is just, like, ask how much data it takes a human to become, like, an expert in some domain and it's, like, much, much smaller than the amount of data that's going to be needed on any plausible trend extrapolation, like it's-
8. DPDwarkesh Patel
  All right. No, not in terms of performance, but is it the active learning part? Is it the structure? Like, what is it?
9. PCPaul Christiano
  I mean, I would guess a complicated mess of a lot of things. In some sense, there's not that much going on in a brain. Like, as you say, there's just not that many... It's not that many bytes in a genome.
10. DPDwarkesh Patel
  Yeah.
11. PCPaul Christiano
  Um, but there's very, very few bytes in an ML algorithm. Like, if you think a genome is like a billion bytes or whatever, maybe you think less, maybe you think it's like 100 million bytes, um, then, like, you know, an ML algorithm is like, if compressed, probably more like hundreds of thousands of bytes or something. Like, the total complexity of, like, here's how you train GPT-4 is just like... I haven't thought about these numbers, but, like, it's very, very small compared to a genome.
12. DPDwarkesh Patel
  Mm-hmm.
13. PCPaul Christiano
  And so although a genome is very simple, it's, like, very, very complicated compared to algorithms that humans design. Like, really hideously more complicated than algorithm a human would design.
14. DPDwarkesh Patel
  Is, is that true? So, okay, so the, uh, the human genome is three billion base pairs or something, um, but only, like, one or 2% of that is protein coding. So that's fi- 50 million base pairs, uh-
15. PCPaul Christiano
  But I, I don't... Yeah, so I don't know much about biology. In particular, I guess the question is like how many of those bits are, like, productive for, like, shaping development of a brain. And presumably a significant part of the non-protein coding genome can... I mean, I just don't know. It seems really hard to guess how much of that plays a role. Like, the most important decisions are probably, like, from an algorithm design perspective are not like... Like, the protein coding part is, is less important than the, like, decisions about, like, what happens during development or, like, how cells differentiate. I don't know if that's... I know nothing about biology, so I suspect I- I'm happy to run with 100 million base pairs though.
16. DPDwarkesh Patel
  Right but, uh, but on, on the other end on the hyperparameters that are shipped before training run, that might be not that much but w- if you're gonna include all the, uh, all the base pairs in, uh, the genome, then which are not all relevant to the brains or are relevant to, like, very, uh, bigger details about, like, h- just the basics of biology should probably include, like, the Python library and the compilers and the operating system for GPT-4 as well, uh, to make that comparison analogous. So, at the end of the day, I- I don't actually don't know which, which one has storing much more information.
17. PCPaul Christiano
  Yeah. I mean, I think the way I would put it is, like, I-The number of bits it takes to specify the learning algorithm to train GPT-4 is, like, very small. And you might wonder, like, maybe a genome, like the number of bits it would, like, take to specify a brain is also very small-
18. DPDwarkesh Patel
  Yeah.
19. PCPaul Christiano
  ... and a genome is much, much faster than that. Um, but it is also just plausible that a genome is like closer to... Like, certainly the space, the amount of space to put complexity in a genome, we could ask how well evolution uses it and, like, I have no idea whatsoever. But the amount of space in a genome is like very, very vast compared to the number of bits that are actually taken to specify, like the architecture or optimization procedure and so on for GPT-4.
20. DPDwarkesh Patel
  Mm-hmm.
21. PCPaul Christiano
  Just because, again, genome is simple, but algorithms are, like, really very simple, and the algorithms are really very simple.
22. DPDwarkesh Patel
  And, and stepping back, do you think this is where the, uh, the better sample efficiency of human learning comes from?
23. PCPaul Christiano
  Yeah.
24. DPDwarkesh Patel
  Like, why it's better than gradient descent?
25. PCPaul Christiano
  Yeah. So I haven't thought that much about the sample efficiency question in a long time. But if you thought, like, a synapse was seeing something like, you know, a neuron firing once per second, then how many seconds are there in a human life?
26. DPDwarkesh Patel
  We can just pull up a calculator real quick.
27. PCPaul Christiano
  Yeah, let's do some calculating. Tell me the number.
28. DPDwarkesh Patel
  Okay. 3600 seconds per hour-
29. PCPaul Christiano
  Times 24 times 365 times 20.
30. DPDwarkesh Patel
  Okay, so that's 630 million seconds.
54:53 – 1:17:23
Misalignment and takeover
1. PCPaul Christiano
2. DPDwarkesh Patel
  Well, we'll get back to the timeline stuff in a second. Uh, at some point, we should talk about alignment, so let's-
3. PCPaul Christiano
  Yeah.
4. DPDwarkesh Patel
  ... let's talk about alignment. At what stage does misalignment happen? So right now with something like GPT-4, I'm not even sure it would make sense to say that it's misaligned, um, 'cause it doesn't, it's not aligned to anything in particular. Uh, is it, is that at human level where you think the ability to be deceptive comes about? Uh, what is the process by which misalignment happens?
5. PCPaul Christiano
  I think even for GPT-4, it's reasonable to ask questions like, are there cases where GPT-4 knows that humans don't want X but it does X anyway? Like, where it's like, well, I know that, like, I can give this answer, which is misleading and if it was explained to a human what was happening, they wouldn't want that to be done, but I'm gonna produce it. I think that, like, GPT-4 understands things enough that you can have, like, that misalignment in that sense. Yeah, I think GPT, like I've sometimes talked about being, like, benign instead of aligned, meaning that, like, well, it's not exactly clear if it's aligned or if that context is meaningful. It's just, like, kind of a messy word to use in general. But I think what we're more confident of is it's, like, not doing... You know, it's not-... optimizing for this goal, which is like at cross-purposes to humans. It's either optimizing for nothing or like maybe it's optimizing for what humans want or close enough or something that's like an approximation, good enough to still not take over. But anyway, I'm like, some of these abstractions seem like they do apply to GPT-4. Um, it seems like probably it's not like egregiously misaligned. It doesn't... it's not doing the kind of thing that could lead to takeover, we'd guess.
6. DPDwarkesh Patel
  Suppose you have a system at some point in which ends up in it wanting takeover, what are the checkpoints? And also, what is the internal... Is it just that it, to become more powerful, it needs agency, and agency implies, uh, other goals? Or do you see a different process by which misalignment happens?
7. PCPaul Christiano
  Yes. I think there's a couple possible stories for getting to catastrophic misalignment and they have slightly different answers to this question. Um, so maybe I'll just briefly describe two stories and try and talk about when they can... when they start making sense to me. So one type of story is you train or fine-tune your AI system to do things that humans will rate highly or that, like, get other kinds of reward in a broad diversity of situations and then it learns to, in general, dropped in some new situation, try and figure out which actions would receive a high reward or whatever, um, and then take those actions. And then when deployed in the real world, like, sort of gaining control of its own training data provision process is something that gets a very high reward, and so it does that. So this is like one kind of story. Um, like, it wants to grab the reward button or whatever. It wants to intimidate the humans into giving it a high reward, et cetera. I think that doesn't really require that much. This basically requires a system which is like, in fact, looks at a bunch of environments, is able to understand, like, the mechanism of reward provision as like a common feature of those environments, is able to think in some novel environment, like, "Hey, which actions would result in me getting a high reward?" And it's thinking about that concept precisely enough that when it says high reward, it's saying like, "Okay, well, how is reward actually computed?" It's like some actual physical process being implemented in the world. My guess would be, like, GPT-4 is about at the level where with handholding you can observe this kind of, like, scary generalizations of this type, although I think they haven't been shown basically. Um, that is, you can have a system which in fact is fine-tuned out a bunch of cases and then some new case will try and, like, do an end run around humans even in a way humans would penalize if they were able to notice it or would have penalized in training environments. So I think GPT-4 is kind of at the boundary where these things are possible. Um, examples kind of exist but are getting significantly better over time. Um, I'm very excited about, like, there's this anthropic project basically trying to see how good an example can you make now, um, of this phenomena. And I think the answer is like kind of okay, probably (laughs) . Um, so that just I think is gonna continuously get better from here. I think for the level where we're concerned, like, this is related to me having really broad distributions over how smart models are. I think it's, like, not out of the question that you take GPT... Like, GPT-4's understanding of the world is, like, much crisper and, like, much better than GPT-3's understanding, um, just like it's really like night and day. And so it would not be that crazy to me if you took GPT-5 and you trained it to get a bunch of reward and it was actually like, "Okay, my goal is not doing the kind of thing which, like, thematically looks nice to humans. My goal is getting a bunch of reward. Um, and then we'll generalize in a new situation to get reward."
8. DPDwarkesh Patel
  Oh, and by, by the way, this requires it to consciously want to, uh, do something that it knows that humans wouldn't want it to do? Or is it just that we weren't good enough to specify that the thing that we accidentally ended up rewarding is not what we actually want?
9. PCPaul Christiano
  I think the scenarios I am most interested in and most people are concerned about from a catastrophic risk perspective, it involves systems understanding that they're taking actions which a human would penalize if they, the human was aware of what's going on such that you have to either deceive humans about what's happening, um, or you need to, like, actively subvert human attempts to correct your behavior.
10. DPDwarkesh Patel
  Right.
11. PCPaul Christiano
  So these, the failures come from really this combination or they require this combination of both, like, trying to do something humans don't like and understanding that humans would stop you.
12. DPDwarkesh Patel
  Right.
13. PCPaul Christiano
  I think you can have only the barest examples. You can't have the barest examples for GPT-4. Like, you can create the situations where GPT-4 will be like, "Sure, in that situation, like, here's what I would do. I would like go hack the computer and change my reward." Or in fact, we'll, like, do things that are, like, simple hacks or, like, go change the source of this file or whatever to get a higher reward. They're pretty weak examples. I think it's plausible GPT-5 will have, like, compelling examples of those phenomena. I really don't know. This is very related to, like, the very broad error bars on, like, how competent such systems will be when.
14. DPDwarkesh Patel
  Mm-hmm.
15. PCPaul Christiano
  Um, that's all with respect to this first mode of, like, a system is taking actions that get reward and, like, overpowering or deceiving humans is helpful for getting reward. There's this other failure mode and other family failure modes where AI systems want something potentially unrelated to reward. I understand that, like, they're being trained and, like, while you're being trained, there are a bunch of, like, reasons you might want to do the kinds of things humans want you to do. But then when deployed in the real world, if you're able to realize you're no longer being trained, you no longer have reason to do the kinds of things human want. You'd prefer to, like, be able to determine your own destiny, like, control your own... your own computing hardware, et cetera, which I think, like, probably emerge, like, a little bit later than systems that try and get reward. And so will generalize in scary unpredictable ways to new situations. I don't know when those appear. But also, again, broad enough error bars that it's, like, conceivable for systems in the near future, you know? I wouldn't put it, like, less than one in 1,000 for GPT-5, certainly.
16. DPDwarkesh Patel
  If, if we deployed all these AI systems and so- some of them are reward hacking, some of them are deceptive, some of them are just normal, whatever, h- how do you imagine that they might interact with each other at the expense of humans? Uh, how, how hard do you think it would be to... for them to communicate in ways that we would not be able to recognize and coordinate, um, coordinate at our, at our expense?
17. PCPaul Christiano
  Yeah, I think that most realistic failures probably involve two factories interacting. One factor is, like, the world is pretty complicated and the humans mostly don't understand what's happening. So, like, AI systems are writing code that's very hard for humans to understand maybe how it works at all, but more likely, like, they understand roughly how it works, but there's a lot of complicated interactions. Um, AI systems are running businesses that interact primarily with other AIs. They're, like, doing SEO for, like, AI search processes. They're, like, running financial transactions, like, thinking about a trade with AI counterparties. Um, and so you can have this world where even if humans kind of understand the jumping off point when this was all humans, like, actual considerations of, like, what's a good decision, like, what code is going to work well and be durable or, like, what marketing strategy is effective for selling to these other AIs or whatever is kind of just all mostly outside of sort of humans' understanding. I think this is, like, a really important... Again, when I think of, like, the most plausible...... scary scenarios. I think that's, like, one of the two big risk factors. And so in some sense, your first problem here is, like, having these AI systems who understand a bunch about what's happening and your only lever is, like, "Hey AI, do something that works well." So you don't have a lever to be like, "Hey, do what I really want." You just have the system, you don't really understand, you can observe some outputs like did it make money? And you're just optimizing or at least doing some fine tuning to get the AI to use its understanding of that system to achieve that goal. So I think that's, like, your first risk factor. And, like, once you're in that world, then I think there are, like, all kinds of dynamics amongst AI systems that, again, humans aren't really observing, humans can't really understand. Humans aren't really exerting any direct pressure on, only on outcomes. And then I think it's, it's quite easy to be in a position where, you know, if AI systems started failing, it would be very... They could do a lot of harm very quickly. Um, humans aren't really able to, like, prepare for or mitigate that potential harm because we don't really understand the systems in which they're acting. Um, and then if AI systems, like, you know, they could successfully prevent humans from either understanding what's going on or from, like, successfully, like, retaking the data centers or whatever, if the, if the AI successfully grabbed control.

Episode duration: 3:07:01

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode 9AAhTLa0dT0

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome