Eliezer Yudkowsky — Why AI will kill us, aligning LLMs, nature of intelligence, SciFi, & rationality

For 4 hours, I tried to come up reasons for why AI might not kill us all, and Eliezer Yudkowsky explained why I was wrong. We also discuss his call to halt AI, why LLMs make alignment harder, what it would take to save humanity, his millions of words of sci-fi, and much more. If you want to get to the crux of the conversation, fast forward to 2:35:00 through 3:43:54. Here we go through and debate the main reasons I still think doom is unlikely. 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Transcript: https://dwarkeshpatel.com/p/eliezer-yudkowsky * Apple Podcasts: https://apple.co/3mcPjON * Spotify: https://spoti.fi/3KDFzX9 * Follow me on Twitter: https://twitter.com/dwarkesh_sp 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 - TIME article 00:09:06 - Are humans aligned? 00:37:35 - Large language models 01:07:15 - Can AIs help with alignment? 01:30:17 - Society’s response to AI 01:44:42 - Predictions (or lack thereof) 01:56:55 - Being Eliezer 02:13:06 - Othogonality 02:35:00 - Could alignment be easier than we think? 03:02:15 - What will AIs want? 03:43:54 - Writing fiction & whether rationality helps you win

Dwarkesh PatelhostEliezer Yudkowskyguest

Apr 6, 20234h 3mWatch on YouTube ↗

EVERY SPOKEN WORD

150 min read · 30,012 words

0:00 – 9:06
TIME article
1. DPDwarkesh Patel
  No, no, I-
2. EYEliezer Yudkowsky
  Misaligned.
3. DPDwarkesh Patel
  (laughs)
4. EYEliezer Yudkowsky
  Misaligned.
5. DPDwarkesh Patel
  Or is this misaligned?
6. EYEliezer Yudkowsky
  No, no, no. Not yet. Like, not now.
7. DPDwarkesh Patel
  Is it- okay.
8. EYEliezer Yudkowsky
  Nobody's being careful and deliberate now, but maybe at some point in the indefinite future, people will be careful and deliberate.
9. DPDwarkesh Patel
  Okay. (laughs)
10. EYEliezer Yudkowsky
  Sure. Let's grant that premise. Keep going. If you try to rouse your planet, there are the idiot disaster monkeys who are like, "Ooh, ooh," like, "if this is dangerous, it must be powerful, right? I'm gonna, like, be first to grab the poisoned banana." And it's not a coincidence that, that I can, like, zoom in and poke at this and ask questions like this, and that you did not ask these questions of yourself. You are imagining nice ways you can get the thing, but reality is not necessarily imagining how to give you what you want. Should one remain silent? Should one let everyone walk directly into the whirling razor blades? Like continuing to play out a video game you know you're going to lose, because that's all you have.
11. DPDwarkesh Patel
  Okay. Today, I have the pleasure of speaking with Eliezer Yudkowsky. Eliezer, thank you so much for coming out to The Lunar Society.
12. EYEliezer Yudkowsky
  You're welcome.
13. DPDwarkesh Patel
  First question. So yesterday, when we were recording this, you had an article in Time calling for a moratorium on further AI, um, training runs. Now, my first question is, it's probably not likely that governments are gonna adopt some sort of treaty that restricts, um, AI right now, so what was the goal with writing it right now?
14. EYEliezer Yudkowsky
  I think that I thought that this was something very unlikely for governments to adopt, and then all of my friends kept on telling me, like, "No, no. Actually, if you talk to anyone outside of the tech industry, they think maybe we shouldn't do that." And I was like, "All right then." Like, I assumed that this concept had no popular support. Maybe I assumed incorrectly. It seems foolish and to lack dignity to not even try to say what ought to be done. There wasn't a galaxy-brain purpose behind it. I, I think that over the last 22 years or so, we've seen a great lack of galaxy-brained ideas playing out successfully.
15. DPDwarkesh Patel
  Have, has anybody in government, not necessarily after the article but I suggest in general, have they reached out to you in a way that makes you think that they sort of have the broad contours of the problem correct?
16. EYEliezer Yudkowsky
  No. I'm going on reports that normal people (laughs) um, are more willing than the people I've been previously talking to to entertain calls, "This is a bad idea. Maybe you should just not do that."
17. DPDwarkesh Patel
  That's surprising to hear because I would have assumed that the people in Silicon Valley who are weirdos would be more likely to find this sort of message, um, they could kind of rocket the whole idea that nano-machines will, AI's gonna make nano-machines that take over. Uh, it's surprising to hear that normal people got the message first.
18. EYEliezer Yudkowsky
  Well, uh, (laughs) I, I hesitate to t- to, to use the term midwit, but maybe this-
19. DPDwarkesh Patel
  (laughs)
20. EYEliezer Yudkowsky
  ... was all just a midwit thing. (laughs)
21. DPDwarkesh Patel
  All right. Um, so uh, my concern with, uh, I guess either the six-month moratorium or, uh, forever moratorium until we solve alignment is that at this point it seems like it could, uh, to people it seems like we're crying wolf. And actually not that it could, but it would be like crying wolf because these systems aren't yet at a point at which they're dangerous.
22. EYEliezer Yudkowsky
  And nobody is saying they are. Well, I'm not saying they are. The open letter signatories aren't saying they are, I don't think. I-
23. DPDwarkesh Patel
  So if there is a point at which we can sort of get the public momentum to do some sort of stop, wouldn't it be useful to exercise it when we get a GPT-6 and who knows what it's capable of? Why, why do it now?
24. EYEliezer Yudkowsky
  Because allegedly, possibly, and we will see, people right now are able to appreciate that things are storming ahead and, uh, a bit faster than the ability to, well, ensure any sort of good outcome for, for them. And you know, you could, you could be like, "Ah, yes, well, like, we will, like, play the galaxy-brained clever political move of trying to time when the popular support will be there." But again, I heard rumors that people were actually like completely open to the concept of let's stop. So again, just trying to say it and, uh, it's not clear to me what happens if we wait for GPT-5 to say it. I don't actually know what GPT-5 is going to be like. It is, th- it has been very hard to call. The, like, rate at which these cap- at which these systems acquire capability as they are trained to larger and larger sizes and more and more tokens and, uh, like GPT-4 is a bit beyond in some ways where I thought this paradigm was going to scale, period, so I don't actually know what happens if GPT-5 is built, and even if GPT-5 doesn't end the world, which I agree is like more than 50% of where my probability mass lies, even if GPT-5 doesn't end the world, maybe it's getting... Maybe that's enough time for GPT-4.5 to get ensconced everywhere and in everything and for it actually to be harder to call a stop, both politically and in- and, and technically. The, there's also the point that training algorithms keep improving. If we put a hard limit on the total computes and training runs right now, these systems would still get more capable over time as the algorithms improved and got more efficient, um, like more oomph per floating point operation, and things would still improve, but slower. And if you start that process off at the GPT-5 level where I don't actually know how capable that is exactly, you, you may have like a bunch less lifeline left before you get into dangerous territory.
25. DPDwarkesh Patel
  The concern is then that, listen, there's, you know, millions of GPUs out there in the world and so the actors would be, you would, who would be willing to cooperate or who you could even identify in order to even get the government to make them cooperate would be potentially the ones that are-... most on the message. And so what you're left with is a z- system where they're be, they, you know, they stagnate for six months or a year or however long this lasts. Um, and then what is the game plan? Like is there some plan by which if we wait a few years then alignment will be solved? Do we, do we have some s- sort of timeline like that or what, what, what does three year mark?
26. EYEliezer Yudkowsky
  Well, it will, alignment will not be solved in a few years. I would hope for something along the lines of human intelligence enhancement works. I do not think they are going to have the timeline for (laughs) genetically engineering humans to work, but maybe. Uh, this is why I mentioned in the TIMED letter that if I had, like, infinite capability to dictate the laws, that there would be a carve out on biology. Um, like AI that is, like, just for biology and not trained on text for the in- uh, from the internet. Human intelligence enhancement, make people smarter. Making people smarter has a chance of going right in a way that making a extremely smart AI does not have a realistic chance of going right at this point. Um, so yeah. That would in, in terms of, like, remotely, (sighs) y- y- y- you know, how do I put it? If you were on a, if we were on a sane planet, what the sane planet does at th- at this point is shut it all down and work on human intelligence enhancement. It is, I don't think we're going to live in that sane world. I think we are all going to die, but having heard that people are more open to this outside of California, it makes sense to me to just, like, try saying out loud what it is that you do in a saner planet and not just assume that people are not going to do that.
27. DPDwarkesh Patel
  In what percentage of the worlds where humanity survives is there, uh, human enhancement? Like even if there's 1% chance humanity survives, is basically that entire branch dominated by the worlds where there's some sort of...
28. EYEliezer Yudkowsky
  I mean, I think we're, we're, we're just, like, mainly in the territory of Hail Mary pla- Ha- Hail Mary passes at this point and human intelligence enhancement is one Hail Mary pass. Um, maybe you can put people in MRIs and train them using neurofeedback to be a little saner, to not rationalize so much. Maybe you can figure out how to have something light up every time somebody is, like, working backwards from what they want to be true to what they take as their premises. Maybe you can just, like, fire off little lights and teach people not to do that so much. Maybe the GPT-4 level systems can be reinforcement learning from human feedbacked into being consistently smart, nice, and charitable in conversation and just unleash a billion of them on Twitter and just have them, like, spread sanity everywhere. I do not think this, uh, I, I do worry that this is, like, not going to be the most profitable use of the technology, um, but you know, you're asking me to list out Hail, Hail Mary passes, so that's what I'm doing. Maybe you can actually figure out how to take a brain, slice it, scan it, simulate it, run uploads and upgrade the uploads or run the uploads faster. These are also quite dangerous things, but they do not have the utter lethality of artificial intelligence.
29. DPDwarkesh Patel
  All right.
9:06 – 37:35
Are humans aligned?
1. DPDwarkesh Patel
  That's actually a great jumping point into the next topic I wanna talk to you about, um, orthogonality and here's my first question. Speaking of human enhancement, suppose we bred human beings to be friendly and cooperative, but also more intelligent. Uh, I'm sure you're gonna disagree with this analogy, but I just wanna understand why. I claim that over many generations, you would just have really smart humans who are also really friendly and cooperative. Do, would, would you disagree with that or would you disagree with the analogy?
2. EYEliezer Yudkowsky
  So the main thing is that you're starting from minds that are already very, very similar to yours. You're starting from minds w- of which whom al- many of them already exhibit the characteristics that you want. Um, there are a- already many people in the world, I hope, who are nice in the way that you want them to be nice. Um, there's, it, it, of course it depends on how nice you want exactly. Um, I think that if you, like, actually go start trying to ru- run a project of, uh, selectively encouraging some marriages between particular people and encouraging them to have children, you will rapidly find, as one does in any process of, eh, as one does when one does this to say chickens, that when you select on the stuff you want, there turns out there's a bunch of stuff correlated with it and that you're not changing just one thing. If you try to make people who are inhumanly nice, who are nice than, eh, nicer than anyone has ever been before, you're going outside the space that human psychology has previously evolved and adapted to deal with and weird stuff will happen to those people. This is, none of this is, like, very analogous to AI. Uh, I am just pointing out something along the lines of, well, taking your analogy at face value, what would happen exactly? And, um, you know, it's the sort of thing where you could maybe do it, but there's all kinds of pitfalls that you'd probably find out about if you cracked open a textbook on, uh, animal breeding. (sighs)
3. DPDwarkesh Patel
  Mm-hmm. Um, uh, so I mean, th- the thing you mentioned initially, which is that we are starting off with basic human psychology that we're f- kind of fine-tuning with breeding. Um, luckily, the current paradigm of, um, uh, AI is, you know, you just have these models that are trained on human text and I mean, you would assume that this would give you a sort of starting point of something like human psychology.
4. EYEliezer Yudkowsky
  Why do you assume that?
5. DPDwarkesh Patel
  Because they're trained on human text.
6. EYEliezer Yudkowsky
  And what does that do?
7. DPDwarkesh Patel
  Sh- Uh, whatever sorts of thoughts and emotions that lead to the production of human text are, need to be simulated in the AI in order to produce those themselves.
8. EYEliezer Yudkowsky
  I see, so like, if you, if you take a person and like, if you take an, an actor and tell them to play a character, they just like, become that person. You can tell that 'cause you know, you know, like you see somebody on screen playing, uh, Buffy the Vampire Slayer and you know, that's probably just actually Buffy in there. That's who that is.
9. DPDwarkesh Patel
  I think, I think a better analogy is if you have a child and you tell him, "Hey, be this way," they're more likely to just, uh, be that way. And I mean, other than, like, putting on an act for, like, 20 years or something.
10. EYEliezer Yudkowsky
  ... uh, depends on what you're telling them to be, exactly. If you're telling them to be- Yeah. If you're... But that's not what you're telling them to do. You're try, telling them to play the part of an alien. You, like... Some, some, something with a completely inhuman psychology, as extrapolated by science fiction authors and in many cases, uh, you know, like, done by computers, because, you know, humans can't quite think that way. And your child eventually manages, l- to learn to act that way. What exactly is going on in there now? Are they just the alien or did they pick up the rhythm of what you're asking them to imitate and be like, "Ah, yes, I see who I'm supposed to pretend to be." Are, are they actually that person or are they pretending? That's true even if you're not asking them to be an alien (laughs) . You know, my, my parents tried to r- to raise me Orthodox Jewish and that did not take at all. I learned to pretend, I learned to comply. I hated every minute of it. Okay, not literally every minute of it. I should avoid saying untrue things. I hated most minutes of it. Uh, and yeah, like... 'Cause they were trying to show me a way to be that was alien to my own psychology. And the religion that I actually picked up was from the science fiction books instead, as it were, though I'm using religion ver- very metaphorically here, more like ethos you might say. I was raised with the science fiction books I was reading from my parents' library and Orthodox Judaism and the ethos of the science fiction books rang truer in my soul and so that took in, the Orthodox Judaism didn't. But the Orthodox Judaism was what I had to imitate, it was what I had to pretend to be, was what the... was the answers I had to give whether I believe them or not, because otherwise you get punished.
11. DPDwarkesh Patel
  But, uh, I mean, on that point itself, the rates of apostasy are probably below 50% in any religion, right? Like some people do leave but often they just become the thing they're imitating as a child.
12. EYEliezer Yudkowsky
  Yes, because the religions are selected to not have that many apostates. If aliens came in and introduced their religion, you'd get a lot more apostates.
13. DPDwarkesh Patel
  Right. But I mean, uh, I, I think we're probably in a more virtuous situation with ML because you... I mean, these systems are kind of, uh, through stochastic gradient descent sort of regularized so that the system that is pretending to be something where there's like multiple layers of interpretation is going to be more complex than the one that it's just being the thing. And over time, like the, the system that is just being the thing will be optimized, right? It'll just be simpler.
14. EYEliezer Yudkowsky
  This seems like an ordinate cope for, for one thing. You're not training it to be any one particular person, you're training it to switch masks to anyone on the internet as soon as they figure out who that person on the internet is. If, if I put the internet in front of you and I was like, "Learn to predict the next word. Learn to predict the next word over and over," you do not just, like, turn into a random human because the random human is not what's best at predicting the next word of everyone who's ever been on the internet. You learn to very rapidly, like, pick up on the cues of like what sort of person is talking, what will they say next. You memorize so many facts that... just because they're helpful in predicting that, the next word. You, you learn all kinds of patterns, you learn all the languages. You learn to switch rapidly from being one kind of person or another as the conversation that you are predicting changes who's speaking. This is not a human we're describing. You are not training a human layer.
15. DPDwarkesh Patel
  W- would you at least say that we are living in a better situation than one in which we have some sort of black box where you have this, um, sort of Machiavellian, uh, fittest survive assimilation that produces AI? Like it at least, uh, this situation is at least more likely to produce alignment than one in which something that is completely untouched by human psychology, uh, would produce?
16. EYEliezer Yudkowsky
  More likely, yes. Maybe you're like, it's an order of magnitude likelier, 0% instead of 0%.
17. DPDwarkesh Patel
  (laughs)
18. EYEliezer Yudkowsky
  Getting stuff like more likely does not help you if the baseline is, is like nearly zero. Like, the whole training setup there is, is producing an actress, a predictor. It's not actually being put into the c- into the kind of ancestral situ- situation that evolve humans nor the kind of modern situation that raises humans. Though to be clear, raising it like human wouldn't help. But like, the AI, you're, you're like giving it a very alien problem that is, that is not what humans solve and it is like solving that problem not the way a human would.
19. DPDwarkesh Patel
  Okay, so how about this? I can see that I, uh, certainly don't know for sure what is going on in these systems. In fact, I, I don't... obviously nobody does but, uh, tha- that also goes for you. So could it not just be that even through imitating all humans it like, I don't know, reinforcement learning works and then all these other things we're trying somehow work and actually just like being an actor produces some sort of benign and, uh, benign outcome where y- there, there isn't that level of simulation and, uh, conniving?
20. EYEliezer Yudkowsky
  I think it predictably breaks down as you try to make the system smarter, as you try to derive sufficiently useful work from it and in particular, like, the sort of work where some other AI doesn't just kill you off six months later. I, I... Yeah, like, I think the present system is not smart enough to have a deep conniving actress thinking long strings of coherent thoughts about how to predict the next word but as the syst- the, the, as the mask that it wears, as the people it's pretending to be gets smarter and smarter, um, I think that at some point the thing in there that is predicting how humans plan, predicting how humans talk, predicting how humans think and needing to be at least as smart as the umans- i- human it is predicting in order to do that, um, I suspect at some point there is a new coherence born within the system and something strange starts happening. I think that if you have something that can accurately predict-... and, I mean, Eliezer Yudkowsky, to, to use a particular example I know quite well. I think that to accurately predict Eliezer Yudkowsky, you've got to be able to do the kind of thinking where you are reflecting on yourself and that if, in order to, like, simulate Eliezer Yudkowsky reflecting on himself, like, you need to be able to do that kind of thinking. And this is not airtight logic, but I sus-... I expect there to be a, a discount factor in the, the... S- so like, if you ask me to play a part of somebody who's quite unlike me, I think there's some amount of penalty that my-... that the- the character I- I'm playing gets to his intelligence because I'm secretly back there simulating him. And that's even... and that's in... that's even if we're, like, quite similar and, like, the stranger they are, the more unfamiliar situation, the less the person I'm playing is, is as smart as I am, the more they are dumber than I am. So similarly, I think that if you get a- an AI that's very, very good at predicting what Eliezer says, I think that there's a quite alien mind doing that. That it actually has to be, to some degree, smarter than me in order to play the role of something that thinks differently from how it does very, very accurately. And I reflect on myself. I think about how my thoughts are not good enough by my own standards and how I want to rearrange my own thought processes. I look at the world and see it going the way I did not want it to go and asking myself, "How could I change this world?" I look around at other humans and I model them and sometimes I try to persuade them of things. These are all capabilities that the system would then- would then be somewhere in there. And I just, like, don't trust the lot, the... I don't trust the blind hope that all of that capability is pointed entirely at pretending to be Eliezer and only exists insofar as it's, like, the mirror and isomorph of Eliezer. That all the prediction is like... is by being something exactly like me and not thinking about me while not being me.
21. DPDwarkesh Patel
  I, uh, certainly, uh, I, I, I don't wanna claim that it is guaranteed that there isn't something super alien and something that is against our aims happening within the Shoggoth. But, uh, you made an e- earlier claim which seemed much stronger than the idea that you don't want blind hope, which is that we're going from, like, 0% probability to an order of magnitude greater at 0% probability. Um, there's a difference between saying that, uh, we should be wary and that w- like, there's no hope, right? Like, I, I could imagine so many things that could be happening in the Shoggoth's brain, um, especially in our level of confusion and mysticism over what is happening. Uh, so I mean, okay, so o- one example is, like, I don't know, let, let's say that it i- is be- kinda just becomes the average of all human psychology and motives. We'll also make them-
22. EYEliezer Yudkowsky
  It's not, but it's not the average. It is able to be every one of those people.
23. DPDwarkesh Patel
  Right, right.
24. EYEliezer Yudkowsky
  That's very different from being the average, right? Like, it's, it's, it's, it's very different from being an average ch- chess player versus being able to predict every chess player in the database. These are very different things.
25. DPDwarkesh Patel
  Yeah, no, I, I meant in terms of motives it is the average, whereas it can simulate any given human.
26. EYEliezer Yudkowsky
  W- why, why would the m- what... where, where-
27. DPDwarkesh Patel
  I don't... I'm not saying that's the most likely one. I'm just saying, like, it doesn't seem-
28. EYEliezer Yudkowsky
  I... this, this, this just, this just seems 0% probable to me. Like, the motive is going to be, like, I want to... like, in so far, the m- the motive is going to be, like, some weird fun house mirror thing of, "I want to predict very accurately."
29. DPDwarkesh Patel
  Right. Um, why then are we so sure that wha- whatever the drives that come about because of this motive are gonna be incompatible with the survival and flourishing of humanity?
30. EYEliezer Yudkowsky
  Most drives that happen when you take a loss, flun- function and splinter it into things correlated with it and l- um, then amp up intelligence until some kind of strange coherence is born within the thing and then ask it how it want to self-modify or what kind of success for bis- system it would build, things that alien ultimately end up wanting the universe to be some particular way that doesn't happen to have hum- for wanting the universe to be a way such that humans are not a solution to the question of how to make the universe most that way. Like, like, the thing that very strongly wants to predict text, even if you got that goal into the system exactly, which is not what would happen, the universe with the most predictable text is not a universe that has the universe in it... the u- the universe that has humans in it.
37:35 – 1:07:15
Large language models
1. DPDwarkesh Patel
  Well, we'll- we'll see, I guess, um, wh- when- when that technology becomes available. Uh, l- let me ask you about, um, LLMs. So what is your position now about whether these things can get us to AGI?
2. EYEliezer Yudkowsky
  I don't know. Um, GPT-4 got... I was previously being like, I don't think stack more layers does this. Um, and then GPT-4 got further than I thought that stack more layers was going to get. And, um, I don't actually know that they got GPT-4 just by stacking more layers because OpenAI has very correctly, um, declined to tell us what exactly goes on in there, uh, in terms of its architecture. Um, so maybe they are no longer just stacking more layers. But any case, like, however they build GPT-4, it's gotten further than I expected stacking more layers of transformers to get. Um, and therefore, I have noticed this fact and expected further updates in the same direction, so I'm not like just predictably updating in the same direction every time like an idiot. And now I do not know. I am no longer willing to say that I, that, um, GPT-6 does not end the world.
3. DPDwarkesh Patel
  Does it also make you more inclined to think that there's going to be sort of slow takeoffs or more incremental takeoffs where like GPT-2... GPT-3 is better than GPT-2, GPT-4 is in some ways better than GPT-3 and then we just keep going that way in sort of this straight line?
4. EYEliezer Yudkowsky
  So I do think that over time, I have come to expect a bit more that things will hang around in a near human place and weird shit will happen as a result. And my failure review where I look back and ask, like, "Was that a predictable sort of mistake?" I sort of feel like it was to some extent maybe a case of you're always going to get capabilities in some order. And it was much easier to visualize the end point where you have all the capabilities than where you have some of the capabilities. And therefore, my visualizations were not dwelling enough on a space suite, predictably in retrospect, have entered into later, where things have some capabilities, but not others and it's weird. I do think that, like, in 2012, I would not have called that large language models were the way and that large language models are in some way, like, more uncannily semi-human than what I would justly have predicted in 2012 knowing only what I knew then. Um, but- but broadly speaking, yeah. Like, I do feel like, like GPT-4 is already, like, kinda hanging out for longer in a weird near-human space than I was really visualizing, in part because that's so incredibly hard to visualize or call correctly in advance of when it happens, which is in retrospect a bias.
5. DPDwarkesh Patel
  Given that fact, are you... Like, how has your model of intelligence itself changed?
6. EYEliezer Yudkowsky
  Very little.
7. DPDwarkesh Patel
  So here- here's one claim somebody could make. Like, listen, if these things hang around human level, uh, and if they're trained the way in which they are, um, recursive self-improvement is much less likely because, like, their human level intelligence and what are they gonna... It- it's not a matter of just, like, optimizing some four loops or something. They've got to, like, train a billion dollar another run to scale up. Um, so, you know, that- that kind of recursive self-intelligence, uh, idea is less likely. Wh- wh- how- how do you respond?
8. EYEliezer Yudkowsky
  At some point, they get smart enough that they can roll their own AI systems and are better at it than humans. And that is the point at which you definitely start to see foom. Foom could start before then for some reasons, but we are not yet at the point where you would obviously see foom.
9. DPDwarkesh Patel
  Why does even the fact they're gonna be around human level for a while increase your odds? Or does it increase your odds of human survival? Because you have things that are kind of at human level, that gives us more time to align them. Maybe we can use their o- help to align these, uh, future versions of themselves.
10. EYEliezer Yudkowsky
  I do not think that you use AIs to... Okay, so like, having an AI help you... Having AI do your AI alignment homework for you is like the nightmare application for alignment. Aligning them enough that they can align themselves is- is like very chicken and egg, very alignment complete. Um, there's, like, a... The- the same thing to- to do with capabilities like those might be enhanced human intelligence. Like- like poke around in the s- in the- in the space of proteins, um, like collect the genomes, uh, tie to life accomplishments, um, look at the... Look at those genes, see if you can, uh, extrapolate out the whole proteinomics and the- and the actual interactions and figure out what are likely candidates for if you administer this to an adult, because we do not have time to raise kids from scratch. If you administer this to an adult, the adult gets smarter, try that. Like... And then the system just needs to understand biology.... and having an, a actual very smart thing understanding biology is not safe. I think that if you try to do that sufficiently unsafe that you probably die. But if you have it, if you have these things trying to solve alignment for you, they need to understand AI design. And the way that any of your... there are a large language model, they're very, very good at human psychology because predicting the next thing you'll do is their entire deal. And game theory, and sec- computer security, and adversarial situations, and thinking in detail about AI failure scenarios in order to prevent them. And as, as it's... there's just like so many dangerous domains you've got to operate and to do alignment.
11. DPDwarkesh Patel
  Okay. Um, there's two or three more re- uh, re- there's two or three reasons why I'm more optimistic about the possibility of a human level, um, intelligence helping us than you are. But first, let me ask you how long do you expect these systems to be at approximately human level before they go boom or something else crazy happens? Do you have some sense? (laughs) All right. Um, first is that in most domains, verification is much easier than generation. So it is-
12. EYEliezer Yudkowsky
  Oh, yes, that's another one of the things that makes alignment a nightmare. The, i- it is like so much easier to tell, like, that something has not lied to you about how a protein folds up if you... 'cause you can do, like, some crystallography on it, than it is... and, and like ask, uh, ask it how does it know that, th- than it is to, like, tell whether or not it's lying to you about a particular alignment methodology being likely to work on a super intelligence.
13. DPDwarkesh Patel
  Wh- why, why is there a stronger reason to think, like, that confirming new solutions in an alignment... or first of all, do you think confirming your solutions in alignment will be easier than generating new solutions in an alignment?
14. EYEliezer Yudkowsky
  Basically no.
15. DPDwarkesh Patel
  Why not? 'Cause like in most of the human domains, that is the case, right?
16. EYEliezer Yudkowsky
  Yeah. So alignment, the thing, how does your thing and says like, "This will work for aligning a super intelligence." And, you know, it gives you some, like, early predictions of, like, when that, that, all for... of, like, how the thing will behave when it's sa- when it's passively safe, when it can't kill you. That all bear out, and those predictions all come true, and then the sys- i- and then you would, like, augment the system further to where it's no longer passively safe to where its, i- its safety depends on its alignment. And then you die, and the super intelligence you, you, you built, like, goes over to the AI that you asked to help with alignment and was like, "Good job. Billion dollars." That's observation number one.
17. DPDwarkesh Patel
  (laughs)
18. EYEliezer Yudkowsky
  Observation number two is that, like, for the last 10 years, all effective altru- not... like, all effective altruism has been arguing about, like, whether if they should believe, like, Eliezer Yudkowsky or Paul Christiano, right? So that's, like, two systems. I, I believe that Paul is honest. I claim that I am honest. Neither of us are aliens. And so we have these two, like, honest non-aliens having an argument about alignment and people can't figure out who's right. Now, now you're gonna have, like, aliens talking to you about alignment. You're gonna f- and you're gonna verify their results?
19. DPDwarkesh Patel
  Well-
20. EYEliezer Yudkowsky
  Aliens, aliens who are possibly lying.
21. DPDwarkesh Patel
  So on that second point, I think it will be... it would be much easier if both of you had, like, concrete proposals for alignment and you would just have, like, the pseudo code for... both of you, like, produce pseudo code for alignment and you're like, "This is... here's my solution, here's my solution." I think at that point, actually it would be pretty easy to tell which, uh, one of you is right.
22. EYEliezer Yudkowsky
  I think you're wrong. I, I think that, yeah, I, I think that that's, like, substantially harder than being like, "Oh, well, I can just, like, look at the code of the operating system and see if it has any security flaws."
23. DPDwarkesh Patel
  Um-
24. EYEliezer Yudkowsky
  Y- you're asking, like, what happens as this thing gets ver- like, dangerously smart, and that is not going to be transparent in the code.
25. DPDwarkesh Patel
  L- l- let me come back to that, on your first point about, uh, uh, these thing... you know, the alignment not generalizing. Given that you updated in the direction where the same sort of stacking more layers on the, uh, more attention layers is going to work, it seems that there will be more generalization between, like, GPT-4 and GPT-5. So, I mean, presumably, whatever alignment techniques you used on GPT-2 would have worked on GPT-3 and so on from GPT-3.
26. EYEliezer Yudkowsky
  Wait, sorry, what?
27. DPDwarkesh Patel
  RLHF on GPT-2 worked on GPT-3 or Constitution AI or something that works on GPT-3 would have also worked on GPT-4.
28. EYEliezer Yudkowsky
  All, all kinds of s- interesting things started happening with GPT-3.5 and GPT-4 that were not in GPT-3.
29. DPDwarkesh Patel
  But the same contours of approach, like the RLHF approach or, like, uh, Constitution AI-
30. EYEliezer Yudkowsky
  If by that you mean it didn't really work in, in one case-
1:07:15 – 1:30:17
Can AIs help with alignment?
1. EYEliezer Yudkowsky
2. DPDwarkesh Patel
  We briefly covered this but I- I- I think this is an important topic so I- I wanna get the explanation again of why you're pessimistic that once we have these human level AIs we'll be able to use them to work on alignment itself.... um, I think we were, started talking about how, uh, whether in fact when it comes to alignment, the verification is actually easier than generation.
3. EYEliezer Yudkowsky
  Yeah, and I think that's the core of it. Like, I, I, yeah, that, the, the crux is like if, if you show me a scheme whereby (laughs) you can take a thing that's like being like, "Well, here's a really great scheme for alignment." And be like, "Ah, yes, I can verify that this is a really great scheme for alignment. Even though you are an alien, even though you might be trying to lie to me. Now that I have this in hand, I can verify this is totally a great scheme for alignment and if we do what you say, the super intelligence will totally not kill us." (laughs) That's, that's the crux of it. Um, I don't think you even can even like up vote, down vote very well, and that sort of thing. I think if you up vote, down vote, it learns, learns to exploit the human raters.
4. DPDwarkesh Patel
  I-
5. EYEliezer Yudkowsky
  Based on watching discourse in this area find various loopholes in the people listening to it and learning how to exploit them. Like out, as, as, as a, as an evolving meme.
6. DPDwarkesh Patel
  Yeah. Uh, like, the, well the, the fact is that we can just see like how they go wrong, right? Like...
7. EYEliezer Yudkowsky
  I can see how people are going wrong. If they could see how they were going wrong, then you know, there would be a very different conversation. And, and being nowhere near the top of that food chain, I guess in my humility, that is, amazing as it may sound. My humility that is actually greater than the humility of other people in this field. I know that I can be fooled. I know that if you build an AI and you like keep on making it smarter until I start voting its stuff up, it found out how to fool me. I don't think I can't be fooled. I watch other people be fooled by stuff that would not fool me, and instead of concluding that I'm the ultimate peak of unfoolableness, I'm like, "Wow, I'm bet, I am just like them and I don't realize it."
8. DPDwarkesh Patel
  W- what, what if you force the AI to say like, uh, slightly smarter than humans you said, "Give me a method for aligning the future version of you and give me a mathematical proof that it works."
9. EYEliezer Yudkowsky
  A mathematical proof that it works? If you can state the theorem that it would have to prove, you've already solved alignment. That you are like now 99.99% of the way to the finish line.
10. DPDwarkesh Patel
  But what if you just tell it, like come up with a theorem and give me the proof?
11. EYEliezer Yudkowsky
  Then you are trusting it to explain the theorem to you informally and that the informal meaning of the theorem is correct.
12. DPDwarkesh Patel
  But-
13. EYEliezer Yudkowsky
  And that's, and, and that is the, and that's the weak point for everything falls apart.
14. DPDwarkesh Patel
  At the point where it is at human level, I'm not so convinced that we're going to have a system that is already have, already smart enough and to have, uh, you know, these levels of deception where it has a solution for alignment but it won't give it to us. Or like it will purposely make a solution for alignment that is messed up in this specific way that will not work specifically on the next version or the version after that of, uh, GPT. Like, why would that
15. Speaking as-
16. ... already occur at human level?
17. EYEliezer Yudkowsky
  Speaking as the inventor of logical decision theory, if the rest of human, if the rest of the human species had been keeping me locked in a box, and I have watched people fail at this problem, (laughs) like I watched those people fail at this problem. I could have blindsigh- blindsided you so hard by executing a logical handshake with a super intelligence, um, that was g- that I was going to poke in a way where it would fall into the attractor basin of reflecting on itself and inventing logical decision theory. And then m- seeing that I had the part I, the part of this I can't do requires me to be able to predict the super intelligence. But if I were a bit smarter, I could then like predict on its, on a, on a correct level of abstraction the super intelligence looking back and seeing that I had predicted it, seeing the logical dependency on its actions acrossing time, and being like, "Ah, yes, like I need to like do this values handshake with my creator inside this little box." Where the rest of the human ke- species was keeping him trapped. Like, I could have pulled this shit on you guys, you know? I didn't have to tell you about logical decision theory.
18. DPDwarkesh Patel
  Speaking as somebody who doesn't know about logical decision theory, uh, that didn't make sense to me but-
19. EYEliezer Yudkowsky
  Okay. (laughs)
20. DPDwarkesh Patel
  ... I, like I, I trust that there's, uh, there's that... (laughs)
21. EYEliezer Yudkowsky
  N- look, yeah, there's, it's, it's you just, just like trying to play this game against things smarter than you is, is a fool's game.
22. DPDwarkesh Patel
  But they're not that much smarter than you at this point, right? Like we're talking
23. NANarrator
  (music plays)
24. EYEliezer Yudkowsky
  I'm not that much smarter than, than all the, than all the people who thought that rational agents defect aga- against each other in the prisoner's dilemma and can't think of any better way out than that.
25. DPDwarkesh Patel
  I, so o- on the object level, I don't know whether somebody could've, uh, figured that out 'cause I'm not sure what the thing is. But, uh-
26. EYEliezer Yudkowsky
  Now you have, you, you, you-
27. DPDwarkesh Patel
  ... but like on the meta-level thing is like-
28. EYEliezer Yudkowsky
  The academic literature would have to be seen to be believed. (sighs) But, but the point is like, the, the, the one major technical contribution that I'm proud of which is like, not all that precedented and you can like look at the literature and see it's not all that precedented. Like, would in fact have been a way for something that knew about that technical innovation to build a super intelligence that would kill you and extract value itself from that super intelligence in a way that would just like completely blindside the literature as it existed prior to that technical contribution and there's gonna be other stuff like that.
29. DPDwarkesh Patel
  So, I, I guess like my sort of remark at this point is that ha- having, uh, conceded that, uh, these-
30. EYEliezer Yudkowsky
  What, what, like the type of contribution I made is specifically, if you look at it carefully, a way to po- a way that a malicious actor could use to poke a super intelligence into a basin of reflective consistency where it's then going to do a handshake with the thing that poked it into that basin of consistency and not what the creators thought about in a way that was like pretty unprecedented relative to the discussion before I made that technical contribution. It's like, among the many ways you could get screwed over if you trust something smarter than you, uh, i- it's among the many ways that something smarter than you could code something that sounded like a totally reasonable argument about how to align a system and like actually have that thing kill you and then get value from that itself. But I agree that this is like weird and you'd have to look up logical decision theory-

Episode duration: 4:03:24

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode 41SUp-TRVlg

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome