Skip to content
Dwarkesh PodcastDwarkesh Podcast

Dario Amodei (Anthropic CEO) — The hidden pattern behind every AI breakthrough

Here is my conversation with Dario Amodei, CEO of Anthropic. Dario is hilarious and has fascinating takes on what these models are doing, why they scale so well, and what it will take to align them. 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Transcript: https://www.dwarkeshpatel.com/dario-amodei * Apple Podcasts: https://apple.co/3rZOzPA * Spotify: https://spoti.fi/3QwMXXU * Follow me on Twitter: https://twitter.com/dwarkesh_sp --- I’m running an experiment on this episode. I’m not doing an ad. Instead, I’m just going to ask you to pay for whatever value you feel you personally got out of this conversation. Pay here: https://bit.ly/3ONINtp --- 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 - Introduction 00:01:00 - Scaling 00:15:46 - Language 00:22:58 - Economic Usefulness 00:38:05 - Bioterrorism 00:43:35 - Cybersecurity 00:47:19 - Alignment & mechanistic interpretability 00:57:43 - Does alignment research require scale? 01:05:30 - Misuse vs misalignment 01:09:06 - What if AI goes well? 01:11:05 - China 01:15:11 - How to think about alignment 01:31:31 - Is modern security good enough? 01:36:09 - Inefficiencies in training 01:45:53 - Anthropic’s Long Term Benefit Trust 01:51:18 - Is Claude conscious? 01:56:14 - Keeping a low profile

Dario AmodeiguestDwarkesh Patelhost
Aug 8, 20231h 58mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:001:00

    Introduction

    1. DA

      ... a generally well-educated human.

    2. DP

      Yeah.

    3. DA

      That could happen in, you know, two or three years. (screen whooshes)

    4. DP

      What does that imply for Anthropic when, in two-

    5. DA

      Yes.

    6. DP

      ... to three years, these leviathans are doing-

    7. DA

      Yes.

    8. DP

      ... like $10 billion training runs?

    9. DA

      Yes. The models, they just wanna learn, and it was a bit like a Zen koan. I listened to this and, and I became enlightened.

    10. DP

      (laughs)

    11. DA

      (laughs) The compute doesn't flow, like the spice doesn't flow. Or it, it, it's like-

    12. DP

      (laughs)

    13. DA

      ... you can't, like...

    14. DP

      (laughs)

    15. DA

      Like, the, the blob has to be unencumbered, right?

    16. DP

      (laughs)

    17. DA

      The big acceleration that, that happened late last year and, and beginning of this year, we didn't cause that. And honestly, I think if you look at the reaction of Google that, that might be 10 times more important than, than anything else. (screen whooshes) There was a running joke, the way building AGI would look like is, you know, there would be a data center next to a nuclear power plant, next to a bunker.

    18. DP

      But now it's 2030, what happens next? What, what are we doing with a superhuman god?

    19. DA

      Yeah. Yeah.

    20. DP

      Okay, today I have the pleasure of speaking with Dario Amodei, who is the CEO of Anthropic, and I'm really excited about this one. Dario, thank you so much for coming on the podcast.

    21. DA

      Thanks

  2. 1:0015:46

    Scaling

    1. DA

      for having me.

    2. DP

      First question, you have been one of the very few people who has seen scaling coming for years, more than five years. I don't know how long it's been, but as somebody who's seen it coming, what is fundamentally the explanation for why scaling works? Why is the universe organized such that if you throw big blobs of compute at a wide enough distribution of data, the thing becomes intelligent?

    3. DA

      I think the truth is that we still don't know. I think it's almost entirely an empirical fact.

    4. DP

      Mm-hmm.

    5. DA

      Um, you know, I think it's a fact that you could kind of sense from the data and from a bunch of different places, um, but I think we don't still have a satisfying explanation for it. If I were to try to make one, but I'm just... I don't know, I'm just kind of waving my hands when I say this. You know, there, there, there's this, there's these ideas in physics around, like, long tail or power law of, like, correlations or effects.

    6. DP

      Mm-hmm.

    7. DA

      And so, like, when a bunch of stuff happens, right, when you have a bunch of, like, features, you get a lot of the data in, like, kind of the early, you know, the, the, the, the fat part of the distribution before the tails. Um, you know, for language this would be things like, oh, I figured out there are parts of speech and nouns follow verbs, and then there are these more and more and more and more subtle correlations. Um, and so it, it kind of makes sense why there would be this, you know, every log or order of magnitude that you add-

    8. DP

      Mm-hmm.

    9. DA

      ... you kind of capture more of the distribution. What I, what's not clear at all is why does it scale so smoothly with parameters?

    10. DP

      Mm-hmm.

    11. DA

      Why does it scale so smoothly with the amount of data? Why, y- you can think up some explanations of why it's linear, like the parameters are like a bucket and so the data's like water and so size of the bucket is proportional to size of the water, but, like, why does it lead to all these, this very smooth scaling? I think we still don't know. There's all these explanations. Our chief scientist, Gar- Jared Kaplan, did some stuff on, like, fractal manifold dimension that, like, you can use to explain it. So there's, there's all kinds of ideas, but I feel like we just don't really know for sure.

    12. DP

      And by the way, for, for the audience who's trying to follow along, by scaling, we're referring to the fact that you can very predictably see how you go from GPT-3 to GPT-4, or in th- this case, Claude I to Claude II, that the loss in terms of whether it can predict the next token scales very smoothly. So okay, we, we don't know why it's happening, but can you at least predict i- i- empirically, here is the loss at which this ability will emerge, here is the place where this circuit will emerge? Is, is that at all predictable or are you just looking at the loss number?

    13. DA

      So that, that is much less predictable.

    14. DP

      Mm-hmm.

    15. DA

      What's predictable is this statistical average, this loss, this entropy. It's super predictable. It's like, you know, predictable to, like, sometimes even to several significant figures, which you don't see outside of physics, right? You don't expect to see it in this messy empirical field. Um, but actually specific abilities are very hard to predict. So, you know, back when I was working on GPT-2 and GPT-3, like, when does arithmetic come in place? When do models learn to code? Sometimes it's very, it's very abrupt.

    16. DP

      Mm-hmm.

    17. DA

      Um, you know, it's kind of like you can predict statistical averages of the weather but the weather on one particular day is very, you know, very, very hard to predict.

    18. DP

      So, uh, uh, dumb it down for me. I don't understand manifolds, but mechanistically it doesn't know addition yet, now it knows addition.

    19. DA

      Yeah.

    20. DP

      What has happened?

    21. DA

      Uh, this is another question that we don't know the answer to. I mean, we're trying to answer this with things like mechanistic interpretability, but, you know, I'm not sure. I mean, you can think about these things about, like, circuits snapping into place, although there is some evidence that when you look at the models being able to add things that, you know, like if you look at its chance of getting the right answer, that shoots up all of a sudden. But if you look at, okay, what's the probability of the right answer? You'll see it climb from, like, one in a million to one in a hundred thousand to one in a thousand long before it, it actually gets the right answer. And so there's some cont- i- in many of these cases at least, I don't know if in all of them, there's some continuous pro- process going on behind the scenes. I don't understand it at all.

    22. DP

      Uh, does that imply that the circuit or the process for doing addition was pre-existing and it just got increased in salience?

    23. DA

      Yeah, I, I don't know if, like, there's this circuit that's weak and getting stronger. I don't know if it's something that works but not very well. Like, I, I think we don't know and these are some of the questions we're trying to answer with mechanistic interpretability.

    24. DP

      Are there abilities that won't emerge with scale?

    25. DA

      So I definitely think that, again, like, things like alignment and values are not guaranteed to emerge with scale, right? It's, it's kind of like, you know, one way to think about it is you, you train the model and it is...... basically, it's like predicting the world, it's understanding the world. Its, its job is facts, not values, right? It's trying to predict what comes next, but there's, there's just... there's free variables here, where it's like, it... what should you do, what should you think, what should you value? Those, you know, like the, there, there just, there aren't the bits for that. There's just like, "Well, if I started with this, I should finish with this. If I started with this other thing, I should finish with this other thing." Um, and so I think that's not going to emerge.

    26. DP

      Mm. I wanna talk about alignment in a second, but o- on scaling, if it turns out that scaling plateaus before we reach human level intelligence, looking back on it, what would s- be your explanation? What do you think w- is likely to be the case if that turns out to be the outcome?

    27. DA

      Yeah. Um, so I guess I would distinguish some problem with the fundamental theory with some practical issue. So, uh, one practical issue we could have is we could run out of data. For various reasons, I think that's not going to happen, but, uh, you know, uh, if you look at it very, very naively, we're not that far from running out of data. And so it's like we just don't have the data to continue the, to continue the scaling curves. I think, uh, you know, another way it could happen is, like, oh, we just, we just used up our, all of our compute that was available and that, that wasn't enough, and then progress is slow after that. I wouldn't bet on either of those things happening, but they, they could. I, I think from a, from a fundamental perspective, I... personally, I think it's very unlikely that the scaling laws will just stop. If they do, another reason, again, this isn't fully fundamental, could just be we don't have quite the right architecture. Like, if we tried to do it with an LSDM or an RNN, the slope would be different. I still might be that we get there, but I think there are some things that are just very hard to represent when you don't have this ability to attend far in the past that transformers have. If, somehow, and I don't know how we would know this, it kind of wasn't about the architecture and we just hit a wall, I think I'd be very surprised by that. I think we're already at the point where the things the models can't do don't seem to me to be different in kind from the things they can do.

    28. DP

      Mm-hmm.

    29. DA

      Um, and it just... You know, you could have made a case a few years ago that it was like, they can't reason, they can't program. Like, you could have, you could have drawn boundaries and said, "Well, maybe you'll hit a wall." I didn't think that. I didn't think we would hit a wall.

    30. DP

      Right.

  3. 15:4622:58

    Language

    1. DA

      that particular direction.

    2. DP

      When did it become obvious to you, uh, that language is the means to just feed a bunch of data into these things that, or was it just you ran out of other things, like robotics there's not enough data?

    3. DA

      Yeah.

    4. DP

      This other thing, there's not enough data.

    5. DA

      Yeah. I mean, I think this whole idea of, like, the next word prediction that you could do self-supervised learning.

    6. DP

      Yeah.

    7. DA

      You know, that together with the idea that it's like, wow, for predicting the next word there's so much richness in structure there, right? You know, it might say two plus two equals and you have to know the answer is four. And you know, it might be telling a story about a character and then basically it's- it's posing to the model, you know, the e- the equivalent of these developmental tests that get posed to children. You know, Mary walks into the room and, you know, puts an item in the air and then, you know, Chuck walks into the room and removes the item and Mary- Mary doesn't see it. What does Mary think hap- You know, so like, so the models are gonna have to- to get this right in the service of predicting the next word, they're gonna have to solve all, you know, solve all these theory of mind problems, solve all these math problems. And so I di- you know, I- I, my thinking was just, well, you know, you scale it up as much as you can. You, you, you, you, you know, there's- there's kind of no limit to it. And I think, I kind of had abstractly that view, but the thing of course that like really solidified it and convinced me was the work that Alec Radford did on GPT-1.

    8. DP

      Mm-hmm.

    9. DA

      Um, which was not only could you get this- this language model that could predict things very well, but also you could fine-tune it. You needed to fine-tune it in those days to do all these other tasks. And so I was like, wow, you know, it, it, this isn't just some narrow thing where you get the language model right. It's sort of halfway to everywhere, right? It's like, you know, you get the language model right and then with a little move in this direction it can, you know, it can solve this, this, you know, logical dereference test or whatever. And you know, with this, this other thing, you know, it can, it can solve translation or something. And then you're like, wow, I think there's- there's really something to do it and, and of course we can, we can really scale it. (laughs)

    10. DP

      One- one thing that's confusing or that would have been hard to see, if you told me in 2018...... we'll have models in 2023, like, la two that can write theorems in the style of Shakespeare or whatever theorem you want, uh, you want. They can, A, standardize tests with open-ended questions, you know. Um, just all kinds of really impressive things. You would have said at that time, I would have said, "Oh, you have AGI. You clearly have something that is a human level intelligence." Where these, while these things are impressive, it clearly seems we're not at human level, at least in the current generation and potentially for generations to come. What explains this discrepancy between super impressive performance in these benchmarks and in just, like, the things you could describe-

    11. DA

      Yeah.

    12. DP

      ... versus, yeah, generally?

    13. DA

      So that, that was one area where actually I was not prescient and I was surprised as well.

    14. DP

      Yeah.

    15. DA

      Um, so when I first looked at GPT-3 and, you know, more so the kind of things that we built in the early days at, at Anthropic, my, my general sense was, I, I, you know, I looked at these and I'm like, "It seems like they, they really grasped the es- essence of language."

    16. DP

      Yeah.

    17. DA

      I'm not sure how much we scale them up. Like, maybe we, maybe what's, what's more needed from here is, like, RL and all, and kinda, and kinda all the other stuff. Like, we might be kind of near the... You know, I thought in 2020, like, we can scale this a bunch more, but I wonder if it's more efficient to scale it more or to start adding on these other objectives, like, like RL. I thought maybe if you do as much RL as, you know, as, as you've done pre-training for a, for a, you know, 2020-style model that that's, that's the way to go and scaling it up will keep working, but, you know, is that, is that really the best path? And I, I think it, I don't know. It just keeps going. Like, I thought it had understood a lot of the essence of language, but then, you know, there's, there's kind of, there's kind of further to go. Um, and, and so, I don't know. Stepping back from it, like, one of the reasons why I'm sort of very empiricist about, about AI, about safety, about organizations is that you often get surprised, right? I, you know, I feel like I've been right about some things, but I've still, you know, with these theoretical pictures ahead, been wrong about most things. Being right about 10% of the stuff is, you know, sets you head and shoulders ab- above (laughs) , um, above, above many people. You know, if you look back to, I can't remember who it was, kind of, you know, made these diagrams that are like, you know, here's, here's the village i- idiot. Here's Einstein. Here's the scale of intelligence, right?

    18. DP

      Right.

    19. DA

      And the vi- village idiot and Einstein are, like, very close to each other. Like, that, maybe that's still true in some abstract sense or something, but it's, it's not really what we're seeing, is it?

    20. DP

      No.

    21. DA

      We're seeing, like, that it seems like the human range is pretty broad and doesn't... We don't hit the human range in the same place or at the same time for different tasks, right? Like, you know, like, write, write a sonnet, you know, in the style of Cormac McCarthy or something. Like, I don't know, I'm not very creative so I couldn't do that. But, like, you know, that's, that's a pretty high level human skill, right? Um, and even the model is starting to get good at stuff of, you know, like, constrained writing. You know, there's this, like... Write a, you know, write a page without using the letter E or something like-

    22. DP

      (laughs)

    23. DA

      ... write a page about X without using the letter E. Like, I think the models might be, like, superhuman or close to superhuman at that.

    24. DP

      Mm-hmm.

    25. DA

      Um, but when it comes to, you know, uh, yeah, I don't know, prove relatively simple mathematical theorems, like, they're, they're just starting to do the beginning of it.

    26. DP

      Mm-hmm.

    27. DA

      They make really dumb mistakes sometimes and they, they really lack any kind of broad, like, you know, correcting your errors or doing some extended task. And so, I don't know. It turns out that intelligence isn't, isn't a spectrum. There are a bunch of different areas of domain expertise. There are a bunch of different, like, kinds of skills. Like, memory is different. I mean, it's all, it's all formed in the blob. (laughs)

    28. DP

      (laughs)

    29. DA

      It's not... It's all formed in the blob. It's not complicated. But to the extent it even is on a spectrum, the spectrum is also wide. If you asked me 10 years ago, that's not what I would have expected at all, but, uh, I think that's very much the way it's turned out.

    30. DP

      Oh, man. I, I have so many questions just as follow-up on that. One is, do you expect that given the distribution of training that these models get from massive amounts of internet data versus what humans got from evolution, that the repertoire of skills that elicits will be just barely overlapping, will be like concentric circles? How, how do you think about... W- do, do those matter or is it just-

  4. 22:5838:05

    Economic Usefulness

    1. DA

      learned that (laughs) .

    2. DP

      Right. How likely do you think it is that these models will be superhuman for many years at economically valuable tasks while they are still below humans in many other relevant tasks that prevents like an intelligence explosion or something?

    3. DA

      I think this kind of stuff is, like, really hard to know, um, so I'll give, I'll give that caveat that like, you know... Again, like, the basic scaling laws you can kind of predict and then, like, this more granular stuff, which we really want to know to know how this all, all is gonna go, is, is much harder to know. But my guess would be the scaling laws are gonna continue. You know, again, subject to, you know, do people slow down for safety or for regulatory reasons.

    4. DP

      Yeah.

    5. DA

      Um, but, you know, let's just, let's just-... put all that aside and say, like, we have the economic capability to keep scaling. If we did that, what would happen? And I, I think my view is we're gonna keep getting better across the board and I don't see any area where the models are like super, super weak or not starting to make progress. Like, that used to be true of, like, math and programming, but I think over the last six months, you know, the, the 2023 generation of models compared to the 2022 generation have started to learn that. There may be more subtle things we don't know and so I, I kind of suspect, even if it isn't quite even, that the rising tide will lift all the boats.

    6. DP

      Does that include the thing you were mentioning earlier where if there's an extended task, it kind of loses its train of thought? Um...

    7. DA

      Yeah. Yeah.

    8. DP

      Or its ability to just, like, execute a series of steps?

    9. DA

      So, so I think that, that, that's gonna depend on things like RL training to have the model do longer horizon tasks. I don't expect that to require a substantial amount of additional compute.

    10. DP

      Mm-hmm.

    11. DA

      Um, I think that, um, that, that was probably an artifact of, uh, yeah, kind of thinking about RL in the wrong way and underestimating how much the model had learned on its own. In terms of, you know, are we gonna be superhuman in some areas and not others? I think it's complicated. I could imagine that we won't be superhuman in some areas because, for example, they involve, like, embodiment in the physical world and then it's like, what happens? Like, do the AIs help us train faster AIs and those faster AIs wrap around and solve that?

    12. DP

      Mm-hmm.

    13. DA

      Do you not need the physical world? It depends what you mean. Are we worried about an alignment disaster? Are we worried about misuse, like making weapons of mass destruction? Are we worried about the AI t- t- you know, or, you know, the AI taking over research from humans? Are we worried about it reaching some threshold of economic productivity where it can do what the average hu- I- these different thresholds, I think, have, have different answers. Although, I suspect they will all come within a few years of e-

    14. DP

      Let, let me ask about those thresholds. So if Claude was an employee at Anthropic-

    15. DA

      Yeah.

    16. DP

      ... what salary would it be worth? What, is it, like, meaningfully speeding up AI progress?

    17. DA

      Yeah. It feels to me like an intern in most areas. Um, but then some specific areas where it's better than that.

    18. DP

      Mm-hmm.

    19. DA

      Again, I think one thing that's, makes the comparison hard is, like, the form factor is kind of, like, not the same as a human, right? Like, a hu- like, you know, if you were to behave like one of these chatbots, like, we wouldn't really... I mean, I guess we could have this conversation. It's like, but, you know, they're, they're not really... They're more designed to answer single or a few questions, right? Um, and, and like, you know, they don't have a, the concept of having a long life of prior experience, right? We're talking here about, you know, things that, that I've experienced in the past, right? And chatbots don't, don't have that. And so there's, there's all kinds of stuff missing and so it's hard to make a comparison, but y- I don't know. It, it, they, they feel like interns in some areas and kind of then they have areas where they spike and are really savants where...

    20. DP

      Mm-hmm.

    21. DA

      They may be better than (laughs) they may be better than anyone here.

    22. DP

      But does the overall picture of something like an intelligence explosion... You know, my, my former guest is Carl Schulman and he has-

    23. DA

      Yeah. Yeah.

    24. DP

      ... this, like, very detailed model of an intelligence-

    25. DA

      Yeah.

    26. DP

      Does that, as somebody who would actually, like, see that happening, does that make sense to you as th- they go from interns to entry level software engineers, those entry level software engineers increase your productivity?

    27. DA

      Yeah. I, I think, I think the idea that the, the AI systems become more productive and first they speed up the productivity of humans-

    28. DP

      Yeah.

    29. DA

      ... then they, you know, kind of equal the productivity of humans and, and, and you know, and then they're in some meaningful sense the main contributor to scientific progress, that that happens at some point. I, I think that, that basic logic seems likely to me. Although I, I have a suspicion that when we actually go into the details, it's gonna be kind of like weird and different than we expect, that all the detailed models are kind of... You know? We're thinking about the wrong things or we're right about one thing and then are wrong about 10 other things and, and so I, I don't know. I think we might end up in like a weirder world than we expect.

    30. DP

      Mm. When you add all this together, like your estimate of when we get something kind of human level-

  5. 38:0543:35

    Bioterrorism

    1. DA

      able to put these things together.

    2. DP

      On that point, last week in your Senate testimony, you said-

    3. DA

      Yes.

    4. DP

      ... that these models are two to three years away from potentially enabling large-scale bioterrorism attacks, or something like that.

    5. DA

      Yes.

    6. DP

      Can you make that more concrete, without obviously giving the kind of information-

    7. DA

      Yes.

    8. DP

      ... that would... (laughs) But is it, like, one-shotting how to weaponize something? Is it-

    9. DA

      Yeah, yeah.

    10. DP

      Or do you have to fine-tune an open-source model? Like, what would that actually look like?

    11. DA

      Yeah. I think it'd be good to clarify this, because we did a blog post on the Senate testimony, and, like, I think various people kinda didn't understand the point or didn't-

    12. DP

      Yeah.

    13. DA

      ... didn't understand what we'd done. So, I think today, and, you know, of course, in our models, we try and, you know, prevent this, but there's always jailbreaks. You can ask the models all kinds of things about biology and get them to say all kinds of scary things.

    14. DP

      Yeah.

    15. DA

      Uh, but often those scary things are things that you could Google, and I'm, I'm therefore not particularly worried about that. Um, I think it's actually an impediment to seeing the real danger, where, you know, someone just says, "Oh, I asked this model, like, you know, for the smallpo-, you know, for, to tell me some things about smallpox," and it will. That, that is actually, you know, kind of not what I'm worried about. So we spent about six months working with some of... basically some of the folks who are the most expert in the world on how do, how do biological attacks happen, um, you know, what, what would you need to conduct such an attack, and how do we defend against such an attack? They worked very intensively on just the entire workflow of, if I were trying to do a bad thing, it's not one shot, it's a long process, there are many steps to it. Um, it's not just like I asked the model for this one page of information. And again, without going into any detail, the thing I said in the te- the Senate testimony is, like, there are some steps where you can just get information on Google. There are some steps that are what I'd call missing. They're scattered across a bunch of textbooks, or they're not in any textbook, they're kind of implicit knowledge, and they're not really, like, they're not explicit knowledge. They're, they're, they're more like, "I have to do this lab protocol, and, like, what if I get it wrong? Oh, if this happens, then, then my temperature was too low. If that happened, I needed to add more of this particular reagent."

    16. DP

      Ah, yeah.

    17. DA

      What we found is that, for the most part, those missing, those key missing pieces, the models can't do them yet, but we found that sometimes they can. Um, and when they can, sometimes they still hallucinate, which is a thing that's-

    18. DP

      Yeah.

    19. DA

      ... that's kind of keeping us safe. But we saw enough signs of the models doing, doing those, those key things well. And if we look at, you know, state-of-the-art models and go backwards to previous models, we look at the trend, it shows every sign of, two or three years from now, we're gonna have a real problem.

    20. DP

      Yeah, especially the thing you mentioned, the, on the log scale, you go from, like, one in 100 times it gets it right to one in 10 to-

    21. DA

      E- exactly. So, you know, I've seen many of these, like, groks in my life, right? I was there when I, I watched when GPT-3 learned to do arithmetic, when GPT-2 learned to do regression a little bit above chance, when, you know, when we got, you know, with Claude and we got better on, like, you know, all, all these, all these tests of helpful, honest, harmless. I've seen a lot of groks. This is, this is unfortunately not one that I'm excited about, but I believe it's happening.

    22. DP

      So somebody might say, listen, you were a coauthor on this post that OpenAI released about GPT-2 where they said-

    23. DA

      Yes.

    24. DP

      ... you know, "We're not gonna release the weights or the details here-

    25. DA

      Yeah.

    26. DP

      ... because we're worried that this model will be used for something, you know, bad." And looking back on it, now it's laughable to think that GPT-2 could have done anything bad. Are we just, like, way too worried, this is a concern that doesn't make sense for...

    27. DA

      So it is interesting. Um, it might be worth looking back at the actual text of that post.

    28. DP

      Mm-hmm.

    29. DA

      Um, so I don't remember it exactly, but it, it should, it, you know, it's, it's, it's still up on the internet. It says something like, you know, "We're choosing not to release the weights, uh, because of concerns about misuse," but it also said, "This is an experiment. We're not sure if this is necessary or the right thing to do at this time, but we'd like to establish a norm of thinking carefully about these things." Um, you know, you could think of it a little like the, you know, the, the Asilomar Conference in the, in the 1970s, right? Where it's like...... you know, they were just figuring out recombinant DNA. You know, there, it was not necessarily the case that someone could do something really bad with recombinant DNA, it's just the possibilities were starting to become clear. Those words, at least, were the right attitude. Now, I think there's a separate thing that, like, you know, people don't just judge the post, they judge the organization. Is this an organization that, you know, is... Produces a lot of hype or that has credibility or something like that? And so I think that had some effect on it. I guess you could also ask, like, is it inevitable that people would just interpret it as, like, eh, uh, uh, uh, y- you know, you can't get across any message more complicated than this thing right here is dangerous. Um, so you can argue about those but I think the, the basic thing that was in my head and the head, the head of others who, who were, who were involved in that and, you know, I think what, what is, what is evident in the post is, like, we actually don't know. We have pretty wide error of errors on what's dangerous and what's not, so we should, you know, like, we, we want to establish a norm of being careful. I, I think, by the way, we have enormously more evidence, we've seen enormously more of these crocks now and so we're well-calibrated, but there's still uncertainty, right? In all of these statements I said, like, in two or three years, we might be there, right? There's a substantial risk of it and we don't wanna take that risk. But, you know, I wouldn't say it's,

  6. 43:3547:19

    Cybersecurity

    1. DA

      it's 100%, it could be 50/50.

    2. DP

      Okay. Let's talk about cybersecurity which, in addition-

    3. DA

      Yes.

    4. DP

      ... to bio-risk, is another thing Anthropic has been emphasizing. How have you avoided the cloud micro-arch- architecture from leaking? Because as you know, your competitors have been less successful at, uh, this kind of security.

    5. DA

      Can't comment on anyone else's security. Don't know what's going on in there. A thing that we have done is, uh, you know, so, so there are, there are these, these architectural innovations, right, that make training more efficient. We call them compute multipliers because they're the equivalent of, you know, improving, improving, uh, eh, you know, uh, uh, uh, uh... They're like having more compute. Our compute multipliers, again, I don't wanna say too much about it because it could allow an adversary to counteract our, our, our measures, but we limit the number of people who are aware of, o- of a given compute multiplier to those who need to know about it. Um, and so there's, there's a very small number of people who could leak all of these secrets. There's a larger number of people who could leak one of them. Um, but you know, but this is the standard compartmentalization strategy that's used in the intelligence community or, you know, resistant cells (laughs) or, or whatever. Um, so you know, w- we've over the last, uh, last few months, we've implemented these measures. So, you know, I don't wanna jinx anything by saying, "Oh, this could never happen to us." Um, but I think, I think it would be harder for it to happen. Um, I don't wanna go into any more detail and, and you know, by the way, I'd encourage all the other companies to do this as well. It's a- as much as, like, c- competitors' architectures leaking-

    6. DP

      Yeah.

    7. DA

      ... is, is narrowly helpful to Anthropic, it's not good for anyone in the long run, right?

    8. DP

      Sure.

    9. DA

      Um, so security around this stuff is really important.

    10. DP

      Even with all the security you have, could you, with your current security prevent a dedicated state level actor from getting the Claude 2 weights?

    11. DA

      It depends how dedicated, is what, is what I would say. Our, our head of security who, who was, you know, used to work on security for Chrome, which, you know-

    12. DP

      Yeah. (laughs)

    13. DA

      ... very widely used and attacked application, he likes to think about it in terms of how much would it cost to attack Anthropic successfully?

    14. DP

      Yeah.

    15. DA

      I, again, I don't wanna go into super detail of how much I think it will cost to attack and it's just kind of inviting people, but, like, one of our goals is that it costs more to attack Anthropic than, than it costs to just train your own model.

    16. DP

      Mm-hmm.

    17. DA

      Um, uh, which doesn't guarantee things because, you know, of course you need the talent as well, so you might still... But, you know, but it, but attacks have, have, have risks, d- diplomatic costs, uh, and you know, and, and, and they use up the very, the very sparse resources that nation state actors might have in order to, to do, to do the attacks.

    18. DP

      Mm-hmm.

    19. DA

      Um, so we're not there yet, by the way. But I, but I think, I think we're to a very high standard compared to the size of company that we are. Like, I think if you look at security for most 150-person companies, like, I think there's, there's just no comparison. Um, but you know, could we, could we resist if, if it was a state actor's top priority to steal our model weights? No. They would, they would succeed.

    20. DP

      How long does that stay true? Because at some point the value keeps increasing and increasing and another part of this question is that, what kind of a secret is how to train Claude 3 or Claude 2? Is it, you know, with nuclear, uh, weapons, for example, we had lots of spies. You just take a blueprint to cross and-

    21. DA

      Yes.

    22. DP

      ... that's, you, the implosion device and that's what you need. Here is it just, is it more tacit like the thing you were talking about, biology, you need to know how these reagents work? Is it just like, you got the blueprint, you got the micro-architecture and the hyper-parameters, you're good to go?

    23. DA

      I mean there are, there are some things that are like, you know, a one-line equation and there are other things that are more complicated.

    24. DP

      Yeah.

    25. DA

      Um, and I think compartmentalization is the, the best way to do it. Just limit the number of people who know about something. If you're a thousand-person company and everyone knows every secret, like, one, I guarantee you have some, you have a leaker, and two, I guarantee you have a spy. Like, a

  7. 47:1957:43

    Alignment & mechanistic interpretability

    1. DA

      literal spy.

    2. DP

      Okay. Let's talk about alignment and let's talk about mechanistic interpretability which is the branch-

    3. DA

      Yes.

    4. DP

      ... of which you, um, you guys specialize in. While you're answering this question, you might want to explain what mechanistic interpretability is but just, um, the broader question is, mechanistically, what is alignment? Is it that you're locking in the model into a benevolent character? Are you disabling deceptive circuits and procedures? Like, what concretely is happening-

    5. DA

      Yeah. I-

    6. DP

      ... when you align a model?

    7. DA

      I think as with most things, you know, when we actually train a model to be aligned, we don't know what happens inside the model, right? There are different ways of training it to be aligned but I think we don't really know what happens. I mean, I think for some of the current methods, I think all the current methods that involve some kind of fine tuning of course have the property that the underlying knowledge and abilities that we might be worried about-... don't, don't disappear, it's just, you know, the- the model is just taught not to output them. I don't know if that's a fatal flaw or if, you know, or if that's just the way things have to be. I don't know what's going on inside mechanistically and I think that's the whole point of mechanistic interpretability, to really understand what's going on inside the models at the level of individual circuits.

    8. DP

      Eventually when it's solved, what does the solution look like? Where, what is then the case where, if it called for, you do the mechanistic interpretability thing and now you're like, "I'm satisfied, it's a line"? What is it that you've seen?

    9. DA

      Yeah. So, I, I think, I think we don't know that yet. I think we don't know enough to, to know that yet. I mean, I can, I can give you a sketch for like what the process looks like as opposed to what the final result looks like. Um, so I think verifiability is a lot of the challenge here, right? We have all these methods that purport to align AI systems and, and do succeed at doing so for today's tasks. But then the, the question is always, if you had a more powerful model or if you had a model in a different situation-

    10. DP

      Yeah.

    11. DA

      ... would it, would it, would it be aligned? And so I think this problem would be much easier if you had an oracle that could just scan a model and say like, "Okay, I know this model is aligned. I know what it'll do in every situation." Um, then the problem would be much easier and I think the closest thing we have to that is something like mechanistic interpretability. It's not anywhere near up to the task yet, but I guess I would say I think of it as almost like an extended training set and an extended test set, right?

    12. DP

      Mm-hmm.

    13. DA

      Everything we're doing, all the alignment methods we're doing are the training set, right? You, you know, you can, you can run tests on them, but will it really work out a distribution, will it really work in another situation? Mechanistic interpretability's the only thing that even in principle, and we're, we're nowhere near there yet, but even in principle is the thing where it's like, it's more like an x-ray of the model than a modification to the model, right? It's more like an assessment than an intervention.

    14. DP

      Mm-hmm.

    15. DA

      And so somehow we need to get into a dynamic where we have an extended test set, an extended training set which is all these alignments methods, and an extended test set which is kind of like you, you x-ray the model and say like, "Okay, what worked and what didn't?"

    16. DP

      Mm-hmm.

    17. DA

      In, in a way that goes beyond just the empirical tests that you, that you, that you've run, right? Um, where you're saying, "What is the, what it, what is the model going to do in these situations? What is it within its capabilities to do?" Instead of, "What did it do phenomenologically?" And of course, we have to be careful about that, right? One of the things that I think is very important is we should never train for interpretability because I think that is, that's taking away that advantage, right? You even have the problem, you know, similar to like validation versus test set where like if you look at the x-ray too many times, you can interfere, but I think that's a, a much weaker optim- we should worry about that, but that's a, that's a much weaker process. It's not automated optimization. We should just make sure as with validation in test sets that we don't look at the validation set too many times before running the test set. But you know, that's, uh, again, that's, that's more of a, that's, that's manual pressure rather than automated pressure. And so some solution where it's like we have some dynamic between the training and test set, where it's like we're, we're trying things out and we, we, we really figure out if they work via way of testing them that the model isn't optimizing against, some, some orthogonal way. Like if, if, if I, if I think of... And I think we're never gonna have a guarantee, but some process where we, we do those things together. Again, not in a stupid way, there's lots of stupid ways to do this where you fool yourself.

    18. DP

      Yeah, yeah.

    19. DA

      But like some way to put extended training for alignment ability with exte- extended testing for alignment ability together in a way that actually works.

    20. DP

      Hmm. I, I still don't feel like I understand the intuition that, l- l- why you think this is likely to work or this is promising to pursue, and let me ask the question in a cer- more specific way, and excuse the tortured analogy.

    21. DA

      Yeah.

    22. DP

      But listen, if you're, you're an economist and you want to understand the economy-

    23. DA

      Yeah.

    24. DP

      ... so you send a whole bunch of microeconomists out there-

    25. DA

      Yeah.

    26. DP

      ... and one of them studies how the restaurant business works, one of them studies how the tourism business works, you know, one of them studies how the baking works and at the end, they all come together and you still don't know whether there's gonna be a recession in five years or not. Why is this not like that where you have an understanding of we understand how induction heads work in a two-layer transformer, we understand, you know, modular arithmetic? How does this add up to, does this model want to kill us? Like, what does this model fundamentally want?

    27. DA

      Yes. A few things on that. I mean, I think that's like the right set of questions to ask. I think what we're hoping for in the end is not that we'll understand every detail, but again I would give like the x-ray or the MRI analogy that like we can be in a position where we can look at the broad features of the model and say like, "Is this a model whose internal state and plans are very different from what it externally represents itself to do?" Right? Is this a model where we're uncomfortable that, you know, far too much of its computational power is, uh, you know, is, is devoted to doing what look like fairly destructive and manipulative things?

    28. DP

      Mm-hmm.

    29. DA

      Again, we don't know for sure whether that's possible, but I, I think some at least positive signs that it might be possible, again, the model is not intentionally hiding from you, right? It might turn out that the training process hides it from you and I, you know, I can think of cases where if the model's really super intelligent it like thinks in a way so that it like affects its own cognition. I suspect w- we should think about that. We should consider everything. I, I, I, I suspect that it may roughly work to think of the model as, you know, if it's trained in, in, in, in the normal way just a- you know, at, at the, at the just getting to just above human level, it, it may be a reason we should check. It may be a reasonable assumption that the internal structure of the model is not intentionally optimizing against us. And I give an analogy like to humans. So uh, it's actually possible, um, to, you know, to look at an MRI of someone, um, and predict above random chance whether they're a psychopath.Um, there was actually a story a few years back about a neuroscientist who was studying this and then he looked at his own scan and discovered that he was a psychopath. And then everyone, e- everybody in his life was like, "No, no, no, that's just as obvious. Like, you're a complete asshole." (laughs) "Like, you must be a psychopath." Um, and he, he was totally unaware of this. The basic idea that, um, you know, that, that, there can be these macro features that, like, psychopath is probably a good analogy for it, right? They're like, you know, this is what we would be afraid of. A model that's kind of, like, charming on the surface, very goal-oriented and, you know, but very dark on the inside. Uh, you know, and, and on, you know, on the surface, their behavior might look like the behavior of someone else, but their goals are very different.

    30. DP

      A question somebody might have is, listen, you know, you mentioned earlier the importance of being empirical.

  8. 57:431:05:30

    Does alignment research require scale?

    1. DA

      more like us.

    2. DP

      Uh, talent density I'm sure is important, but another thing Anthropic has em- emphasized is that you need to have frontier models in order to do safety research.

    3. DA

      Yes.

    4. DP

      And of course, like, actually be a company as well. The current frontier models, something somebody might guess, like, GPT-4oClock to, like, $100 million or something like that.

    5. DA

      Uh, that general order of magnitude and very broad terms is not wrong.

    6. DP

      But, you know, where two to three years from now the kinds of things you're talking about, we're talking more and more orders of magnitude. To keep up with that and to... if it's the case that safety requires to be on the frontier, I mean, what is the case in which Anthropic is, like, competing with these leviathans to stay on that same scale?

    7. DA

      Yeah. I mean, I think it's a, I think it's a very... it's a situation with a lot of trade-offs, right? I think it's, I think it's not easy. Um, I guess to go back t- uh, maybe I'll just like answer the questions one by one, right? So like, to go back to like, you know, why do... why is safety so tied to scale, right? Um, some people don't think it is. But like if I, if I just look at like, you know, where, where, where have been, where have been the areas that, you know, you know, like safety methods have, like, been put into practice or, like, worked for something, for anything, even if we don't think they'll, they'll work in general, you know, I go back to thinking of all the ideas, you know, something like, you know, debate and amplification, right? You know, back in 2018 when we wrote papers about those at O- at OpenAI it was like, well, human feedback isn't, isn't quite gonna work but, you know, debate and amplification will take us beyond that. But then if you, if you actually look at... and we've, you know, done attempts to do debates. We're really limited by the, by the quality of the model uh, where it's like, you know, for two models to have a debate that is coherent enough that a human can judge it so that the training process can actually work, you need models that are at or maybe even beyond on some topics the current frontier. Now you can come up with a method, you can come up with the idea without being on the frontier but I... you know, for me that's a very small fraction of what needs to be done, right? It's very easy to come up with these methods. It's very easy to come up with, like, "Oh, the problem is X, maybe a solution is Y." But, you know, I, I really want to know, you know, whe- whether things work in practice even for the systems we have today and I want to know what kinds of things go wrong with them. I, I, I just feel like you discover 10 new ideas and 10 new ways that things are going to go wrong by trying these in practice and that, that empirical learning, I think it's, it's n- just not as widely understood as it should be. Kind of every... you know, I would say the same thing about methods like Constitutional AI and some people say, "Oh, it doesn't matter. Like, we know this method doesn't work. It won't work for, you know, pure alignment." I neither agree nor disagree with that.... I think that's just kind of overconfident. The way we discover new things and understand the structure of what's gonna work and what's, what's not is by playing around with things. Not that we should just kind of blindly say, "Oh, this worked here and so, so it'll work there," but you, you, you really, you really start to understand the patterns, like with, like with the scaling laws. Even mechanistic interpretability, which might be the one area I see where a lot of progress has been made without the frontier models. We are, you know, we're seeing in, you know, the work that, say, OpenAI put out a couple, a couple months ago that, you know, using very powerful models to help you auto-interpret the weak models, again, that's not everything you can do in interpretability but, you know, that's a, that's a big component of it and we, you know, we found it useful too.

    8. DP

      Yeah.

    9. DA

      And so you see this f- this, this phenomenon over and over again where it's like, you know, the, the scaling and the safety are these two snakes that are, like, coiled with each other, always even more than you think, right? I, I, you know, uh, y- with interpretability, like, I think three years ago I didn't think that this would be as true of interpretability but somehow it manages to be true. Why? Because intelligence is useful, it's useful for a number of tasks. One of the tasks it's useful for is, like, figuring out how to judge and evaluate other intelligence, and maybe someday even, even for, you know, doing the alignment research itself.

    10. DP

      Uh, g- given all that's true, what, what does that imply for Anthropic when in two to three years-

    11. DA

      Yes.

    12. DP

      ... these leviathans are doing, like, $10 billion training runs?

    13. DA

      Yes. Yes, so, uh, choice one is if it, if we can't or if it costs too much to stay on the frontier then, you know, then, then we shouldn't, uh, then we shouldn't do it and, you know, we won't work with the most advanced models, we'll see what we can get with, you know, models that are not quite as advanced. I think you can get some value there, like, non-zero value, but I'm, I'm kind of skeptical that the value is all that high or the learning can be fast enough to really, to really be in favor of the task. The second option is you just, you just find a way, you just, uh, you know, you just accept the trade-offs. And I think the trade-offs are more positive than they appear because of a phenomenon that I've called race to the top. Um, I could go into that later but I'll just, let me put that aside for now. Uh, and then I think the third phenomenon is, you know, as things get, as things get to that scale, I think this may coincide with, you know, starting to get into some non-trivial probability of very serious danger. Again, I think it's gonna come first from misuse, the kind of bio stuff that I talked about, but I don't think we have the level of autonomy yet to worry about some of the, you know, alignment stuff happening in, like, two years but it might not be very far behind that at all. You know, that, that may, that may lead to unilateral or multilateral or government enforced, which we support, d- decisions, uh, not to scale as fast as we could.

    14. DP

      Yeah.

    15. DA

      Um, that may end up being the right thing to do. So I, I, you know, actually that's kind of like, I, I kind of hope things go in that, in that direction, um, and then we don't have this hard trade-off between we're not on the frontier and we can't quite do the research as well as, as well as we want or influence other orgs as well as we want, um, or versus we're kind of on the frontier and, like, have to accept the trade-offs which are, which are net positive but, like, have a, have a lot in both, in both directions.

    16. DP

      Okay. On the, on the misuse versus misalignment, those are both problems as you mentioned but in the long scheme of things, what ha- what is, what are you more concerned about? Like 30 years down the line-

    17. DA

      Yeah.

    18. DP

      ... which do you think will be a, considered a bigger problem?

    19. DA

      I, I think it's much less than 30 years, um, but I'm, I'm worried about both. I don't know. If you have, if you, if you have a model that could in theory, you know, like, take over the world on its own, um, if you were able to control that model then, you know, it follows pretty simply that, you know, if a model was following the wishes of some small subset of people and not others then those people could use it to take over the world on, on their, on their behalf. The very premise of misalignment means that we should be worried about misuse as well with similar levels of consequences.

    20. DP

      But, but so- there's some people who might be more doomer-y than you would say misuse is, m- you're already working towards the optimistic scenario there because you've at least figured out how to align the model with the bad guys, now you just need to make sure that it's aligned with the good guys instead.

    21. DA

      Yeah.

    22. DP

      Why do you think that you could get to the point where it, it's aligned with the bad, you know, where you haven't already solved the-

    23. DA

      Yes. I, I guess if you had the view that, like, alignment is completely unsolvable then, uh, you know, then you'd be like, "Well I don't, you know, we're dead anyway so I don't wanna worry about misuse." That's not my position at all but, but also, like, you should think in terms of, like, what's a plan that would actually succeed that would make things good? Any plan that actually succeeds regardless of how hard misalignment is to solve-

    24. DP

      Mm-hmm.

    25. DA

      Any problem, any plan that actually succeeds is gonna need to solve misuse as well as misalignment. It's going to need to solve the fact that, like, as the AI models get better, you know, faster and faster, they're gonna create a big problem around the balance of power between countries, they're gonna create a big problem around is it possible for a single individual to do something bad that it's hard for everyone else to stop? Any actual solution that needs to, leads to a good future needs to solve those problems as well.

  9. 1:05:301:09:06

    Misuse vs misalignment

    1. DA

    2. DP

      Mm-hmm.

    3. DA

      If your perspective is we're screwed because we can't solve the first problem so don't worry about problems two and three, like, that- that's not really a statement you shouldn't worry about problems two and three, right? Like, w- they're, they're in our path, what, what, no matter what...

    4. DP

      Yeah. In, in the scenario we succeed we have to solve all of them so yeah-

    5. DA

      Yeah.

    6. DP

      ... we might as well operate... We should be planning for success-

    7. DA

      Right.

    8. DP

      ... not for failure. I- if misuse doesn't happen and the right people have the superhuman models, what does that look like? Like, who are the right people? Who, who is actually-

    9. DA

      Yeah.

    10. DP

      ... controlling the model from five years from now?

    11. DA

      Yeah. I mean, my, my view is that these things are powerful enough that I think, you know, it's, it's going to involve, you know, a substantial role or at least involvement of, you know, some kind of government or assembly of government bodies. Again, like, you know, there are, there are kind of very naive versions of this like, you know, I don't think we should just, you know, I don't know, like, h- hand, h- hand the model over to the UN or whoever happens to be in office at, at a given time. Like, I could see that go poorly but there-... it's, it's too powerful. There needs to be some kind of legitimate process for managing this technology which i- you know, includes the role of the people building it, includes the role of like democratically elected authorities, includes the role of, you know, all the, all the (laughs) individuals who will be affected by it so that there, there... A, a- a- a- at the end of the day there, there needs to be some politically legitimate process.

    12. DP

      But what does that look like? If, if it's not the case that you just hand it to whoever the president is at the time-

    13. DA

      Yeah.

    14. DP

      ... uh, is, what does the body look like? What, uh, I mean, is it something you're-

    15. DA

      These are things it's really hard to know ahead of time. Like, I think, you know, people love to kind of propose these broad plans and say like, "Oh, this is the way we should do it, this is the way we should do it." I think the honest fact is that we're figuring this out as we go along and that, you know, a- a- a- and anyone who says, you know, "This is, this is the body that, you know, we should create this kind of body modeled after this thing," like I think, I think we should try things and experiment with them with less powerful versions of the technology. We, we need to figure this out in time but, but also it's not the, really the kind of thing you can know in advance.

    16. DP

      Mm. The, the long-term benefit trust-

    17. DA

      Yes.

    18. DP

      ... that you have, how do, how would that interface with this body? Is that the body itself? If not, is it like... So just for the context-

    19. DA

      Yeah.

    20. DP

      ... you might wanna explain what it is for the audience but-

    21. DA

      Yeah, yeah. So I don't know, I think of the long-term benefit trust as like a much, a much narrower thing. Like, this is something that, like, makes decisions for Anthropic. So this is basically a body... It was described in a recent Vox article, we'll be saying more about it in, you know, later, later this year. Uh, but it's basically a body that o- over time, uh, gains the ability to appoint the majority of the board seats of Anthropic.

    22. DP

      Right.

    23. DA

      Uh, and this is so, you know, it's a mixture of experts in I'd say, like, AI alignment, national security, and philanthropy in general.

    24. DP

      But if control is handed to them of Anthropic, that doesn't imply-

    25. DA

      Yes.

    26. DP

      ... that control of... If Anthropic has AGI, the control of AGI-

    27. DA

      Yeah.

    28. DP

      ... itself is handed to them.

    29. DA

      That doesn't, that doesn't imply that Anthropic or any other entity should be the entity that, like, makes decisions about AGI on behalf of humanity.

    30. DP

      Right.

  10. 1:09:061:11:05

    What if AI goes well?

    1. DA

      could deal with it.

    2. DP

      Mm-hmm. Okay, so let's forget about governance. Let's just talk about what this going well looks like. Obviously there's the things we can all agree on. You know, cure all the diseases-

Episode duration: 1:58:43

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode Nlkk3glap_U

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome