Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity | Lex Fridman Podcast #452

Dario Amodei is the CEO of Anthropic, the company that created Claude. Amanda Askell is an AI researcher working on Claude's character and personality. Chris Olah is an AI researcher working on mechanistic interpretability. Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep452-sb See below for timestamps, transcript, and to give feedback, submit questions, contact Lex, etc. *Transcript:* https://lexfridman.com/dario-amodei-transcript *CONTACT LEX:* *Feedback* - give feedback to Lex: https://lexfridman.com/survey *AMA* - submit questions, videos or call-in: https://lexfridman.com/ama *Hiring* - join our team: https://lexfridman.com/hiring *Other* - other ways to get in touch: https://lexfridman.com/contact *EPISODE LINKS:* Claude: https://claude.ai Anthropic's X: https://x.com/AnthropicAI Anthropic's Website: https://anthropic.com Dario's X: https://x.com/DarioAmodei Dario's Website: https://darioamodei.com Machines of Loving Grace (Essay): https://darioamodei.com/machines-of-loving-grace Chris's X: https://x.com/ch402 Chris's Blog: https://colah.github.io Amanda's X: https://x.com/AmandaAskell Amanda's Website: https://askell.io *SPONSORS:* To support this podcast, check out our sponsors & get discounts: *Encord:* AI tooling for annotation & data management. Go to https://lexfridman.com/s/encord-ep452-sb *Notion:* Note-taking and team collaboration. Go to https://lexfridman.com/s/notion-ep452-sb *Shopify:* Sell stuff online. Go to https://lexfridman.com/s/shopify-ep452-sb *BetterHelp:* Online therapy and counseling. Go to https://lexfridman.com/s/betterhelp-ep452-sb *LMNT:* Zero-sugar electrolyte drink mix. Go to https://lexfridman.com/s/lmnt-ep452-sb *OUTLINE:* 0:00 - Introduction 3:14 - Scaling laws 12:20 - Limits of LLM scaling 20:45 - Competition with OpenAI, Google, xAI, Meta 26:08 - Claude 29:44 - Opus 3.5 34:30 - Sonnet 3.5 37:50 - Claude 4.0 42:02 - Criticism of Claude 54:49 - AI Safety Levels 1:05:37 - ASL-3 and ASL-4 1:09:40 - Computer use 1:19:35 - Government regulation of AI 1:38:24 - Hiring a great team 1:47:14 - Post-training 1:52:39 - Constitutional AI 1:58:05 - Machines of Loving Grace 2:17:11 - AGI timeline 2:29:46 - Programming 2:36:46 - Meaning of life 2:42:53 - Amanda Askell - Philosophy 2:45:21 - Programming advice for non-technical people 2:49:09 - Talking to Claude 3:05:41 - Prompt engineering 3:14:15 - Post-training 3:18:54 - Constitutional AI 3:23:48 - System prompts 3:29:54 - Is Claude getting dumber? 3:41:56 - Character training 3:42:56 - Nature of truth 3:47:32 - Optimal rate of failure 3:54:43 - AI consciousness 4:09:14 - AGI 4:17:52 - Chris Olah - Mechanistic Interpretability 4:22:44 - Features, Circuits, Universality 4:40:17 - Superposition 4:51:16 - Monosemanticity 4:58:08 - Scaling Monosemanticity 5:06:56 - Macroscopic behavior of neural networks 5:11:50 - Beauty of neural networks *PODCAST LINKS:* - Podcast Website: https://lexfridman.com/podcast - Apple Podcasts: https://apple.co/2lwqZIr - Spotify: https://spoti.fi/2nEwCF8 - RSS: https://lexfridman.com/feed/podcast/ - Podcast Playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4 - Clips Channel: https://www.youtube.com/lexclips *SOCIAL LINKS:* - X: https://x.com/lexfridman - Instagram: https://instagram.com/lexfridman - TikTok: https://tiktok.com/@lexfridman - LinkedIn: https://linkedin.com/in/lexfridman - Facebook: https://facebook.com/lexfridman - Patreon: https://patreon.com/lexfridman - Telegram: https://t.me/lexfridman - Reddit: https://reddit.com/r/lexfridman

Dario AmodeiguestLex FridmanhostAmanda AskellguestChris Olahguest

Nov 11, 20245h 15mWatch on YouTube ↗

EVERY SPOKEN WORD

150 min read · 30,122 words

0:00 – 3:14
Introduction
1. DADario Amodei
  If you extrapolate the curves that we've had so far, right? If, if you say, well, I don't know, we're starting to get to, like, PhD level and, and last year we were at undergraduate level and the year before we were at, like, the level of a high school student. Again, you can, you can quibble with at what tasks and for what. We're still missing modalities but those are being added. Like computer use was added, like image generation has been added. If you just kind of like eyeball the rate at which these capabilities are increasing, it does make you think that we'll get there by 2026 or 2027. I think there are still worlds where it doesn't happen in, in 100 years. Those wor- the number of those worlds is rapidly decreasing. We are rapidly running out of truly convincing blockers, truly compelling reasons why this will not happen in the next few years. The scale-up is very quick. Like, we, we do this today. We make a model and then we deploy thousands, maybe tens of thousands of instances of it. I think by the time, you know, certainly within two to three years, whether we have these super powerful AIs or not, clusters are gonna get to the size wh- where you'll be able to deploy millions of these. I am optimistic about meaning. I worry about economics and the concentration of power. That's actually what I worry about more, the abuse of power.
2. LFLex Fridman
  And AI increases the, uh, amount of power in the world and if you concentrate that power and abuse that power, it can do immeasurable damage.
3. DADario Amodei
  Yes. It's very frightening. It's ver- it's very frightening.
4. LFLex Fridman
  The following is a conversation with Dario Amodei, CEO of Anthropic, the company that created Claude that is currently and often at the top of most LLM benchmark leaderboards. On top of that, Dario and the Anthropic team have been outspoken advocates for taking the topic of AI safety very seriously, and they have continued to publish a lot of fascinating AI research on this and other topics. I'm also joined afterwards by two other brilliant people from Anthropic. First, Amanda Askell, who is a researcher working on alignment and fine-tuning of Claude, including the design of Claude's character and personality. A few folks told me she has probably talked with Claude more than any human at Anthropic, so she was definitely a fascinating person to talk to about prompt engineering and practical advice on how to get the best out of Claude. After that, Chris Olah stopped by for a chat. He's one of the pioneers of the field of mechanistic interpretability, which is an exciting set of efforts that aims to reverse engineer neural networks, to figure out what's going on inside, inferring behaviors from neural activation patterns inside the network. This is a very promising approach for keeping future super intelligent AI systems safe. For example, by detecting from the activations when the model is trying to deceive the human it is talking to. This is a Lex Friedman podcast. To support it, please check out our sponsors in the description. And now, dear friends, here's Dario Amodei.
3:14 – 12:20
Scaling laws
1. LFLex Fridman
  Let's start with the big idea of scaling laws and the scaling hypothesis. What is it? What is its history? And where do we stand today?
2. DADario Amodei
  So, I can only describe it as it, as, you know, as it relates to kind of my own experience, but I've been in the AI field for about, uh, 10 years, and it was something I noticed very early on. So I first joined the AI world when I was, uh, working at Baidu with Andrew Ng in late 2014, which is almost exactly 10 years ago now, and the first thing we worked on was speech recognition systems. And in those days, I think deep learning was a new thing, it had made lots of progress, but everyone was always saying, "We don't have the algorithms we need to succeed." You know? We- we- we- we're not... we're only matching a tiny, tiny fraction, there's so much we need to kind of discover algorithmically. We haven't found the picture of how to match the human brain. Uh, and when, you know, uh, i- in some ways it was fortunate, I was kind of, you know, you can have almost beginner's luck, right? I was like a n- a newcomer to the field and, you know, I looked at the neural net that we were using for speech, the recurrent neural networks, and I said, "I don't know, what if you make them bigger and give them more layers and what if you scale up the data along with this?" Right? I just saw these as, as like independent dials that you could turn, and I noticed that the models started to do better and better as you gave them more data, as you, as you made the models larger, as you trained them for longer. Um, and I- I didn't measure things precisely in those days but, but along with, with colleagues, we very much got the informal sense that the more data and the more compute and the more training you put into these models, the better they perform. And so initially, my thinking was, hey, maybe that is just true for speech recognition systems, right? Maybe, maybe that's just one particular quirk, one particular area. I think it wasn't until 2017 when I first saw the results from GPT-1 that it clicked for me that language is probably the area in which we can do this. We can get trillions of words of language data, we can train on them, and the models we were training those days were tiny. You could train them on one to eight GPUs whereas, you know, now we train jobs on tens of thousands, soon going to hundreds of thousands of GPUs. And so when I, when I saw those two things together, um... And, you know, there were a few people like Ilya Sutskever who, who you've interviewed who had somewhat similar views, right? He might have been the first one, th- though I think a few people came to, came to similar reviews around the same time, right? There was, you know, Rich Sutton's Bitter Lesson, there was... Gwern wrote about the scaling hypothesis. But I think somewhere between 2014 and 2017 was when it really clicked for me, when I really got conviction that, hey, we're gonna be able to do these incredibly wide cognitive tasks if we just, if we just scale up the models. And at, at every stage of scaling, there are always arguments and, you know, when I first heard them, honestly I thought, probably I'm the one who's wrong and, you know, all of these, all of these experts in the field are right, they know the situation better, better than I do, right? There's, you know, the Chomsky argument about, like, you can get syntactics but you can't get semantics. There was this idea, oh, you can make a sentence make sense but you can't make a paragraph make sense. The latest one we have today is...... uh, you know, we're gonna run out of data, or the data isn't high quality enough, or models can't reason. And- and each time, every time, we managed to, we managed to either find a way around or scaling just is the way around. Um, sometimes it's one, sometimes it's the other. Uh, and- and so I'm now at this point, I- I- I still think, you know, it's- it's- it's always quite uncertain. We have nothing but inductive inference to tell us that the next few years are gonna be like the next- the last 10 years. But- but I've seen- I've seen the movie enough times. I've seen the story happen for- for enough times to- to really believe that probably the scaling is going to continue, and that there's some magic to it that we haven't really explained on a theoretical basis yet.
3. LFLex Fridman
  And, of course, the scaling here is bigger networks, bigger data, bigger compute.
4. DADario Amodei
  Yes.
5. LFLex Fridman
  All of those.
6. DADario Amodei
  In- in- in particular, linear scaling up of bigger networks, bigger training times, and, uh, more- and- and more data. Uh, so all of these things... almost like a chemical reaction. You know, you have three ingredients in the chemical reaction, and you need to linearly scale up the three ingredients. If you scale up one, not the others, you run out of the other reagents and- and the reaction stops. But if you scale up everything, e- everything in series, then- then the reaction can proceed.
7. LFLex Fridman
  And, of course, now that you have this kind of empirical science/art, you can apply it to other, uh, more nuanced things, like scaling laws applied to interpretability or scaling laws applied to post-training or just seeing how does this thing scale. But the big scaling law, I guess, is the underlying scaling hypothesis has to do with big networks, big data leads to intelligence.
8. DADario Amodei
  Yeah. We've- we've documented scaling laws in lots of domains other than language, right? So, uh, initially, the- the paper we did that first showed it was in early 2020, where we first showed it for language. There was then some work late in 2020 where we showed the same thing for other modalities like images, video, text-to-image, image-to-text, math, that they all had the same pattern. And- and you're right. Now, there are other stages like post-training or there are new types of reasoning models. And in- in- in all of those cases that we've measured, we see similar- similar types of scaling laws.
9. LFLex Fridman
  A bit of a philosophical question, but what's your intuition about why bigger is better in terms of network size and data size? Why does it lead to more intelligent models?
10. DADario Amodei
  So in my previous career as a- as a biophysicist... So I did physics undergrad and then biophysics in- in- in- in grad school. So when I think back to what I know as a physicist, which is actually much less than what some of my colleagues at (laughs) Anthropic have in terms of- in terms of expertise in physics. Uh, there's this- there's this concept called the one over F noise and one over X distributions, um, where- where often, um, uh, you know, just- just like if you add up a bunch of natural processes, you get a Gaussian. If you add up a bunch of kind of differently distributed natural processes, if you like- if you like take a- take a, um, probe and- and hook it up to a resistor, the distribution of the thermal noise in the resistor goes as one over the frequency. Um, it's some kind of natural convergent distribution. Uh, and- and I- I- I- and- and I think what it amounts to is that if you look at a lot of things that are- that are produced by some natural process that has a lot of different scales, right? Not a Gaussian, which is kind of narrowly distributed, but, you know, if I look at kind of, like, large and small fluct- fluctuations that lead to- lead to electrical noise, um, they have this decaying one over X distribution. And so now I think of, like, patterns in the physical world, right? If I- if... or- or in language. If I think about the patterns in language, there are some really simple patterns. Some words are much more common than others, like "the." Then there's basic noun-verb structure. Then there's the fact that, you know, nouns and verbs have to agree, they have to coordinate. Then there's the higher level sentence structure. Then there's the thematic structure of paragraphs. And so the fact that there's this regressing structure, you can imagine that as you make the networks larger, first they capture the really simple correlations, the really simple patterns, and there's this long tail of other patterns. And if that long tail of other patterns is really smooth, like it is with the one over F noise in, you know, physical processes, like- like- li- like resistors, then you can imagine as you make the network larger, it's kind of capturing more and more of that distribution. And so that smoothness gets reflected in how well the models are at predicting it, how well they perform. Language is an evolved process, right? We've- we've developed language. We have common words and less common words. We have common expressions and less common expressions. We have ideas, cliches that are expressed frequently, and we have novel ideas. And that process has- has developed, has evolved with humans over millions of years. And so the- the- the guess, and this is pure speculation, would be- would be that there is- there's some kind of long tail distribution of- of- of the distribution of these ideas.
11. LFLex Fridman
  So there's the long tail, but also there's the height of the hierarchy of concepts that you're building up, so the bigger the network, presumably you have a higher capacity to...
12. DADario Amodei
  Exactly. If you have a small network, you only get the common stuff, right? If- if I take a tiny neural network, it's very good at understanding that, you know, a sentence has to have, you know, verb, adjective, noun, right? But it's- it's terrible at deciding what those verb, adjective, and nouns should be and whether they should make sense. If I make it just a little bigger, it gets good at that. Then suddenly, it's good at the sentences, but it's not good at the paragraphs. And so the- these- these rarer and more complex patterns get picked up as I add- as I add more capacity to the network.
12:20 – 20:45
Limits of LLM scaling
1. DADario Amodei
2. LFLex Fridman
  Well, the natural question then is, what's the ceiling of this?
3. DADario Amodei
  Yeah.
4. LFLex Fridman
  Like, how complicated and complex is the real world? How much of this stuff is there to learn?
5. DADario Amodei
  I don't think any of us knows the answer to that question. Um, I s- m- my strong instinct would be that there's no ceiling below the level of humans, right? We humans are able to understand these various patterns and so that, that makes me think that if we continue to, you know, scale up these, these, these models to kind of develop new methods for training them and scaling them up, uh, that will at least get to the level that we've gotten to with humans. There's then a question of, you know, how much more is it possible to understand that humans do? How much p- how much is it possible to be smarter and more perceptive than humans? I, I would guess the answer has, has got to be domain-dependent. If I look at an area like biology, and you know, I wrote this essay, Machines of Loving Grace, it seems to me that humans are struggling to understand the complexity of biology, right? If you go to Stanford or to Harvard or to Berkeley, you have whole departments of, you know, folks trying to study, you know, like the immune system or metabolic pathways, and, and each person understands only a tiny bit, part of it, specializes, and they're struggling to combine their knowledge with that of, with that of other humans. And so I have an instinct that there's, there's a lot of room at the top for AIs to get smarter. If I think o- of something like materials in the, in the physical world or, you know, um, like addressing, you know, conflicts between humans or something like that, I mean, you know, it, it may be there's only some of these problems are not intractable but much harder and, and it, it may be that there's only, there's only so well you can do at some of these things, right? Just like with speech recognition. There's only so clear I can hear your speech. So I think in some areas there may be ceilings in, in, in, you know, that are very close to what humans have done. In, in other areas those ceilings may be very far away. And I think we'll only find out when we build these systems. Uh, there's... It's very hard to know in advance. We can speculate but we can't be sure.
6. LFLex Fridman
  And in some domains the ceiling might have to do with human bureaucracies and things like this as you write about.
7. DADario Amodei
  Yes.
8. LFLex Fridman
  So humans fundamentally have to be part of the loop. That's the cause of the ceiling, not maybe the limit of the intelligence.
9. DADario Amodei
  Yeah. I think in many cases, um, you know, in theory technology could change very fast. For example, all the things that we might invent with respect to biology, um, but remember there's, there's a, you know, there's a clinical trial system that we have to go through to actually administer these things to humans. I think that's a mixture of things that are unnecessary and bureaucratic and things that kind of protect the integrity of society, and the whole challenge is that it's hard to tell, it's hard to tell what's going on, uh, it's hard to tell which is which, right? My, my view is definitely I think in terms of drug development, we... M- my view is that we're too slow and we're too conservative. But certainly if you get these things wrong, you know, it's, it's possible to, to, to risk people's lives by, by being, by being, by being too reckless. And so at least, at least some of these human institutions are in fact protecting people. So it's, it's all about finding the balance. I strongly suspect that balance is kind of more on the side of pushing to make things happen faster but there is a balance.
10. LFLex Fridman
  If we do hit a limit, if we do hit a slowdown in the scaling laws, what do you think would be the reason? Is it compute limited, data limited, uh, is it something else? Idea limited?
11. DADario Amodei
  So a few things. Now we're talking about hitting the limit before we get to the level of...
12. LFLex Fridman
  Yeah.
13. DADario Amodei
  ... of humans and the skill of humans. Um, so, so I think one that's, you know, one that's popular today and I think, you know, could be a limit that we run into, I, like most of the limits I would bet against it, but it's definitely possible, is we simply run out of data. There's only so much data on the internet. And there's issues with the quality of the data, right? You can get hundreds of trillions of words on the internet but a lot of it is, is repetitive or it's search engine, you know, search engine optimization drivel, or maybe in the future it'll even be text generated by AIs itself. Uh, and, and so I think there are limits to what to, to, to what can be produced in this way. That said, we, and I would guess other companies, are working on ways to make data synthetic, uh, where you can, you know, you can use the model to generate more data of the type that you have, uh, that you have already or even generate data from scratch. If you think about, uh, what was done with, uh, DeepMind's AlphaGo Zero, they managed to get a bot all the way from, you know, no ability to play Go whatsoever to above human level just by playing against itself. There was no example data from humans required in the, the AlphaGo Zero version of it. The other direction, of course, is these reasoning models that do chain of thought and stop to think, um, and, and reflect on their own thinking. In a way, that's another kind of synthetic data coupled with reinforcement learning. So my, my guess is with one of those methods we'll get around the data limitation or there may be other sources of data that are, that are available. Um, we could just observe that even if there's no problem with data as we start to scale models up they just stop getting better. It's, it seemed to be our, our reliable observation that they've gotten better. That could just stop at some point for a reason we don't understand. Um, the answer could be that we need to, uh, you know, we need to invent some new architecture. Um, it's been, there have been problems in the past whe- with, say, numerical stability of models, where it looked like things were, were leveling off but, but actually, you know, when we, when we, when we found the right unblocker they didn't end up doing so. So perhaps there's new, some new optimization method or some new, uh, technique we need to, to unblock things. I've seen no evidence of that so far but if things were to, to slow down that perhaps could be one reason.
14. LFLex Fridman
  What about the limits of compute? Meaning, uh, the expense of......uh, nature of building bigger and bigger data centers?
15. DADario Amodei
  So right now, I think, uh, you know, most of the frontier model companies, I would guess, are, are operating, you know, roughly, you know, $1 billion scale, plus or minus a factor of three, right? Those are the models that exist now or are being trained now. Uh, I think next year, we're gonna go to a few billion, and then, uh, 2026, we may go to, uh, uh, you know, above 10-10-10 billion, and probably by 2027, there are ambitions to build 100-100 billion dollar, uh, $100 billion clusters. And I think all of that actually will happen. There's a lot of determination to build the compute to do it within this country. Uh, and I would guess that it actually does happen. Now, if we get to 100 billion and that's still not enough compute, that's still not enough scale, then either we need even more scale or we need to develop some way of doing it more efficiently, of shifting the curve. Um, I think be- between all of these, one of the reasons I'm bullish about powerful AI happening so fast is just that if you extrapolate the next few points on the curve, we're very quickly getting towards human-level ability, right? Some of the new models that, that we developed, some, some reasoning models that have come from other companies, they're starting to get to what I would call the PhD or professional level, right? If you look at their, their coding ability, um, the latest model we released, Sonnet 3.5, the new or updated version, it gets something like 50% on SWE-Bench. And SWE-Bench is an example of a bunch of professional real-world software engineering tasks. At the beginning of the year, I think the state of the art was 3 or 4%. So in ten months, we've gone from 3% to 50% on this task. And I think in another year, we'll probably be at 90%. I mean, I don't know, but mi- might even be- might even be less than that. Uh, we've seen similar things in graduate level math, physics, and biology from models like OpenAI's o1. Uh, so, uh, if we, if we just continue to extrapolate this, right, in terms of skill- skills that we have, I think if we extrapolate the straight curve, within a few years, we will get to these models being, you know, above the- the highest professional level in terms of humans. Now, will that curve continue? You've pointed to and I've pointed to a lot of reasons why, you know, possible reasons why that might not happen. But if the, if the extrapolation curve continues, that is the trajectory
20:45 – 26:08
Competition with OpenAI, Google, xAI, Meta
1. DADario Amodei
  we're on.
2. LFLex Fridman
  So Anthropic has several competitors, it'd be interesting to get your sort of view of it all. OpenAI, Google, xAI, Meta. What does it take to win, in the broad sense of win, in this space?
3. DADario Amodei
  Yeah, so I wanna separate out a couple of things, right? So, you know, Anthropic's, Anthropics mission is to kind of try to make this all go well, right? And- and, you know, we have a theory of change called Race to the Top, right? Race to the Top is about trying to push the other players to do the right thing by setting an example. It's not about being the good guy, it's about setting things up so that all of us can be the good guy. I'll give a few examples of this. Early in the history of Anthropic, one of our co-founders, Chris Olah, who I believe you're- you're interviewing soon, you know, he's the co-founder of the field of mechanistic interpretability, which is an attempt to understand what's going on inside AI models. Uh, so we had him and one of our early teams focus on this area of interpretability, which we think is good for making models safe and transparent. For three or four years, that had no commercial application whatsoever. It still doesn't today. We're doing some early betas with it, and probably it will eventually, but, uh, you know, this is a very, very long research bet and one in which we've- we've built in public and shared our results publicly. And- and we did this because, you know, we think it's a way to make models safer. An interesting thing is that as we've done this, other companies have started doing it as well. In some cases because they've been inspired by it. In some cases because they're worried that, uh, you know, if- if other companies are doing this that look more responsible, they wanna look more responsible too. No one wants to look like the irresponsible actor. And- and so they adopt this, they adopt this as well. When folks come to Anthropic, interpretability is often a draw, and I tell them, "The other places you didn't go, tell them why you came here."
4. LFLex Fridman
  (laughs)
5. DADario Amodei
  Um, and- and then y- y- you see soon that there, that there's interpretability teams else- elsewhere as well. And in a way, that takes away our competitive advantage because it's like, oh, th- (laughs) they... Now others are doing it as well. But it's good, it's good for the broader system, and so we have to invent some new thing that we're doing that others aren't doing as well, and the hope is to basically bid up, bid up the importance of- of- of doing the right thing. And it's not, it's not about us in particular, right? It's not about having one particular good guy. Other companies can do this as well. If they- if they- if they join the race to do this, that's- that's, you know, that's the best news ever, right? Um, uh, it's- it's just, it's about kind of shaping the incentives to point upward instead of shaping the incentives to point- to point downward.
6. LFLex Fridman
  And we should say this example of the field of, uh, mechanistic interpretability is just a- a rigorous, non-hand-wavy way of doing AI safety.
7. DADario Amodei
  Yes.
8. LFLex Fridman
  Or it's tending that way.
9. DADario Amodei
  Trying to. I mean, I- I think we're still early, um, in terms of our ability to see things. But I've been surprised at how much we've been able to look inside these systems and understand what we see, right? Unlike with the scaling laws where it feels like there's some, you know, law that's driving these models to perform better, on- on the inside, the models aren't... You know, there's no reason why they should be designed for us to understand them, right? They're designed to operate, they're designed to work, just like the human brain or human biochemistry. They're not designed for a human to open up the hatch, look inside, and understand them. But, we have found, and, you know, you can talk in much more detail about this to Chris, that when we open them up, when we do look inside them, we- we find things that are surprisingly interesting.
10. LFLex Fridman
  And as a side effect, you also get to see the beauty of these models. E- you get to explore the...... sort of, uh, the beautiful na- nature of large neural networks through the MacInterp kind of methodology.
11. DADario Amodei
  I'm, I'm, I'm amazed at how clean it's been.
12. LFLex Fridman
  Yeah. (laughs)
13. DADario Amodei
  I, I'm amazed at things like induction heads. I'm amazed at things like, uh, you know, that, that we can, you know, use sparse autoencoders to find these directions within the networks, uh, and that the directions correspond to these very clear concepts. We demonstrated this a bit with the Golden Gate Bridge Claude. So this was an experiment where we found a, a direction inside one of the, the neural network's layers that corresponded to the Golden Gate Bridge, and we just turned that way up. And so, uh, we, we released this model as a demo. It was kind of half a joke, uh, for a couple days, uh, but it was, it was illustrative of, of the method we developed. And, uh, you could, you could take the Golden Gate... Uh, you could take the model. You could ask it about anything, you know. You know, it would be like how... Y- you could say, "How was your day?" And anything you asked, because this feature was activated, would connect to the Golden Gate Bridge. So it would say, you know, "I'm, I'm, I'm feeling relaxed and expansive, much like the arches-"
14. LFLex Fridman
  (laughs) Okay.
15. DADario Amodei
  "... of the Golden Gate Bridge," or you know, y-
16. LFLex Fridman
  It would masterfully change topic-
17. DADario Amodei
  Yes.
18. LFLex Fridman
  ... to the Golden Gate Bridge, and it integrated. There was also a sadness to it, to, to the focus it had on the Golden Gate Bridge. I think people quickly fell in love with it, I think. So people already miss it, 'cause it was taken down, I think, after a day.
19. DADario Amodei
  Somehow these interventions on the model, um, where, where, where, where you kind of adjust its behavior somehow emotionally made it seem more human-
20. LFLex Fridman
  Yeah. Yeah.
21. DADario Amodei
  ... than any other version of the model we've seen.
22. LFLex Fridman
  It's a strong personality, strong identity.
23. DADario Amodei
  Uh, it has a strong personality. It has these kind of like obsessive interests.
24. LFLex Fridman
  (laughs)
25. DADario Amodei
  You know, we can all think of someone who's-
26. LFLex Fridman
  Yeah.
27. DADario Amodei
  ... like obsessed with something.
28. LFLex Fridman
  Yeah.
29. DADario Amodei
  So it does make it feel somehow a bit more human.
26:08 – 29:44
Claude
1. DADario Amodei
2. LFLex Fridman
  Let's talk about the present. Let's talk about Claude. So this year, (laughs) a lot has happened. In March, Claude 3 Opus, Sonnet, Haiku were released, then Claude 3.5 Sonnet in July, with an updated version just now released, and then also Claude 3.5 Haiku was released. Okay. Can you explain the difference between Opus, Sonnet, and Haiku, and how we should think about the different versions?
3. DADario Amodei
  Yeah, so let's go back to March when we first released, uh, these three models. So, you know, our thinking was, you know, different companies produce kind of large and small models, better and worse models. We felt that there was demand both for a really powerful model, um, you know, and you... That might be a little bit slower that you'd have to pay more for, and also for fast, cheap models that are as smart as they can be for how fast and cheap, right? Whenever you wanna do some kind of like, you know, difficult analysis, like if I, you know, I wanna write code, for instance, or, you know, I wanna, I wanna brainstorm ideas, or I wanna do creative writing, I want the really powerful model. But then there's a lot of practical applications in a business sense where it's like, I'm interacting with a website. I, I... You know, like, I'm, like, doing my taxes, or I'm, you know, talking to a, you know, to, like, a legal advisor and I wanna analyze a contract. Or, you know, we have plenty of companies that are just like, you know, "I wanna do complete on my, on my IDE," or something. Uh, and, and for all of those things, you wanna act fast and you wanna use the model very broadly. So we wanted to serve that whole spectrum of needs. Um, so we ended up with this, uh, you know, this kind of poetry theme, and so what's a really short poem? It's a haiku.
4. LFLex Fridman
  Yeah.
5. DADario Amodei
  And so haiku is the small, fast, cheap model that is, you know, was at the time, was really surprisingly, surprisingly, uh, intelligent for how fast and cheap it was. Uh, sonnet is a, is a medium-sized poem, right? A couple paragraphs. And so sonnet was the middle model. It is smarter, but also a little bit slower, a little bit more expensive. And an opus, like a magnum opus is a large work, uh, Opus was the, the largest, smartest model at the time. Um, so that, that was the original kind of thinking behind it. Um, and our, our thinking then was, well, each new generation of models should shift that trade-off curve. Uh, so when we released Sonnet 3.5, it has the same, roughly the same, you know, cost and speed as the Sonnet 3 model. Uh, but, uh, it, it increased its intelligence to the point where it was smarter than the original Opus 3 model, uh, especially for code, but, but also just in general. And so now, you know, we've shown results for, uh, Haiku 3.5, and I believe Haiku 3.5, the smallest new model, is about as good as Opus 3, the largest old model.
6. LFLex Fridman
  Yeah.
7. DADario Amodei
  So basically, the aim here is to shift the curve, and then at some point there's gonna be an Opus 3.5. Um, now every new generation of models has its own thing. They use new data. Their personality changes in ways that we kind of, you know, try to steer, but are not fully able to steer. And, and so, uh, there's never quite that exact equivalence where the only thing you're changing is intelligence. Um, we always try and improve other things, and some things change without us, without us knowing or measuring. So it's, it's very much an inexact science. In many ways, the manner and personality of these models is more an art than it is a science.
29:44 – 34:30
Opus 3.5
1. DADario Amodei
2. LFLex Fridman
  So what is sort of the reason for, uh, the span of time between, say, uh, Claude Opus 3.0 and 3.5? What is, what takes that time, if you can speak to?
3. DADario Amodei
  Yeah, so there's, there's different, there's different, uh, processes. Um, uh, there's pre-training, which is, you know, just kind of the normal language model training, and that takes a very long time. Um, that uses, you know, these days, you know, tens, you know, tens of thousands, sometimes many tens of thousands of, uh, GPUs or TPUs or Trainium or, you know, what... We use different platforms, but, you know, accelerator chips, um, often, often training for months. Uh, there's then a kind of post-training phase where we do reinforcement learning from human feedback, as well as other kinds of reinforcement learning. That, that phase is getting, uh, larger and larger now. And you know, uh, you know, often that's less of an exact science. It often takes effort to get it right.Um, models are then tested with some of our early partners to see how good they are, and they're then tested both internally and externally for their safety, particularly for catastrophic and autonomy risks. Uh, so, uh, we do internal testing according to our responsible scaling policy, which I, you know, could talk more about that in detail. And then we have an agreement with the US and the UK AI Safety Institute, as well as other third-party testers in specific domains to test the models for what are called CBRN risks, chemical, biological, radiological, and nuclear, which are, you know, we don't think that models impose these risks seriously yet, but- but every new model we want to evaluate to see if we're starting to get close to some of these- these- these more dangerous, um, uh, these more dangerous capabilities. So those are the phases and then, uh, you know, then- then it just takes some time to get the model working in terms of inference and launching it in the API. So there's ju- just a lot of steps to, uh, to actually- to actually making a model work. And of course, you know, we're always trying to make the processes as streamlined as possible, right? We want our safety testing to be rigorous, but we want it to be rigorous and to be, you know, to be automatic, to- to happen as fast as it can without compromising on rigor. Same with our pre-training process and our post-training process. So, you know, it's just like building anything else. It's just like building airplanes. You want to make them, you know, you want to make them safe, but you want to make the process streamlined. And I think the creative tension between those is- is, you know, is an important thing in making the models work.
4. LFLex Fridman
  Yeah. Uh, rumor on the street, I forget who was saying that, uh, Anthropic has really good tooling. So I, uh, probably a lot of the challenge here is on the software engineering side, is to build the tooling to- to have a, like a efficient, low-friction interaction with the infrastructure.
5. DADario Amodei
  You would be surprised how much of the challenges of, uh, you know, building these models comes down to, you know, software engineering, performance engineering. You know, y- y- you know, from the outside you might think, "Oh man, we had this eureka breakthrough," right? You know, this movie with the science, "We discovered it. We figured it out." But- but- but I think- I think all things even- even- even, you know, incredible discoveries like they- they- they- they- they almost always come down to the details. Um, and- and often super, super boring details. I can't speak to whether we have better tooling than- than other companies. I mean, you know, haven't been at those other companies, at least- at least not recently. Um, but it's certainly something we give a lot of attention to.
6. LFLex Fridman
  I don't know if you can say but from 3, from Claude 3 to Claude 3.5, is there any extra pre-training going on or is it mostly focused on the post-training? There's been leaps in performance.
7. DADario Amodei
  Yeah. I think, I think at any given stage, we're focused on improving everything at once.
8. LFLex Fridman
  Okay.
9. DADario Amodei
  Um, just- just naturally like there are different teams. Each team makes progress in a particular area in- in- in making a particular, you know, their particular segment of the relay race better. And it's just natural that when we make a new model, we put- we put all of these things in at once.
10. LFLex Fridman
  So, uh, the data you have, like the preference data you get from R- RLHF, is that applicable? Is there ways to apply it to newer models as they get trained up?
11. DADario Amodei
  Yeah. Preference data from old models sometimes gets used for new models, although of course, uh, it- it performs somewhat better when it's, you know, trained on, it's trained on the new models. Note that we have this, you know, Constitutional AI method such that we don't only use preference data, we kind of, there's also a post-training process where we train the model against itself. And there's, you know, new- new types of post-training the model against itself that are used every day. So it's not just RLHF, it's a bunch of other methods as well. Um, post-training, I think, you know, is becoming more and more sophisticated.
34:30 – 37:50
Sonnet 3.5
1. DADario Amodei
2. LFLex Fridman
  Well, what explains the big leap in performance for the new Sonnet 3.5? I mean, at least in the programming side. And maybe this is a good place to talk about benchmarks. What does it-
3. DADario Amodei
  Yeah.
4. LFLex Fridman
  ... mean to get better? Just the number went up, but, you know, I- I- I program, but I also love programming and I, um, Claude 3.5 through Cursor is what I use, uh, to assist me in programming. And there was at- at least experientially, anecdotally, it's gotten smarter, uh, at programming. So what, like, what- what does it take to get it, uh, to get it smarter?
5. DADario Amodei
  We observed that as well, by the way. There were a couple, uh, very strong engineers here at Anthropic, um, who all previous code models both produced by us and produced by all the other companies hadn't really been useful to- to, hadn't really been useful to them. You know, they said, you know, "Maybe a be- maybe this is useful to a beginner, it's not useful to me." But Sonnet 3.5, the original one for the first time, they said, "Oh my God, this helped me with something that, you know, that would have taken me hours to do. This is the first model that's actually saved me time." So again, the waterline is rising and- and then I think, you know, the new Sonnet has been- has been even better. In terms of what it- what it takes, I mean, I'll just say it's been across the board. It's in the pre-training, it's in the post-training, it's in various evaluations that we do. We've observed this as well and if we go into the details of the benchmarks, so SWE-Bench is basically, you know, since- since, you know, since- since you're a programmer, you know, you- you'll be familiar with like pull requests and, you know, uh... Just- just pull requests are like the, you know, the like a sort of- a sort of atomic unit of work. You know, you could say I'm, you know, I'm implementing one, I'm implementing one thing. Um, uh, and- and so SWE-Bench actually gives you kind of a real world situation where the code base is in a current state and I'm trying to implement something that's, you know, that's described in, described in language. We have internal benchmarks where we, where we measure the same thing and you say, "Just give the model free rein to like, you know, do anything, run- run- run anything, edit anything." Um, how- how well is it able to complete these tasks? And it's that benchmark that's gone from it can do it 3% of the time to it can do it about 50% of the time. Um, so I actually do believe that if we get- you can gain benchmarks, but I think if we get to 100% on that benchmark in- in a way that isn't kind of like over-trained or- or gamed for that particular benchmark, probably represents a- a- a real and serious increase in kind of, in kind of programming, programming ability. And- and I would suspect that if we can get to, you know, 90, 90, 95% that- that- that it, you know, it- it will- it will represent ability to autonomously do a significant fraction of software engineering tasks.
6. LFLex Fridman
  Okay.Well, ridiculous timeline question. Uh, when is Claude Opus, uh, 3.5 coming out?
7. DADario Amodei
  Uh, not giving you an exact date, uh, but, you know, they're, they're, uh, uh, as, you know, as far as we know the plan is still to have a Claude 3.5 Opus.
8. LFLex Fridman
  Are we gonna get it before GTA VI or no?
9. DADario Amodei
  Like Duke Nukem Forever-
10. LFLex Fridman
  Duke Nukem?
11. DADario Amodei
  ... or what was that game that... There was some game that was delayed 15 years.
12. LFLex Fridman
  That's right.
13. DADario Amodei
  Was that Duke Nukem Forever?
14. LFLex Fridman
  Yeah. And I think GTA is now just releasing trailers.
15. DADario Amodei
  It, you know, it's only been three months since we released the first Sonnet. (laughs)
16. LFLex Fridman
  Yeah, it's incr- the incredible pace of releases.
17. DADario Amodei
  The, it, it just, it just tells you about the pace.
18. LFLex Fridman
  Yeah.
19. DADario Amodei
  The expectations for when things are gonna come out.
20. LFLex Fridman
  So, uh,
37:50 – 42:02
Claude 4.0
1. LFLex Fridman
  what about 4.0? So how do you think about sort of as these models get bigger and bigger about versioning, and also just versioning in general, why Sonnet 3.5 updated with the date? Why not (laughs) Sonnet 3.6-
2. DADario Amodei
  Yeah. It's, it's actually-
3. LFLex Fridman
  ... which a lot of people calling...
4. DADario Amodei
  Naming is actually an interesting challenge here, right?
5. LFLex Fridman
  Yeah.
6. DADario Amodei
  Because I think a year ago most of the model was pre-training. And so you could start from the beginning and just say, "Okay, we're gonna have models of different sizes. We're gonna train them all together and, you know, we'll have a, a family of naming schemes and then we'll put some new magic into them and then, you know, we'll have the next, the next generation." Um, the trouble starts already when some of them take a lot longer than others to train, right? That already messes up your time, time a little bit. But as you make big improvements in, as you make big improvements in pre-training, uh, then you suddenly notice, oh, I can make better pre-trained model and that doesn't take very long to do. And, but, you know, clearly it has the same, you know, size and shape of previous models. Uh, uh, so I think those two together as well as the timing, timing issues, any kind of scheme you come up with, uh, you know, the r- r- reality tends to kind of frustrate that scheme, right? You, it t- ten- tends to kind of break out of the, break out of the scheme. It's not like software where you can say, "Oh, this is like, you know, 3.7, this is 3.8." No, you have models with different, different trade-offs. You can change some things in your models. You can train, you can change other things. Some are faster and slower at inference. Some have to be more expensive. Some have to be less expensive. And so I think all the companies have struggled with this.
7. LFLex Fridman
  Yeah.
8. DADario Amodei
  Um, I think we did very, you know, I think, think we were in a good, good position (laughs) in terms of naming when we had Haiku, Sonnet, and Opus.
9. LFLex Fridman
  It was great. Great start.
10. DADario Amodei
  And we're, we're trying to maintain it, but it's not, it's not, it's not perfect. Um, so we'll, we'll, we'll try and get back to the simplicity, but it, it, um, uh, just the, the nature of the field, I feel like no one's figured out naming. It's somehow a different paradigm from, like, normal software. And, and, and so w- we, we just, n- none of the companies have been perfect at it. Um, it's something we struggle with surprisingly much relative to, you know, how (laughs) , relative to how trivial it is to, to, you know, for the, the, the, the grand science of training the models.
11. LFLex Fridman
  So from the user side, the user experience of the updated Sonnet 3.5 is just different than the previous, uh, June 2024 Sonnet 3.5. It would be nice to come up with some kind of labeling that embodies that because people talk about Sonnet 3.5, but now there's a different one, and so how do you refer to the previous one and the new one and it, it, uh, when there's a distinct improvement? It just makes conversation about it, uh, just challenging.
12. DADario Amodei
  Yeah, yeah. I, I definitely think this question of there are lots of properties of the models that are not reflected in the benchmarks. Um, I, I think, I think that's, that's definitely the case and everyone agrees, and not all of them are capabilities. Some of them are, you know, models can be polite or brusque. They can be, uh, you know, uh, very reactive or they can ask you questions. Um, they can have what feels like a warm personality or a cold personality. They can be boring or they can be very distinctive like Golden Gate Claude was. Um, and we have a whole, you know, we have a whole team kind of focused on I think we call it Claude Character. Uh, Amanda leads that team and we'll, we'll talk to you about that. But it's still a very inexact science. Uh, and, and o- often we find that models have properties that we're not aware of. The, the fact of the matter is that you can, you know, talk to a model 10,000 times and there are some behaviors you might not see. Uh, just like, just like with a human, right? I can know someone for a few months and, you know, not know that they have a certain skill or not know that there's a certain side to them. And so I think, I think we just have to get used to this idea and we're always looking for better ways of testing our models to, to demonstrate these capabilities and, and also to decide which are, which are the, which are the personality properties we want models to have and which we don't want to have. That itself, the normative question is also super interesting.
42:02 – 54:49
Criticism of Claude
1. DADario Amodei
2. LFLex Fridman
  I got to ask you a question from Reddit.
3. DADario Amodei
  From Reddit?
4. LFLex Fridman
  (laughs)
5. DADario Amodei
  Oh boy. (laughs)
6. LFLex Fridman
  You know, there, there's just this fascinating, to me at least, it's a psychological social phenomenon where people report that Claude has gotten dumber for them over time. And so, uh, the question is, does the user complaint about the dumbing down of Claude 3.5 Sonnet hold any water? So are these anecdotal reports a kind of social phenomena or did Claude... Is there any cases where Claude would get dumber?
7. DADario Amodei
  So, uh, this actually doesn't apply... This, this isn't just about Claude. I, I believe this, I believe I've seen these complaints for every foundation model produced by a major company. Um, people said this about GPT-4. They said it about GPT-4 Turbo. Um, so, so, so a couple things. Um, one, the actual weights of the model, right, the actual brain of the model, that does not change unless we introduce a new model. Um, there, there are just a number of reasons why it would not make sense practically to be randomly substituting in...... substituting in new versions of the model. It's difficult from an inference perspective and it's actually hard to control all the consequences of changing the weights of the model. Let's say you wanted to fine-tune the model to be like, I don't know, to like, to say certainly less which, w- you know, an old version of Sonnet used to do. Um, you actually end up changing 100 things as well so we have a whole process for it and we have a whole process for modifying the model, we do a bunch of testing on it, we do a bunch of, um, like, we do a bunch of user testing in early customers. So it, it, w- we both have never changed the weights of the model without, without telling anyone and it, it, w- it wouldn't, c- certainly in the current set-up it would not make sense to do that. Now there are a couple of things that we do occasionally do. Um, one is sometimes we run A/B tests, um, but those are typically very close to when a model is being, is being, uh, released and for a very small fraction of time. Um, so, uh, you know, like the d- th- you know, the, the day before the new Sonnet 3.5, I, I agree, we should have (laughs) should have had a better name. It's (laughs) clunky to refer to it. Um, there were some comments from people that like, "It's got, it's s- got, it's gotten a lot better," and that's because the, you know, fraction were exposed to, to an A/B test for, for those wonder, for those one or two days. Um, the other is that occasionally the system prompt will change. Um, uh, and the system prompt can have some effects although it's un- it, it, it's unlikely to dumb down models, it's unlikely to make them dumber. Um, and, and, and, and we've seen that while these two things which I'm listing to be very complete, um, happen relatively, h- happen quite infrequently, um, the complaints about j- for us and for other model companies about the model changed, the model isn't good at this, the model got more censored, the model was dumbed down, those complaints are constant. And so I don't want to say, like, people are imagining it or anything but, like, the models are, for the most part, not changing. Um, if I were to offer a theory, um, I, I think it actually relates to one of the things I said before which is that models have many, are very complex and have many aspects to them and so often, you know, if I, if I, if I, if I ask a model a question, you know, if I'm like, if I'm like, "Do task X," versus, "Can you do task X?" the model might respond in different ways. Uh, and, and so there are all kinds of subtle things that you can change about the way you interact with the model that can give you very different results. Um, to be clear, this, this itself is like a failing by, by us and by the other model providers that, that the models are, are just, just often sensitive to, like, small, small changes in wording. It's yet another way in which the science of how these models work is very poorly developed. Uh, and, and so, you know, if I go to sleep one night and I was like talking to the model in a certain way and I, like, slightly change the phrasing of how I talk to the model, you know, I could, I could get different results. So that's, that's one possible way. The other thing is, man, it's just hard to quantify this stuff. Uh, it's hard to quantify this stuff. I think people are very excited by new models when they come out and then as time goes on they, they become very aware of the, they become very aware of the limitations, so that may be another effect. But that's, that's all a very long-winded way of saying for the most part, with some fairly narrow exceptions, the models are not changing.
8. LFLex Fridman
  I think there is a psychological effect. You just start getting used to it, the baseline raises. Like when people had first gotten wifi on airplanes it's, like, amazing.
9. DADario Amodei
  It's like amazing.
10. LFLex Fridman
  Magic.
11. DADario Amodei
  Yeah.
12. LFLex Fridman
  And then, and then you start... (laughs)
13. DADario Amodei
  And now I'm like, "I can't get this thing to work!"
14. LFLex Fridman
  (laughs) Yeah.
15. DADario Amodei
  (laughs) "This is such a piece of crap!" (laughs)
16. LFLex Fridman
  Exactly. So then it's easy to have the conspiracy theory of they're making wifi slower and slower. This is probably a- something I'll talk to Amanda much more about but, um, another Reddit question. Uh, "When will Claude stop trying to be my, uh, puritanical grandmother, imposing its moral world view on me as a paying customer?" And also, "What is the psychology behind making Claude overly apologetic?" So this kind of reports about the experience-
17. DADario Amodei
  Yeah.
18. LFLex Fridman
  ... a different angle on the frustration. It has to do with the character
19. DADario Amodei
  Yeah, so a couple points on this first. One is, um, like, things that people say on Reddit and Twitter or X or whatever it is, um, there's actually a huge distribution shift between, like, the stuff that people complain loudly about on social media and what actually kind of like, you know, statistically users care about and that drives people to use the models. Like people are frustrated with, you know, things like, you know, the model not writing out all the code or the model, uh, you know, just, just not being as good a- a- at code as it could be even though it's the best model in the world on code. Um, I, I think the majority of thing, of things are about that, um, uh, but, uh, certainly a, a, a kind of vocal minority are, uh, you know, kind- kind of, kind of raise these concerns, right, are frustrated by the model refusing things that it shouldn't refuse or, like, apologizing too much or just, just having these kind of, like, annoying verbal tics. Um, the second caveat, and I just want to say this, like, super clearly because I think it's, like, some people don't know what o- o- others, like, n- kind of know it but forget it. Like it is very difficult to control across the board how the models behave. You cannot just reach in there and say, "Oh, I want the model to, like, apologize less." Like you can do that, you can include training data that says, like, "Oh, the model should, like, apologize less," but then in some other situation they end up being, like, super rude or, like, over-confident in a way that's, like, misleading people. So there, there are all these trade-offs. Um, uh, f- for example, another thing is if there was a period during which models, ours and I think others as well, were too verbose, right? They would, like, repeat themselves, they would say too much. Um-You can cut down on the verbosity by penalizing the models for, for just talking for too long. What happens when you do that, if you do it in a crude way, is when the models are coding, sometimes they'll say, "Rest of the code goes here," right? Because they've learned that that's a way to economize and that they see it and then, and then... so that leads the model to be so-called lazy in coding, where they-
20. LFLex Fridman
  Yeah.
21. DADario Amodei
  ... where they, where they're just like, "Ah, you can finish the rest of it."
22. LFLex Fridman
  Yeah.
23. DADario Amodei
  It's not, it's not because we wanna, you know, save on compute or because, you know, the models are lazy and, you know, it, d- during winter break or any of the other kind of conspiracy theories that have, that have, that have come up. It's actually, it's just very hard to control the behavior of the model, to steer the behavior of the model in all circumstances at once. You can kind of... there's this, this whack-a-mole aspect where you push on one thing and, like, you know, these o- these, these, th- you know, th- th- these other things start to move as well that you may not even notice or measure. And so one of the reasons that I, that I care so much about, uh, you know, kind of grand alignment of these AI systems in the future is actually, the systems are actually quite unpredictable. They're actually quite hard to steer and control. Um, and this version we're seeing today of you make one thing better, it makes another thing worse, uh, I think that's, that's like a present day analog of future control problems in AI systems that we can start to study today, right? I think, I think that, that, that difficulty in, in steering the behavior and, and making sure that if we push an AI system in one direction, it doesn't push it in another direction in some, in some other ways that we didn't want, uh, I think that's, that's kind of an, that's kind of an early sign of things to come. And if we can do a good job of solving this problem, right, of, like, you ask the model to, like, you know, to, like, make and distribute smallpox and it says no, but it's willing to, like, help you in your graduate level virology class. Like, how do we get both of those things at once? It's hard. It's very easy to go to one side or the other, and it's a multi-dimensional problem. And so, uh, I, you know, I think these questions of, like, shaping the model's personality, I think they're very hard. I think we haven't done perfectly on them. I think we've actually done the best of all the AI companies, but still so far from perfect. Uh, and I think if we can get this right, if we can control the, the, you know, control the false positives and false negatives in this, this very kind of controlled present day environment, we'll be much better at doing it for the future when our worry is, you know, will the models be super autonomous? Will they be able to, you know, make very dangerous things? Will they be able to autonomously, you know, build whole companies and are those companies aligned? So, so I, I, I think of this, this present task as both vexing but also good practice for the future.
24. LFLex Fridman
  What's the current best way of gathering sort of user feedback? Like, uh, not anecdotal data, but just large scale data about pain points or the opposite of pain points, positive things, so on? Is it internal testing? Is it-
25. DADario Amodei
  Yeah.
26. LFLex Fridman
  ... a specific group testing, AB testing? What, what, what works?
27. DADario Amodei
  Uh, so, so, so typically, um, we'll have internal model bashings where all of Anthropic, Anthropic is almost 1,000 people, um, you know, people just, just try and break the model. They try and interact with it various ways. Um, uh, we have a suite of evals-
28. LFLex Fridman
  Mm-hmm.
29. DADario Amodei
  ... uh, for, you know, oh, is the model refusing in ways that it, that it couldn't? I think we even had a certainly eval because, you know, our, our mo- a- a- again, one point model had this problem where, like, it had this annoying tic where it would, like, respond to a wide range of questions by saying, "Certainly, I can help you with that. Certainly, I would be happy to do that. Certainly, this is correct." Um, uh, and so we had a, like, certainly eval, which is like how, how often-
30. LFLex Fridman
  (laughs)
54:49 – 1:05:37
AI Safety Levels
1. LFLex Fridman
  Okay. Can you explain the responsible scaling policy and the AI safety level standards-
2. DADario Amodei
  Yeah.
3. LFLex Fridman
  ... ASL levels?
4. DADario Amodei
  As much as I'm excited about the benefits of these models, and, you know, we'll t- talk about that if we talk about Machines of Love and Grace, um, I'm, I'm worried about the risks and I continue to be worried about the risks. Uh, no one should think that, you know, Machines of Love and Grace was me, me saying, uh, you know, "I'm no longer worried about the risks of these models." I think they're two sides of the same coin. The, the, uh, power of the models and their ability to solve all these problems in, you know, biology, neuroscience, economic development, gover- governance and peace, large parts of the economy, those, those come with risks as well, right? With great power comes great responsibility, right? That's the, the two are, the two are paired. Uh, things that are powerful can do good things and they can do bad things. Um, I think of those risks as, as being in s- you know, several different, different categories. Perhaps the two biggest risks that I think about, and that's not to say that there aren't risks today that are, that are important, but when I think of the really, the, the, you know, the things that would happen on the grandest scale, um, one is what I call catastrophic misuse. These are misuse of the models in domains like cyber, bio, radiological, nuclear, right? Things that could, you know, that could harm or even kill thousands, even millions of people if they really, really go wrong. Um, like, these are the n- you know, number one priority to prevent. And, and here, I would just make a simple observation, which is that mo- the models, you know, if, if I look today at people who have done really bad things in the world, um, uh, I think actually humanity has been protected by the fact that the overlap between really smart, well-educated people and people who wanna do really horrific things has generally been small. Like, you know, let's say, let's say I'm someone who, you know, I, you know, I have a PhD in this field, I have a well-paying job, um, there's so much to lose. Why do I wanna, like... You know, even, even assuming I'm completely evil, which, which most people are not, um, why, why, you know, why would such a person risk their, risk their, you know, risk their life, risk, risk their, their legacy, their reputation to, to do something like, you know, truly, truly evil? If we had a lot more people like that, the world would be a much more dangerous place. And so my, my, my worry is that by being a j- a, a, a much more intelligent agent, AI could break that correlation. And so I, I, I, I, I do have serious worries about that. I believe we can prevent those worries, uh, but, you know, I, I think as a counterpoint to Machines of Love and Grace, I wanna say that this is, the, I, there's still serious risks. And, and the second range of risks would be the autonomy risks, which is the idea that models might on their own, particularly as we give them more agency than they've had in the past, uh, particularly as we give them supervision over wider tasks, like, you know, writing whole code bases or some day even, you know, effectively operating entire, entire companies, they're on a long enough leash. Are they, are they doing what we really want them to do? It's very difficult to even understand in detail what they're doing, let alone, let alone control it. And like I said, this- these early signs that it's, it's hard to perfectly draw the boundary between things the model should do and things the model shouldn't do, that, that, you know, if, if you go to one side, y- you get things that are annoying and useless and you go to the other side, you get other behaviors. If you fix one thing, it creates other problems. We're getting better and better at solving this. I don't think this is an unsolvable problem. I think this is a, you know, this is a science, like, like the safety of airplanes or the safety of cars or the safety of drugs. I, you know, I, I don't think there's any big thing we're missing, I just think we need to get better at controlling these models. And so these are, these are the two risks I'm worried about and our responsible scaling plan, which I'll, uh, recognize is a very long-winded answer to your question-
5. LFLex Fridman
  (laughs) I love it. I love it.
6. DADario Amodei
  ... uh, our responsible scaling plan is designed to address these two types of risks. And so, every time we develop a new model, we basically test it for its ability to do both of these bad things. So if I were to back up a little bit, um, I, I think we have a, I think we have an interesting dilemma with AI systems where they're not yet powerful enough to present these catastrophes. I don't know that, I don't know if they'll ever present- prevent these catastrophes. It's possible they won't, but the, the case for worry, the case for risk is strong enough that we should, we should act now, and, and they're, they're getting better very, very fast, right? I, you know, I testified in the Senate that, you know, we might have serious bio risks within two to three years, that was about a year ago. Things have proceeded, proceeded apace. Uh, uh, so we have this thing where it's, like, it's, it's, it's surprisingly hard to, to address these risks because they're not here today, they don't exist, they're like ghosts, but they're coming at us so fast because the models are improving so fast. So how do you deal with something that's not here today, doesn't exist, but is, is coming at us very fast? Uh, so the solution we came up with for that in, in collaboration with, uh, you know, people like, uh, the organization METER and Paul Christiano is, okay, wh- what y- wha- what, what you need for that are you need tests to tell you when the risk is getting close. You need an early warning system. And, and so every time we have a new model, we test it for its capability to do these CBRN tasks as well as testing it for, you know, how capable it is of doing tasks autonomously on its own. And, uh, in the latest version of our RSP, which we released in the last, in the last month or two, uh, the way we test autonomy risks is the model, th- the AI model's ability to do aspects of AI research itself, uh, which when the model, when the AI models can do AI research, they become kind of truly, truly autonomous. Uh, and that, you know, that threshold is important for a bunch of other ways. And, and so what do we then do with these tasks?The RSP basically develops what we've called an if-then structure, which is if the models pass a certain capability, then we impose a certain set of safety and security requirements on them. So today's models are what's called ASL-2. Models that were A- ASL-1 is for systems that manifestly don't pose any risk of autonomy or misuse. So for example, a chess-playing bot, Deep Blue, would be ASL-1. It's just manifestly the case that you can't use Deep Blue for anything other than chess. It was just designed for chess. No one's gonna use it to, like, you know, to conduct a masterful cyberattack or to, you know, run wild and take over the world. ASL-2 is today's AI systems, where we've measured them and we think these systems are simply not smart enough to, uh, to, you know, autonomously self-replicate or conduct a bunch of tasks, uh, and also not smart enough to provide meaningful information about CBRN risks and how to build CBRN weapons above and beyond what can be known from looking at Google. Uh, i- in fact, sometimes they do provide information, but, but not a- above and beyond a search engine, but not in a way that can be stitched together. Um, not, not in a way that kind of end-to-end is dangerous enough. So ASL-3 is gonna be the point at which, uh, the models are helpful enough to enhance the capabilities of non-state actors, right? State actors can already do a lot, a lot of, unfortunately, to a high level of proficiency, a lot of these very dangerous and destructive things. The difference is that non-state, non-state actors are not capable of it. And so when we get to ASL-3, we'll take special security precautions designed to be sufficient to prevent theft of the model by non-state actors and misuse of the model as it's deployed. Uh, we'll have to have enhanced filters targeted at these particular areas.
7. LFLex Fridman
  Cyber, bio, nuclear.
8. DADario Amodei
  Cyber, bio, nuclear, and model autonomy, which is less a misuse risk and more a risk of the model doing bad things itself. ASL-4, getting to the point where these models could, could enhance the capability of a, of a, of a already knowledgeable state actor and/or become the, you know, the main source of such a risk. Like, if you wanted to engage in such a risk, the main way you would do it is through a model. And then I think ASL-4 on the autonomy side, it's, it's some, some, some amount of acceleration in AI research capabilities with an, with an AI model. And then ASL-5 is where we would get to the models that are, you know, that are, that are kind of, that are kind of, you know, truly capable, that it could exceed humanity in their ability to do, to do any of these tasks. And so the, the, the point of the if-then structure commitment is, is basically to say, "Look, I, I, I don't know. I've been, I've been working with these models for many years and I've been worried about risk for many years." It's actually kinda dangerous to cry wolf. It's actually kinda dangerous to say, "This th-, you know, this, this model is, this model is risky," and, you know, p- people look at it and they say, "This is manifestly not dangerous." Again, it's, it's, it's the, the delicacy of the risk isn't here today, but it's coming at us fast. How do you deal with that? It's, it's really vexing to a risk planner to deal with it. And so this if-then structure basically says, "Look, we don't want to antagonize a bunch of people. We don't wanna harm our own, you know, our, our, our kind of own ability to have a place in the conversation by imposing these, these very onerous burdens on models that are not dangerous today." So the if-then, the trigger commitment is basically a way to deal with this. It says you clamp down hard when you can show the model was dangerous. And of course, what has to come with that is, you know, enough of a buffer threshold that y- that, you know, you can, you can, uh, y- y- you know, you're, you're, you're, you're not at high risk of kind of missing the danger. It's not a perfect framework. We've had to change it every, every, uh, you know, we came out with a new one just a few weeks ago and probably, probably going forward, we might release new ones multiple times a year because it's, it's hard to get these policies right, like technically, organizationally, from a research perspective. But that is the proposal, if-then commitments and triggers in order to minimize burdens and false alarms now, but really react appropriately when the dangers are here.
1:05:37 – 1:09:40
ASL-3 and ASL-4
1. DADario Amodei
2. LFLex Fridman
  What do you think the timeline for ASL-3 is, where several of the triggers are fired, and what do you think the timeline is for ASL-4?
3. DADario Amodei
  Yeah. So that is hotly debated within the company. Um, uh, we are working actively to prepare ASL-3, uh, security, uh, security measures as well as ASL-3 deployment measures. Um, I'm not gonna go into detail, but we've made, we've made a lot of progress on both and, you know, we're, we're prepared to be, I think, ready quite soon. Uh, I would, I would not be surpri- I would not be surprised at all if we hit ASL-3, uh, next year. There was some concern that we, we might even hit it, uh, uh, uh, this year. That's still, that's still possible. That could still happen. It's, like, very hard to say, but, like, I would be very, very surprised if it was, like, 2030. Uh, I think it's much sooner than that.
4. LFLex Fridman
  So there's, uh, protocols for detecting it, the if-then, and then there's protocols for how to respond to it.
5. DADario Amodei
  Yes.
6. LFLex Fridman
  How difficult is the second, the latter? The, uh...
7. DADario Amodei
  Yeah. I think for ASL-3, it's primarily about security, um, and, and about f- you know, filters on the model relating to a very narrow set of areas when we deploy the model because at ASL-3, the model isn't autonomous yet. Um, uh, and, and so you don't have to worry about, you know, kind of the model itself...... behaving in a bad way even when it's deployed internally. So, I think the ASL-3 measures are, are, I won't say straightforward, they're, they're, they're, they're rigorous, but they're easier to reason about. I think once we get to ASL-4, um, we start to have worries about the models being smart enough that they might sandbag tests, they might not tell the truth about tests. Um, we had some results came out about, like, sleeper agents, and there was a more recent paper about, you know, can, can the models, uh, uh, mislead attempts to, you know, sa- sandbag their own abilities, right? Show them, you know, uh, uh, uh, uh, present themselves as being less capable than they are. And so, I think with ASL-4, there's gonna be an important component of using other things than just interacting with the models. For example, interpretability or hidden chains of thought, uh, where you have to look inside the model and verify via some other mechanism that, that is not, you know, is not as easily corrupted as what the model says, uh, that, that, you know, the, that, that the model indeed has some property. Uh, so we're still working on ASL-4. One of the properties of the RSP is that we, we don't specify ASL-4 until we've hit ASL-3.
8. LFLex Fridman
  Yeah.
9. DADario Amodei
  B- and, and I think that's proven to be a wise decision because even with ASL-3, it, again, it's hard to know this stuff in detail, and, and it, it, we wanna take as much time a- as we can possibly take to get these things right.
10. LFLex Fridman
  So for ASL-3, the bad actor would be the humans using it.
11. DADario Amodei
  Humans, yes.
12. LFLex Fridman
  And so there, it's a little bit more, um...
13. DADario Amodei
  For ASL-4, it's both, I think.
14. LFLex Fridman
  It's both.
15. DADario Amodei
  Both.
16. LFLex Fridman
  And so deception, and that's where mechanistic interpretability comes into play, and, uh, hopefully the techniques used for that are not made accessible to the model.
17. DADario Amodei
  Yeah, I mean, uh, of course you can hook up the mechanistic interpretability to the model itself, um, but then you, then you, then you, then you've kind of lost it as a reliable indicator of, uh, of, uh, of, of, of the model state. There are a bunch of exotic ways you can think of that it might also not be reliable, like if the, you know, model gets smart enough that it can, like, you know, jump computers and, like, read the code where you're, like, looking at its internal state. We've thought about some of those. I think they're exotic enough. There are ways to render them unlikely, but yeah. Generally, you wanna, you wanna preserve mechanistic interpretability as a kind of verification set or test set that's separate from the training process of the model.
18. LFLex Fridman
  See, I think, uh, as these models become better and better at conversation and become smarter, social engineering becomes a threat too, 'cause they (laughs) -
19. DADario Amodei
  Oh, yeah.
20. LFLex Fridman
  ... they can start being very convincing to the engineers inside companies.
21. DADario Amodei
  Oh, yeah. Yeah.
22. LFLex Fridman
  (laughs)
23. DADario Amodei
  It's actually, like, you know, we've, we've seen lots of examples of demagoguery in our life from humans, and, and you know there's a concern that models could do that, could do that as well.
1:09:40 – 1:19:35
Computer use
1. DADario Amodei
2. LFLex Fridman
  One of the ways that Claude has been getting more and more powerful is it's now able to do some agentic stuff. Um, computer use, uh, there's also an analysis within the sandbox of Claude.AI itself, but let's talk about computer use. That's seems to me super exciting that you can just give Claude a task and it, uh, it takes a bunch of actions, figures it out, and has access to the, your computer through screenshots. So, can you explain how that works, uh, and where that's headed?
3. DADario Amodei
  Yeah. It's actually relatively simple. So Claude has, has had for a long time, since, since Claude 3 back in March, the ability to analyze images and respond to them with text. The, the only new thing we added is those images can be screenshots of a computer, and in response, we train the model to give a location on the screen where you can click and/or buttons on the keyboard you can press in order to take action. And i- it turns out that with actually not all that much additional training, the models can get quite good at that task. It's a good example of generalization. Um, you know, people sometimes say if you get to low Earth orbit, you're like halfway to anywhere, right? Because of how much it takes to escape the gravity well. If you have a strong pretrained model, I feel like you're halfway to anywhere, uh, in, in ter- in terms of, in terms of the intelligence space. Uh, uh, uh, and, and, and so actually, it didn't, it didn't take all that much to get, to get Claude to do this, and you can just set that in a loop. Give the model a screenshot, tell it what to click on, give it the next screenshot, tell it what to click on, and, and that turns into a full kind of almost, almost 3D video interaction of the model, and it's able to do all of these tasks, right? You know, we, we showed these demos where it's able to, like, fill out spreadsheets, it's able to kind of, like, interact with a website, it's able to, you know, um, it's, you know, it's able to open all kinds of, you know, programs, different operating systems, Windows, Linux, Mac. Uh, uh, so, uh, you know, I think all of that is very exciting. I, I will say, while in theory there's nothing you could do there that you couldn't have done through just giving the model the API to drive the computer screen, uh, this really lowers the barrier, and you know, there's, there's, there's a lot of folks who, who, who either, you know, kind of, kind of aren't, aren't, aren't, you know, aren't in a position to, to interact with those APIs or it takes them a long time to do. It's just the screen is just a universal interface that's a lot easier to interact with, and so I expect over time this is gonna lower a bunch of barriers. Now honestly, the current model has... there's, there, it leaves a lot still to be desired, and we were, we were honest about that in the blog, right? It makes mistakes, it misclicks, and we, we, you know, we were careful to warn people, "Hey this thing isn't... you can't just leave this thing to, you know, run on your computer for minutes and minutes." Um, "You gotta give this thing boundaries and guardrails." And I think that's one of the reasons we released it first in an API form rather than kind of, you know, this, this kind of just, just hand, just hand it to the consumer and, and give it control of their, of their, of their, of their computer. Um, but, but, you know, I definitely feel that it's important to get these capabilities out there. As models get more powerful, we're gonna have to grapple with, you know, how do we use these capabilities safely? How do we prevent them from being abused? Uh, and, and, you know, I think, I think releasing, releasing the model while, while, while the capabilities are, are, you know, are, are still, are still limited is, is, is very helpful in terms of, in terms of doing that. Um, you know, I think since it's been released, a number of customers, I think, uh, Replit was maybe, was maybe one of the, the, the most, uh, uh, quickest, quickest, quickest to, quickest to deploy things, um, have, have, you know, have made use of it in various ways. People have hooked up demos for, you know, Windows desktops, Macs.... uh, uh, you know, Linux, Linux machines. Uh, so yeah. It's been, it's been, it's been very exciting. I think as with, as with anything else, you know, it- it- it, it comes with new exciting abilities, and then- then- then, you know, then- th- th- then with those new exciting abilities, we have to think about how to, how to, you know, make the model, you know, safe, reliable, do what humans want them to do. I mean, it's the sa- it's the same story for everything, right? Same thing. It's that same tension.
4. LFLex Fridman
  But- but the possibility of use cases here is just... the- the range is incredible, so, uh, how much... to make it work really well in the future, how much do you have to specially kind of, uh, go beyond what's the pretrained model's doing? Do more post-training? RLHF or supervised fine-tuning or synthetic data just for the
5. NANarrator
  Yeah.
6. LFLex Fridman
  ... agentic stuff?
7. DADario Amodei
  I think speaking at a high level, it's our intention to keep investing a lot in, you know, making, making the model better. Uh, like I think, I think, uh, you know, we look at, look at some of the, you know, some of the benchmarks where previous models were like, "Oh, could do it 6% of the time," and now our model could do it 14% or 22% of the time, and yeah. We want to get up to, you know, the human level of reliability of 80%, 90%, just like anywhere else, right? We're on the same curve that we were on with Swe-Bench, where I think, I would guess, a year from now the models can do this very, very reliably. But you gotta start somewhere.
8. LFLex Fridman
  So you think it's possible to get to th- the human level, 90%, uh, basically doing the same thing you're doing now? Or is it... has to be special for computer use?
9. DADario Amodei
  I- I- I mean, uh, depends what you mean by, by, you know, special and... special in general. Um, but, but, I... you know, I generally think, you know, the same kinds of techniques that we've been using to train the current model, I- I expect that doubling down on those techniques in the same way that we have for code, for code, for models in general, for other ki- for, you know, for image input, um, uh, you know, for voice, uh, I expect those same techniques will scale here as they have everywhere else.
10. LFLex Fridman
  But this is giving sort of th- the power of action to Claude, and so you could do a lot of really powerful things but you could do a lot of damage also.
11. DADario Amodei
  Yeah, yeah. No, uh, and we've been very aware of that. Look, my- my view actually is... computer use isn't a fundamentally new capability like the CBRN or autonomy capabilities are. Um, it's more like... it kind of opens the aperture for the model to use and apply its existing abilities. Uh, and- and so the way we think about it, going back to our RSP, is... nothing that this model is doing inherently increases, you know, the risk from an RSP- R- RSP perspective. But, as the models get more powerful, having this capability may make it scarier once it... you know, once it has the cognitive capability to, um... you know, to do something at the ASL-3 and ASL-4 level. This- this, you know, th- th- this may be the thing that kind of unbounds it from doing so, so going forward, certainly this modality of interaction is something we have tested for and that we will continue to test for in RSP going forward. Um, I think it's probably better to have... to learn and explore this capability before the model is super, uh... you know, super capable.

Episode duration: 5:15:00

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode ugvHCXCOmm4

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome