DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters | Lex Fridman Podcast #459

Dylan Patel is the founder of SemiAnalysis, a research & analysis company specializing in semiconductors, GPUs, CPUs, and AI hardware. Nathan Lambert is a research scientist at the Allen Institute for AI (Ai2) and the author of a blog on AI called Interconnects. Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep459-sb See below for timestamps, transcript, and to give feedback, submit questions, contact Lex, etc. *Transcript:* https://lexfridman.com/deepseek-dylan-patel-nathan-lambert-transcript *CONTACT LEX:* *Feedback* - give feedback to Lex: https://lexfridman.com/survey *AMA* - submit questions, videos or call-in: https://lexfridman.com/ama *Hiring* - join our team: https://lexfridman.com/hiring *Other* - other ways to get in touch: https://lexfridman.com/contact *EPISODE LINKS:* Dylan's X: https://x.com/dylan522p SemiAnalysis: https://semianalysis.com/ Nathan's X: https://x.com/natolambert Nathan's Blog: https://www.interconnects.ai/ Nathan's Podcast: https://www.interconnects.ai/podcast Nathan's Website: https://www.natolambert.com/ Nathan's YouTube: https://youtube.com/@natolambert Nathan's Book: https://rlhfbook.com/ *SPONSORS:* To support this podcast, check out our sponsors & get discounts: *Invideo AI:* AI video generator. Go to https://lexfridman.com/s/invideoai-ep459-sb *GitHub:* Developer platform and AI code editor. Go to https://lexfridman.com/s/github-ep459-sb *Shopify:* Sell stuff online. Go to https://lexfridman.com/s/shopify-ep459-sb *NetSuite:* Business management software. Go to https://lexfridman.com/s/netsuite-ep459-sb *AG1:* All-in-one daily nutrition drinks. Go to https://lexfridman.com/s/ag1-ep459-sb *OUTLINE:* 0:00 - Introduction 3:33 - DeepSeek-R1 and DeepSeek-V3 25:07 - Low cost of training 51:25 - DeepSeek compute cluster 58:57 - Export controls on GPUs to China 1:09:16 - AGI timeline 1:18:41 - China's manufacturing capacity 1:26:36 - Cold war with China 1:31:05 - TSMC and Taiwan 1:54:44 - Best GPUs for AI 2:09:36 - Why DeepSeek is so cheap 2:22:55 - Espionage 2:31:57 - Censorship 2:44:52 - Andrej Karpathy and magic of RL 2:55:23 - OpenAI o3-mini vs DeepSeek r1 3:14:31 - NVIDIA 3:18:58 - GPU smuggling 3:25:36 - DeepSeek training on OpenAI data 3:36:04 - AI megaclusters 4:11:26 - Who wins the race to AGI? 4:21:39 - AI agents 4:30:21 - Programming and AI 4:37:49 - Open source 4:47:01 - Stargate 4:54:30 - Future of AI *PODCAST LINKS:* - Podcast Website: https://lexfridman.com/podcast - Apple Podcasts: https://apple.co/2lwqZIr - Spotify: https://spoti.fi/2nEwCF8 - RSS: https://lexfridman.com/feed/podcast/ - Podcast Playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4 - Clips Channel: https://www.youtube.com/lexclips *SOCIAL LINKS:* - X: https://x.com/lexfridman - Instagram: https://instagram.com/lexfridman - TikTok: https://tiktok.com/@lexfridman - LinkedIn: https://linkedin.com/in/lexfridman - Facebook: https://facebook.com/lexfridman - Patreon: https://patreon.com/lexfridman - Telegram: https://t.me/lexfridman - Reddit: https://reddit.com/r/lexfridman

Lex FridmanhostNathan LambertguestDylan PatelguestGuestguest

Feb 3, 20255h 6mWatch on YouTube ↗

EVERY SPOKEN WORD

150 min read · 30,182 words

0:00 – 3:33
Introduction
1. LFLex Fridman
  The following is a conversation with Dylan Patel and Nathan Lambert. Dylan runs SemiAnalysis, a well-respected research and analysis company that specializes in semiconductors, GPUs, CPUs, and AI hardware in general. Nathan is a research scientist at the Allen Institute for AI, and is the author of the amazing blog on AI called Interconnects. They are both highly respected, read, and listened to by the experts, researchers, and engineers in the field of AI. And personally, I'm just a fan of the two of them. So, I used the DeepSeek moment that shook the AI world a bit as an opportunity to sit down with them and lay it all out. From DeepSeek, OpenAI, Google, xAI, Meta Anthropic, to NVIDIA and TSMC, and to US, China, Taiwan relations, and everything else that is happening at the cutting edge of AI. This conversation is a deep dive into many critical aspects of the AI industry. While it does get super technical, we tried to make sure that it's still accessible to folks outside of the AI field by defining terms, stating important concepts explicitly, spelling out acronyms, and in general, always moving across the several layers of abstraction and levels of detail. There is a lot of hype in the media about what AI is and isn't. The purpose of this podcast, in part, is to cut through the hype, through the bullshit, and the low resolution analysis, and to discuss in detail how stuff works and what the implications are. Let me also, if I may, comment on the new OpenAI o3 mini reasoning model, the release of which we were anticipating during the conversation, and it did indeed come out right after. Its capabilities and cost are on par with our expectations, as we stated. OpenAI o3 mini is indeed a great model, but it should be stated that, uh, DeepSeek R1 has similar performance on benchmarks, is still cheaper, and it reveals its chain of thought reasoning, which o3 mini does not. It only shows a summary of the reasoning. Plus, R1 is open weight and, uh, o3 mini is not. By the way, I got a chance to play with, uh, o3 mini, and uh, anecdotal vibe check-wise, I felt that o3 mini, specifically o3 mini high, is, uh, better than R1. Still, for me personally, I find that Claude Sonnet 3.5 is the best model for programming, except for tricky cases where I will use o1 pro to brainstorm. Either way, many more better AI models will come, including reasoning models, both from American and Chinese companies. They will continue to shift the cost curve. But the "DeepSeek moment" is indeed real. I think it will still be remembered five years from now as a pivotal event in tech history, due in part to the geopolitical implications, but for other reasons too, as we discuss in detail from many perspectives in this conversation. This is the Lex Fridman Podcast. To support it, please check out our sponsors in the description, and now, dear friends, here's Dylan Patel and Nathan Lambert.
3:33 – 25:07
DeepSeek-R1 and DeepSeek-V3
1. LFLex Fridman
  A lot of people are curious to understand China's DeepSeek AI models, so let's lay it out. Nathan, can you describe what DeepSeek V3 and DeepSeek R1 are, how they work, how they're trained? Let's, uh, look at the big picture and then we'll zoom in on the details.
2. NLNathan Lambert
  Yeah, so DeepSeek V3 is a new mixture of experts transformer language model from DeepSeek, who is based in China. They have some new specifics in the model that we'll get into. Largely this is a open weight model, and it's a instruction model like what you would use in ChatGPT. Um, they also released what is called the base model, which is before these techniques of post-training. Uh, most people use instruction models today, and those are what's served in all sorts of applications. This was released on, I believe, December 26th or that week, and then weeks later on January 20th, DeepSeek released DeepSeek R1, which is a reasoning model, which really accelerated a lot of this discussion. This reasoning model i- has a lot of overlapping training steps to DeepSeek V3, and it's confusing that you have a base model called V3 that you do something to to get a chat model, and then you do some different things to get a reasoning model. I think a lot of the AI industry is going through this challenge of communications right now where OpenAI makes fun of their own naming schemes. They have GPT-4o, they have OpenAI o1, and there's a lot of types of models, so we're gonna break down what each of them are. There's a lot of s- technical specifics on training and go through them high level to specific and kind of go through each of them.
3. LFLex Fridman
  There's so many places we can go here, but maybe let's go to open weights first. What does it mean for a model to be open weights, and what are the different flavors of open source in general?
4. NLNathan Lambert
  Yeah, so this discussion has been going on for a long time in AI, it became more important since ChatGPT, or more focal since ChatGPT at the end of 2022. Open weights is the accepted term for, um, when model weights of a language model are available on the internet for people to download. Those weights can have different licenses, which is the essen- effectively the terms by which you can use the model. There are licenses that come from history in open source software. There are licenses that are designed by companies specifically. Um, all of LLaMA, DeepSeek, Qwen, Mistral, these popular names in open weight models have some of their own licenses. It's complicated 'cause not all the s- same models have the same terms.The big debate is on what makes a model open weight. It's, like, why are we saying this term? It's kind of a mouthful. It sounds close to open source, but it's not the same. There's still a lot of debate on the definition and soul of open source AI. Open source software has a rich history on freedom to modify, freedom to take on your own, freedom from any restrictions on how you would use the software, and what that means for AI is still being defined. So, uh, for what I do, I work at the Allen Institute for AI. We're a nonprofit. We want to make AI open for everybody, and we try to lead on what we think is truly open source. There's not full agreement in the community, but for us, that means releasing the training data, releasing the training code, and then also having open weights like this. And we'll get into the details of the models and... Again and again, as we try to get deeper into how the models will train- were trained, we will say things like, "The data processing, data filtering, data quality is the number one determinant of the model quality," and then a lot of the training code is the determinant on how long it takes to train and how fast your experimentation is. So without fully open source models where you have access to this data, it is hard to know, or it's harder to replicate. So we'll get into cost numbers for DeepSeek V3 on mostly GPU hours and how much you could pay to rent those yourselves. But without the data, the replication cost is going to be far, far higher. And same goes for the code.
5. LFLex Fridman
  We should also say that this is probably one of the more open models out of the frontier models.
6. NLNathan Lambert
  Yeah.
7. LFLex Fridman
  So, like, in this full spectrum, we're probably the fullest open source, like you said. Open code, open data, open weights. This is not open code. This is probably not open data. And this is open weights, and the licensing is, uh, MIT license or it's... Uh, I mean, there's some nuance in the different models, but it's towards the free... In terms of the open source movement, these are the, kind of the good guys.
8. NLNathan Lambert
  Yeah. DeepSeek is doing fantastic work for disseminating understanding of AI. Uh, their papers are extremely detailed in what they do, and for other teams around the world, they're very actionable in terms of improving your own training techniques. Uh, and we'll talk about licenses more. The DeepSeek R1 model has a very permissive license. It's called the MIT license. That effectively means there's no downstream res- restrictions on commercial use. There's no use case restrictions. You can use the outputs from the models to create synthetic data. And this is all fantastic. I think the closest peer is something like LLaMA, where you have the weights and you have a technical report, and the technical report is very good for LLaMA. One of the most read PDFs of the year last year is the LLaMA 3 paper. But it... In some ways, it's slightly less actionable. It has less details on the training specifics, I think less plots, um, and so on. And the LLaMA 3 license is more restrictive than MIT, and then between the DeepSeek custom license and the LLaMA license, we could get into this whole rabbit hole, I think. We'll, we'll, we'll make sure we want to go down the license rabbit hole before we do specifics.
9. LFLex Fridman
  Yeah, and I mean... So it should be stated that one of the implications of DeepSeek, it puts pressure on LLaMA and everybody else on OpenAI to push towards, uh, open source. And that's the other side of open source that, uh, you mentioned, is how much is published in detail about it. So how open are you with the sort of the insights behind the code? So like, how good is the technical reports? Are they hand wavy or is there actual, uh, details in there? And that's one of the things that DeepSeek did well, is they published a lot of the details.
10. NLNathan Lambert
  Yeah. Especially in the DeepSeek V3, which is their pre-training paper. They were very clear that they are doing interventions on the technical stack that go at many different levels. For example, on their... To get highly efficient training, they're making modifications at or below the CUDA layer for NVIDIA chips. I have never worked there myself, and there are a few people in the world that do that very well, and some of them are at DeepSeek. And these types of people are at DeepSeek and leading American frontier labs, but there are not many places.
11. LFLex Fridman
  To help people understand w- the other implication of open weights, just, you know, there's, um, a topic w- we'll return to often here. So, there's a, uh, fear that China, the nation, might have interest in, um, stealing American data, violating privacy of American citizens. What can we say about open weights to help us understand what, what the weights are able to do-
12. NLNathan Lambert
  Yeah.
13. LFLex Fridman
  ... in terms of stealing people's data?
14. NLNathan Lambert
  Yeah. So these weights that you can download from Hugging Face or other platforms are very big matrices of numbers. You can download them to a computer in your own house that has no internet, and you can run this model and you're totally in control of your data. That is something that is different than how a lot of language model usage is actually done today, which is mostly through APIs, where you send your prompt to GPUs run by certain companies. And these companies will have different distributions and policies on how your data is stored, if it is used to train future models, where it is stored, if it is encrypted, and so on. So the open weights are you have your fate of data in your own hands, and that is something that is deeply connected to the soul of open source.
15. LFLex Fridman
  So it's not the model that steals your data, it's whoever's hosting the model, which could be China, if you're using the DeepSeek app, or it could be Perplexity. Uh, you know, you're trusting them with your data. Or OpenAI, you're trusting them with your data. And some of these are American companies, some of these are Chinese companies. But the model itself is not doing the stealing. It's the host. All right. So, uh, back to the basics. What's the difference between DeepSeek V3 and DeepSeek R1? Can we try to, like-... uh, lay out the confusion potential?
16. NLNathan Lambert
  Yes. So, for one, I have very understanding of many people being confused by these two model names. So, I would say the best way to think about this is that when training a language model, you have what is called pre-training, which is when you're predicting the large amounts of mostly internet text, you're trying to predict the next token. And what to know about these new DeepSeek models is that they do this internet large-scale pre-training once to get what is called DeepSeekV3 base. This is the base model, it's just going to finish your sentences for you, it's going to be harder to work with than ChatGPT. And then, what DeepSeek did is they've done two different post-training regimes to make the models have specific desirable behaviors. So, what is the more normal model in terms of the last few years of AI and instruct model, a chat model, a "aligned" model, a helpful model, there are many ways to describe this, is more standard post-training. So, this is things like instruction tuning, reinforcing Learning from Human Feedback. We'll get into some of these words. And this is what they did to create the DeepSeekV3 model. This was the first model to be released and it is very high re- performant, it's competitive with GPT-4, LLaMa-405B, so on. And then when this r- release was happening, we don't know their exact timeline or soon after, they were finishing the training of a different training process from the same next token prediction based model that I talked about, which is when this new reasoning training that people have heard about comes in, in order to create the model that is called DeepSeekR1. The R through this conversation is good for grounding for reasoning, and the name is also similar to OpenAI's O1, which is the other reasoning model that people have heard about. And while I have to break down the training for R1 in more detail, because for one, we have a paper detailing it, but also it is a far newer set of techniques for the AI community, so it's a much more rapidly evolving area of research.
17. LFLex Fridman
  Maybe we should also say the big two categories of training of pre-training and post-training. These umbrella terms that people use. So, what is pre-training and what is post-training, and what are the different flavors of things underneath post-training umbrella?
18. NLNathan Lambert
  Yeah. So pre-training, I'm using some of the same words that really get the message across is you're doing what is called autoregressive prediction to predict the next token in a series of documents. This is done over standard practices, trillions of tokens. So this is a ton of data that is mostly scraped from the web. In some of DeepSeek's earlier papers, they talk about their training data being distilled for math. I m- I shouldn't use this word yet, but taken from Common Crawl, and that's a public access that anyone listening to this could go download data from the Common Crawl website. This is a crawler that is maintained publicly. Yes, other tech companies eventually shift to their own crawler, and DeepSeek likely has done this as well as most frontier labs do. But this sort of data is something that people can get started with, and you're just predicting text in a series of documents. This is p- can be scaled to be very efficient, and there's a lot of numbers that are thrown around in AI training, like how many floating point operations or FLOPs are used, and then you can also look at how many hours of these GPUs, uh, that are used. And it's largely one loss function taken to a very large amount of (laughs) of compute usage. You just- you set up really efficient systems. And then at the end of that, you have the space model, and pre-training is where there is a lot more of complexity in terms of how the process is emerging or evolving and the different types of training losses that we will use. I think this is a lot of techniques grounded in the natural language processing literature. The oldest technique, which is still used today is something called instruction tuning or also known as supervised fine-tuning. These acronyms will be IFT or SFT, th- that people really go back and forth throughout them, and I will probably do the same, which is where you add this formatting to the model where it knows to take a question that is like, "Explain the history of the Roman Empire to me." And or something you'll- a sort of question you'll see on Reddit or Stack Overflow, and then the model will respond in a information dense but presentable manner. The core of that formatting is in this instruction tuning phase, and then there's two other categories of loss functions that are being used today. One, I will classify as preference fine-tuning. Preference fine-tuning is a generalized term for what came out of Reinforcement Learning from Human Feedback, which is RLHF. This Reinfor- reinforcement Learning from Human Feedback is credited as the technique that helped, uh, ChatGPT break through. It is a technique to make the responses that are nicely formatted, like these Reddit answers, more in tune with what a human would like to read. This is done by collecting pairwise preferences from actual humans out in the world to start, and now AIs are also labeling this data, and we'll get into those tradeoffs. And you have this kind of contrastive loss function between a good answer and a bad answer, and the model learns to pick up these trends. There's different implementation ways. You have things called reward models. You could have direct alignment algorithms. There's a lot of really specific things you can do, but all of this is about fine-tuning to human preferences. And the final stage is much newer and will link to what is done in R1 and these reasoning models is, I think, OpenAI's name for this, they had this new API in the fall which they called the Reinforcement Fine-Tuning API. This is the idea that you use the techniques of reinforcement learning, which is a whole framework of AI, there's a deep literature here. To summarize, it's often known as trial and error learning or the subfield of AI where you're trying to make sequential decisions in a certain potentially un- potentially noisy environment. There's a lot of ways we could go down that, but...... fine-tuning language models where they can generate an answer and then you check to see if the answer matches the true solution. For math or code, you have an ex- exactly correct answer for math, you can have unit tests for code. And what we are doing is we are checking the language models work, and we are giving it multiple opportunities on the same questions to see if it is right. And if you keep doing this, the models can learn to improve in verifiable domains. Uh, to a great extent, it works really well, it's a newer technique in the academic literature. It's been used at frontier labs in the US that don't share every detail, uh, for multiple years. So this is the idea of using reinforcement learning with language models, and it has been taking off, especially in this DeepSeek moment.
19. LFLex Fridman
  And we should say that there's a lot of exciting stuff going on on the, uh, again, across the stack, but the post-training probably this year, there's going to be a lot of interesting developments in the post-training. We'll, we'll talk about it. Uh, I almost forgot to talk about the, the difference between DeepSeek V3 and R1 on the user experience side. So forget the technical stuff, forget all of that. Just people that don't know anything about AI, they show up, like, what's the actual experience? What's the use case for each one when they actually, like, type and talk to it?
20. NLNathan Lambert
  Yeah.
21. LFLex Fridman
  What, what is each good at? That kind of thing.
22. NLNathan Lambert
  So let's start with DeepSeek V3 again. It's what more people would have tried something like it. You ask it a question, it'll start generating tokens very fast, and those tokens will look like a very human legible answer. It'll be some sort of markdown list. It might have formatting to help you draw to the core details in the answer, and it'll generate tens to hundreds of tokens. A token is normally a w- word for common words or a sub-word part in a longer word. And it'll look like a very high-quality Reddit or Stack Overflow answer. These models are really getting good at doing these across a wide variety of domains. Uh, even things that if you're an expert, things that are close to the fringe of knowledge, they will still be fairly good at. I think cutting edge AI topics that I do research on, these models are capable for study aid, and they're regularly updated. Where this changes is with the DeepSeek R1, what is called these reasoning models is when you see tokens coming from these models to start, it will be a large chain of thought process. We'll get back to chain of thought in a second, which looks like a lot of tokens where the model is explaining the problem. The model will often break down the problem and be like, "Okay, they asked me for this. Let's break down the problem. I'm going to need to do this." And you'll see all of this generating from a model. It'll come very fast in most user experiences. These APIs are very fast, so you'll see a lot of tokens, a lot of words show up really fast. It'll keep flowing on the screen, and this is all the reasoning process, and then eventually the model will change its tone in R1, and it'll write the answer where it summarizes its reading pro- reasoning process and writes a similar answer to the first types of model. But in DeepSeek's case, which is part of why this was so popular even outside the AI community, is that you can see how the language model is breaking down problems. And then you get this answer on a technical side, they, they train the model to do this specifically where they have a section which is reasoning and then it generates a special token which is probably hidden from the user most of the time, which says, "Okay, I'm starting the answer." So the model is trained to do this two-stage process on its own. If you use a similar model in, say, OpenAI, OpenAI's user interface is trying to summarize this process for you nicely by kind of showing the sections that the model is doing, and it'll kind of click through. It'll say, "Breaking down the problem, making X calculation, cleaning the result," and then the answer will come for something like OpenAI.
23. LFLex Fridman
  Maybe it's useful here to go through, like, an example of it, DeepSeek R1 reasoning.
24. NLNathan Lambert
  Yeah, so the, if, if you're looking at the screen here, what you'll see is a screenshot of the DeepSeek chat app. And at the top is Thought for 151 s- seven seconds with the drop-down arrow. Underneath that, if we were in an app that we were running, the drop-down arrow would have the reasoning.
25. LFLex Fridman
  So in this case, uh, the ques- the specific question which, you know, I'm philosophically/pothead inclined. So this is asking Deep, DeepSeek R1 for one truly novel insight about humans. And it reveals the reasoning and basically the trul- truly novel aspect is what's pushing the reasoning to constantly sort of the model asking itself, "Is this truly novel?" So it's actually challenging itself to be more novel, more counterintuitive, uh, more, uh, less cringe, I suppose. So some of the reasoning says, uh, this is just snapshots. "Alternatively, humans have a unique meta-emotion where they feel emotions about their own emotions, e.g. feeling guilty about being angry. This recursive emotional layering creates complex motivational drives that don't exist in other animals. The insight is that human emotions are nested." So it's like, it's reasoning through how humans feel emotions. It's reasoning about meta-emotions.
26. NLNathan Lambert
  Gonna have pages and pages of this.
27. LFLex Fridman
  Yeah, exactly.
28. NLNathan Lambert
  It's almost too much to actually read, but it's nice to skim as it's coming.
29. LFLex Fridman
  It's stream of con- it's a James Joyce-like stream of consciousness, and then it goes, "Wait, the user wants something that's not seen anywhere else. Let me dig deeper and consider the human ability to hold contradictory beliefs simultaneously. Cognitive dissonance is known, but perhaps the function is to allow flexible adaptation," so on and so forth. I mean, that really captures the public imagination that, holy shit, this isn't, uh, I mean, intelligent/almost like an inkling of sentience because, like, you're thinking through, you're self-reflecting, you're deliberating. And the final result of that after 157 seconds is, "Humans instinctively convert selfish desires into cooperative systems by collectively pretending abstract rules, money, laws, rights are real. These shared hallucinations act as 'games' where competition is secretly redirected to benefit the group, turning conflict into society's fuel."Pretty profound. I mean, you know comment to-
30. NLNathan Lambert
  This is a potential digression, but a lot of people have found that these reasoning models can sometimes produce much more eloquent text. Like, that is a, at least interesting example. I think depending on how open-minded you are, you find language models interesting or not, and there's a spectrum there.
25:07 – 51:25
Low cost of training
1. LFLex Fridman
  How were they able to achieve such low cost on the training and the inference? Maybe you could talk the training first.
2. DPDylan Patel
  Yeah. So there's, there's two main techniques that they implemented that are probably the majority of their efficiency, and then there's a lot of implementation details that maybe we'll gloss over or, or get into later that sort of contribute to it. But those two main things are, one, is they went to a mixture of experts model, uh, which, which we'll define in a second, and then the other thing is that they invented this new technique called MLA, latent attention. Both of these are, are big deals. Mixture of experts is something that's been in the literature for a handful of years, and OpenAI with GPT-4 was the first one to productize a mo- mixture of experts model. And what this means is when you look at the common models around, uh, that most people have been able to interact with that are open, right, think LLaMA. LLaMA is a dense model, i.e. every single parameter or neuron is activated as you're going through the model for every single token you generate, right? Now, with a mixture of experts model, you don't do that, right? How d- how does a human actually work, right? Is like, oh, well, my visual cortex is active when I'm thinking about, you know, vision tasks, or, um, like, you know, other things, right? My s- my amygdala is when I'm scared, right? These different aspects of your brain are focused on different things. A mixture of experts model attempts to approximate this to some extent. It's nowhere close to what a brain architecture is, but different portions of the model activate, right? You'll have a set number of experts in the model and a set number that are activated each time, and this dramatically reduces both your training and inference cost. Because now you're, you know, if you think about the parameter count as the sort of total embedding space for all of this knowledge that you're compressing down during training, when you're embedding this data in, instead of having to activate every single parameter every single time you're training or running inference, now you can just activate on a subset and the model will learn which expert to route to for different tasks. And so this is a humongous innovation in terms of, hey, I can continue to grow the total embedding space of parameters. And so DeepSeek's model is, you know, 600-something billion parameters, right? Uh, relative to LLaMA-405B, it's four or five billion parameters, right? Llama to... Relative to LLaMA-70B, it's 70 billion parameters, right? So this model technically has more embedding space for information, right, uh, to compress all of the world's knowledge that's on the internet down, but at the same time, it on- is only activating around 37 billion of the parameters. So only 37 billion of these parameters actually need to be computed every single time you're training data or inferencing data out of it, and so versus, versus, again, a LLaMA model, 70 billion parameters must be activated or 405 billion parameters must be activated. So you've dramatically reduced your compute cost when you're doing training and inference with this mixture of experts architecture.
3. NLNathan Lambert
  Should we break down where it actually applies and go into the transformer? Is that useful?
4. LFLex Fridman
  Let's go. Let's go into the transformer. (laughs)
5. NLNathan Lambert
  Cool. Okay. So the transformer is a thing that is talked about a lot, and we will not cover every detail. Uh, essentially, the transformer is built on repeated blocks of this attention mechanism and then a traditional dense, fully connected, multi-layer perception, whatever word you want to use for your normal neural network, and you alternate these blocks. There's other details. And where mixture of experts is applied is at this dense model. The dense model holds most of the weights if you count them in a transformer model. So, you can get really big gains from those mixture of experts on parameter efficiency, uh, training and inference because you get this efficiency by not activating all of these parameters.
6. LFLex Fridman
  We should also say that a transformer is a giant neural network.
7. NLNathan Lambert
  Yeah.
8. LFLex Fridman
  And then there's, for 15 years now, there's what's called the deep learning revolution. Network's gotten larger and larger. At a certain point, the scaling laws appeared where people realized-
9. NLNathan Lambert
  This is a Scaling Laws shirt, by the way. (laughs)
10. LFLex Fridman
  (laughs) Representing scaling laws, where it became more and more formalized that bigger is better across multiple dimensions of what bigger means. So, uh, and, but these are all sort of neural networks we're talking about, and we're talking about different architectures of how construct- to construct these neural networks such that the training and the inference on them is super efficient.
11. NLNathan Lambert
  Yeah. Every different type of model has a different scaling law for it, which is effectively for how much compute you put in, the architecture will get to different levels of performance at test tasks. And mixture of experts is one of the ones at training time, even if you don't consider the inference benefits, which are also big, at training time, your efficiency with your GPUs is dramatically improved by using this architecture if it is well-implemented. So you can get effectively the same performance model and evaluation scores with numbers like 30% less compute. I think there's gonna be a wide variation depending on your implementation details and stuff, but it is just important to realize that this type of technical innovation is something that gives huge gains, and I expect most companies that are serving their models to move to this mixture of experts' implementation. Historically, the reason why not everyone might do it is because it's a implementation complexity, especially when doing these big models. So this is one of the things that's, uh, DeepSeek gets credit for, is they do this extremely well. They do mixture of experts extremely well. This architecture for what is called DeepSeekMoE, MoE is the shortened version of mixture of experts, is multiple papers old. This-... part of their training infrastructure is not new to these models. Alone, and the same goes for what Dillon mentioned with multi-head latent attention, this is all about reducing memory usage during inference and same things during training by using some fancy low-rank approximation math. If you get into the details with this latent attention, it's one of those things that I look at and it's like, okay, this- they're doing really complex implementations because there's other parts of language models such as, uh, embeddings that are used to extend the context length. The common one that DeepSeek uses is rotary positional embeddings, which is called RoPE. And if you want to use RoPE with a normal MoE, it's kind of a sequential thing. You take these, you take two of the attention matrices and you rotate them by a complex value rotation, which is a matrix multiplication. With DeepSeek's MLA, with this new attention architecture, they need to do some clever things because they're not set up the same and it just makes the implementation complexity much higher. So they're managing all of these things, and these are probably the sort of things that OpenAI, these closed labs are doing. We don't know if they're doing the exact same techniques, but they actually shared them with the world, which is really nice to feel like this is the cutting edge of efficient language model training.
12. GUGuest
  And some of this is, requires low level engineering, just i- is a giant mess and trickery. So as I understand, they went below CUDA. So they go super low programming of GPUs.
13. DPDylan Patel
  Effectively, NVIDIA builds this library called NCCL, right? Uh, in which, you know, when you're training a model, you have all these communications between every single layer of the model and you may have over 100 layers, right?
14. NLNathan Lambert
  What does NCCL stand for? It's N-C-C-L?
15. DPDylan Patel
  NVIDIA Communications Collectives Library.
16. NLNathan Lambert
  Yeah.
17. GUGuest
  Nice. (laughs)
18. NLNathan Lambert
  (laughs)
19. DPDylan Patel
  Um, and so-
20. GUGuest
  Damn. (laughs)
21. DPDylan Patel
  (laughs) When, when you're training a model, right, you're gonna have all these all-reduces and all-gathers, right? Uh, between each layer, between the, uh, multi-layer perception or feed-forward network and the attention mechanism you'll have, you'll have basically the model synchronized, right? Um, or you'll have all the, you'll have all-reducer and all-gather. Um, and, and this is a communication between all the GPUs in the network, whether, whether it's in training or inference. So NVIDIA has a standard library. This is one of the reasons why it's really difficult to use anyone else's hardware, uh, for training is because no one's really built a standard communications library. Um, and, and NVIDIA's done this at a sort of a higher level, right?Uh, DeepSeek, because they have certain limitations around the GPUs that they have access to, the interconnects are limited to some extent, um, by the restrictions of the GPUs that were shipped into China legally, not the ones that are smuggled, but legally shipped in, uh, that they used to train this model. They had to figure out how to get efficiencies, right? And one of those things is that instead of just calling the NVIDIA library NCCL, right, they instead created their, they scheduled their own communications, uh, which, which the la- some of the labs do, right? Um, Meta talked about in Llama 3 how they made their own custom version of NCCL. This is, they didn't, they didn't talk about the implementation details. This is some of what they did. Probably not as well as, maybe not as well as DeepSeek because DeepSeek ne- you know, necessity is the mother of innovation, and they had to do this. Whereas, uh, in the ca- you know, OpenAI has people that do this sort of stuff, Anthropic, et cetera. Uh, but, you know, DeepSeek certainly did it publicly and they may have done it even better because they were gimped on a certain aspect of the chips that they have access to. And so they scheduled communications o- um, you know, by scheduling specific SMs. SMs you could think of as like the core on a GPU, right? So there's hundreds of cores or there's, you know, a bit over 100 cores, SMs on a GPU and they were specifically scheduling, hey, which ones are running the model? Which ones are doing all-reduce? Which one are doing all-gather, right? And they would flip back and forth between them and this requires extremely low level programming.
22. NLNathan Lambert
  This is what NCCL does automatically or other NVIDIA libraries handle this automatically usually.
23. DPDylan Patel
  Yeah, exactly. And so technically they're using, you know, PTX which is like, sort of like you could think of it as like an assembly type language. It's not exactly that or instruction set, right? Like c- coding directly to assembly or instruction set. It's not exactly that, but uh, that's still part of technically CUDA but it's like do I want to write in Python, you know, PyTorch equivalent and call NVIDIA libraries? Do I wanna go down to the C level, right? Or uh, you know, encode even lower level or do I wanna go all the way down to the assembly or ISA level? And, and there are cases where you go all the way down there at the very big labs, but most companies just do not do that, right? Because it's a waste of time and the efficiency gains you get are not worth it. But DeepSeek's implementation is so complex, right? Especially with their mixture of experts, right? Um, people have done mixture of experts but they're generally eight, 16 experts, right? And they activate two. So you know, one of the words we like to use is like sparsity factor, right? Or usage, right? So, so you might have four, you know, one fourth of your model activate, right? And, and, and that's what Mistral's, uh, Mixtral model, right? Uh, their, their model that really catapulted them to like, oh my god, they're really, really good. Um, OpenAI has also had models that are MoE and, and, and so have all the other labs that are major closed. But what DeepSeek did that maybe only the leading labs have only just started recently doing is have such a high sparsity factor, right? It's not one fourth of the model, right? Two out of eight experts activating every time you go through the model. It's eight out of 256. And-
24. NLNathan Lambert
  There's different implementations for mixture of experts where you can have some of these experts that are ev- always activated, which this just looks like a small neural network and then all the tokens go through that and then they also go through some that are selected by this routing mechanism. And one of the innovations in DeepSeek's architecture is that they changed the routing mechanism in mixture of expert models. There's something called an auxiliary loss which effectively means during training you want to make sure that all of these experts are used across the tasks that the model sees. Why there can be failures in mixture of experts is that when you're doing this training, the one objective is token prediction accuracy. And if you just let turning go with a mixture of expert model on your own, it can be that the...... model learns to only use a subset of the experts, and in the MOE literature, there's something called the auxiliary loss, which helps balance them. But if you think about the loss functions of deep learning, this even connects to the bitter lesson, is that you want to have the minimum inductive bias in your model to let the model learn maximally. And this auxiliary loss, this balancing across experts, could be seen as intention with the prediction accuracy of the tokens. So, we don't know the exact extent that the DeepSeek MOE changed, which is instead of doing an auxiliary loss, they have an extra parameter in their routing, which after the batches they update this parameter to make sure that the next batches all have a similar use of experts. And this type of change can be big, it can be small, but they add up over time, and this is the sort of thing that just points to them innovating, and I'm sure all the labs that are training big MOEs are looking at these sort of things, which is getting away from the auxiliary loss. Some of them might already use it, but you just keep, you keep accumulating gains, and we'll talk about the philosophy of training and how you organize these organizations, and a lot of it is just compounding small improvements over time in your data, in your architecture, in your post-training, and how they integrate with each other. DeepSeek does the same thing, and some of them are shared, uh, a lot. We have to take them on face value, that they share their most important details. I mean, architecture and the weights are out there, so we're seeing what they're doing, and it adds up.
25. DPDylan Patel
  Going back to sort of the, like, efficiency and complexity point, right? It's 32, uh, versus a 4, right? For like mixed drawal and other MOE models that have been publicly released. So, this ratio is extremely high, and sort of what Nathan was getting at there was when you have such a different level of sparsity, um, you can't just have every GPU have the entire model, right? The model's too big, there's too much complexity there. So, you have to split up the model, um, with different types of parallelism, right? And so, you might have different experts on different GPU nodes, but now what, what happens when a to- you know, this set of data that you get, hey, all of it looks like this one way and all of it should route to one part of my, you know, model, right? Um, so, so when all of it route- routes to one part of the model, then you can have the, you can have this overloading of a cert- certain set of the GPU resources, or a certain set of the GPUs, and then the rest of the, the training network sits idle because all of the tokens are just routing to that. So, this is the biggest complexity, one of the big complexities with running a very, you know, sparse mixture of experts model, uh, i.e., you know, this 32 ratio versus this, uh, 4 ratio, is that you end up with so many of the experts just sitting there idle. So, how do I load balance between them? How do I schedule the communications between them? This is a lot of the, like, extremely low level detailed work that they figured out in the public first, and potentially, like, second or third in the world, and maybe even first in some cases.
26. LFLex Fridman
  What, uh, lesson do you, uh, in the direction of the bitter lesson, do you take from all of this? Where... Is this going to be the direction where a lot of the gain is going to be, which is this kind of low level optimization? Or is this a short-term thing where the biggest gains will be more on the algorithmic high level side of like post-training? Is, is this like a short-term leap because they've figured out like a hack because constraints, necessity is the mother of invention? Or is th- is there still a lot of gain?
27. NLNathan Lambert
  I think we should summarize what the bitter lesson actually is about, is that-
28. LFLex Fridman
  Okay.
29. NLNathan Lambert
  ... the bitter lesson, essentially, if you paraphrase it, is that the types of training that will win out in deep learning as we go are those methods that are, which are scalable in learning and search, is what it calls out. And this scale word gets a lot of attention in this. The interpretation that I use is effectively to avoid adding in the human priors to your learning process, and if you read the original essay, this is what it talks about, is how researchers will try to come up with what clever solutions to their specific problem that might get them small gains in the short term, while simply enabling these deep learning systems to work efficiently, and for these bigger problems in the long term, might be more likely to scale and continue to drive success. And therefore, we were talking about relatively small implementation changes to the mixture of experts model, and therefore it's like, okay, like, we will need a few more years to know (laughs) if one of these are actually really crucial to the bitter lesson. But the bitter lesson is really this long-term arc of how simplicity can often win, and there's a lot of sayings in the industry, like the models just want to learn. You have to give them the simple loss landscape where you put compute through the model and, and they will learn, and getting barriers out of the way.
30. LFLex Fridman
  Th- that's where the power of something like Nickel comes in, where standardized code that could be used by a lot of people to create s- sort of simple innovations that can scale, which is why the hacks, the, uh, I imagine the, the code base for DeepSeek is probably a giant mess.
51:25 – 58:57
DeepSeek compute cluster
1. LFLex Fridman
  Okay. Uh, what do we understand about the hardware it's been trained on? DeepSeek.
2. DPDylan Patel
  DeepSeek is very interesting, right? This is where, second to take us to zoom out outta who they are first of all, right? HighFlyer is a hedge fund that has historically done quantitative trading in China as well as elsewhere, and they have always had a significant number of GPUs, right? In the past, a lot of these high frequency trading, algorithmic quant traders used FPGAs, uh, but it's shifted to GPUs definitely, and, and there's both, right? But GPUs especially, and Deep- and, and HighFlyer, which is the hedge fund that owns DeepSeek and everyone who works for DeepSeek is part of HighFlyer to some extent, right? Uh, same, same parent company, same owner, same CEO. They had all these resources and infrastructure for trading, and then they devoted a humongous portion of them to training models, uh, both language models and otherwise, right? Because these, these, these tell- techniques were heavily AI influenced. Um, you know, more recently people have, you know, realized, hey, trading with, um, you know, like even, even when you go back to, like, Renaissance and all these r- uh, all these, like, quantitative firms, natural language processing is the key to, like, trading really fast, right? Understanding a press release, uh, and making the right trade, right? And so DeepSeek has always been really good at this. And even as far back as 2021, they, they have press releases and papers saying like, "Hey, we're the first company in China with an A100 cluster this large."
3. LFLex Fridman
  Okay.
4. DPDylan Patel
  It was 10,000 A100 GPUs, right? This is, this is in 2021. Now, this wasn't all for training, you know, large language models. This was mostly for training models for their quantitative aspects, their, uh, quantitative trading, as well as, you know, a lot of that was natural language processing to be clear, right? Um, and so this is the sort of history, right? So verifiable fact is that in 2021 they built the largest Chinese, uh, cluster, at least they claim it was the largest cluster in China, 10,000 GPUs.
5. NLNathan Lambert
  Before expert controls started.
6. DPDylan Patel
  Yeah. It's like-
7. NLNathan Lambert
  They've had a huge cluster before any conversation of expert controls.
8. DPDylan Patel
  So then you step it forward to like, what have they done over the last four years since then? Right? Um, obviously they've continued to operate the hedge fund, probably make tons of money. And the other thing is that they've leaned more and more and more into AI. The CEO, Lian Chengfeng, uh, Lian-
9. NLNathan Lambert
  You're not putting me on the spot on this. We discussed this before. (laughs)
10. DPDylan Patel
  Yeah, we were. Lian Feng, right, the CEO-
11. NLNathan Lambert
  We're all fans. (laughs)
12. DPDylan Patel
  ... he owns maybe... Lian Feng, he run- r- he owns maybe a little bit more than half the company allegedly, right? Um, is an extremely, like, Elon, Jensen kind of figure where he's just, like, involved in everything, right?
13. NLNathan Lambert
  Mm-hmm.
14. DPDylan Patel
  Um, and so over that time period, he's gotten really in-depth into AI. He actually has a bit of a, like a, if you- if you see some of his statements, a bit of an e/acc vibe almost, right?
15. NLNathan Lambert
  Total AGI vibes and like, "We need to do this. We need to make a new ecosystem of open AI. We need China to lead on this sort of ecosystem, because historically the Western countries have led on th- on software ecosystems." And in or- he straight up acknowledges like, "In order to do this, we need to do something different." DeepSeek is his way of doing this. The- the- some of the translated interviews with him are fantastic.
16. LFLex Fridman
  So he has done interviews?
17. NLNathan Lambert
  Yeah.
18. LFLex Fridman
  Do you think he would do a Western interview or no? Or is there controls on the Chinese-
19. NLNathan Lambert
  There hasn't been one yet, but-
20. LFLex Fridman
  Okay.
21. NLNathan Lambert
  ... I would try it. (laughs)
22. LFLex Fridman
  All right. Well, j- I just got a Chinese translator, so it's great. This is, this is el push. So fascinating figure, engineer pushing full on into AI, leveraging the success from the high frequency trading.
23. NLNathan Lambert
  Very direct quotes like, "We will not switch to closed source," when asked about this stuff.
24. DPDylan Patel
  Oh.
25. NLNathan Lambert
  He, very long term motivated in how the ecosystem of AI should work. And I think from a Chinese perspective, he wants the Chinese company, a, a Chinese company to build this vision.
26. DPDylan Patel
  And so this is sort of like the "visionary" behind the company, right? This hedge fund still exists, right? This, this quantitative firm. And so-Deep Seek is the sort of at- at, you know, slowly he got turned to this full view of, like, AI everything about this, right? But at some point it slowly maneuvered and you made Deep Seek, um, and Deep Seek has done multiple models since then. They've acquired more and more GPUs. They share infrastructure with the fund, right? Um, and so, you know, there is no exact number of public GPU resources that they have, but besides this 10,000 GPUs that they bought in 2021, right? And they were fantastically profitable, right? And then this paper claims they did only 2,000 H800 GPUs, which are a restricted GPU that was previously allowed in China, but no longer allowed, and there's a new version. But it's basically NVIDIA's H100 for China, right? Um, and th- there's some restrictions on it, specifically around the communications, uh, sort of, uh, speed, th- the interconnect speed, right, which is why they had to do this crazy SM, you know, scheduling stuff, right? So, so going back to that, right, it's like this is obviously not true in terms of their total GPU count.
27. LFLex Fridman
  Obvious available GPUs, but for this training run you think 2,000 is the correct number or no?
28. DPDylan Patel
  So this is where it takes, um, you know, significant amount of sort of like zoning in, right? Like what do you call your training run, right? Do you count all of the research and ablations that you ran, right? Picking all this stuff, because yes, you can do a YOLO run, but at some level you have to do the test at the small scale, and then you have to do some tests at medium scale before you go to a large scale.
29. NLNathan Lambert
  A- accepted practice is that for any given model that is a notable advancement, you're gonna do two to 4X compute of the full training run in experiments alone.
30. LFLex Fridman
  So a lot of this compute that's being scaled up is probably used in large part at this time for research.
58:57 – 1:09:16
Export controls on GPUs to China
1. NLNathan Lambert
  they have a lot of compute.
2. LFLex Fridman
  Can you in general actually just zoom out and also talk about the, the Hopper architecture, the NVIDIA Hopper GPU architecture and the difference between H100 and H800, like you mentioned the interconnects before?
3. DPDylan Patel
  Yeah, so there's, you know, Ampere was the A100 and then H100 Hopper, right? People use them h- synonymously i- in the US because really there's just H100 and now there's H200, right? But same thing, uh, mostly. In China, they've had tu- there have been different salvos of export restrictions. So initially the US government limited on a two factor scale, right, which is chip interconnect versus, uh, flops, right? So any chip that had interconnects above a certain level and FLOPS abo- above a certain... Floating point operations above a certain level was restricted. Uh, later the government realized that this was a flaw in the restriction, and they cut it down to just floating point operations. And so, um-
4. NLNathan Lambert
  So H- H800 had high FLOPS, low com- communication?
5. DPDylan Patel
  Exactly, so the H800 was the same performance as H100 on FLOPS, right? But it didn't ha- it had, it just had the interconnect bandwidth cut. Deep Seek knew how to utilize this res- you know, "Hey, even though we're cut back on the interconnect, we can do all this fancy stuff to figure out how to use the GPU fully anyways," right?
6. LFLex Fridman
  Mm-hmm.
7. DPDylan Patel
  And, and so that was back in October 2022, but, uh, later in 2023, end of 2023, implemented in 2024, the US government banned the H800, right? Um, and so by the way, this H800 cluster, these 2,000 GPUs, was not even purchased in 2024, right? It was purchased in late 2023. Um-
8. LFLex Fridman
  Mm-hmm.
9. DPDylan Patel
  ... and they're just getting the model out now, right, because it takes a lot of research, et cetera. Um, H800 was banned, and now there's a new chip called the H20. Uh, the H20 is, uh, cut back on only FLOPS, but the interconnect bandwidth is the same, and in fact in some ways it's better than the H100 because it has better memory bandwidth and memory capacity. So there are, you know, NVIDIA is working within the constraints of what the government says and then get- builds the best possible GPU for China.
10. LFLex Fridman
  Can we take this actual tangent and we'll return back to the hardware, is the, the philosophy, the mo- the motivation, the case for export controls, what is it? Uh, Dariuh Madedj just published a blog post about export controls. The case he makes is that if AI becomes super powerful, and he says by 2026 we'll have AGI or super powerful AI, and that's going to give a significant... Whoever builds that will have a significant military advantage.And so, because the United States is a, is a democracy, and as he l- says, China is, uh, authoritarian, or has authoritarian elements, you want a unipo- polar world with a super powerful military because of the AI is one that's a democracy. It's a much more complicated world geopolitically when you have two superpowers with super powerful AI, and one is authoritarian. So, that's the case he makes, and so we wanna, uh, the United States wants to use export controls to slow down to make sure th- (laughs) th- that, uh, China can't do these gigantic, uh, training runs that would be presumably required to build AGI.
11. NLNathan Lambert
  This is very abstract. I think this can be the goal of s- how some people describe export controls is this super powerful AI. There's... And you touched on the training run idea. There's not many worlds where China cannot train AI models. The export controls are kneecapping the amount of compute or the density of compute that China can have, and if you think about the AI ecosystem right now as all of these AI companies' revenue numbers are up and to the right. Their AI usage is just continuing to grow. More GPUs are going to inference. A large part of export controls, if they work, is just that the amount of AI that can be run in China is going to be much lower. So, on the training side, DeepSeek V3 is a great example, which you have a very focused team that can still get to the frontier of AI on... This 2,000 GPUs is not that hard to get, all considering in the world. They're still gonna have those GPUs. They're still gonna be able to train models, but if there's gonna be a huge market for AI, if you have strong export controls and you wanna have 100,000 GPUs just serving the equivalent of ChatGPT clusters, with good export controls it also just makes it so that E- AI can be used much less. And I think that is a much easier goal to achieve than trying to debate on what AGI is, and if you have these extremely intelligent autonomous AIs and data centers, like, those are the things that could be running in these GPU clusters in the United States, but not in China.
12. DPDylan Patel
  To some extent, training a model does effectively nothing, right?
13. NLNathan Lambert
  Yeah. (laughs)
14. DPDylan Patel
  Like, they have a model. The, the thing that Dario is sort of speaking to is the implementation of that model, once trained, to then create huge economic growth, huge increases in military capabilities, huge capability- uh, increases in productivity of people, uh, betterment of lives. Whatever, whatever you want to direct super powerful AI towards, you can, but that requires a significant amount of compute, right? And so, the US government has effectively said, um... And, and, and, and forever, right? Like, train- training will always be a portion of the total compute. Um, you know, we mentioned Meta's 400,000 GPUs. Only 16,000 made LLaMA, right? So, the, the percentage that Meta's dedicating to inference... Now this might be for recommendation systems that are trying to hack our mind into spending more time and watching more ads, or if it's, if it's, uh, or if it's for a super powerful AI that's doing productive things. Doesn't matter about the exact use that our, you know, economic system decides. It's that that can be delivered whatever, in whatever way we want. Whereas with China, right, you know, you're, you know, export restrictions, great. You're never gonna be able to cut everything off, right? Uh, and that's, that's like... I think that's quite well-understood by the US government, uh, is that you can't cut everything off. Um, you know-
15. NLNathan Lambert
  And they'll make their own chips. They're, they're
16. DPDylan Patel
  And, and they're trying to make their own chips. They'll be worse than ours, but, you know, this is... The whole point is to just keep a gap, right?
17. NLNathan Lambert
  Yeah.
18. DPDylan Patel
  Um, and therefore at some point as the AI... You know, in a world where 2, 3% economic growth... This is really dumb, by the way, right? To cut off, uh, you know, high tech and not make money off of it. But in a world where super powerful AI comes about and then starts creating significant changes in society, which is what all the AI leaders and big tech companies believe, I think. Super powerful AI is gonna change society massively, and therefore, this compounding effect of the difference in compute is really important. There's some sci-fi out there where, like, AI is ta- is, like, measured in the power of l- in, like, how much power is delivered to compute, right? Or how much, uh, is being... You know, that's sort of a way of thinking about what's the economic output is just how much power are you directing towards that AI.
19. NLNathan Lambert
  Should we talk about reasoning models with this as a way that this might be actionable as something that people can actually see? So the reasoning models that are coming out with R1 and O1, they're designed to use more compute. There's a lot of buzzy words in the AI community about this, test-time compute, inference-time compute, whatever, but, um, Dillon has good research on this. You can get to the specific numbers on the ratio of when you train a model you can look at things about the compute, amount of compute used at training and amount of compute used at inference. These reasoning models are making inference way more important to doing complex tasks. In the fall, in December, their OpenAI announced this O3 model. There's another thing in AI when things move fast. We get both announcements and releases. Announcements are essentially blog posts where you pat yourself on the back and you say you did things, and releases are when the model's out there, the paper's out there, et cetera. So OpenAI has announced O3, and we can check if O3 mini is out-
20. DPDylan Patel
  Mm-hmm.
21. NLNathan Lambert
  ... as of recording potentially, but that doesn't really change the point, which is that the breakthrough result was something called ARC AGI task, which is the abstract reasoning corpus, a task for artificial general intelligence. Um, François Chollet is the guy who's been... It's, it's a multi-year-old paper. It's a brilliant benchmark, and the number for OpenAI O3 to solve this was that it used a thu- some sort of number of samples in the API. The API has, like, thinking effort and number of samples. They used 1,000 samples to solve this task, and it comes out to be, like, five to $20 per question, which you're, you're putting in, effectively, a math puzzle, and then it takes orders of dollars to answer one question. And this is a lot of compute. If this is gonna take off in the US, OpenAI needs a ton of GPUs on inference to capture this. They have this o- um, OpenAI ChatGPT Pro subscription, which is $200 a month, but they-
22. DPDylan Patel
  Which Sam said they're losing money on.
23. NLNathan Lambert
  Which means that people are burning a lot of GPUs on inference, and I, I've signed up with it. I've played with it. I don't think I'm a power user, but I, I, I use it, and it's like that is the thing that a Chinese company with-... mediumly strong expert controls, there will always be loopholes, might not be about to do it all. And if that, i- i- the main result for O3 is also a spectacular coding performance and if that feeds back into AI companies being able to experiment better.
24. LFLex Fridman
  So presumably the idea is for an AGI, a much larger fraction of the compute will be used for this test on compute for the reasoning. For the AGI goes into a room and thinks about how to take over the world and then, you know, come back in 2.7 hours-
25. NLNathan Lambert
  This is what it's gonna-
26. LFLex Fridman
  ... and, and that it's gonna take a lot of compute.
27. NLNathan Lambert
  This is what people at, like, CEO, or leaders of OpenAI and Anthropic talk about, is, like, autonomous AI models, which is you give them a task and they work on it in the background. I think my personal definition of AGI is much simpler. Like, I think language models are a form of AGI and all this super powerful stuff is a next step that's great if we get these tools. But a language model has so much value in so many domains, it is a general intelligence to me. But this next step of agentic things where they're independent and they can do tasks that aren't in the training data is what the m- few year outlook that these AI companies are driving for.
28. LFLex Fridman
  I think the terminology here that Dari- Dario uses a super powerful AI, so I agree with you on the AGI, I think we already have something like this exceptionally impressive that Alan Turing would for sure say is AGI. Uh, but he's referring more to something once in possession of, then you would have a significant military and geopolitical advantage over other nations. So it's not just like you can ask it how to cook an omelet.
29. NLNathan Lambert
  And he has a much more positive view in his essay, Machines of Love and Grace.
30. LFLex Fridman
  Yes. Yeah.
1:09:16 – 1:18:41
AGI timeline
1. NLNathan Lambert
2. LFLex Fridman
  So we're doing a depth first search here on topics, uh, taking tangent of a tangent, so let's continue, uh, on that depth first search. Uh, y- you said that you're both feeling the AGI.
3. NLNathan Lambert
  (laughs)
4. LFLex Fridman
  So you're, what's, what's your timeline? Dario's 2026 for the super powerful AI that's g- you know, that's basically agentic to a degree where it's a real security threat.
5. NLNathan Lambert
  (laughs)
6. LFLex Fridman
  That level of AGI. What's your, what's your timeline?
7. NLNathan Lambert
  I don't like to attribute specific abilities because predicting specific abilities and when is very hard. I think mostly if you're going to say that I'm feeling the AGI is that I expect continued rapid surprising progress over the next few years. So something like R1 is less surprising to me from DeepSeek because I, I expect there to be new paradigms where substantial progress can be made. I think DeepSeek R1 is so unsettling because we're kind of on this path with, with ChatGPT, it's like it's getting better, it's getting better, it's getting better. And then we have a new direction for, for changing the models and we took one step like this, and we, like, took a step up. So it looks like a really fast sto- slope and then we're gonna just take more steps. So I guess it's really unsettling when you have these big steps and I expect that to keep happening. I see, um, I've tried OpenAI Operator, I've tried Claude computer use, they're not there yet. I understand the idea, but it's just so hard to predict what is the breakthrough that'll make something like that work. And I think it's more likely that we have breakthroughs that work in things that we don't know what they're gonna do. So everyone wants agents, Dario has a very eloquent way of describing this, and I just think that it's, like, there's gonna be more than that, so, like, just expect these things to come.
8. LFLex Fridman
  I'm gonna have to try to pin you down to a date on the AGI timeline. Uh, (laughs) like, the nuclear weapon moment, so moment where, on the geopolitical stage, there's a real, like, you know, 'cause we're talking about export controls, when do you think, just even to throw out a date, when do you think that would be? Like, for me, it's probably after 2030. So I'm not as-
9. NLNathan Lambert
  That's what I would say.
10. DPDylan Patel
  So, so define that, right? Because to me it kind of almost has already happened, right? You look at elections in India and Pakistan, people get AI voice calls and think they're talking to the politician, right? The AI diffusion rules which was enacted in the last couple weeks of the Biden admin and looks like the Trump admin will keep and potentially even strengthen, limit cloud computing and GPU sales to countries that are not even related to China. It's like this is, this-
11. NLNathan Lambert
  Portugal and all these, like, normal comp- countries are on the-
12. DPDylan Patel
  Yeah. It's like-
13. NLNathan Lambert
  ... you need approval from the US list.
14. DPDylan Patel
  Like, yeah, Portugal and, like, you know, like, like all these countries that are allies, right? Singapore, right? Like they, they freaking have F-35s and we don't let them buy GPUs.
15. NLNathan Lambert
  Mm-hmm.
16. DPDylan Patel
  Like this is, this to me is already to the scale of, like, you know...
17. LFLex Fridman
  Well, that just means that, uh, the US military is really nervous about this new technology. That doesn't mean the technology is already there. So, like, they might be just very cautious about this thing that they don't quite understand. But that's a really good point, sort of the, the robocalls, swarms of semi-intelligent bots could be a weapon, could be doing a lot of social engineering.
18. DPDylan Patel
  I mean there's tons of talk about, you know, from the 2016 elections, like Cambridge Analytica and all this stuff, Russian influence. I, I mean every country in the world is pushing stuff onto the internet and has narratives they want, right? Like that's, th- every, every, like, technically competent whether it's Russia, China, US, Israel, et cetera, right? You know, people are pushing viewpoints onto the internet en masse. And language models crash the cost of, like, very intelligent sounding language.
19. NLNathan Lambert
  There's some research that shows that the distribution is actually a limiting factor. So language models haven't yet made misinformation particularly, like, changed the equation there. The internet is still ongoing. I think there's a blog, AI Snake Oil, and some of my friends at Princeton that write on this stuff. So there is research. It's like, it's a default that everyone assumes and I would have thought the same thing is that misinformation is gonna get far worse with language models. I think in terms of internet posts and things that people have been measuring, it hasn't been a exponential increase or something extremely measurable. And things you're talking about with voice calls and stuff like that, it could be in modalities that are harder to measure.... so it's, it's something that it's too soon to tell in terms of, I think that's like political instability via the web is very, it's m- it's monitored by a lot of researchers to see what's happening. I think the, that you're asking about like the AGI thing. I, my, if, if you make me give a year, I would be like, "Okay, I have AI CEO saying this, they've been saying two years for a while."
20. DPDylan Patel
  Mm-hmm.
21. NLNathan Lambert
  I think that there are people like Dario a- Anthropic, the CEO, had thought about this so deeply. I need to take their work seriously, but also understand that they have difference of, different (laughs) incentives, so I feel like add a few years to that, which is how you get something similar to 2030 or a little after 2030.
22. DPDylan Patel
  I think to some extent we have capabilities that hit a certain point where any one person could say, "Oh, okay, if I can le- leverage those capabilities for X amount of time, this is AGI," right? Call it 27, 28. But then the cost of actually operating that capability-
23. NLNathan Lambert
  Yeah, this is gonna be my point.
24. DPDylan Patel
  ... is so, so extreme that no one can actually deploy it at scale en masse to actually completely revolutionize the economy on a click and a snap of a finger. So I don't think it will be like a snap of the finger moment-
25. NLNathan Lambert
  Yeah. It's the physical constraint timeline.
26. DPDylan Patel
  Rather it'll be a, you know, "Oh, the capabilities are here, but I can't deploy it everywhere," right? And so one, one simple example going back sort of to 2023 was when, uh, you know, Bing with GPT-4 came out and everyone was freaking out about search, right?
27. NLNathan Lambert
  (laughs) Oh, my gosh.
28. DPDylan Patel
  Perplexity came out.
29. NLNathan Lambert
  (laughs)
30. DPDylan Patel
  If you did the cost on like, hey, implementing GPT-3 into every Google search, was like, "Oh, okay, this is just, like, physically impossible to implement," right? And, and, and as we step forward to, like, going back to the test time compute thing, right? A query for, you know, you ask ChatGPT a question, it costs cents, right? For their most capable model of chat, right? To get a query back. To solve an Arc AGI problem though cost five to 20 bucks, right? And this is, this is in a, you know-

Episode duration: 5:06:18

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode _1f-o0nqpEI

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome