Skip to content
Lex Fridman PodcastLex Fridman Podcast

Raschka & Lambert on Lex Fridman: Why Post-Training Won 2025

Rlvr and inference time scaling, not architecture, drove 2025 AI gains. Deepseek open-weight releases showed frontier performance need not be closed-source.

Lex FridmanhostSebastian RaschkaguestNathan Lambertguest
Jan 31, 20264h 25mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:001:57

    Introduction

    1. LF

      The following is a conversation all about the state-of-the-art in artificial intelligence, including some of the exciting technical breakthroughs and developments in AI that happened over the past year, and some of the interesting things we think might happen this upcoming year. At times, it does get super technical, but we do try to make sure that it remains accessible to folks outside the field without ever dumbing it down. It is a great honor and pleasure to be able to do this kind of episode with two of my favorite people in the AI community, Sebastian Raschka and Nathan Lambert. They are both widely respected machine learning researchers and engineers who also happen to be great communicators, educators, writers and Twitterers, X posters. Sebastian is the author of two books I highly recommend for beginners and experts alike. First is Build a Large Language Model From Scratch and Build a Reasoning Model From Scratch. I truly believe in the machine learning computer science world, the best way to learn and understand something is to build it yourself from scratch. Nathan is the post-training lead at the Allen Institute for AI and author of the definitive book on reinforcement learning from human feedback. Both of them have great X accounts, great Substacks. Sebastian has courses on YouTube, Nathan has a podcast, and everyone should absolutely follow all of those. This is the Lex Fridman Podcast. To support it, please check out our sponsors in the description, where you can also find links to contact me, ask questions, give feedback, and so on. And now, dear friends, here's Sebastian Raschka and Nathan Lambert.

  2. 1:5710:38

    China vs US: Who wins the AI race?

    1. LF

      So I think, uh, one useful lens to look at all of this through is the DeepSeek, so-called DeepSeek moment. This happened about a year ago, in January 2025, when the open-weight Chinese company, DeepSeek, released DeepSeek-R1 that, uh, I think it's fair to say, surprised everyone with, uh, near or at state-of-the-art performance, with allegedly much less compute for much cheaper. And from then to today, the AI competition has gotten insane, both on the research level and the product level. It's just been accelerating. Let's discuss all of this today, and maybe let's start with some spicy questions if we can. [chuckles] Uh, who is winning at the international level? Would you say it's the set of companies in China or the set of companies in the United States? And Sebastian, Nathan, it's good to see you guys. Uh, so Sebastian, who do you think is winning?

    2. SR

      Um, so winning is [chuckles] a very broad, uh, you know, term. I, I would say you mentioned the DeepSeek moment, and I do think DeepSeek is definitely winning the hearts of the people who work on open-weight models because they share these as open models. Um, winning, I think, has multiple timescales to it. We have today, we have next year, we have in ten years. One thing I know for sure is that, um, I don't think nowadays, 2026, that there will be any company who is, let's say, having access to a technology that no other company has access to. And that is mainly because researchers are frequently changing jobs, changing labs, they, uh, rotate. So I don't think there will be a clear winner in terms of technology access. However, I do think there will be, uh, the differentiating factor will be budget and hardware constraints. So I don't think the ideas will be proprietary, but the way or the resources that are needed to implement them. And so I don't see currently a take-it-all scenario where a winner takes it all. I, I can't see that at the moment.

    3. LF

      Uh, Nathan, what do you think?

    4. NL

      You see the labs put different energy into what they're trying to do, and I think to demarcate the point in time when we're recording this, um, the hype over Anthropic's Claude Opus 4.5 model has been absolutely insane. Which is just-- I mean, I've used it and built stuff in the last few weeks, and it's, uh, it's almost gotten to the point where it feels like a bit of a meme in terms of the hype. And it's kind of funny because this is very organic, and then if we go back a few months ago, we can get the release date, and the notes is Gemini 3 from Google got released, and it seemed like the marketing and just, like, wow factor of that release was super high. But then at the end of November, Claude Opus 4.5 was released, and the hype has been growing. But Gemini 3 was before this, and it kind of feels like people don't really talk about it as much, even though when it came out, everybody was like, "This is, um, Gemini's moment to retake kind of Google's structural advantages in AI." And Gemini 3 is a fantastic model, and I still use it. It's just kind of differentiation is lower. And I agree with Sebastian, what you're saying with all these, like, the idea space is very fluid, but, um, culturally, Anthropic w-- is known for betting very hard on code, which this Claude code thing is working out for them right now. So I think that even if the ideas flow pretty freely, so much of this is bottlenecked by human effort and kind of culture of organizations, where Anthropic seems to at least be presenting as the least chaotic. It's, is, is a bit of an advantage, and if they can keep doing that for a while... But on the other side of things, there's a lot of ominous technology from China, where there's way many more labs than DeepSeek. So DeepSeek kicked off a movement within China, I say kind of similar to how ChatGPT kicked off a movement in the US where everything had a chatbot. There's now tons of tech companies in China that are releasing very strong frontier open-weight models, to the point where I would say that DeepSeek is kind of losing its crown as the preeminent open model maker in China, and the likes of, um, Z.AI with their GLM models, MiniMax's models, um, Kimi Moonshot, e-especially in the last few months, have shone more brightly. The new DeepSeek models are still very strong, but that's kind of a-- it, it could look back as a big narrative point where in 2025, DeepSeek came, and then all-- and it kind of provided this platform for way more Chinese companies that are releasing these fantastic models to kind of have this new type of operation. So these models from these Chinese companies are open weights, and depending on this trajectory of business models that these American companies are doing-... could be at risk, but currently, a lot of people are paying for AI software in the US, and historically, in China and other parts of the world, people don't pay a lot for software.

    5. LF

      So some of these models, like DeepSeek, uh, have the love of the people because they are open weight. Uh, how long do you think, uh, the Chinese companies keep releasing open-weight models?

    6. NL

      I would say for a few years. I think that, like in the US, there's not a clear business model for it. I have been writing about open models for a while, and these Chinese companies have realized it, so I get inbound from some of them. And they're smart and realize the same constraints, which is that a lot of US comp-- tech companies and other IT companies won't pay for a API subscription to Chinese companies for security concerns. This has been a long-standing, um, habit in tech, and the people at these companies then see open-weight models as an ability to influence and take part of a huge growing AI expenditure market in the US, and they're very realistic about this, and it's working for them. And I think that the government will see that that is building a lot of influence internationally in terms of uptake of the technology. So there's gonna be a lot of incentives to keep it going, but building these models and doing the research is very expensive. So at some point, I expect consolidation, but I don't expect that to be a story of 2026, where there will be more open model builders throughout 2026 than there were in 2025, and a lot of the notable ones will be in China.

    7. LF

      You were gonna say something?

    8. SR

      Um, yes. You mentioned DeepSeek losing its crown. I do think to some extent, yes, but we also have to consider, though, they are still, I would say, slightly ahead, and the other ones-- it's not that DeepSeek got worse, it's just, like, the other ones are using the ideas from Deep, DeepSeek. For example, you mentioned Kimi, same architecture. They're training it, and then again, we have this leapfrogging, where they might be, at some point in time, a bit better because they have the more recent model. And I think this comes back to, um, the, the fact that there won't be a clear winner. It's-- well, it will just be like, like that. One person releases something, the other one comes in, and the, the recent-- the most recent model is probably always the best model.

    9. NL

      Yeah. We'll also see that Chinese companies have different incentives. So, like, DeepSeek is very secretive, where some of these start-ups are, like the MiniMaxes and Z.AI's of the world, those two literally have filed IPO paperwork, and they're trying to get Western mindshare and do a lot of outreach there. So I don't know if these incentives will kind of change the model development, 'cause DeepSeek famously is built by a hedge fund-

    10. SR

      Mm-hmm

    11. NL

      ... High-Flyer Capital, and we don't know exactly what they [chuckles] like, we don't know what they use the models for or if they care about this.

    12. LF

      They're secret in terms of communication. They're not secret in terms of the technical reports-

    13. NL

      Mm

    14. LF

      ... that describe how their models work. They're still open on that front.

    15. NL

      Mm.

    16. LF

      And we should also say, on the Opus- four five hype, there's the layer of, uh, something being the darling of the X echo chamber, on Twitter echo chamber, and the actual amount of people that are using the model. I think it's probably fair to say that ChatGPT and Gemini are focused on the broad user base that just want to solve problems in their daily lives, and that user base is gigantic. So the hype about the coding may not be representative of the actual use.

    17. SR

      I would say also, um, a lot of the usage patterns are, like you said, name recognition, brand, uh, and, and stuff, but also muscle memory almost, where, um, you know, like, ChatGPT has been around for a long time. People just got used to using it, and it's kind of like almost like a flywheel. They recommend it to other users and that stuff. One interesting point is also the customization of, uh, LLMs. For example, ChatGPT has a memory feature, right? And so you may have a subscription, and you use it for personal stuff, but I don't know if you want to use that same thing at work, you know, because that's a boundary between private and work. If you're working at a company, they might not allow that, or you may not want that. And I think that's also an interesting point, where you might have multiple subscriptions. One, one is just clean code. It keeps-- mm, has nothing of your personal images that you... or hobby projects in there. It's just, like, the work thing, and then the other one is your personal thing. So I think that's also something where two different use cases, and it doesn't mean you only have to have one. It's, it's-- I think the future is also multiple ones.

  3. 10:3821:38

    ChatGPT vs Claude vs Gemini vs Grok: Who is winning?

    1. LF

      What model do you think won twenty twenty-five, and what model do you think is gonna win 'twenty-six?

    2. NL

      I think in the context of a consumer chat bots, is a question of, are you willing to bet on Gemini over ChatGPT?

    3. LF

      Mm-hmm.

    4. NL

      Which I would say in my gut feels like a bit of a risky bet, because OpenAI has been the incumbent, and there's so many benefits to that in tech. But I think the momentum, if you look at twenty twenty-five, was on Gemini's side, but they were starting from such a low point [chuckles] , I think, um, RIP Bard, and these earlier attempts, uh, uh, of getting started. I think huge credit for them for powering through the organizational cr- chaos to make that happen. But also, it's hard to bet against Chat-- to OpenAI because they always come off as, as so chaotic, but they're very good at landing things.

    5. LF

      Mm.

    6. NL

      And I think, like, personally, I have very mixed reviews of GPT 5, but it had to have saved them so much money, with the headline feature being a router, where most users are no longer charging, like, charging their GPU costs as much. So I think it's very hard to dissociate the things that I like out of models versus the things that are gonna actually be a general public differentiator.

    7. LF

      What do you think about twenty twenty-six? Who's gonna win?

    8. NL

      I'll say something, even though it's risky. I will say that I think Gemini will continue to take progress on ChatGPT. I think Google's scale, when both of these are operating at such extreme scales, and, like, G- Google has the ability to separate that research and product a bit better, where you hear so much about OpenAI being chaotic operationally and chasing the high-impact thing, which is a very start-up culture. And then on the software and enterprise side, I think Anthropic will have continued to success, as they've again and again been set up for that. And o- obviously, Google's Cloud has a, a lot of offerings, but I think this kind of like Gemini name brand is important for them to build, and, and Google's Cloud will continue to do well. But that's kind of a c- more complex thing to explain in the ecosystem because that's competing with the likes of Azure and AWS rather than on the model provider side.

    9. LF

      So in infrastructure, you think TPUs give an advantage?

    10. NL

      ... largely because the margin on NVIDIA chips is insane, and Google can develop everything from top to bottom to fit their stack and not have to pay this margin, and they've had a head start in building data centers. So all of these things that have both high lead times and very hard margins on high costs, Google has a, just kind of a historical advantage there. And, uh, if there's gonna be a new paradigm, it's most likely to come from OpenAI, where they're kind of-- their research division, again and again, has kind of shown this ability to land a new research idea or a product. I think, like, deep research, Sora, o1 thinking models, like, all these definitional things have come from OpenAI, and that's gotta be one of their top traits as an organization. So it's kind of hard to bet against that, but I think a lot of this year will be about scale and optimizing what could be described as low-hanging fruit in models.

    11. LF

      And clearly, there's a trade-off between intelligence and speed. This is what ChatGPT Five was trying to solve behind the scenes. It's like, do people actually want intelligence, the broad public, or do they want speed?

    12. SR

      I think it's a nice variety, actually, or the option to, uh, have a toggle there. I mean, first, for my personal usage, most of the time when I look something up, I use ChatGPT to ask a quick question, get the information I wanted fast. For, you know, most daily tasks, I use the quick model. Nowadays, I think the auto mode is pretty good, where you don't have to specifically say thinking or, you know, non-thinking and stuff. Then again, I also sometimes want the pro mode. Very often, what I do is when I have something written, I put it into a ChatGPT and say, "Hey, do a very thorough check. Is-- are all my references correct? Are all my thoughts correct? Uh, did I make any formatting mistakes, and are the figure numbers wrong?" Or something like that. And I don't need that right away. It's something, okay, I finish my stuff, maybe have dinner, let it run, come back, and go through this [chuckles] and I think-- see, this is where I think it's important to have this option. I would go crazy if for each query, I would have to wait thirty minutes or ten minutes even. [laughing]

    13. NL

      That's me.

    14. SR

      Yeah. [laughing]

    15. NL

      Um, I'm, like, sitting over here losing my mind that you use the router and the non-thinking model. I'm like, oh, how do you-- how do you live with-- how do you live with that?

    16. LF

      Yeah.

    17. NL

      It's, like, my reaction. I'm-- been heavily on ChatGPT for a while. Um, never touched Five non-thinking. I find it, the, its tone and then its propensity of errors. It just, like, has a higher likelihood of errors. Some of this is from back when OpenAI released o3, which was the first model to do this deep search and find many sources and integrate them for you. So I became habituated with that. So I will only use GPT 5.2 thinking or pro when I'm finding any sort of information query for work, whether that's a paper or some code reference that I found, and it's just like I, I will regularly have, like, five pro queries going simultaneously, each looking for one specific paper or feedback on an equation or something.

    18. SR

      I have a fun example of where I just needed the answer as fast as possible. For this podcast, before [chuckles] I was going on the trip, um, I have, like, a local GPU running at home, and I wanted to run a long, uh, RL experiment. And usually, I also unplug things because you never know if you're not at home, you don't want to have things plugged in, and I accidentally unplugged the, the GPU. [chuckles] It was like my wife was already in the car, and it's like, "Oh, dang!" And then basically, I wanted as fast as possible a Bash script that runs my different, uh, experiments and the evaluation. And I did something I know. I learned how to use the Bash, uh, interface, or Bash terminal, but i- in that moment, I just needed, like, ten seconds. Give me the command.

    19. LF

      This is a hilarious situation, but yeah. So what did you use?

    20. SR

      Uh, so I did the non-thinking fastest model. It gave me the Bash, uh, command. I, uh, to chain different, uh, scripts to each other, and then the thing is, like, you have the T thing where you want to route this to a log file. Top of my head, I was just, like, in a hurry. I could have thought about it myself.

    21. LF

      By the way, I don't know if this is a representative case, wife waiting in the car.

    22. SR

      [laughing]

    23. LF

      You have to run, you unplug the GPU.

    24. SR

      [laughing]

    25. LF

      You have to generate a Bash script.

    26. SR

      Yeah.

    27. LF

      This sounds like a movie-

    28. NL

      I, I use-

    29. LF

      ... like Mission Impossible.

    30. NL

      I use Gemini for that. So I use thinking for all the information stuff, and then Gemini for fast things or stuff that I could sometimes Google, which is, like, it's good at explaining things, and I trust that it has this kind of background of knowledge, and it's simple. And the Gemini app has gotten a lot better-

  4. 21:3828:29

    Best AI for coding

    1. LF

      Uh, we didn't really mention programming. That's another use case that a lot of people deeply care about. So I use basically half-and-half Cursor and Claude Code because they're-- I find them to be, like, fundamentally different experience and both useful. Uh, what do you guys... You program quite a bit-

    2. SR

      Mm-hmm.

    3. LF

      -so what, what do you use? What's the current vibe?

    4. SR

      So I use the Codex plugin for VS Code. Uh, you know, it's very convenient. It's just like a plugin, and then it's a chat interface that has access to your repository. I know that Claude Code is, I think, a bit different. It is a bit more agentic. It touches more things. It does a whole project for you. I'm not quite there yet, where I'm comfortable with that because, ah, maybe I'm a control freak, but I still [chuckles] would like to see a bit what's going on, and Codex is kind of like, right now, for me, like, the sweet spot where it is helping me, but it is not taking completely over.

    5. LF

      I should mention, one of the reasons I do use Claude Code is to build the skill of programming with English. I mean, the experience is fundamentally different. You're, as opposed to micromanaging the details of the process of the generation of the code and, uh, looking at the diff, which you can in Cursor, i- if that's the IDE you use, and, and in changing, altering, looking and reading the code and understanding the code deeply as you progress, versus just kind of like thinking in this design space and just guiding it at this, uh, macro level, which I think, uh, is another way of thinking about the programming process. Also, we should say that Claude Code, it just seems to be somehow a better utilization of Claude Opus 4.5.

    6. NL

      It's a good side by side for people to do. So you can have Claude Code open, you can have Cursor open, and you can have VS Code open, and you can select the same models on all of them-

    7. LF

      Mm-hmm

    8. NL

      ... and ask questions, and it's very interesting, like the, like, Claude Code is way better in that domain.

    9. LF

      Yeah.

    10. NL

      It's remarkable.

    11. LF

      All right, we should say that both of you are, are legit on multiple fronts: researchers, programmers, educators, tweeters. [laughing] And on the book front, too. So Nathan, at some point soon, hopefully, has an RLHF book coming out.

    12. NL

      It's available for pre-order, and there's a full digital preprint, just making it pretty and better organized for the physical thing, which is a lot of why I do it, because it's fun to create things that you think are excellent in the physical form when so much of our life is digital.

    13. LF

      I should say, going to Perplexity here, Sebastian Raschka is a machine learning researcher and author known for several influential books. A couple of them that I wanted to mention, which is a book I highly recommend, Build a Large Language Model From Scratch, and the new one, Build a Reasoning Model From Scratch. So I'm really excited about that. Building stuff from scratch is one of the most powerful ways of learning.

    14. SR

      Honestly, building an LLM from scratch is a lot of fun. It's also a lot of r- to learn, and like you said, it's probably the best way to learn how something really works. 'Cause you can look at figures, but figures can have mistakes. You can look at con- uh, concepts, explanations, but you might-... misunderstand them, but if you see the co- there is code and the code works, you know it's correct. I mean, there's no misunderstanding. It's like it's precise, otherwise it wouldn't work. And I think that's like kind of like the beauty behind coding. It is kind of like it doesn't lie, it's math, basically. So even though with math, I think you can have mistakes in a book, you would never notice. Because you're not running the math when you are reading the book, you can't verify this, and with code, what's, what's nice is you can verify it.

    15. LF

      Yeah, I agree with you about the LLM From Scratch book. It's nice to tune out everything else, the internet and so on, and just focus on the book. But, you know, I read, uh, several, like, you know, uh, history books. It's just less l- lonely somehow. It's really more fun. Like, uh, for example, on the programming front, I think it's genuinely more fun to program with an LLM.

    16. SR

      Mm.

    17. LF

      And I think it's genuinely more fun to read with an LLM.

    18. SR

      Mm-hmm.

    19. LF

      But you're right, like, this distraction should be minimized. So it's, uh, you use the LLM to basically enrich the experience, maybe add more context.

    20. SR

      Mm.

    21. LF

      Maybe the... I just, the rate of aha moments for me in a small scale is really high with LLMs.

    22. SR

      Hundred percent. I would s- I also want to correct myself. I'm not suggesting not to use LLMs. Uh, I suggest doing it in multiple passes, like one pass, just offline focus mode, and then after that, uh, I mean, I also take notes, but I, I try to resist the urge to i- immediately look things up. I, I do a second pass. It's just like, for me, more structured this way, and I get le- I mean, sometimes things are answered in the chapter, but sometimes also it just helps to let it sink in and think about it. Other people have different preferences. I would highly recommend using LLMs when reading books. For me, it's just, it's not the first thing to do, it's like the second pass.

    23. LF

      By way of recommendation, as you said, I do the opposite. I like to use the LLM at the beginning-

    24. SR

      Mm

    25. LF

      ... to lay out the full context of, like, what is this world that I'm now stepping into? But I try to avoid clicking out of the LLM into the world of, like, Twitter and blogs, and because then you're now down this rabbit hole, you're reading somebody's opinion. There's a flame war about a particular topic, and all of a sudden, you're no longer-- you're now in the, in the realm of the internet and Reddit and so on. But if you're purely l- letting the LLM give you the context of why this matters, what are the big picture ideas, uh, but sometimes books themselves are good at doing that, but not always, so.

    26. NL

      This is why I like the ChatGPT app. It gives the AI a home in your computer when you are fo- you can focus on it, rather than just being another tab in my mess of internet options. And I think Claude Code, and these particularly, does a good job of making that a joy, where it seems very engaging as a product designed to be an interface that your AI will then go out into the world. And it's something that is very kind of intangible between it and Codex, is that it just feels kind of warm and engaging, where Codex can often be as good from OpenAI, but it just kind of like feels a little bit rougher on the edges. Whereas, like, Claude Code is, makes it fun to build things, particularly from scratch, where you just don't-- like, you don't have to care, but you trust that it'll make something. Like, obviously, this is good for websites and kind of refreshing tooling and stuff like this, which I use it for, or data analysis. So I, my, my blog, we scrape Hugging Face, so we keep the download numbers for every dataset and model over time now, so we have them. And it's like Claude was just like: "Yeah, I've made use of that data, no problem." And I was like: "That would've taken me days." And I was like, then I have enough situational awareness to be like, okay, these trends obviously make sense, and you can check things. 'Cause that's just the kind of wonderful interface where you can have an intermediary and not have to do the kind of awful low-level work that you would have to do to maintain different web projects and do this stuff.

  5. 28:2940:08

    Open Source vs Closed Source LLMs

    1. LF

      All right, so we just talked about a bunch of the closed-weight models. Let's talk about the open ones. Uh, so tell me about the landscape of open LLM models. Which are interesting ones, which stand out to you, and why? We already mentioned DeepSeek.

    2. NL

      Do you want to see how many we can name off the top of our head?

    3. LF

      Yeah, yeah, without looking at notes.

    4. NL

      DeepSeek, Qimi, Minimax, Z.ai, Ant Ling. [chuckles] We're just going Chinese. [laughing]

    5. SR

      [chuckles] Let's throw in Mistral AI, Gemma, um-

    6. NL

      Yeah

    7. SR

      ... GPT OSS, the open-source model by, uh, ChatGPT. Actually, NVIDIA Nemotron had a-- or NVIDIA had a very cool one, a Nemotron-3. Um, there, there's a lot of stuff, especially at the end of the year. Qwen, one, maybe the one-

    8. NL

      Oh, yeah, Qwen was the na- the obvious name I was missing. I was trying to get through the... You can get at least ten Chinese and at least [chuckles] ten Western. I think that, I mean, OpenAI released their first open model-

    9. SR

      Mm, a long time

    10. NL

      ... since GPT-2. That was-- when I, when I meant, talked, when I was writing about OpenAI's open model release, they're all like: "Don't forget about GPT-2," which I thought was really funny [chuckles] because it's just such a different time. But GPT OSS is actually a very strong model and does some things that the other models don't do very well. And I think that selfishly, I'll promote a bunch of, like, Western companies. So both in the US and Europe have these, like, fully open models. So I work at Allen Institute for AI, where we've been building Olmo, which releases data and code and all of this, and now we have actual competition for people that are trying to release everything so that other people can train these models. So there's the Institute for Foundation Models/LM 360, which is, like, had their K2 models of various types. Apertess is a Swiss research consortium. Hugging Face, um, has SmolLM, which is very popular, um, and NVIDIA's Nemotron has started releasing data as well. And then Stanford's Marin Community Project, which is kind of making it so there's a pipeline for people to open a GitHub issue and implement a new idea, and then have it run in a stable language modeling stack. So this space, that list was way smaller in 2024-

    11. LF

      Mm

    12. NL

      ... so I think it was, like, just Ai2. So that's a great thing for more people to get involved in to understand language models, which doesn't really have a, like a Chinese company that is, has an analog. While I'm talking, I'll say that-... the Chinese open language models tend to be much bigger, and that gives them this higher peak performance as MoEs, where a lot of these things that we like a lot, whether it was Gemma, um, and Nemotron, have tended to be smaller models from the US, which is, which is starting to change from the US, US and Europe. Uh, Mistral Large 3 came out, which was a giant MoE model, very similar to DeepSeek architecture in December. And then a start-up, RCAI, and both Nemotron have-- Nemotron, it's NVIDIA, have teased MoE models of this way bigger than one hundred billion parameters-

    13. LF

      Mm-hmm.

    14. NL

      -like this four hundred billion parameter range coming in this like Q1 2026 timeline. So I think this kind of balance is set to change this year in terms of what people are using the Chinese versus US open models for, which will be-- which I'm personally quite gonna be very excited to watch.

    15. LF

      First of all, huge props for being [chuckles] able to name so many of these.

    16. NL

      [chuckles]

    17. LF

      Did you actually name Llama?

    18. NL

      Um, no.

    19. LF

      I feel like- [chuckles]

    20. NL

      RIP.

    21. SR

      This was not on purpose. [chuckles]

    22. LF

      RIP, Llama.

    23. NL

      Mm-hmm.

    24. LF

      All right. Can you mention what are some interesting models that stand out? So you mentioned Qwen 3 is, is, is obviously a standout.

    25. SR

      So I would say the year is almost bookended by both, uh, DeepSeek version three and R1, and then on the other hand, in December, uh, DeepSeek version three point two. Because what I like about those is they always have an interesting architecture tweak-

    26. LF

      Mm-hmm

    27. SR

      ... that others don't have. But otherwise, if you want to go with, um, you know, like the familiar but really good performance, Qwen 3, and like, um, Nathan said, also GPD OSS. And I think GPD OSS, what's interesting about it is kind of like the first public or like open-weight model that was really trained with tool use in mind, which I do think is kind of a little bit of a paradigm shift, where the ecosystem was not quite ready for it. So with tool use, I mean that the LLM is able to do a web search, to call a Python interpreter.

    28. LF

      Mm-hmm.

    29. SR

      And I do think this, uh, it's a standout because I think it's a huge unlock, because, um, one of the most, uh, common complaints about LLMs are, for example, hallucinations, right?

    30. LF

      Mm-hmm.

  6. 40:0848:05

    Transformers: Evolution of LLMs since 2019

    1. LF

      And then maybe is it useful to step back and talk about transformer architecture in general?

    2. SR

      Yeah. So maybe we should start with the GPT-2 architecture, the transformer that was derived from the Attention Is All You Need paper.

    3. LF

      Mm-hmm.

    4. SR

      So the Attention, uh, Is All You Need paper had a transformer architecture that had two parts, an encoder and a decoder, and GPT went just focusing in on the decoder part. It is essentially, uh, still a neural network, um, and it has this attention mechanism inside, and you predict one token at a time. You pass it through an embedding layer. There's the transformer block. The transformer block has attention modules and a fully connected layer, and there are some normalization layers in between, but it's essentially neural network layers with this attention mechanism. So coming from GPT-2, uh, when we move on to GPT-OSS, there is, for example, the Mixture of Experts, um, layer. It's not invented by GPT-OSS, it's a few years old, um, but it is essentially a, a tweak to make the model larger without consuming more compute in each forward pass. So there is this, uh, fully connected layer, and if listeners are familiar with, um, multilayer perceptrons, you can think of a mini multilayer perceptron, a fully connected neural network layer inside the transformer, and it's very expensive because it's fully connected. If you have a thousand inputs, a thousand outputs, that's like a one million connections, and it's a very expensive part in this transformer. And the idea is to kind of expand that into multiple feedforward net- uh, networks. So instead of having one, let's say you have two hundred and fifty-six, but it would make it way more expensive because now you have two hundred and fifty-six, but you don't use all of them at the same time. So you now have a router that says, "Okay, based on this input token, it would be useful to use this, um, fully connected network." And in that context, it's called an expert. So a Mixture of Experts means you have multiple experts, and depending on what your input is, uh, let's say it's more math-heavy, it would use different experts compared to, let's say, translating input text from English to Spanish. It would maybe consult different experts. It's not quite clear, I mean, as clear-cut to say, "Okay, this is only an expert for math, and for Spanish," it's a bit more fuzzy. But the idea is essentially that you pack more knowledge into the network, but not all the knowledge is used all the time.

    5. LF

      Mm-hmm.

    6. SR

      That would be very wasteful. So you are kind of like, during the token generation, you are more selective. There's a router that selects which tokens should go to which expert. Adds more complexity, it's harder to train. There's a lot of, you know, that can go wrong, like collapse and everything. So I think that's why OLMO-3 still uses, uh, dense. I mean, you have, I think, OLMO models with Mixture of Experts, but dense models, where dense means... Uh, so also, it's jargon. There's a distinction between dense and sparse. So Mixture of Experts is considered sparse because we have a lot of experts, but only few of them are active, so that's called sparse. And then dense would be the opposite, where you only have, like, one fully connected module, and it's always, you know, utilized.

    7. LF

      So may- maybe this is a good place to also talk about KV cache.

    8. SR

      Mm-hmm.

    9. LF

      But actually before that, even zooming out, like, fundamentally, how many new ideas-... have been implemented from, from GPT-2 to today?

    10. SR

      Mm-hmm.

    11. LF

      Like, how different really are these architectures?

    12. SR

      Picture like the mixture of experts. Um, the attention mechanism in GPT-OSS, that would be the group query attention mechanism. So it's a slight tweak from multi-head attention to group query attention, so that we have two. I think they replaced, uh, LayerNorm by RMSNorm, but it's just like a different normalization layer. Not a big change, it's just like a, a tweak. Um, the nonlinear activation function, people familiar in- with deep ne- neural networks, I mean, it's the same as changing sigmoid with ReLU. It's, it's not changing [chuckles] the network fundamentally, it's just like a tweak, like a little, little tweak. Um, and, and that's about it, I would say. It's not really fundamentally that different. It's still the same, same architecture. So you can convert one from one, uh... You can go from one into the other by just adding these, these changes, basically.

    13. LF

      It fundamentally is still the same architecture.

    14. SR

      Mm-hmm. Yep. So for example, you mentioned my book earlier. That's a GPT-2 model in the book because it's simple and it's very small. Um, so 124, uh, 120 million parameters approximately. But in the bonus materials, I do have OMO 3 from scratch, Gemma 3 from scratch, and other types of from-scratch models. And I always started with my GPT-2 model and just, you know, tweaked a few or added different components, and you get from one to the other. It's like, it's kind of like a lineage in a sense. Yeah.

    15. LF

      Can you build up an intuition for people? Because s- sort of when you zoom out, you look at it, there's so much rapid advancement in the AI world, and at the same time, fundamentally, the architectures have not changed.

    16. SR

      Mm-hmm.

    17. LF

      So where is all the turbulence, the turmoil of the advancement happening? Where, where's the gains to be had?

    18. SR

      So there are the different stages where you develop the network, um, or train the network. You have the pre-training. Now, um, back in the day, it was just pre-training with GPT-2. Now you have pre-training, mid-training, and post-training. Um, so and I think right now we are in the post-training focus stage. I mean, pre-training still gives you, um, advantages if you scale it up to better, higher quality data. But then we have capability unlocks that were not there with GPT-2. For example, uh, ChatGPT, it is basically a GPT-3 model, and, and GPT-3 is the same as GPT-2 in terms of architecture. What was new was adding the, um, supervised fine-tuning and the reinforcement learning with human feedback. So it's more on the algorithmic side rather than the architecture.

    19. NL

      I would say that the systems also change a lot. I think if you listen to NVIDIA's announcements, they talk about these things like, "You now do FP8, you can now do FP4," and what is happening is these labs are figuring out how to utilize more compute to put it into one model, which lets them train faster, and that lets them put more data in, and then you can find better configurations faster by doing this. So you can look at, like, the-- essentially, the tokens per second per GPU is a metric that you look at when you're doing large-scale training, and you could get-- you can go from, like, ten K to thirteen K by turning on FP8 training, which means you're using less memory per parameter in the model, and by saving less information, you do less communication, you can train faster. So all of these, like, system things underpin way faster experimentation on data and algorithms that is kind of like-- it's this, it's this kind of loop that keeps going, [chuckles] wh- where it's kind of hard to describe when you look at the architecture, and they're exactly the same, but the code base used to train these models is gonna be vastly different.

    20. SR

      Mm-hmm.

    21. NL

      And you could probably like, I don't... The GPUs are different, but you probably train GPT-OSS 20B way faster in wall clock time than GPT-2-

    22. SR

      Mm-hmm

    23. NL

      ... was trained at the time.

    24. SR

      Yeah, like you said, they had, for example, in the mixture of experts, this NV, uh, FP4 optimization, for example, where you get more throughput. But I, I do think this is for the speed, this is true, but, uh, it, it doesn't give the model new capabilities in a sense. It's just how much can we make, make the computation coarser without suffering in terms of model performance degradation? Um, but I do think, I mean, there are alternatives popping up to the transformer. There's text diffusion models, uh, completely different paradigm. Um, and there's also, I mean, though text diffusion models might use transformer architectures, but it's not an auto- autoregressive, um, transformer, and also Mamba models. Uh, it's a state-space model, but they do have trade-offs, and, uh, what's right is there's nothing that has replaced the autoregressive transformer as state-of-the-art model. So, like, for state of the art, you would still do that, go with that thing, but there are now alternatives for the cheaper and, like, alternatives that are kind of, um, making compromises, but i- it's not just one architecture anymore. There are little ones coming up, but if we talk about the state of the art, it's pretty much still the, the transformer architecture, autoregressive, derived from GPT-2,

  7. 48:051:04:12

    AI Scaling Laws: Are they dead or still holding?

    1. SR

      essentially.

    2. LF

      I guess the big question here is, we talked quite a bit here on the architecture behind the, the pre-training. Are the scaling laws holding strong across pre-training, post-training, inference, context size, data, synthetic data?

    3. NL

      I'd like to start with the technical definition of scaling law-

    4. LF

      Yes

    5. NL

      ... which kind of informs all of this. The scaling law is a power law relationship between, you can think of the X-axis, so kind of what you are scaling is a combination of compute and data, which are kind of similar, and then the Y-axis is like the held-out prediction accuracy over our next tokens. We talk about models being autoregressive. It's like if you keep a set of m- text that the model has not seen, how accurate will it get when you will train? And the idea of scaling laws came when people figured out that that was a very predictable relationship, and I think that that technical term is continuing, and then the question is like, what do users get out of it? And then there are more types of scaling where, um, OpenAI's o1 was famous for introducing inference time scaling, and I think less famously for also showing that you can scale reinforcement learning training and get kind of this log X-axis and then a linear increase in performance on Y-axis. So there's kind of these three axes now where the traditional scaling laws are talk- talked about for pre-training, which is how big your model is and how big your dataset is, and then-... scaling reinforcement learning, which is like, how long can you do this trial-and-error learning that we will talk about? We'll define more of this, and then this inference time compute, which is just letting the model generate more tokens on a specific problem. So I'm kind of bullish where they, they're all really still working, but the low-hanging fruit has mostly been taken, especially in the last year, on, um, reinforcement learning with verifiable rewards, which is this RLVR, and then inference time scaling, which is just why these models feel so different to use, where previously you would get that first token immediately, and now they'll go off for seconds, minutes, or even hours, th- generating these hidden thoughts before giving you the first word of your than-- answer. And that's all about this inference time scaling, which is such a wonderful kind of step function in terms of how the models change abilities. They've kind of enabled this tool use stuff and enabled this much better software engineering that we were talking about. And this is, when we say enabled, almost entirely downstream of the fact that this reinforced learning with verifiable rewards training just kind of let the models pick up these skills very easily. So it let the models learn. So if you look at the reasoning process, when the models are generating a lot of tokens, what it'll be often doing is it tries a tool, it looks at what it gets back, it tries another API, it sees what it gets back, and if it solves the problem. So the models, when you're training them, very quickly learn to do this. And then at the end of the day, that gives this kind of general foundation where the model can use CLI commands very nicely in your repo and handle Git for you and move things around and organize things, or search to find more information, which if we're sitting in these chairs a year ago, is something that we didn't really think of the models being doing. So this is just kind of something that has happened this year and has totally transformed how we think of using AI, which I think is very magical. It's such an interesting evolution and just so-- unlocks so much value. But it's, it's like, it's not clear what the next avenue will be in terms of unlocking stuff like this. I think that there's, there's-- we'll get to continual learning later, but there's a lot of buzz around certain areas of AI, but no one knows when the next step function will, will really come.

    6. LF

      So you've, you've actually said quite a lot of things there and said profound things quickly. It would be nice to unpack them a little bit. You say you're bullish basically on every version of scaling. So can we just even start at the beginning? Pre-training, are we kind of implying that the low-hanging fruit on pre-training scaling has been picked? Is the, is, has pre-training hit a plateau, or is even pre-training still you're bullish on?

    7. NL

      Pre-training has gotten extremely expensive. I think to scale up pre-training, it's also implying that you're gonna serve a very large model to the users. So I think that it's been loosely established, the likes of GPT-4 and similar models were around one trillion, like this order of trillion parameters at the biggest size. There's a lot of rumors that they've actually gotten smaller as training has gotten more efficient. Pe- you want to make the model smaller because then your costs of serving go down proportionally. These models, the cost of training them is really low relative to the cost of serving them to hundreds of millions of users. I think DeepSeek had this famous number of about five million dollars for pre-training at cloud market rates. I think Olmo Three, um, section two point four in the paper, we just detailed how long we had the GPU clusters sitting around for training, which includes engineering issues, multiple seeds, and it was, like, about two million dollars to rent the cluster to, like, deal with all the problems and headaches of training a model. So these models are pretty... like, a lot of people could get one to ten million dollars to train a model, but the recurring costs [chuckles] of serving millions of users is really billions of dollars of compute. I think that you can look at cl- like, a thousand GPU rental, you can pay a hundred grand a day for, and these companies could have millions of GPUs. [chuckles] Like, you can look at how much these things cost to sit around. So that's kind of a big thing, and then it's like, if scaling is actually giving you a better model, like, is it gonna be financially worth it? And I think it'll kind of slowly we'll push it out as AI solves more compelling tasks. So like the likes of Claude Opus four point five, making Claude code just work for things. I think I, I launched this project called, like, the Atom Project, which is like American Truly Open Models in July, and that was, like, a true vibe coded website. [chuckles] And, like, I have a job, um, make plots and stuff, and then I came back to refresh it in the last few weeks, and it's like Claude Opus four point five versus whatever model at the time was, like, just crushed all the issues that it had from building in June and July, and, like, it might be a bigger model. There's a lot of things that go into this, but that's like-- there's still progress coming.

    8. LF

      So, so what you're speaking to is the nuance of the Y-axis of the scaling laws, that the, the way it's experienced versus on a benchmark, the actual intelligence is, might, might be different. But still, your intuition about pre-training, if you scale the, the size of compute, will the models get better? Not whether it's financially viable, but just from the law aspect of it, do you think the models will get smarter?

    9. NL

      Yeah, and I think that there's-- and this sometimes comes off as like, almost like disillusioned from people, leadership at AI companies saying this, but they're like: "It's held for thirteen orders of magnitude of computer something. Like, why would it ever end?" So I think fundamentally it is pretty unlikely to stop. It's just like eventually we're not even gonna be able to test the bigger scales because of all the problems that come with more compute. I think that there's a lot of talk on how twenty twenty-six is a year when very large Blackwell compute clusters, it's like gigawatt-scale facilities that hyperscalers are coming online, and these were all contracts for power and data centers that were signed and sought out in, like, twenty twenty-two and twenty twenty-three, so before or right after ChatGPT. So it took this two- to three-year lead time to build these bigger clusters to train the models, while there's obviously immense interest in building even more data centers than that. So that is, like, kind of the crux that people are saying is, like, these new clusters are coming, the labs are gonna have more compute for training, they're going to utilize this, but it's not a given. And it's like I...... I've seen so much progress that I expect it, and I expect a little bit bigger models, and I expect, um, I would say it's more like we will see a two thousand dollar subscription this year. We've seen two hundred dollar subscriptions. It's like that could ten X again, and these are the kind of things that could come, and they're all downstream of this, like, bit big- bit bigger model that offers just a little bit more cutting edge.

    10. LF

      So, you know, it's reported that xAI is gonna hit that, uh, one gigawatt scale early '26 and full two gigawatt by year-end. W- how do you think they'll utilize that in the context of scaling laws? Is, is a lot of that inference, is a lot of that training?

    11. NL

      It ends up being all of the above. So I think that all of your decisions when you're training a model come back to pre-training. So if you're gonna scale RL in a model, you still need to decide on your architecture that enables this. We were talking about, like, other architectures and, uh, using different types of attention. We're also talking about mixture of experts models. The sparse nature of MoE models makes it much more efficient to do, um, generation, which becomes a big part of, um, post-training. And it's like you need to have your architecture ready so that you can actually scale up this compute. I still think most of the compute is going in at pre-training because you can still make a model better, you still want to go and revisit this. You still want the best base model that you can. And in a few years, that'll saturate, and the, the RL compute will just go longer.

    12. LF

      Is there people who disagree with you that say, basically, pre-training is dead, it's all about scaling inference, scaling post-training, scaling context, continual learning, uh, scaling data, synthetic data?

    13. NL

      People vibe that way and describe it in that way, but I think it's not the practice that is happening.

    14. LF

      It's just the general vibe of people saying this thing is dead.

    15. NL

      The excitement is elsewhere.

    16. LF

      Yes.

    17. NL

      So the low-hanging fruit in RL is elsewhere. Like, for example, we released our model in November, for every company has deadlines. Our deadline was, like, November twentieth, and our, for that, our RL run was five days, which compared to twenty twenty-four, is a very long time to just be doing post-training at a model of, like, thirty billion parameters. It's not a big model. And then in December, we had another release, which was just-- we let the RL run for go, go for another three and a half weeks, and the model got notably better, so we release it. And, like, that's a big amount of time to just allocate to, like, something that is gonna be your, um, peak-

    18. LF

      Mm-hmm

    19. NL

      ... for the year. [chuckles] So it's like-

    20. LF

      So reasoning is-

    21. NL

      There's these types of decisions that happen when they're training a model where they just, like, can't, they can't leave it forever. You have to keep, you have to keep pulling in the improvements you have from your researchers. So that's like you redo pre-training. You'll do this post-training for a month, but then you need to give it to your users. You need to do safety testing. So there's kind of just, like... I think there's a lot in place that reinforces this cycle of just keep updating the models. There's things to improve. You get a new compute cluster that lets you do something maybe more stably or faster. It's like you hear a lot about Blackwell having rollout issues, where at Ai2, most of the models were pre-training around, like, one to two thousand GPUs. But when you're pre-training on ten thousand or a hundred thousand GPUs, you hit very different failures. So GPUs are known to break in weird ways, and doing a hundred thousand GPU run is like, you're pretty much guaranteed to always have at least one GPU that is down. And you need to have your training code handle that redundancy, which is just a very different problem.

    22. LF

      Mm-hmm.

    23. NL

      Whereas, like, what we're doing, like, I'm playing with post-training on a DJX Spark, or you have your book, it's like... Or people learning ML, it's like what they're battling to train these biggest models is just, like, mass distributed scale, and it's very different. But that's somewhat different than, like, are these- like, that's a systems problem-

    24. LF

      Mm-hmm

    25. NL

      ... in order to enable the scaling laws, especially at pre-training, you need all of these GPUs at once. When we shift to reinforcement learning, it actually lends itself to heterogeneous compute because you have many copies of the model. And, uh, to do a primer for a language model, reinforcement learning, what you're doing is you have two sets of GPUs. One is, you can call it the actor, and one you call the learner. The learner is where your actual reinforcement learning updates are gonna do. These are traditionally policy gradient algorithms, um, proximal policy optimization, PPO, and group relative po- policy optimization, GRPO, are the two popular classes. And on the other side, you're gonna have actors, which are generating completions, and these completions are the things that you're gonna grade. So reinforcement learning is all about optimizing reward. And in practice, what you can do is that you can have a lot of different actors in different parts of the world doing different types of problems, and then you send it back to this highly networked compute cluster to do this actual learning, where you, where you take the po- where you take the gradients, and you need to have a tightly meshed network where you can do different types of parallelism and spread out your model for efficient training. So there's just, like, a lot of-- every different type of training and serving has these considerations you need to scale. Like we talked about pre-training, we talked about RL, and then inference time scaling is like: how do you serve a model that's thinking for an hour to a hundred million users? I'm like, I don't really know about that, but I know that's a hard problem, and in order to give people this intelligence, there's all these systems problems, and we need more compute, and you need more stable compute to do it.

    26. LF

      But you're bullish on all of these kinds of scaling, is what I'm hearing, on the inference, on the reasoning, even on the pre-training.

    27. SR

      Yeah. So that's a, a big can of worms here, but so there are basically two... The knobs are the training and the, uh, inference scaling, where you can get gains. And so in an, in a world where we had, let's say, infinite compute resources, you want to do all of them. But, like, so you have training, you have inference scaling, and training is like a hierarchy. It's pre-training, mid-training, post-training. Changing the model size, more training data, making, training a bigger model gives you more knowledge in the model. Then the model, um, let's say, has a better-- it's like a better base model back in the day, or we still we call it foundation model, and it unlocks... So you un- but you don't, let's say, have the model be able to solve your most complex task-... tasks during pre-training or after pre-training, you still have these other unlock phases where you have mid-training or long context, for example, post-training with, uh, LR/VR that unlocks capabilities that the model has in terms of just knowledge in the pre-training. And I think, sure, if you, uh, so do more pre-training, you get a better base model that you can unlock later. But like Nathan said, it, it just becomes too expensive, so we don't have infinite compute. So you have to decide: Do I want to spend that compute more on making the model larger? But, you know, it's like a trade-off. It's, it's like an ideal world, you want to do all of them. And I think in that sense, scaling is still pretty much alive. You would still get a better model, but like we saw with GPT 4.5, it's just not worth it. I mean, it's like ... 'cause you can, let's say you can unlock more performance with other techniques at that current moment. Especially, um, if you look at inference scaling, that's one of the biggest gains this year with o1, um, where it took a smaller model further than pre-training a larger model like GPT 4.5. So it, it's like, I wouldn't say pre-training scaling is dead, it's just like there are other more attractive ways to scale right now at the moment. But at some point, you know, we'll- y- you will still wanna make some progress on the pre-training. The thing is also to consider, um, where you, where do you want to spend your money? If you spend it more on the pre-training, it's like a fixed cost. You train the model, and then it has this capability forever. You, you can always use it, uh, and so forth. With inference scaling, you don't spend money during training, you spend money later per query. And then it's also like the math, how long is my model gonna be on the market? If I replace it in half a year, maybe it's not worth spending $5 million, $10 million, $100 million on the training it longer. Maybe it's just, I will just do more inference scaling and get the performance from there. It maybe cost me $2 million in terms of user queries. It becomes a question of how many users you have and then doing the math. Um, and I think that's also where it's interesting, where ChatGPT is in a position, I think they have a lot of users, where they need to go a bit cheaper, where they have that, uh, GPT-5 model that is a bit smaller. Other companies that have ... Let's say if your customers have other, uh, other, um, trade-offs, for example, there was also the Math Olympiad or some of these, these math, uh, problems where ChatGPT or, or OpenAI, they had a proprietary model, and I'm pretty sure it's just like a model that has been maybe fine-tuned a little bit more, but most of it was during inference scaling to achieve this peak performance in certain tasks, where you don't need that all the time. And but yeah, long [chuckles] story short, I do think all of these, uh, pre-training, mid-training, post-training, inference scaling, they are all still things you want to do. It's just finding, at the moment, in this year, it's finding the right ratio that gives you the best bang for the buck, basically.

  8. 1:04:121:37:18

    How AI is trained: Pre-training, Mid-training, and Post-training

    1. LF

      I think this might be a good place to define pre-training, mid-training, and post-training.

    2. SR

      So pre-training is the classic training, uh, one next token prediction at a time. You have a big corpus of data, and, uh, Nathan also has very interesting insights there because of MoE-3. It's a big portion of the paper focuses on the right data mix. So pre-training is essentially just, you know, train across entropy loss, training on next token prediction on a, a vast corpus of internet data, books, papers, and so forth. It has changed a little bit over the years in the sense people used to throw in everything they can. Now, it's not just raw data, it's also synthetic data where people re- um, let's say, rephrase certain things. Uh, so synthetic data doesn't necessarily mean purely AI made-up, uh, data. It's also taking something from an article, Wikipedia article, and then rephrasing it as a Q&A question, or, um, summarizing it, rewording it, and, and making, uh, better data that way. 'Cause I think of it also like with humans, if someone, let's say, reads a book compared to a messy, I don't know, no offense, but like Reddit post or something like that. I do think you learn- [laughing] You ... No offense, uh, but I think-

    3. LF

      There's gonna be a post about this-

    4. SR

      Yeah [chuckles]

    5. LF

      ... Sebastian.

    6. SR

      Some Reddit data is very coveted and excellent for training.

    7. SR

      Yep, yep.

    8. SR

      You just have to filter it. [chuckles]

    9. LF

      Yeah.

    10. SR

      I, I think that's the idea. Uh, I, I think it, it's like if someone took that and rephrases that in a s- let's say, con- more concise and structured way-

    11. LF

      Mm-hmm

    12. SR

      ... I think it's higher quality data that gets the LLM maybe the same-- you get the same LLM out of it at the end, but it gets there faster, it trains faster because the ... Let's say, if the grammar and the punctuation is correct, it already learns the correct way, versus getting information from a messy way and then learning later how to correct that, and stuff like that. So I think that is how pre-training evolved and how, um, how still, while-- why scaling still works is that it's not about just the amount of data, it's also the tricks to make that data better for you, in a sense. And, and mid-training is, I mean, it used to be called, uh, pre-training. It's, I think it's called mid-training because it was awkward to have pre-training and post-training, but nothing in the middle, right? It sounds a bit weird to have pre-training and post-training, but what's the actual training? So the mid-training is usually similar [chuckles] to, uh, pre-training, but, you know, it's a bit more, I would say, specialized than pre-training. It's the same algorithm, but what you do is you focus, for example, on long context. Like, as one example, you have long context documents. The reason you don't do that during just pure pre-training is because you don't have that many long context documents, so you have a specific phase. And one problem of LLMs is also still, it's a neural network. It has the problem of catastrophic forgetting. So you teach it something, it forgets other things. And you wanna ... It's not 100% forgetting, but, you know, it's like no free lunch, you can't-- it's also the same with humans. If you ask me some math I learned 10 years ago, I don't know, [chuckles] I would have to look at it again.

    13. LF

      Oh, Nathan was actually saying that he's consuming so much content that there's a catastrophic forgetting issue.

    14. SR

      Yeah.

    15. SR

      Yep.

    16. SR

      I'm, like, trying to learn so much about AI, and it's like I was learning about pre-training parallelism, I'm like, I lost something and I don't know what it was. [laughing]

    17. LF

      [laughing]

    18. SR

      I don't want to anthropomorphize, uh, LLMs, but it's, I think, the same kind of in that sense, how humans learn. I mean-... the quantity is not always better because, yeah, you, it, it's like being selective. And I, and the mid-training is being selective in terms of quality content at the end. So the last thing the LLM has seen is the quality stuff. And then post-training is all the, uh, fine-tuning, supervised fine-tuning, uh, DPO, um, reinforcement learning with verifiable rewards, hu- uh, with human feedback, and so forth. So the refinement stages. And it's also interesting, it's like the cost thing, right? I mean, it's like pre-training, you spend a lot of money on that right now. RL, a bit less. RL, you don't really, I would say, teach it knowledge. It's more like unlocking the knowledge. Uh, it's, it's more like a skill learning, like how to solve problems with the knowledge that it has from pre-training. There are paper-- there are actually three papers [chuckles] this year, or last year, two thousand and twenty-five, on, uh, RL for pre-training, but I, I mean, I don't think anyone does that in production.

    19. NL

      Toy, toy examples for now. [chuckles]

    20. SR

      Toy examples, right. But to generalize, RL, uh, post-training is more like the skill unlock, where pre-training is like soaking up the knowledge, essentially, yeah.

    21. NL

      A few things that could be helpful for people. A lot of people get, like, have-- think of synthetic data as being bad for training the models. You mentioned, like, the DeepSeek, got Olmo, uh, OCR, which is optical character recognition paper. A lot of labs did. Ai2 had one, uh, like, had multiple. And the reason that each of these labs has these is because there's vast amounts of PDFs and other digital documents on the web that are in formats that aren't encoded with text easily. So you use these Olmo-CR, these... or DeepSeek OCR, and we called ours Olmo-OCR, to extract what can be trillions of tokens of, um, candidate data for pre-training. And pre-training data set size is on the order of trillions, is measured in trillions of tokens. Smaller models from researchers can be something like five to ten trillion. Um, Qwen is documented going up to, like, fifty trillion, and there's rumors that these closed labs can go to, like, a hundred trillion tokens. And just getting this potential data to put in, I think they, they have a very big funnel, and then the data you actually train the model on is a small percentage of this. Like the syn-- this character recognition data would be described as synthetic data for pre-training in a lab. And then there's also the things like ChatGPT now gives wonderful answers, and you can train on those best answers, and that's synthetic data. It's very different than, like, early ChatGPT, lots of hallucinations, data when people became grounded in synthetic data.

    22. SR

      One interesting question is, if I recall correctly, Olmo-III was trained with less data than, uh, specifically some o- other OpenWeight models, maybe even Olmo-II, but you still got better performance, and that might be one of the examples how the data helped.

    23. NL

      It's mostly down to data quality.

    24. SR

      Mm-hmm.

    25. NL

      I think if we had more compute, we would train for longer. I think we'd ultimately see that as a, like, just, like, something we would want to do. And especially with big models, you need to have more compute because we talked about having more parameters, and we talked about knowledge. And essentially, there's a ratio where big models can absorb m- more from data, and then you're gonna-- you get more benefit out of this. It's, it's, it's like one of these, any logarithmic graph in your mind is like, a small model will level off sooner if you're measuring trillions of tokens, and bigger, bigger models need more. But mostly is we aren't training that big of models right now at Ai2, and getting the highest quality data we can is the natural starting point.

    26. LF

      Is there something to be said, uh, about the topic of data quality? Is there some low-hanging fruit there still, where the quality could be improved?

    27. NL

      It's like turning the crank. So I think historically, in the open, there's been, like, a canonical best pre-training data set that has moved around between who has the most recent one or the best recent effort. Like Ai2's Dolmo was very early with the first Olmo, and Hugging Face had FinWeb, and there's a DCLM project, which has been kind of like a-- which is, it stands for Data Comp Language Model. There's been data comp for other machine learning projects, and they have a, had a very strong data set. And a lot of it is the Internet is becoming fairly closed off. So we have Common Crawl, which I think is hundreds of trillions of tokens, and you filter it, and it looks like being a lot of scientific work, where you're training classifiers and making decisions based on how do you prune down this dese-- this data set into the highest quality stuff and the stuff that suits your tasks. So previously, language models were tested a lot more on, like, knowledge and just kind of conversational things, but now they're expected to do math and code. So to train a reasoning model, you need to remix your whole data set. And there's a lot of actually wonderful scientific methods here where you can, you can, like, take your gigantic data set, you sample a lot of really tiny things from different sources. So you say you have GitHub, Stack Exchange, Reddit, Wikipedia. You can sample small things from them, and you train small models on each of these mixes and measure their performance on your evaluations. And you can just do, like, basic linear regression, and it's like, here's your optimal data set. But if your evaluations change, your data set changes a lot. So a lot of Olmo-III was new sources for reasoning to be better at math and code to... And then you do this mixing procedure, and it gives you the answer. [chuckles] And I think that's a lot of that's happened at labs this year. So there's new hot things, whether it's like coding environments or web navigation, and you just need to bring in new data. You need to change your whole pre-training so your post-training can work better and stuff like this. So that's, like, the constant re- re-evolution and the redetermining of what they care about as their, for their models.

    28. LF

      Are there fun anecdotes of what sources of data are particularly high quality that we wouldn't expect? You mentioned Reddit sometimes can be a source.

    29. NL

      Reddit was very useful. I think that, um, like, PDFs is definitely one.

    30. SR

      Especially archive.

  9. 1:37:181:58:11

    Post-training explained: Exciting new research directions in LLMs

    1. LF

      as developers. Now, we've had this fascinating conversation that started with pre-training and mid-training. Let's get to post-training. A lot of fun stuff in post-training. So what are some of the interesting ideas in post-training?

    2. NL

      The biggest one from 2025 is learning this reinforcement learning with verifiable rewards. You can scale up the training there, which means doing a lot of this kind of iterative generate grade loop, and that lets the models learn both interesting behaviors on the tool use and software side. This could be searching, running commands on their own, and seeing the outputs, and then also that training enables this inference time scaling very nicely. And it just turned out that this paradigm was very nicely linked in this, where it's this kind of RL training enables inference time scaling, but inference time scaling could have been found in different ways. So it's kind of this perfect storm of the models change a lot, and the way that they're trained is a major factor in doing so. And this has changed how people approach post-training.... dramatically.

    3. LF

      Can you describe RLVR, popularized by DeepSeek R1? Can you describe how it works?

    4. NL

      Yeah, fun fact, um, I was on the team that came up with the term RLVR, which is from our two to three work before DeepSeek, which is-- we don't take a lot of credit for the, being the people to popularize the scaling RL. But as fun as what academics get as an aside, is the ability to name and influence-

    5. LF

      Mm

    6. NL

      - the discourse, because the closed labs can only say so much. That one of the things you can do as an academic is like, you might not have the compute to train the, the model, but you can frame things in a way that ends up being... I describe it as like a community can come together around this RLVR term, which is very fun. And then DeepSeek is the people that did the training breakthrough, which is they scaled the reinforcement learning, which was, you have the model generate answers and then grade the completion if it was right, and then that accuracy is your reward for reinforcement learning. So reinforcement learning is classically an agent that acts in an environment, and the environment gives it a state and, and a reward back, and you try to maximize this reward. In the case of language models, the reward is normally accuracy on a set of verifiable tasks, whether it's math problems, coding tasks, and it starts to get blurry with things like factual domains. Like, that is also in some ways verifiable or constraints on your instruction, like respond only with sen-- words that start with A. Like, all of these things are verifiable in some way, and the core idea of this is you find a lot more of these problems that are verifiable, and you let the model try it many times while taking these RL steps, these RL gradient updates. The infrastructure evolved from this reinforced learning from human feedback, where in that era, the score they were trying to optimize was a learned reward model of aggregate human preferences. So you kind of change the problem domains, and that let the optimization go on to much bigger scales, which kind of kickstarted a major change in what the models can do and how people use them.

    7. LF

      What kind of domains is, uh, RLVR amenable to?

    8. NL

      Math and code are the famous ones, and then there's a lot of work kind of on s- what is called the rubrics, which is related to a word people might have heard as LM, as a judge, which is like, for each problem, I'll have a set of problems in my training data set. I'll then have another language model and ask it: "What would a good answer to this problem look like?" And then you could try the problem a bunch of times over and over again and assign a score based on this rubric. So that's not necessarily verifiable, like a math and code domain, but this rubric's idea and other scientific problems that it might be a little bit more vague, is where a lot of the attention is, where they're trying to push this set of methods into these kind of, uh, more open-ended domains, so the models can learn a lot more.

    9. SR

      I think that's called reinforcement learning with AI feedback, right?

    10. NL

      That's the older term from it that was coined in Anthropic's constitutional AI paper. [chuckles] So it's like a lot of these things come in cycles.

    11. SR

      Also, just one step back for the RLVR. So I think the interesting, beautiful thing here is that you j-- you ask the LLM, let's say, a math question, and then you know the correct answer, and you let the LLM, like you said, figure it out. But how it does it, I mean, you don't really constrain it much. There are some constraints you can add, like use the same language. Don't switch between Spanish and English. But let's say you're pretty much hands-off. You only give the question and the answer, and then the LLM has to, you know, just the task to arrive at the, uh, right answer. But the beautiful thing here is, what happens in practice is that the LLM will do a step-by-step description. Like, you know, like as a student or like as a, yeah, m- mathematician, how you would derive the solution. It will give you, or it will use those steps, and that helps actually the model to improve its own accuracy. And then, like you said, the inference scaling. So inference scaling loosely means basically spending more compute during, using the LLM, during inference. And here, the inference scaling is that the model would use more tokens. And, and also, I think in the R1 paper, they showed the longer they train the model, the longer the responses are. They, they grow over time, they use more tokens, so it becomes more expensive, becomes more expensive for simple tasks. But these explanations, they help the model with the accuracy. There are also interesting [chuckles] lot, uh, lot of papers showing what the model explains does not necessarily have to be correct, or maybe it's even unrelated to the answer, but for some reason, it still helps the model. Like, this is the fact that it is, um, explaining. And I think it's also, again, I don't want to anthropomorphize these LLMs, but it's kind of like how we humans operate, right? If there's a complex math problem, let's say in a math, uh, class, you, you usually have a note paper, and you do it step by step, you cross out things, and the model also self-corrects. And that, that was, I think, the aha moment in the R1 paper. They called it aha moment because the model itself recognized it made a mistake and then said, "Ah, I did something wrong, and so let me try." And I think that's just so cool that this falls out of just giving it the correct answer and having it figure out how to do it, that it kind of does, in a sense, what a human would do. Although LLMs don't think like humans, [chuckles] it's kind of like an interesting coincidence. And, and the, the other side-- nice side effect is it's great for us humans often to see these steps. It builds trust, but also we learn, we can double-check things.

    12. NL

      There's a lot in here.

    13. SR

      Mm.

    14. NL

      I think some of the debate-- There's been a lot of debate this year on if the language models like these aha... I think the aha moments are kind of fake because in pre-training, you essentially have seen the whole Internet.

    15. SR

      Yep.

    16. NL

      So you have definitely seen people explaining their work, even, even verbally, like a transcript of a math lecture. You try this, "Oh, I messed this up." And what reinforcement learning is, this RLVR is very good at doing, is amplifying these behaviors-

    17. SR

      Mm-hmm

    18. NL

      ... because they're very useful in enabling the model to think longer and to check its work. And-... I agree that it is very beautiful that this training kind of, the model learns to amplify this in a way that is just so useful at the final answers being better.

    19. SR

      I can give you also a hands-on example. I was training the GPT-3 base model with RLVR on MATH five hundred. The base model had an accuracy of about fifteen percent. Just fifty steps, like in a few minutes, with RLVR, the model went from fifteen percent to fifty percent accuracy. And the model, you can't tell me it's learning anything about, fundamentally about math in-

    20. NL

      The Qwen example is weird because there's been two papers this year, one of which I was on, that talks about data contamination in Qwen-

    21. SR

      Mm-hmm, mm-hmm

    22. NL

      ... and specifically that they trained on a lot of this special mid-training phase that we did-

    23. SR

      Exactly

    24. NL

      -like a minute on, but it's weird since they train on problems that are almost identical to that.

    25. SR

      Exactly. And so you can see that basically the RL, it's not teaching the model any new knowledge about math. You can't do that in fifty steps. So the knowledge is already there in the pre-training, you're just unlocking it.

    26. NL

      I still disagree with the kind of premise, because there's a lot of weird complexities that you can't prove. Because one of the things that points to weirdness is that if you take the Qwen-three so-called base model, and you can, you could Google on the screen, you could Google, like, math dataset hugging face, and you could take a problem. And what you do if you put it into Qwen-three base, the-- all these math problems have words. So it'd be like, Alice has five apples and takes one and gives three to whoever, and they're these word problems. With these Qwen-based models, why people are suspicious of them is if you change the numbers but keep the words-

    27. SR

      Mm.

    28. NL

      -Qwen will produce like a very high de... wi-without tools, will produce a very high accuracy, like decimal representation-

    29. SR

      Mm-hmm, mm-hmm

    30. NL

      ... of the answer. Which means there's some like, at some time, it was shown problems that were almost identical to the test set, and it was using tools to get a very high precision answer. But a language model without tools will never actually have this. So it's kind of been this big debate in the research community is like how much of these reinforcement learning papers that are training on Qwen and measuring specifically on this like math benchmark, where there's been multiple papers talking about contamination, it's like, how much can you believe them? And I think this is what caused the reputation of RLVR being about formatting, because you can get these gains so quickly, and therefore, it must already be in the model. But there's a lot of complexity here that we-- it's not really like controlled experimentation.

  10. 1:58:112:21:03

    Advice for beginners on how to get into AI development & research

    1. LF

      I was wondering if we could take at this point a bit of a tangent and talk about education and learning. If you're somebody listening to this who's a smart person, interested in programming, interested in AI, so I presume building something from scratch is a good beginning. So can you just take me through, like, what you would recommend people do?

    2. SR

      So I would personally start, like you said, uh, implementing a simple model from scratch that you can run on your computer. The goal is not if you build a model from scratch, to have like something you use every day for your personal projects. Like, it's not going to be your personal assistant replacing an existing open weight model or ChatGPT. It's to see what exactly goes into the LLM, what exactly comes out of the LLM, how the pre-training works in that sense, uh, on your own computer, preferably. Um, and then you learn about the pre-training, the supervised fine-tuning, the attention mechanism. You get a solid understanding of how things work, but at some point you will reach a limit because small models can only do so much. And the problem with learning about LLMs at scale is, I would say it's exponentially more complex to make a larger model because it's not that the model just becomes larger, you have to now think about sharding your parameters across multiple GPUs. Even for the KV cache, there are multiple ways you can implement it. One is just to understand how it works, just to grow the cache. That's-- It's like a cache you grow step by step by, let's say, concatenating lists, um, growing it, but then that wouldn't be optimal. In GPUs, you wouldn't do that. You would pre-allocate a tensor and then fill it in, but that adds, again, another twenty, thirty line, lines of code, and for each thing, you add so much code. And I think the trick with the book is basically to understand how the LLM works. It's not gonna be your production level LLM, but once you have that, you can understand the production level LLM.

    3. LF

      So you're trying to always build an LLM that's gonna fit on one GPU?

    4. SR

      Yes. The-- Most of them I have, they-- I have some bonus materials on some, uh, MoE models. I think one or, or two of them, they may require multiple GPUs, but the goal is to have it on one GPU. And the beautiful thing is also you can self-verify. [chuckles] It's almost like RLVR when you code these from scratch. You can take, uh, an existing model from the Hugging Face transformer library. Um, so the Hugging Face transformer library is great, but if you want to learn about LLMs, I think that's not the best place to start because the code is so complex, because it has to full-- it has to fit so many use cases. Also, some people use it in production. It has to be really sophisticated, and it's really intertwined and really hard. It's not linear to read.

    5. NL

      It was started as a fine-tuning library, and then it grew to be like the standard representation of every model architecture and the way it is loaded. So Hugging Face is like the default place to get a model, and Transformers is the software that enables it-

    6. SR

      Mm-hmm.

    7. NL

      -so people can easily load a model-

    8. SR

      Mm

    9. NL

      ... and do something basic with it.

    10. SR

      And all frontier labs that have open weight models have a Hugging Face Transformers version of it, like from DeepSeek to GPT-OSS. That's like the canonical weight, uh, that you can load there. But again, also even Transformers, the library is not used in production. People use then SGLang or vLLM, and it adds another layer of complexity.

    11. LF

      We should say that the Transformers Library has, like, four hundred models.

    12. SR

      So it's a one library that tries to implement a lot of LLMs, and so you have a huge code base. Basically, it's, like, huge. It's like, uh, it's, I don't know, maybe millions-

    13. NL

      It's crazy [chuckles]

    14. SR

      ... hundreds of thousands of lines of code, and find-- It's like understanding the part that you want to understand is finding the needle in the haystack. But what's beautiful about it is y- you have a working implementation, and so you can work backwards from it. What I would recommend doing, or what I also do, is if I want to understand, for example, how Almost 3 is implemented, I would, uh, look at the weights in the model hub, the config file, and then you can see, oh, they use so many layers. They use, let's say, group query attention or multi-head attention in that case. Then you see all the components in, like, a human-readable, I don't know, hundred lines of config file. And then you start, let's say, with your GPT 2 model and add these things, you know? And the cool thing here is you can then load the pre-trained weights and see if they work in your model. And you want to match the same output that you get with a transformer model, and then you can use that as a, basically as a verifiable reward to make your architecture correct. And then it's kind of-- Sometimes it takes me a day to [chuckles] with Almost 3, the, the challenge was, uh, rope for the position embeddings. They had a yarn, uh, extension, and there was some custom, uh, scaling there, and I couldn't quite match the, these things. And in this struggle, you kind of understand things, but the cool thing is, at the end, you know you have it correct because you can unit test it, you can check against the reference implementation, and I think that's maybe one of the best ways to learn, really, like to basically reverse engineer something. Yeah.

    15. NL

      ... I think that that is something that everybody that's interested in getting to AI today should do. And I think that's why I liked your book, is like I came to language models from this RL and robotics field. Like I never had taken the time to just, like, learn all the fundamentals. And this transformer architecture I describe as being like so fundamental as like deep learning was a thing that I had to learn in the past, and people need to do this. I think that where a lot of people kind of get overwhelmed is: how do I apply this to have impact or find like a career path? Because like AI and language models make this fundamental stuff so accessible, and people with motivation will learn it. And then it's like, how do I get the cycles on goal to contribute to research? And I think that I'm actually fairly optimistic in this because the field moves so fast that a lot of times the best people, like, don't fully solve a problem because there's a bigger, lower hanging, [chuckles] like a bigger problem to solve that's very low-hanging fruit, so they move on. And I think that a lot of what I was trying to do in this RLHF book is like, take post-training techniques and just describe how people think about them influencing the model and what people are doing. And then it's remarkable how many things I just think are just like, people stop studying them or don't. So I think people trying to get narrow after doing the fundamentals is good, and then reading the relevant papers and being engaged in the ecosystem, it's like you actually, the proximity that random people have online from [chuckles] the leading researchers, like no one knows who all the... The anonymous account on X and ML is very popular for whatever reason, and no one knows who all these people are. Like, it could just be random people that study the stuff deeply, especially with the AI tools, and just be like, "Keep-- I don't understand this, keep digging into it," I think is a very useful thing. But there's a lot of research areas that just like are maybe three papers that you need to read, [chuckles] and then one of the authors will probably email you back. But you have to put in a lot of effort into these emails to understand the field. Like, I think it would be, for a newcomer, easily weeks of work to feel like they can truly grasp, like, what is a very narrow ara-- area. But I think going narrow after you have the fundamentals will be very useful to people because it's like I became very interested in, like, character training, which is like how you make the model funny or sarcastic or serious, and like, what do you do to the data to do this? And it's like a student at Oxford reached out to me and was like: "Hey, I'm interested in this," and I advised him. And I was like: "That paper now exists." And it's like, I don't know, there's like two or three people in the world that were very interested in this. He's a PhD student, which gives you an advantage, but like, for me, that was a topic I was waiting for someone to be like: "Hey, I have time to spend cycles on this," and I'm sure there's a lot more very narrow things where you're just like: "Oh, it doesn't make sense that there was no answer to this." And I think that it's just like there's so much information coming that people are like, "I can't grab on to any of these," but if you just actually stick in an area, I think there's a lot of interesting things to learn.

    16. SR

      Yeah, I think you can't try to do it all because it would be very overwhelming, and you would burn out if you tried to keep up with everything. For me, for example, I haven't kept up with computer vision in a long time, just focused on LLMs. But coming back to your book, for example, I think this is also a really great book and a really good bang for the buck because you want to learn about RLHF. I wouldn't go out there and read RLHF papers because I would be-- you would be spending two years-

    17. NL

      Some of them contradict.

    18. SR

      Yeah.

    19. NL

      There's-- I've just edited the book, and I was like, there's a chapter where I had to be like, "X papers say one thing, and X papers say another thing, and we'll see what comes out to be true." [chuckles]

    20. LF

      What, what are some of the-- just to go through some of the table of contents, some of the ideas we might have missed in the bigger picture of the post-training. So first of all, you do the problem setup, training overview, what are preferences, preferences data and the optimization tools, reward modeling, regularization, instruction tuning, rejection sampling, reinforcement learning, i.e., policy gradients, direct alignment algorithms. Then constitutional AI and AI feedback, reasoning and inference time scaling, tool use and function calling, synthetic data and distillation, evaluation, and then open question section over optimization, style and information, and then product UX, character, and post-training. So what are some ideas worth mentioning that connect both the educational component and the research component? You mentioned the character training, which is pretty interesting.

    21. NL

      Character training is interesting because there's so little out of it, but we talked about how people engage with these models and like, like we feel good using them because they're positive, but that can go too far. It could be too positive, and it's like essentially it's how do you change your data and, or decision making to make it exactly what you want? And I-- OpenAI has this thing called a model spec, which is essentially their internal guideline for what they want the model to do, and they publish this to developers. So essentially, you can know what is a failure of OpenAI's training, which is like they have the intentions, and they haven't met it yet, versus what is something that they like actually wanted to do and that you don't like. And that transparency is very nice. But all the methods for curating these documents and how easy it is to follow them is not very well known. I think the way the book is designed is that the reinforcement learning chapter is obviously what people want because everybody hears about it with RLVR, and it's the same algorithms [chuckles] and the same math, but it's just like you can use it in, in very different documents. So I think the core of prefe-- of RLHF is like how messy preferences are, is essentially a rehash of a paper that I wrote years ago. But this is essentially the chapter that will tell you why RLHF is never, ever fully solvable. Because, like, the way that even RL is set up is that, um, it assumes that preferences can be quantified and that multiple preferences can be reduced to single values. And I think it relates in the economics literature to the Von Neumann-Morgenstern utility theorem. And like, that is the chapter where all of that philosophical, economic, and like psychological context, it tells you what gets compressed into doing RLHF. So it's like you have all of this, and then at the-- later in the book, it's like you use this RL math to make the number go up. And I think that that's why I think it would be very rewarding for people to do research on, is because it's like quantifying preferences is something that is just like-... humans have designed the problem in order to make preferences studiable. But there's kind of fundamental debates on, like, an example is in a language model response, you have different things you care about, whether it's accuracy or in style, and when you're collecting the data, they all get compressed into like a, "I like this more than another." And it's like, like that is happening. And there's a lot of philoso- [chuckles] there's a lot of research in other areas of the world that go into like, how should you actually do this? I think social choice theory is the subfield of economics around how you should aggregate preferences. [chuckles] And there's like, uh, I was-- I went to a workshop that published a white paper on like, how can you think about using social choice theory for RLHF? So I mostly would want people that get excited about the math to come and have things where they can stumble into and learn this kind of broader context. I think there's a fun thing. I just keep a list of all the tech reports that I like of reasoning models. So in the, in chapter fourteen, which is kind of like a short summary of RLVR, there's just like a gigantic table where I just, like, list every single reasoning model that I like. So there's just like, I think in education, a lot of it needs to be like... At this point, it's like what I like. Mm-hmm. Because the language models are so good at the math, where it's like famous paper, direct preference optimization, which is like a much simpler way of pro- solving the problem than RL. Um, the derivations in the appendix skip steps of math, and it's like I tried for this book. Like, I redid the derivations, and I'm like: What the heck is this log trick that they use to change the math? But doing it with language models, they're like: "This is the log trick." And I'm like, I don't know if I like this, that the math is so commoditized. I think, like, some of the struggle in reading this appendix- Mm-hmm ... and following the math, I think is good for learning. And I, [chuckles] - Yeah, so we're actually returning to this often just on the topic of education. You bo- both have brought up the word struggle- Mm-hmm ... quite a bit. So there is value. If you're not struggling as part of this process, you're not fully following the, the proper process for learning, I suppose. Some of the providers are starting to work on models for education, which are designed to not give-- actually, I haven't used them, but I would guess they're designed to not give all the information at once- Right ... and make people work to do this. So I think you could train models to do this, and it would be a wonderful contribution where, like all of this stuff in the book, you have to reevaluate every decision for it- Yeah ... which is such a great example. I, I think there's, there's a chance you work on an Ai2, which I, which I was like: "Oh, I think this is, be so fun." Makes sense. I do something like that, uh, did that the other day for video games, for example. [chuckles] I sometimes, for my pastime, play video games. Like, I like, uh, video games with, uh, puzzles, you know, like Zelda and Metroid, and there's this new game where I got stuck, and I really got stuck, and I was okay. I, you know, I don't want to struggle f- uh, like, two, uh, two days. And so I used an LLM. But then you, uh, say: "Hey, please don't, uh, add any spoilers, just, you know, I'm here and there. What do I have to do next?" And the same thing you can do, I guess, for math, where you say: "Okay, I'm here at this point. I'm getting stuck. Don't give me the full solution, but w- what is something I could try?" You know, like where you kind of carefully probe it. But the problem here is, I think it requires discipline, and a lot of people do math for like... I mean, there are a lot of people who enjoy math, but there are also a lot of people who need to do it for their homework. And then it's like the shortcut. And yeah, we can develop an educational LLM, but the other LLM is still there, and there's still a temptation to use the other LLMs. I think a lot of people, especially in college, they, they understand the stuff they're passionate about. Mm-hmm. They're self-aware about it, and they understand it shouldn't be easy. Mm-hmm. Mm-hmm. Like, I think you just have to develop a good taste. Mm-hmm. We talk about research taste, like school taste, about stuff that you should be struggling on and, and stuff you shouldn't be struggling on. Which is tricky to know, 'cause sometimes you don't have good, um, long-term vision about what would be actually useful to you in your career. But you have to, you have to d- develop that taste. Yeah. I was talking to maybe my fiancée or friends about this, and it's like there's this brief ten-year window where all of the homework and all the exams could be digital. But before that, everybody had to do all the exams in blue book because there was no other way. And now, after AI, everybody's gonna need to be in blue books and oral exams because everybody could cheat so easily. [chuckles] It's like this brief generation that had a different education system that, like, everything could be digital and, but you still couldn't cheat, and now it's just gonna go back. [chuckles] And it's just very funny. You mentioned character training. Just zooming out on, on a more general topic. For that topic, how much compute was required? And in general, to contribute as a researcher, are there places where not too much compute is required, where you can actually contribute as an individual researcher? For-- on the character training thing, I think this research is built on fine-tuning about seven billion parameter models with LoRA, which is like a, essentially, you only fine-tune a small subset of the weights of the model. I don't know exactly how many GPU hours that would take- But it's doable. Not doable for every academic. So the situation for some academics is, like, so dire that the only work you can do is doing inference, where you have closed models or open models, and you get completions from them, and you can look at them and understand the models. And that's very well suited to evaluation, which you become ex-- you want to be the best at creating representative problems that the models fail on or show certain abilities, which I think that you can break through with this. So I've-- like, I think that the top-end goal for a researcher working on evaluation, if you want to have career momentum, is the frontier labs pick up your evaluation. So it's like, you don't need to have every project do this, but if you go from a small university with no compute, and you figure out something that Claude struggles with, and then the next Claude model has it in the blog post, like there, there's your career rocket ship. I think that that's hard, but it's like if you want to scope the maximum possible impact with minimum compute, it's something like that, which is just get very narrow, and it takes learning of where the models are going. So you need to, like, build a tool that tests where not Claude Four point five will fail. If you're gonna do a re-- if I'm gonna start a research project, I need to think where the models in eight months are gonna be struggling. But what about developing totally novel ideas?... this is a trade-off. I think that if you're doing a PhD, you could also be like, "It's too risky to work in language models. I'm going way longer term," which is like, what is-

    22. LF

      Mm.

    23. NL

      -what is the thing that's gonna define language model development in ten years? Which I think that I end up being a person that's pretty practical. I mean, I went to my PhD, where it's like, "Oh, I got into Berkeley. Worst case, I get a master's, and then I go work in tech." It's like I'm very practical about it, so I'm like, the life afforded to people to work at these AI companies, the amount of-- like, OpenAI's average compensation is over a million dollars in stock a year for an employee. Any normal person in the US, to get into this AI lab is transformative for your life. So I'm pretty practical of like-

    24. LF

      Mm-hmm

    25. NL

      ... there's still a lot of upward mobility working in language models if you're focused, and the outcomes is like, look at these jobs. But from a research perspective, the transformative impact and these academic awards, it's like, be the next Yann LeCun, is from not working on, not caring about language model development very much.

    26. LF

      It's a big financial sacrifice in that case.

    27. NL

      So I get to work with some awesome students, and they're like, "Should I go work in an AI lab?" And I'm like, "Uh, like, you're getting a PhD at a top school, or you're gonna leave to go to a lab?" I'm like: I don't know. Like, if you go work at a top lab, I don't blame you. Don't go work at some random start-up that might go to zero, but if you're going to OpenAI, I'm like, it could be worth leaving a PhD for. [chuckles]

    28. LF

      Let's more rigorously think through this. Where would you give a recommendation for people to do a research contribution? So the options are academia, so get, get a PhD, spend five years publishing. Compute resources are constrained. There's, uh, there's research labs that are more focused on open weight models, and so working there, or closed frontier labs, research labs-

    29. NL

      The-

    30. LF

      ... OpenAI, Anthropic, xAI, so on.

  11. 2:21:032:24:49

    Work culture in AI (72+ hour weeks)

    1. SR

      Yep.

    2. LF

      Can you describe nine nine six as culture that's, I believe you could say, invented in China and, uh, adopted in Silicon Valley? What's, what's nine nine six? It's nine AM to nine PM and-

    3. SR

      Six days a week. [chuckles]

    4. LF

      Six days a week. What is that? Seventy-two hours. Okay, so what-- is this basically the standard in AI companies in Silicon Valley?... more and more of this kind of grind mindset?

    5. SR

      Yeah, I mean, not, maybe not exactly like that, but I think there is a trend towards it. And it's interesting, I think it almost flipped because when I was in academia, I felt like that because, uh, as a professor, you had to write grants, you had to do t-- you had to teach, and you had to do your research. It's like three jobs in one, and it is more than a full-time job if you want to be successful. And, um, I feel like now, like Nathan just said, [chuckles] the professors, in comparison to a lab, I think they have less, like, even maybe pressure or workload than at a frontier lab because-

    6. NL

      I think they work a lot. They're just so fulfilled by-

    7. SR

      Yeah. Yeah.

    8. NL

      Like working with students and having a constant runway of mentorship and, like, a mission that is very people-oriented, I think in an era when things are moving very fast and are very chaotic, is very rewarding to people.

    9. SR

      Yeah, and I think at a start-up, I think it's this pressure. It's like you have to make it, and it's like it is really important that people put in the time, but, well, it is really hard because you have to deliver constantly. And I've been at a start-up. I had a good time, but [chuckles] I don't know if I could do it forever. It's like a interesting pace, uh, and it's exactly like we talked about in the beginning. Uh, these models are leapfrogging each other, and they are just constantly, like, trying to take the next step compared to the competitors. It's just ruthless, I think, right now.

    10. NL

      I think this leapfrogging nature and having multiple players is actually an underrated driver of language modeling process-

    11. SR

      Mm-hmm, mm

    12. NL

      ... where competition is so deeply ingrained to people, and these companies have intentionally created very strong culture. Like, Anthropic is known to be so culturally, like, deeply committed and organized. I mean, like, we hear so little from them, and everybody at Anthropic seems very aligned, and it's like being in a culture that is super tight and having this competitive dynamic is like, talk about a thing that's gonna make you work hard and create things that are better.

    13. SR

      Mm-hmm.

    14. NL

      So I think that this-- but that comes at the cost of human capital, which is like, you can only do this for so long, and people are definitely burning out. I think I've-- I wrote a post on burnout as like I've tread in and out of this myself, especially trying to, like, be a manager full mode training. It's a crazy job doing this. The book Apple in China by Patrick McGee, he talked about the-- how hard the Apple engineers worked to set up the supply chains in China, and he was like... They had saving marriage programs, and he told in a, a podcast, he was like, "People died from this level of working hard." So I think this is just like, it's a perfect environment for creating progress based on [chuckles] human expense, and I-- it's-- there's gonna be a lot of-- there's a lot of... The human expense is the nine-nine-six that we started this with, which is like-

    15. SR

      Yeah

    16. NL

      ... people do really grind.

    17. SR

      I also read this book. I think they had a code word for if someone had to go home to spend time with their family to save the marriage, and, uh, it's crazy. Then, then colleagues honestly say, "Okay, this is like red alert for this situation. We have to let that person go home this weekend," and, um, but at the same time, I don't think they were forced to work. It's really they were so passionate about the product, I guess, that it is, it is-- you, you get into that mindset, and I, I had that sometimes as an academic but also as an independent person. I have that sometimes. I overwork, and it's unhealthy. I had, you know, I had back issues, I had neck issues because I did not take the breaks that I maybe should have taken, but it's not because no one forced me to. It's because I wanted to work, because it's exciting stuff.

    18. NL

      That's what OpenAI and Anthropic-

    19. SR

      Yeah

    20. NL

      ... are like. They want to do this work.

  12. 2:24:492:28:46

    Silicon Valley bubble

    1. SR

      Yeah.

    2. LF

      Yeah, but there's also, there's also a, a feeling, a fervor that's building, especially in Silicon Valley, aligned with the scaling laws idea, where there's this hype where the world will be transformed on a scale of weeks, and you want to be at the center of it. And then, you know, um, I'm-- I have this great fortune of having conversations with wide variety of human beings, and from there I get to see all these bubbles and echo chambers across the world, and it's fascinating to see how we humans form them. And I think it's fair to say that Silicon Valley is a kind of echo chamber, uh, a kind of, um, silo and bubble. I think bubbles are actually really useful and effective. It's not necessarily a negative thing because it could be ultra productive. It could be the, the Steve Jobs reality distortion field, 'cause you just convince each other the breakthroughs are imminent, and by convincing each other of that, you make the breakthroughs imminent.

    3. NL

      Mm-hmm. Bernd Hobart wrote a book classifying bubbles, but essentially one of them is financial bubbles, which is like speculation, which is bad, and the other one is like, I don't know the term, but effectively for build-outs because it pushes people to build these things. And I do think AI is in this, but I worry about it transitioning to a financial bubble, which is like, it's-

    4. LF

      Yeah, but also in the space of ideas, that bubble, you are doing a reality distortion field, and that means you are deviating from reality. And if you go too far from reality while also working, you know, nine-nine-six, and you, you might miss some fundamental aspects of the human experience, including in Silicon Valley, and this is a common problem in Silicon Valley. It's like, it's a very specific geographic area. You might not understand the Midwest perspective, the full experience of all the other different humans in the United States and across the world, and you, and you speak a certain way to each other, you convince each other of a certain thing, and that, that can get you into real trouble. Whether AI is a big success and becomes a powerful technology or it's not, in either trajectory, you can get yourself into trouble. So you have to consider all of that. Here you are, a young person trying to decide what you want to do with your life.

    5. NL

      The thing that is... I don't even really understand this, but the SFAI memes have gotten to the point where permanent underclass was one of them, which was the idea that the last six months of twenty twenty-five was the only time to build durable value in an AI start-up or model. Otherwise, all the value will be captured by existing companies, and you will therefore be poor, which-... like, that's an example of the SF thing that goes so far. I still think for young people, that going to be able to tap into it, if you are really passionate about wanting to have a impact in AI, like being physically in SF is the most likely place where you're going to do this, but it has, it has trade-offs. [chuckles]

Episode duration: 4:25:12

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode EV7WhVT270Q

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome