Y CombinatorInference, Diffusion, World Models, and More | YC Paper Club
EVERY SPOKEN WORD
65 min read · 13,100 words- 0:00 – 0:12
Intro
- FCFrancois Chaubard
[on-hold music] All right. Hello, everyone. How you guys doing?
- 0:12 – 3:49
Intro from YC Visiting Partner Francois Chaubard
- FCFrancois Chaubard
Welcome to the first ever YC Paper Club. [applause] This is, like, a very exciting thing. Absolutely thrilled with the response. We had over a thousand folks that applied to come in. Uh, it was a very hard selection. If you guys have friends that didn't make the cut, I'm very sorry. [laughs] We're, we kinda-- we need to keep it to about a hundred. Um, and so we selected a very, very cool group. Um, the mission is to create this kinda community of great founders and great researchers and try to pull them together. I guess just for you guys to get a sense for how cool the people in this room are, um, raise your hand if you have at least five citations. [laughs] Ten citations. A hundred citations. A thousand citations. Wow, this is insane. Okay, ten thousand citations. Oh my God. Okay. [laughs] All right. This is awesome. I, I would go up to three hundred thousand, but I think it's like Chris Manning, and that's about it. Um, so... [laughs] Uh, raise your hand if you've raised at least a million dollars. Raise your hand if you've re-raised at least five million dollars. At least ten million dollars. At least fifty million dollars. We still got one. We still got two over here. [laughs] All right. Okay. Awesome. The hidden mission that I'll also kind of add on this is we had, uh, Harj and I had, um, this, uh, awesome, uh, breakfast in, uh, Woodside, and this place is so, so unique and special, and we kinda just don't use it enough at YC. So the hidden mission is to make Pioneer great again. And so I went through Winter '16 here. Um, it was an unbelievable time. I think a hundred and forty companies went through that batch. Ten of the, uh, fifteen of them are unicorns. It's an insane number. Um, Rappi, uh, Astranis, um, Deepgram, all these companies were in the batch. And during that time, uh, Sam was still running the show, and basically s-sitting right there would be me, Andrej Karpathy, Vajra Zarembka, and Greg Brockman because they were starting this thing called OpenAI, and it was like the very early stages, and there was, like, not that many AI companies. And so they would ask me and Steve from Deepgram, like, "What are you guys, what are you working on? What are the problems you're working on?" They're looking for problems 'cause they didn't even know what to research. And so it's such a, such a special time. This place is so special, uh, to, to me i-in particular, uh, to Harj as well, and we just, the, it's, it, we don't really use it enough. So I wanted, um, to kinda make this community down here. And I also think that a hundred percent of the AI talent or a-AI people in the Bay Area, probably about half of them are in the city, maybe is a good number. There's Anthropic, uh, there's OpenAI, there's Cursor, there's all this stuff in the city. Then there's a lot that are down here that are not making the trek up to the city to join YC, and so he's like, "Yes, emphatically yes." Um, and so you have Google DeepMind around the corner, you have, um, Tesla, you have xAI, you have Thinking Machines, you have all these other people in Palo Alto. You have a lot of startups. And so, uh, I wanted to kinda, like, solve six birds with one stone and kinda pull together this community down here as well, and Harj, uh, uh, is super excited about it as well. And so thank you very much, Harj, for letting us do this. We got, uh, five great papers here coming up. The first one is Tanishq, Speculative Speculative Decoding. You wanna come up? [applause] All right.
- TKTanishq Kumar
It's all yours.
- FCFrancois Chaubard
Do you want me to pull it on? Yeah, I got you.
- 3:49 – 18:33
Tanishq Kumar — Speculative Speculative Decoding (https://arxiv.org/abs/2603.03251)
- TKTanishq Kumar
Cool. I know it, uh, looks like maybe I was sloppy and I added an extra word in the title, but, uh, it is intentional, um, and it'll make sense in a good time. Um, my name's Tanishq. I'm a grad student at Stanford. Um, this is a project I worked on with Tri Dao and Avner May. I'm going to be evangelizing inference for people today. Hopefully, you'll be inference enjoyers by the end. So I'm not sure how much I have to motivate inference. I worked on training before inference, and I sort of-- the sort of mental model I had in mind for how inference works was, you know, you do this beautiful craftsmanship during the training process, and you get these like, you know, very intricate weights, and then you kinda just hand it off and use them to generate tokens. In my mind, it's sort of like you have the weights, just multiply the matrices. It's why do you need a team for it? Um, I was very confused, but there is, in fact, a lot of subtlety involved. Um, it's a lot of fun, the algorithms and systems behind inference at scale. I'm not sure I need to spend too long talking about why inference is important. Um, there is one point I wanna make that I don't hear people talk about enough. So things you may have heard are that inference costs are high. They dominate training costs when you're serving a model for billions of users or, you know, ten cloud code power users, that's trillions of tokens. Um, not only are inference costs dominating training costs, but even within training, RL is starting to exceed the compute requirements of pre-training. And what is RL but a wrapper on inference, right? So these are two things you've probably heard before. The third is one I fear isn't really talked about, but it's the reason that I started working on inference, and I use the phrase working on inference lightly. This was the only inference project I've ever done. Um, but the, the reason I got interested in making inference fast was not because of cost or for convenience. It was entirely because of capability. So the claim I'm gonna make, and maybe this is the one thing to take away from the message I'm trying to send in this talk, is that inference today is seen as a sort of like cost or convenience lever. But, uh, in one, two, or three years, inference is gonna be seen as a capability. And what I mean by that is that if you have a method, an algorithm, a system where its performance scales with the amount of thinking it does- Then fundamentally, the speed at which you can do inference, the tokens per second, is exactly the peak intelligence that you can deliver. So inference should be thought of as not so much as a, a cost or, or convenience factor, but as a capability. Um, and that's why I got interested in it. I, I wanted to work towards the future where we have an entire data c- data center of twenty thousand B200s just working on the Riemann hypothesis. Um, okay, yes, that's the future that, uh, I had in mind. Perhaps this meme is a little outdated because it has an A100 on it, but, uh, yeah. Okay, so to motivate things, here is an example of fast inference. So I'm gonna give you a little demo of, uh, three algorithms side by side. We're gonna sample, you know, a code prompt from vLLM with just normal autoregressive decoding. We're gonna use their speculative decoding, and then I'm gonna put next to it the sort of janky hand-rolled inference engine I wrote over a summer for this project, um, whose main strength is just that it implements a new algorithm. And so you can see them side by side, SSD is on the right, and you can see it is quite a bit faster than what you can get if you tried to use an open source engine. Um, and it's not the systems, it's, it's the algorithm. Um, so yeah, that's what we wanna work towards, understanding both how speculative decoding works as well as the algorithm on the right. Okay, um, I'll start by introducing what speculative decoding is, how it works, and then we'll move into what speculative speculative decoding is. I hope that if you have, like, a reasonably strong understanding of how speculative decoding works, the, the problem that SSD is trying to solve will feel very motivated and, and the algorithm should just become clear in good time. Okay, so this is the schematic I'm gonna use to explain how vanilla speculative decoding works. Um, it has a small model, the tiny llama up top, as well as a big model, the big llama, and our goal is simply to sample fast from the big llama. We want tokens generated from the big model, and we're gonna use a small model as a sort of proxy or an instrument to be able to sample quickly from the big model. Okay, so what the draft is gonna be responsible for is basically generating a bunch of tokens one by one. One by one is important. It's autoregressive, so you need to do three forward passes on the draft or, you know, however many, some constant number. Um, and these are going to be guesses for what the draft believes that the big model is going to output next. It wants to sort of predict ahead of time. The job that the big model has, I'm gonna call it the target model, is verifying these guesses. What does verification mean? Verification means doing one forward pass over these generated tokens to see how likely it is that the big model would have generated them. The sort of key asymmetry here, the reason that speculation works, is that it is easier to verify than to generate. This is a feature of the transformer architecture where you can get the probabilities for many tokens in a sequence in parallel in one forward pass, um, but you can't generate them in parallel. Autoregressive decoding as, uh, one at a time. Um, so we're leaving the autoregressive decoding, which is slow, uh, to a very quick and small model, and then we're doing just one forward pass on these tokens. And the way you verify tokens is basically by having the big model look at the probabilities of each of the generated tokens and see how plausible it is that it would have generated those tokens. And sort of the intuition here is that we will accept precisely those tokens that the big model could plausibly have generated. Its probabilities were reasonably high. There's subtleties in exactly what the algorithm is, um, that I'm gonna gla- gloss over, but that's the way to think about it. Um, and then we're gonna find a point perhaps where we don't think it's plausible the big model would have generated those tokens, and we're gonna reject those tokens. So in the little schematic on the right, uh, there, the draft samples three, and the big model verifies them and concludes that only the first token was something it would plausibly have generated. It will reject the second token onwards. And importantly, this is a sort of critical but subtle detail of vanilla speculative decoding. Because you have the probabilities at each of the sequence positions, you can sample an extra token at the point at which you rejected a token for free, as in without doing any more forward passes. And so that yellow token is what I'm gonna call a bonus token that you sample for free. This is gonna be important in SSD. Um, so yeah, that's, uh, that's an important conceptual point, and this sort of sets the stage for how SSD works. Okay, we have our schematic, and the way we've set up speculative decoding is that it's a way to exchange flops for latency. So speculation in general is not actually something that, uh, only LLMs do. It's like a, a deep idea in computer science. It's used in CPUs as well, where the general philosophy is that you precompute something ahead of time. Some of what you precompute may be useless because it may be an incorrect prediction of the future, but if you're right, you get to fast-forward in time, um, and you get lower latency as a result. So the, the sort of like moral philosophy of speculative decoding is that it's currency exchange. The difficulty with normal speculative decoding is that you can't push this arbitrarily far. You cannot keep sampling more and more tokens on the draft and keep getting speed-ups because at some point, you're gonna get to a point where you're spending a lot of time drafting and you're not accepting all that many tokens. And in particular, like a big bottleneck in vanilla speculative decoding is the sequential dependence between the small llama and the big llama. Um, the drafting in round T has to take place before the verification of those tokens. Um, and the drafting in round T plus one can't take place before you know the outcome of verification of the previous round because you need that as a prefix to draft, draft on top of. So there's a logical de- dependency here. The goal of SSD is very simple. There's a lot of gnarly and subtle details, but the high-level idea is incredibly simple. It is simply to parallelize the sequential operation. We want drafting and verification to be happening at the same time. Normally in speculation, they happen on the same hardware, and that's fine because there's only one of them happening at a time. In our setup, they're gonna be happening at the same time, so we're not gonna be co-locating them. And the main question basically becomes, how do you parallelize this inherently sequential algorithm that has a logical dependency? Um, and the way we're gonna do that is we are going to have the draft model send back its draft tokens in a certain round. So we've sent back a bunch of blue tokens. That's now the job of the verifier to do a forward passover and verify. And this is gonna take a while because the verifier is a big model. What we on the draft are gonna do is basically start anticipating the most likely verification outcomes immediately as soon as we send back like a certain round of speculation. And once we, we have in mind some of the most likely verification outcomes, we are going to start drafting the next round on top of those immediately while verification is taking place. If we're right, the next time the verifier asks for a draft, we'll have it ready immediately. The-- We're entirely hiding the latency of drafting. If we're wrong, well, we'll have to figure out a backup strategy, and there's, uh, there's, there's, there's some subtleties on what you do and how you do it there. Um, so yeah, the way that Speculative Decoding looks like this, and perhaps unsurprisingly, the analog for SSD is this diagram on the right, where now drafting and verification happen in parallel. Um, the, the principal difficulty or algorithmic design space in SSD is how do you predict verification outcomes ahead of time? I thought verification is where you are leveraging the intelligence of the big model that should by construction be difficult to predict. Um, and the intuition for why it's plausible at all is that you can make many guesses on the draft for what a verification outcome is. And a verification outcome here is just, you know, a plausible number of accepted tokens and then a bonus token o-on top of that. Now, this is hard to predict because a bonus token comes from a vocabulary which has size, you know, tens to hundreds of thousands. Um, so it's a large space to cover. Um, but it turns out you can do it well, um, reasonably well. You can get it right about eighty to ninety percent of the time, which is more than enough to get big speedups. And the way we do that, the short of it, is basically we use information on the draft to predict what the verification outcome is likely to be. When we generated the blue tokens on the draft, we had other tokens that we chose not to sample. Those other tokens are plausible verification bonus token candidates. And so you basically use information from the token distributions of the draft model to predict what likely outcomes on the target are. And then once you have all of these predictions, you can decode them in parallel as just different sequences that you're decoding on top of a shared prefix. And voila, it, uh, it's, it gives you speedups because you get to hide the latency of drafting altogether. Um, there's also a, a, an additional bonus that since verification actually kind of takes a while, you get more time to draft, uh, in the first place. So you can draft more tokens, which increases the expected tokens per round and sort of gives you f-further speedups. There's a bunch of stuff that we work through in the paper that's, uh, that's sort of reckoning with the, the implementation details of this. One of it is how you handle cache misses. One plausible thing you could do, perhaps naively, is to just fall back to ordinary speculation just in time. Turns out that actually this is not always optimal. Um, there's trade-offs. You know, as batch size increases, you're gonna fail to predict some of the sequence's verification outcomes, um, and so you need different ways to predict and handle cache misses. Should you be allocating your compute on the draft equally amongst plausible prefix length? Uh, the short answer is no. You can be clever about it, and all of this trickery just helps you increase your cache hit rate, so to speak, the, uh, amount of time you're able to correctly predict verification outcomes. And there's, there's some trade-offs between cache hit rate and the actual quality of the drafting you're doing. Um, and this is totally non-obvious. Um, and, and, and we, we go into why that exists and how you can navigate it in the paper. Um, I'm happy to talk about it in, in, in Q&A as well. Um, okay, so what do you get for the, the price of this, uh, mind-numbing complexity and, uh, pain wrangling an inference engine? Well, you get the privilege of watching a number go up, which I guess is the North Star of all AI research. And so here we have, uh, a bunch of inference algorithms and inference engines. The blue ones are sort of, uh, my inference engine, and, uh, the light blue is just the im-- baseline implementation of Speculative Decoding. The red is SGLang, which is, you know, of all the inference engines we tried, the fastest with Speculative Decoding, and the dark blue is, is SSD. Um, and normally Speculative Decoding, um, is a, is a win for latency, but it's sort of unclear whether it's useful for throughput. Um, for us it turn-- in, in, in this setting, it's actually a win for both. Um, and so you get numbers going up, and you also get the ability, next time you are at a San Francisco house party, um, to see other people dancing and knowing in the corner that, uh, you know what it takes to sample at three hundred tokens per second, uh, for Llama-three seventy B on four H100s. So this is, uh, sensitive information. Um, but yeah, that's, that's about it. Thank you. [audience applauding]
- FCFrancois Chaubard
All right. That was awesome. Okay, so for this next paper, this is, um, my first experience being scooped. The only issue is that he didn't talk to me, and he did it six months before me. Um, [laughs] but, uh, Isaac can vouch for me on this, and maybe Robert as well. I ha- basically w- fell in love with the diffusion policy paper. I was like, "This is definitely like, you know, a full, uh, predicting like TH horizon steps for your robotic control. Um, we have these amazing video models. Why don't we just use the video model to like run this, like at test time to like play out the movie, and where do I end up?" And then you have your classic push T. And then I started like looking around, uh, and then DeepMind of course already did it. So, [laughs] so I wasted like a month, and I was not happy. But anyway, thank you very much. Please welcome Stannis.
- 18:33 – 30:26
Guangyao (Stannis) Zhou — Diffusion-MPC (https://arxiv.org/abs/2410.05364)
- GZGuangyao (Stannis) Zhou
Hi, everyone. I'm Stannis. I'm a staff research scientist at, uh, Google DeepMind. Uh, currently, I'm, uh, co-leading a new project on world modeling for robotics, uh, where we try to build general purpose policies on top of, uh, video and world models. But, uh, this is an early work that, uh, I did about, uh, two years ago. Uh, so this was, uh, before I switched to working on hardcore robotics and, uh, going into hardware, really scaling up the data. But, uh, you can probably see a lot of, uh, very similar ideas, early version of ideas, uh, demonstrated on toy problems. Okay. So, uh, first, to give some background, what is, uh, model predictive control? So model predictive control, also called, uh, receding horizon control, uses a dynamics model, or some people also call it a world model, and, uh, a action selector mechanism, uh, which is a planner to construct agents that can solve a wide variety of tasks by means of, uh, maximizing a known objective. So the main advantages, uh, of, uh, model pre-predictive control is, uh, it can adapt to normal reward functions at test time. So, uh, the dynamics model are also easier to learn and, uh, generates better than just policies. And, uh, the action proposal dynamics model factorization also allows, uh, easy adaptation to normal dynamics. So we're going to, uh, demonstrate some of these, uh, in later experiments. But basically, here we are showing the overall idea, which is extremely simple. We have a action proposal, which, uh, proposes a sequence of actions. We have a dynamics model, which can evolve these actions and, uh, give you the future states. And, uh, finally, we have some objective functions, uh, that we are trying to optimize. We basically use a planner to optimize that and, uh, pick the actions and, uh, execute it, uh, in the environment. So what is, uh, diffusion model predictive control? So the motivation, uh, mainly is, uh, uh, there are a couple of problems we need to address in order to make MPC effective in practice. One, the dynamics model needs to be accurate to avoid the problem of, uh, compounding errors. And, uh, two, the planning algorithm also needs to be powerful enough to select a good sequence of actions. So with, uh, DMPC, what we did is, uh, to use, uh, diffusion models to learn both multi-step action proposals and, uh, multi-step, uh, dynamics models. So the advantages are mainly to reduce compounding errors, and we also found that, uh, it can simplify the planning algorithm. Essentially, we can just use a very simple, uh, sampling-based planner, and, uh, we can already outperform a lot of the previous, uh, approaches. So, uh, before we dive into the details, also want to give a hierarchical view of, uh, some related works. We organized, uh, so there are a lot of related works in the literature, and, uh, we organized it, uh, uh, in this way, where we basically look at how different approaches, um ... So basically, all approaches essentially try to build a joint, uh, distribution of, uh, the states and the actions, but, uh, they do it in different ways, and also use the different components in different ways. So, for example, you can build it in a factorized way where you have, uh, row A, which is your policy predicting the actions, and then conditioning on the action predicts the state, which is a dynamics model. And, uh, for this, you have the Dyna paradigm, where you basically learn a model and use the, the model to also generate, uh, data in the imagination and then learn a policy. But, uh, you can also do MPC, uh, where you, uh, essentially use a planner to select the actions. And, uh, we also have, uh, some, uh, uh, there are also approaches where you build a joint model of the state and actions, and you're essentially also doing MPC. And there are also model-free approaches, where you directly learn a policy. Uh, I won't dive into the full details, but, uh, uh, there are ba- basically different trade-offs in terms of, uh, runtime pla-- uh, whether we can do runtime planning and, uh, adapting to normal rewards and, uh, adapting to normal dynamics, leveraging non-expert data, and, uh, also the, uh, general speed at, uh, runtime. And there is also the distinction between whether you're doing single-step modeling or multi-step modeling. Okay. So coming to the diffusion model, diffusion model has, uh, enjoyed, uh, a lot of successes, uh, in, uh, generative AI, especially for generating images and, uh, videos. But, uh, in recent years, uh, they also found a lot of successes in robotics. So currently, uh, so here I'm also showing a slide where, uh, this is, uh, kind of the exploration space for, uh, diffusion-based, uh, I would call it diffusion-based agents. So we of course start with the diffusion policy, where we condition on the observation and, uh, generate future actions. But then we also have this, uh, work called, uh, Diffuser, which, uh, is, uh, uh, you can think of, uh, it as, uh, a way to joint, jointly model, uh, observations and the states, but, uh, in, uh, toy space. There are, of course, uh, uh, these ideas are explored in tons of different papers, but, uh, this is just a very simple and, uh, uh, conceptual way to describe it. And, uh, then there's also Decision Diffuser, where we condition on the observations. We directly generate future, uh, we condition on the history, directly generate future observations, and then train a separate inverse dynamics model to derive the actions. And, uh, finally, we have, uh, the diffusion model predictive control, where we first have a action proposal to propose future actions and use a dynamics model to evolve it and, uh, then use, uh, a planner to select the actions. There are different, uh, trade-offs among these. So, for example, diffusion policy is, uh, sort of, um, complex, uh, complex control, like, uh, day to day, we still rely on it, uh, a lot. But, uh, this requires expert demonstrations, so essentially you can't, uh, move out of the behavior cloning paradigm. Uh, for Diffuser, it's, uh, jointly modeling state and, uh, action. So it has, uh, implicit world modeling and also world model-based planning. And this is actually something that, uh, we are trying to explore at, uh, scale, uh, similar ideas. But, uh, and then there's also, uh, Decision Diffuser, where you do ob- observation-only learning. The main benefit of this is, uh, it allows you to leverage of, uh- uh, video-only data, to learn from video-only data, because, uh, for robotics, uh, the data is a, a main bottleneck. And then finally, there's a Diffusion MPC, which, uh, allows us, uh, to do runtime adaptation to normal rewards and, uh, normal dynamics. So what does the algorithm look like? It actually is, uh, extremely simple. We have, uh, uh, offline dataset, and, uh, we have, uh, some hyper parameters. Essentially, we are learning a couple of, um, uh, learning a couple of, uh, models, all from the offline datasets. We are learning a policy which, um, uh, given the current observation, predicts the actions. We're learning a dynamics model which, uh, given the ob- uh, given the actions, uh, evolves the, the observations, uh, to predict the future states. And, uh, uh, basically, after learning all this, at, uh, um, at, uh, inference time, when we actually deploy it as a policy, we, uh, sample the action proposal and, uh, score it, uh, uh, rank it, and, uh, pick the best. But, uh, the main difference, uh, compared to previous approaches is, uh, we adopted, uh, a multi-step action proposal, which, uh, is, uh, essentially very similar to a diffusion policy. But, uh, if you train it on more diverse data, it can give you, uh, more coverage in terms of, uh, the action space. And, uh, we are also using a multi-step, um, uh, dynamics model, which, uh, allows you to, uh, evolve for a long time horizon without, uh, a lot of, uh, compounding error. And, uh, this, uh, allows us, uh, to... Uh, and also, uh, there's the fact that, uh, we leverage a diffusion model, which is, uh, a really powerful way to model data, especially multi-modal data. And, uh, uh, what we observed, uh, empirically is, uh, the, uh, stronger modeling, uh, capabilities also allows us, uh, to, uh, simplify the planning algorithm so that, uh, we can just, uh, use, uh, such a simple, uh, planner to, to, to solve the tasks. Yeah. Um, also contrasting with a few of the representative, uh, uh, past works, uh, including, uh, model-based, uh, offline control, uh, offline planning, and, uh, this, uh, diffuser work, uh, which I mentioned. Uh, it learns, uh, a joint model and, uh, uses, uh, classifier-free guidance, uh, for planning. Okay, uh, so yeah. Next, to dive into some, uh, results. Uh, there are lots of numbers, but, uh, the short answer is, uh, we obtain very competitive results in fixed-reward, single-task setups. This is, uh, just to demonstrate, uh, that, uh, uh, the approach, uh, when you deploy it, uh, in a single-reward, uh, fixed-reward, single-task, uh, setup, it can perform competitively to the current, uh, state-of-the-art, uh, uh, previous state-of-the-art approaches. But, uh, I think, uh, there are a couple of, uh, more interesting, uh, properties of, uh, DMPC. One is it can adapt to normal rewards at runtime. Here we are showing some, uh, examples where, uh, essentially we train the model to, uh, these are very simple MuJoCo tasks, but, uh, we train the model to just, uh, uh, locomotion tasks, uh, run forward and, uh, jump, et cetera. But, uh, at, uh, inference time, we can just, by changing the reward function to, uh, make it, uh, exhibit, uh, normal behaviors like, uh, jumping, et cetera. So, uh, here's another example where we show that, uh, uh, DMPC can adapt to normal dynamics while, uh, these kind of, uh, joint modeling approaches, uh, struggle. This is, uh, really the benefit of the factorization of the action proposal and, uh, the dynamics model. So the, here, the idea is, uh, we can keep the action proposal the same, but, uh, we, uh, we have, uh, scenarios where the dynamics, uh, of the environment changed. So, for example, the walker has, uh, a broken left ankle, and as, as a result, when it starts to execute actions, the, the consequence of the actions change. So in such cases, uh, because of the factorized, uh, representation in DMPC, we can, uh, simply just adapt the dynamics model on, uh, some play data collected in the new environment, and, uh, we observe that, uh, we can, uh, recover a lot of the performance, uh, because of the changing dynamics. Finally, we dug into the various components of, uh, the DMPC design, and, uh, we demonstrated that, uh, the different components in DMPC basically contributed to improved, uh, performance. Uh, this, uh, the- these include, uh, the diffusion active proposals, action proposals, improve performance, and, uh, simplify the planning. We do multi-step diffusion action proposals, and, uh, the, the fact that we do multi-step also, uh, contributes to improved performance. And finally, multi-step, uh, dynamics modeling also, uh, contributes to improved, uh, performance. Uh, that's it. Thank- [audience applauding]
- FCFrancois Chaubard
All right, and that was the last Google DeepMind paper that they're gonna publish, so good luck out there. Um, this next one is one of my lab mates that I work with a lot, that is the most world model-pilled person that I know. [laughing] And so I can't imagine, you know, anyone else presenting this paper other than Jan LeCun himself. Um, [laughing] Isaac Ward. There you go.
- IWIsaac Ward
Yeah. Thanks a lot. [audience applauding] All right, guys. Is that a good distance? You all can hear me at the back? Cool,
- 30:26 – 43:54
Isaac Ward — LeWorldModeling (https://arxiv.org/abs/2603.19312)
- IWIsaac Ward
cool. Yeah, I'm enjoying a, uh, a cool little period in life where I started working on world models a couple years ago, kind of before they got really hot, and now they're enjoying a moment in the sun, and suddenly everyone wants to talk to me, which is nice. I'm presenting LeWorld Modeling, which is a call out, uh, of course, out of Jan LeCun's group. Uh, QR code here if you wanna follow along with the project page, but I'll explain through it. And yeah, really excited to talk to you about this one. Uh, hidden in this presentation is really, like, a billion-dollar question, and it's not hyperbole. Uh, Jan LeCun's raise of $1.03 billion back in March, basically just to train world models is sort of what this presentation is about. I wanna get at some of the questions that they're gonna be testing First five slides here, just gonna do some basics on world models. I think we've all heard the term, but I wanna just make sure we're all on the same page, and then we'll jump into, uh, what this paper is really, uh, offering and what it means for world models at large. But first of all, world models, what are they? Why do we care about them? So really, it's about learning the dynamics of the world, which is to say we're trying to come up with some model, typically we're using, like, a big neural network to predict how a system will change over time based on its input. So you have your current state or scenario using S for notation here. You're applying some action, maybe that's like a movement or a, a command for a robot, um, or a language command for a robot, and then you're trying to predict, like, what its outcome is gonna be. Like, what scenario will it end up in once it's executed that action? So you're really trying to model the system or the environment that the robot is in, modeling the world. It's a world model. Uh, these kinds of models are really cool. They enable a few really interesting capabilities, and one of them is generating imagined outcomes. We've probably all seen, like, these sort of weird, kind of, um, hallucinate-y, uh, imagination sequences coming out of world models over the last couple of years. We'll talk more about those and why they're useful. Uh, this let-- allows us to get to model-based control. I'm glad Stannis kind of explained that in the last talk for me, so I'll skip over it. Um, and the last piece is really cool, surprise quantification. Uh, I'll get to that later, um, but a really powerful capability of world models. I wanted to communicate to you all that this is not a new idea at all. It's really just kinda new advertising or packaging on an old idea. So I started going back through Google Scholar, and this is a paper that I think is older than the average age of this room, um, from Europe's 1990. And of course, Richard S. Sutton, who we know from reinforcement learning, basically describes exactly a modern world model, a black box that takes as input its situation and its action that it's gonna execute and outputs a prediction of its immediate next situation. So a really, really old idea, and, uh, that's the fly F in Europe's 1990. Great. So getting a little bit more explicit, um, and changing the notation from state to observation, just because in real world systems, we typically don't have access to the exact true state. We typically have some observation from sensors. Uh, this is just an example that I pulled up from some world models that we're training, uh, on a quadrotor. So as an example, the observation that the quadrotor gets might be its current kinematic state, position, velocity, this kind of thing, in addition to the images that it's taking from a forward-facing camera. The action might be a control input, in this case, a yaw, and move back to the left. And then we want to make a prediction that says, "Well, if you do that action, you're gonna end up slightly back in the room and looking to the left." And we actually wanna generate what the sensor, um, would result, uh, in, in this case. So highly, uh, dimensional observations, images, uh, and also LiDAR and things like that are completely on the table in world models. Uh, they're really challenging because action sequences can be quite long, um, and the really big thing is that the minima in the optimization landscape for these kinds of models may not correspond to the desired behavior, and more on that later. Um, but hopefully you'll agree that if you've trained a system that's capable of doing this thing, it must have an internal model of the world. And imbuing agents with an internal model of the world, um, is potentially a very useful capability. And that really is the big question. Are we gonna have model-free or model-based policies? Are our agents gonna have an internal model of the world or are they not? And this is sort of being fought out right now, both in the research community and in, like, the startup community. So on the left, model-free. The idea is you're taking some observations, you're feeding this into some kind of big neural network, potentially with a bunch of interesting learning tricks there, but you're getting some optimal action out. So it's just mapping between observation and some optimal action. But at no point is there an explicit representation of what the future might look like if you execute that action. These kinds of models are pretty good. There is growing evidence to show that internal to these neural networks are highly obfuscated and challenging to interpret world models, uh, sort of in the, in the weights. Uh, I'll talk about a paper very briefly that's, um, speaks to that, and maybe someone can present on it in a future week. And then over on the, um, other side, model-based approaches, right? So now we're saying we're gonna train this world model up explicitly and actually use that in our policy to be able to explicitly predict the outcome of potential actions. So yeah, totally, like, two different species of policies. The model-free stuff, some of the weaknesses is they show a little bit of brittleness to out-of-distribution. Um, model-based ones are great 'cause you can kind of quantify modeling error, and this is really important when you're deploying things in the real world. Uh, we'll talk a little about this. I have a little asterisk here, some biological precedent, which we'll speak to more. Um, and you have to have this additional mechanism, of course, which is a downside, where you actually need to propose action candidates to evaluate with the world model, um, which Stannis spoke to in the previous talk. This is a great paper, I just wanted to chuck this in there, uh, which talks about how even model-free based policies do have world models in them, and a really, really cool paper that hopefully can be presented in a future week. Uh, just to make it concrete before we jump into the paper, I wanted to just bring a little toy here just to show you what this looks like. So of course, went to PushT, like all good researchers do, and in PushT, we basically just have an image of a little blue ball agent, and you're trying to push the blue T into the green slot. Uh, the state is comprised-- or the observation is comprised of that image plus the 2D position of the end effector and the 2D action of where you're gonna move the end effector. So you can make a little architecture that looks like this. I just whipped this up, couple hundred thousand parameters and, um... Oh, let's play this. So if that's the actual rollout, this is what the model thinks the action sequence is gonna do. So you can see it's a little bit wobbly 'cause it's a tiny model, but we can certainly train up models of these kinds of toy environments and indeed more complex ones. So what are the challenges associated with training this kind of model? Well, one is you're trying to learn the representation of the world, so how you're gonna compactly represent those highly dimensional images or LiDAR inputs or highly dimensional sensor inputs, at the same time as you're trying to learn how actions change that representation. So you're co-learning representation and dynamics, and there are many solutions in the optimization landscape that will essentially just cause you to do nothing. So for example, a, uh, a local minima in the optimization landscape is to say, "Well, every state is just the same." It's a trivial collapse, basically. Um, and there are many techniques in the literature to say how can you avoid these. So there are solutions of a variety of different kinds that basically say they're a way to avoid the collapse associated with training world models, and that's really where the world model comes in. It says, "Well, instead of having to use some manner of trick or, like, special method or a bunch of, like, hyperparameter tuning schedule, we're instead gonna really drastically simplify this and go for a more elegant method." So if you know a little bit about world models, there's some popular ones in the top right here. This is a figure straight out of the paper. So PLDM is planning in, uh, with latent dynamic models, DYNO, DINO, um, Distillation with No Labels world model, DREAMER out of DeepMind, and then Temporal Difference MPC as the final one. So in some way, shape, or form, I'll explain this, they use some kind of trick or, um, like challenging to configure design to get away with, uh, this collapse, to avoid this collapse, and the world model's coming in and saying basically, "We can do this with sort of one high parameter and one loss term," which I'll talk about. There's really no time to go through all the different tricks that different world model approaches use because it really is the Wild West out there right now, so many different methods. Uh, but they basically fall into one of these three categories. So one is you could do some explicit heuristic that stops collapse by, like, enforcing some special, um, healthiness in, like, the latent space of your embeddings. Um, the language trick is maybe a bit unfair here, but it's what's used in the paper. Uh, you could use some foundational methods, so you could take some, like, existing autoencoder or diffusion model or video model, uh, and use that as, as a basis for your world model and add an action conditioning element in there. Um, or you could use some privileged data that may not be usually available to the model outside of train time, uh, to be able to avoid collapse. And Layworld model, even though it says that it's doing something very different, I really think, uh, it's just offering a new kind of trick, uh, which I'll talk about here. So JEPA is Joint Embedding Predictive Architecture. It's sort of Yan Lecun's main work, and Layworld model is a kind of JEPA model. Uh, basically, the way it works is you're gonna take an autoencoder, um, or I should say a image encoder, uh, encode this observation. In this case, it's of a robot doing a push cube task. That's gonna turn that image into a latent vector. In the latent space of this encoder, uh, you're gonna train an action condition forecasting module, this predictor, to be able to predict what is the next latent embedding gonna look like when I execute this action. So not what the next image is gonna look like, but what's the next latent gonna look like. And you can use the decoder attached to that encoder to decode that back out into a useful image. But for the most part, all the interesting work is gonna be done in the latent space. And basically what they say is over a batch, all of those latent embeddings, uh, should be in a healthy distribution, which they describe as a Gaussian distributed, uh, distribution in, in the latent space. And thus enters the SigReg regularizer, which is this sort of new term they add. So SigReg for sketching, as in, uh, doing one-dimensional passes over a high-dimensional data, um, I for isotropic, so this should look the same when you slice it in any direction, and G for Gaussian distributed, SigReg. So basically you're taking all of these embeddings of your different predictions, doing a one-dimensional slice over each direction, like in that high-dimensional space, and then you want each of the curves across those slices to be Gaussian distributed. And if that's true, then your, um, distribution in the latent space must be very healthy. Uh, so the idea is you can quite cheaply evaluate how Gaussian distributed your embeddings are and thus how healthy your world model is and how non-collapsing it is. So essentially they're just saying instead of training up on the normal predict the next, uh, latent, you add on this additional SigReg term. So I'd argue that basically this paper is just, um, providing a very elegant kind of regularization. And to finish off, I'll just talk about three capabilities that you get from this. So one is the open-loop prediction quality. This is what world models do. So you feed in, like, the context, this push T at the top, um, and you can see the top row is the real example, the bottom is the imagined, and they look about the same. This is good. It means your world model is really good at predicting what your next action is gonna do. They do that on push T and then on a slightly, um, like a 3D analog task, like a push cube. This is all great. I love seeing these, um, these plots. Um, but really what matters is how does this actually affect the policy? Like, for the actual task completion, how is this useful? Um, and that sort of brings us into how you can use these models for model predictive control. Basically, you take your initial observation and a goal observation. I put an asterisk there because how often do you have a goal observation in a robotics task? Like, you don't always know exactly the situation that you wanna end up in, but in this case, that's how they frame it. So they say, you know, "The world looks like this right now. I want the world to look like this." You encode both of those, and then you're basically doing a search over the actions that will get you in the latent space from this starting point to this ending point, and there are well-defined optimization methods to, um, to achieve that. It works pretty well. I'll make it, um, make it simple. The world model is better than the competition on these, like, small 2D tasks. As soon as you go to 3D, DYNO world model wins. It does have a big foundational backbone trained on that kind of image data, so you'd expect it to, um, to win. Um, they run on a really simple environment called Two Room and kind of say, "You know, we don't do so well on this, but that's because we're promoting, like, really high-dimensional healthy embeddings, and it's a very low-dimensional problem." I'm not sure if I'd truly go for that, um, but a good takeaway is that it's about fifty times faster than any of the competition across the board because it's doing all this work in the latent space, and it doesn't have to have any, like, additional tricks relating to more forward passes or, like, having two copies of the model in memory. And, uh, you can actually boot this thing up on, like, a single card less than twenty-four gigabytes of VRAM, and it's only fifty million parameters, so that is pretty nice. Final piece. This is, uh, what I think is a really cool capability of world models. Um, you can quantify the model error. So basically they just come up with some trajectories that kind of screw with the world model. So the top one is going from left to right. That's time. Uh, so that's just, like, a nominal example. Everything's normal. Then they take the same example, but they change the color of the T, and then they take the same example, but they just teleport the T into a different location. And this is really cool because you can actually see the moment they apply those perturbations, you get a spike in the model error, and this is detectable, which is to say world model-enabled agents can quantify how poor their predictions are. They have good estimates of their uncertainty. This is really powerful. Model-free based approaches don't natively give you this stuff. This is my last slide. Um, a few discussion points and broader themes maybe we can chat about here. Obviously, you know, are we gonna go with model-based? Are we gonna go with model-free? Um, what's gonna be the best way to enable intelligent agents to do interesting things in the world? Regularization and representation learning. Um, in this paper, they are co-learning the, uh, representation of the world that the agent has and the dynamics of the world. Should this be separated? Can we take some bioinspiration? Should we use preexisting, um, like foundation models and stuff like that? And then finally, how can we fight, uh, representational collapse elegantly? I think this work does a really great job of that, but the question is still out on what the best way to do it is. So, um, that's my talk. Thanks very much for your attention
- FCFrancois Chaubard
[clapping] All right. Okay. So for the next two, um, we're kind of focusing on, um, less world model stuff and more heady high-level stuff that I think is pretty interesting. Um, this is a, a paper that's gonna be presented by Akshay, one of the YC, uh, startups here named Q Labs, and you're pr- a co-founder president. You're president of Q Labs? Is that right? Okay. Welcome, Akshay.
- AVAkshay Vegesna
[clapping]
- 43:54 – 51:24
Akshay Vegesna — Deep Learning is Not So Mysterious or Different (https://arxiv.org/abs/2503.02113)
- AVAkshay Vegesna
Hey, everybody. Today, I'm gonna be talking through Andrew Gordon Wilson's paper, uh, Deep Learning Is Not So Mysterious or Different. Uh, we actually work with Andrew on the generalization problem at Q Labs, so I'm really excited for more people to know about his work. The current state of machine learning is that we know that scaling, that scaling models leads to better generalization, but we don't have a mechanistic understanding of why that is the case. Um, yeah, if we can understand generaliz-generalization, then we might be able to optimize for it as well. So the payoff to understanding it is actually really, really large. Um, when you talk to people in the field, they often explain that generalization is a mystery, and they point to examples like overparameterization, benign overfitting, and des- and double descent as reasons why we might not be able to understand generalization at all. So Andrew's work here basically dispels those mysteries by using classical theories of generalization, uh, which, which have to date not really been used to explain things like, like overparameterization thus far. So the first classical theory that we'll go through is, uh, PAC-Bayes. So PAC-Bayes basically bounds the test loss, which is the generalization, this is the quantity that we care about, with the training loss and a compression term. Um, the thing is, in the past, when people overparameterize models, this compression term tends to dominate, and so in practice, these bounds become loose and vacuous, meaning that we can't use them for anything at all. This was basically due to a misapplication of the bound. You can compute the, the compression term in an alternative way as we'll get into sort of later in the talk here. So let's go through the first mystery that, uh, Andrew goes through in his paper. Um, the, the, the mystery that he talks about is overparameterization, and this is basically the idea that as you scale up the, the model parameter size from the bias variance, variance trade-off, you would expect that you might overfit. But in practice, we see the opposite. The scaling laws tell us that we actually get better generalization. Um, the, the s- this sc- this scaling and the better generalization from overparameterization is, is, is due to, like, the, the, the massive gains in model capability over the last couple of years, but we still don't really understand why it impr- why it improves generalization. So the PAC-Bayes framework gives us a pretty useful way to think about the success of overparam- parameterization. The first is with empirical risk. Empirical risk is basically training loss. When you increase the number of parameters, you can fit your data better. Um, so the empirical risk, the left, uh, the, the first term goes down, and Andrew's work also finds that when we increase the model p- wh- when we increase the number of parameters, um, we also find more compressible solutions. So this is work by Lotfy et al, et al, and they develop methods to basically compress the, uh, the, yeah, they compress the, the training set you, and, and, and the model, and they basically find a negative correlation between the, the bits required to encode the training set and the number of parameters. Um, and so we find that as we increase the model size, we can find more efficient encodings of the training set. So the, the second term in this bound also gets lower. Another perspective on this model compressibility point is a perspective of flatness. As you increase the number of parameters, uh, it turns out that the number of, that the volume of flat minima in parameter space exponentially increases. This is the green region, and, uh, and comparatively, the, the volume of sharp minima increases much less. And, uh, this is interesting, and this is useful for the compressibility view because flat minima are known to be more compressible than sharp minima. And so overparameterization fits within existing theories, and through Andrew's work, we actually see useful bounds on generalization, even for models at, at like a billion parameter scale. And so we go to the next so-called mystery of deep learning, which is called, uh, benign overfitting, which Andrew also dispels in, or at, at least partially explains in his paper. So the idea of benign overfitting is that deep neural networks are able to fit totally random noise, but at the same time, they are able to sh- to, to generalize well when you have structured data. The mystery is, how can you have an inductive bias that allows you to generalize well if you can also fit totally random data? I think a regularized polynomial model, um, in Andrew's paper gives us pretty good intuition for how this might be the case. Here you can see that on random data, so section C of the figure, that we have enough parameters to fit the data, and so we, we, we can, we can fit the totally random data. But on structured data, the, the regularization pushes us to use the lower order terms. And so we are able to both get the flexibility, but also have inductive bias that allows us to generalize. And generally, this is, this is the view to take, um, f-for, for neural networks. Like, uh, they are expressive models with a soft inductive bias. Um, we can go through this concept, um, just using this figure right here. So, uh, on the left-hand side, we have an example of, of what's, like, a flexible hypothesis space. And a flexible hypothesis space would allow you to fit the data that you have, but the problem is that you would almost certainly overfit if you, if you, um, if you do not have a bias towards one solution over the other. But on the other hand, if you have an inductive bias, you would solve this overfitting problem, but instead you wouldn't, you wouldn't be able to model all the details of reality. Um, and so the middle ground is to have a very expressive hypothesis space, but also have a bias towards solutions that might generalize. For example, in the PAC-Bayes framework, we might want to bias towards more compressible models if we can. And so we see that, uh, deep learning so-called mysteries are actually consistent and partially explained by existing theories such as soft inductive biases and PAC-Bayes. And sort of the thing I wanna leave you with is that, um, if, if we can find the right inductive biases building on these theories, we might be able to optimize for them as well. And by the no-free-lunch theorem, the only way that we get improvements in learning efficiency is through inductive biases. So I, I think that this is-- that working on this problem is, is a really good bet to make. Given the massive sample efficiency gap between AI and humans, we might actually see massive gains in capability if we work on this problem. Um, and so, yeah, that's where I wanna leave you with. Short presentation. [clapping]
- FCFrancois Chaubard
Okay. Um, so for this last paper, then after this we have some boba for everyone. So sit tight, fifteen minutes. Um, this is a idea that, you know, I've been obsessed with. Back to the sample efficiency thing, I think that, like, the two major problems we have left really to solve in, in AI is intelligence per watt, um, and intelligence per sample. And if you compare that to, to where we're at today compared to humans, um, I would say that we're still or- an order or two magnitude off on intelligence per watt, uh, and we're me-- like orders of magnitude off on intelligence per sample. I don't know what percent of the internet that you guys have read, but I have not read the entire internet. In Chris Ray's lab in particular, we've been obsessed with this idea that, um, if I have, uh, under the, the, a fixed size amount of data, and I have infinite compute, just go nuts, how much generalization can I actually achieve? And so this is exactly, uh, the paper that starts to answer that question, and I'm really excited to, uh, introduce, uh, Konwoo. [clapping]
- 51:24 – 1:07:16
Konwoo Kim — Pretraining Under Infinite Compute (https://arxiv.org/pdf/2509.14786)
- KKKonwoo Kim
Uh, hi, I'm Konwoo. Um, this is a paper that I co-led with my amazing collaborator, Suhas, as well as Percy Liang also. So part of the motivation for this paper is just the fact that over the past, uh, six or seven years, pre-training has continued to improve model capabilities in pretty surprising ways. So in 2020, with GPT-3, we had sort of the emergence of in-context learning. In 2022, with Anthropic's RLHF, we had sort of the advent of alignment, and maybe most notably in 2024, with both o1 from OpenAI and then later DeepSeeker R1, we had the emergence of reasoning. And in fact, even still today, we see that with these newer and bigger pre-training runs, like Mythos and 5.5, the models just continue to keep better. And so because pre-training is very expensive, a lot of the focus on the research side of things has been on how do we improve compute efficiency. And in general, people have found that to improve compute efficiency, you need to scale both the number of parameters in your model and the number of data points that you train your model on. And so these were quantified with the so-called Chinchilla scaling laws. The problem with compute efficiency is that we're soon going to be constrained by data. And so if you look at these sort of public projections of the rate of growth of internet data, they suggest that the amount of sort of human-generated text on the internet grows by roughly three percent per year, and the amount of compute that we're spending on pre-training is growing by roughly four or five X per year. And so what this suggests is that as time passes on, the amount of compute that we're willing to spend per data point is going to continue to increase by roughly four X year over year. And so this sort of motivates the core question in this paper, which is: how should you approach pre-training when you're constrained by data but totally unconstrained by compute? And it's worth maybe spending a few seconds to think for yourself if you haven't already seen this paper, like, what would you do in this situation? This is a very different algorithmic regime from sort of the compute-efficient pre-training world that we've sort of lived in for sort of most of, uh, uh, modern time. And it's also worth noting that this question is not that different from how machine learning worked before the modern LLM era. So for things like classical statistics, where maybe you really care about your rates with respect to the number of points of data you have, and you don't care about compute, or even older benchmarks like MNIST and PENTRE bank, where you're sort of implicitly data constrained because the benchmarks don't have that many data points. And so sort of the core contribution that I'll explain in this paper is that we bring the modern toolkit of scaling laws to, to sort of answer this problem. And so what we'll show is that we'll propose a few different scaling recipes, and we'll sort of chase scaling recipes that monotonically decrease your IID validation loss, so sort of in-distribution generalization. And we'll show that these scaling laws have a really clean functional form, and they follow a super clean power law. And when you're able to fit these power laws, what you can do is you can estimate the best possible loss of your recipe by looking at the asymptote of the power law. And this is in some sense a quantification of your best possible performance under infinite compute. And our goal in this paper is sort of to think more carefully about what types of algorithms allow you to lower your compute asymptote, uh, and we're sort of gonna chase these types of infinite compute ones. And so to start, I'm going to introduce this canonical setting that we reference in this paper, which is that we're going to simulate a data-constrained world by just constraining the number of pre-training tokens we have to be a very small amount. So we're gonna assume access to only two hundred million tokens from DCLM, which is general web data. And what we're gonna do is we're gonna pre-train larger and larger models, which is the X-axis, using different kinds of pre-training recipes. And the Y-axis here is going to be, again, our IID validation loss on DS-- DCLM. And our goal is going to be to find recipes that allow us to spend more compute and train larger models while monotonically decreasing our loss. So to start, we can consider sort of the obvious approach that you might take when you're in this setting, which is first to epoch your data, so to train on the same data points over and over again until you start overfitting, as well as scaling up your model, so making your model larger and larger. And what we can do is we can do both of these at the same time, and we can do sort of an exhaustive grid search over these parameters until we start ov-- until we start overfitting, and then we do early stopping. And this is sort of the red line, which is what we call a standard recipe. And what you'll see with the standard recipe is that even if you are willing to spend more compute, as you train more and more overparameterized models, you start to overfit more quickly, and your loss starts to increase after a certain point. And so if you see this line, sort of the natural instinct you should have is: How do we fix this? And one possible approach is to do really aggressive regularization. And so sort of the first baseline in this paper is going to be doing really aggressive regularization by cranking up your weight decay. And so what we do is we show that if you optimally tune your weight decay for each total parameter count, so we're going to optimally tune learning rate, weight decay, and epoch count for each one of these purple points, you can show that your loss follows a really clean power law as you increase the number of parameters in your model. And this is really aggressive regularization, so for context, we use weight decays that are something like thirty times larger than the weight decays that people do for compute optimal pre-training. And so on the legend here, you can see the, the sort of the form of this power law, and it has a few nice properties. One is that the exponent on the model parameters, N, is one, and this is actually predicted by sort of the data constraint theory. The second nice property that it has is that the scaling law has an asymptote, which is three point four three in this case. And this characterizes the performance of the best possible regularized model in this setting if you had, like, infinite compute. So you'll notice that the baseline approaches, because they overfit more quickly, they don't even have a measurable asymptote. And so once we start going down the rabbit hole of regularization and these other types of classical machine learning techniques, there's a whole basket of techniques to, to get into. And so perhaps maybe the most famous one is to do ensembling. And so what we show in this paper is that you can bring back ensembling in the modern world of pre-training language models, and they turn out to be incredibly data efficient. So what these light blue points correspond to is they correspond to three hundred million parameter models that we're ensembling with more and more members. So the fifth point will correspond to one point five total billion total parameters, which is fi- a five ensemble of three hundred million parameter models. We show that you can also fit really clean scaling laws to ensembles, so you also get a power law that has exponent one in the number of ensemble members, and it also has an asymptote. But most importantly, the asymptote of ensembling is much lower than the asymptote of the regularized recipe. So it's giving you a true data efficiency win if you had an infinite amount of compute. There's also this interesting property, which is that ensemblings, if you do a compute matched comparison, so the same number of parameters, are actually better than the regularized recipe. So if your goal is just to train the best one point five billion parameter model, it's better to train an ensemble of a bunch of small models when you're data constrained than to train one really large model. The last thing we show in this plot is that you can actually compose the benefits of regularization and ensembling. So one way to think about this is that regularization gives you this ability to continue to make the models larger and larger, while ensembling introduces this new axis for scaling compute, which is by training more and more models. And so what this gold line, which we call the joint scaling recipe, is we quantify this hypothetical performance if we were able to train an ensemble, an infinitely large ensemble of infinitely large models. And so the way in which we actually quantify this performance is we fit two scaling laws. So we'll take a double limit. What we'll first do is we'll train ensembles of a hundred fifty million parameter models, three hundred million parameter models, and so on and so forth. And then we'll look at the asymptotes of the ensembles, and then we'll take a second... We'll fit a second scaling law to the asymptotes of these ensembles. And this is essentially taking... The first limit is taking the limit over K, and the second limit is taking the limit over N. And what we find is that if you're willing to sort of go through the effort of training infinitely large m-models and infinitely many ensembles, uh, you get a huge loss improvement. And so all of these experiments are sort of in this toy data constraint setup of two hundred million tokens, and obviously, this is very different from sort of the standard regime of pre-training. So what we also do in this paper is we spend some effort on trying to confirm that our recipes scale. So the first way in which we do this is that we build data scaling laws. So what data scaling laws are is that we repeat the exact same set of experiments from the previous slide at four different pre-training token counts, up to one point seven billion, uh, tokens. And so for each slice on the X-axis at each C token count, we're gonna quantify the best possible performance of each recipe if we had an infinite amount of compute. So for the red points, they overfit more quickly, so these will be actual models. While for the purple and the gold points, these will correspond to sort of a single limit or a double limit. What these data scaling laws let us do is they let us quantify the data efficiency numbers of our approaches. So one way in which we do this is if we have some new recipe that we believe should improve upon the standard recipe that we're using right now, you can take the loss of your new recipe, and you can project it onto the data scaling law, so the red line of the standard recipe. And this projection lets you measure essentially the effective number of extra tokens that your algorithmic, algorithmic improvement is buying you. So in this case, what we see is that this joint scaling recipe gives you roughly a 5X data efficiency win over, uh, the, the standard recipe. It's also worth noting that, uh, these data efficiency wins are something that we can realize with sort of finite models, not just double limits. So for example, if you're willing to train a five ensemble of one billion parameter models, this will give you roughly a 3.7X data efficiency win. The other interesting aspect about these data scaling laws is if you look at the functional form in the legend, you'll see that they all have really similar exponents, and they all have very similar asymptotes. And so the reason why this matters is this suggests that even if you repeated these experiments at a much, much larger token scale, if you believe that these data scaling law, laws extrapolate, this data efficiency win is going to be constant over the actual number of token counts that you have. So this suggests that this double joint scaling law recipe has a 5X data efficiency win even if you are willing to send the seed token count to like 10 trillion tokens or whatever people are doing pre-training at these days. So now I'll go over some methods to sort of make this data efficiency win perhaps slightly more practical. And so even though these recipes require a lot of training compute, we also show that you can reduce an amount of inference compute you need by using distillation. So the plot on the right here, the purple line corresponds to the same regularized recipe. The light blue points correspond to the same ensemble scaling. So we first show that what you can do is you can take an eight ensemble, which is roughly two point four billion total parameters, and you can distill it into a single dense three hundred million parameter model, which is the pink star in the bottom. And you can do this while retaining roughly 83% of the loss improvement. So this shows you that data efficiency is not something that you need a large amount of inference compute for. If you're willing to amorti-amortize the test time compute during training time, you can get an extremely data efficient model that's still very, very small. The other surprising result we show in this section is that you can do self-distillation to even improve your loss. So with self-distillation, what we're doing is we're starting with the three hundred million parameter model at the start of the light blue curve, and then we're distilling this model into a fresh three hundred million parameter model, which is the green star. And what we find is very surprisingly, even doing self-distillation gives you huge loss improvement. It even beats the asymptote of the regularized recipe. This is actually pretty counterintuitive, and we have a longer sort of, uh, description of this result in the paper, but it turns out to have pretty surprising connections to, uh, ensembling. And there's actually a view, uh, from prior work on viewing self-distillation as implicitly training a two ensemble. We also show that even though we're only chasing IID val loss in all of our experiments, pretty much all of the trends in this paper directly work on downstream benchmarks. And this was like a fully held out sort of test set where we only looked at the benchmarks at the very end of the paper because the advisors told us to. Um, and you can see that everything tracks. The standard recipe over fits still. Model scaling gives you improvements. Ensembling is even better, and you can still retain a lot of the benefits through distillation. And finally, we also show that you can do this for other settings beyond pre-training, so things like continued pre-training. So we consider a setup where you're trying to CPT a 3B model, and we assume access to sort of this restricted set of four billion math-related tokens, where the whole corpus of data is actually 73 billion tokens. And what we show is that if you're willing to do these data efficiency tricks like aggressive epoching and things like ensembling, you can match the performance of training on the full 73 billion tokens even using only 4 billion tokens, which is roughly a 17X data efficiency win. So to sort of wrap up this talk, maybe the main point I wanna make is that when you're constrained by data and you're unconstrained by compute in this sort of new algorithmic regime, the types of algorithmic choices you make matter a lot, and we should be willing to sort of rethink every aspect of the stack. In this paper, we mostly do this by revisiting a lot of these classical ideas from, uh, machine learning and deep learning. Things like regularization, ensembling, distillation have existed for, for many, many years. And we also introduce this evaluative tool of asymptotes. And maybe the hope is that if you're willing to chase algorithms that have lower compute asymptotes, uh, these will give you like better ideas for data efficiency. But like ultimately, what we really want to do is we want these asymptotes to help us develop new and better ideas under infinite compute that, that don't already exist. And so if you're interested in the details, that's the QR code for the paper, and we've also done some follow-up work on looking at how synthetic data interacts with data efficiency. So feel free to check that out as well if you're interested. Thanks.
- FCFrancois Chaubard
[audience applauding] All right. Thank you guys so much for coming. This is like a dream come true. I'm in one of my favorite places that, um, was most important places in my life, and now I get to talk about AI here. So super, super fun. I think there's a lot of potential for this club. I think I don't have nearly, you know, o-one percent of all the ideas that we probably have to make this club really great, um, in all of your heads. And so we wanna make sure all of you guys get in on the Slack. So I'll make sure that, you know, please send me a note if you're not already on there, and then we can kinda make this thing whatever we want. So it's kinda fun, and I intend to. So like, please come with ideas. We wanna make this super fun. Um, obviously, you know, there's some ground rules, be respectful, all that kinda stuff, um, and definitely be involved, and that's kinda the, the, the biggest thing that we really only really ask. That's all I got. That's a wrap. We got some boba tea. Thank you. [audience applauding]
Episode duration: 1:07:18
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode wE1ZgJdt4uM