Dwarkesh PodcastFrancois Chollet — Why the biggest AI models can't solve simple puzzles
EVERY SPOKEN WORD
150 min read · 30,449 words- 0:00 – 11:53
The ARC benchmark
- FCFrancois Chollet
LLMs are very good at memorizing study programs. If you scale up the size of your database, you are not increasing the intelligence of the system one bit.
- DPDwarkesh Patel
I feel like you're using words like memorization, which we would never use for human children if, if they can just solve any ar- uh, arbitrary algebraic problem. You wouldn't say, like, they've memorized algebra. They'd say they've learned algebra.
- MKMike Knoop
So we got a million dollar prize pool, and there's a $500,000 prize for the first team that can get to the 85% benchmark. If ARC survives three months from here, we'll, we'll up the prize.
- FCFrancois Chollet
OpenAI basically set back progress towards AGI by probably, like, five to 10 years. They caused this complete closing down of Frontier Research Publishing. And now LLMs have, uh, sucked the oxygen out of the room. Like, everyone is just doing LLMs.
- DPDwarkesh Patel
Okay. Today, I have the pleasure to speak with Francois Chollet, who is a AI researcher at Google and creator of Keras. And he's launching a prize in collaboration with Mike Knuth, the co-founder of Zapier, who we'll also be talking to in a second. A million dollar prize to solve the ARC benchmark that he created. So first question, what is the ARC benchmark and why do you even need this prize? Why won't the biggest LLM we have in a year be able to just saturate it?
- FCFrancois Chollet
Sure. So ARC is intended as a kind of IQ test for machine intelligence. And what makes it different from, uh, most LM benchmarks out there is that it's designed to be resistant to memorization. So if you look at the way LLMs work, they are basically this, uh, big interpretive memory. And the way you scale up their capabilities is by trying to cram as much, uh, knowledge and patterns as possible into them. And, uh, by contrast, uh, ARC does not require a lot of knowledge at all. It's designed to only require what's known as, uh, core knowledge, which is, uh, basic knowledge about things like, um, elementary physics, object nests, counting, that sort of thing. Um, the sort of knowledge that any four-year-old or five-year-old, uh, possesses, right? Um, but what's interesting is that each puzzle in ARC is novel, is something that you've probably not encountered before, even if you've memorized the entire internet. And that's what, that's what makes it... (clears throat) Sorry. That's what, uh, makes, makes ARC challenging for LLMs. And so far, LLMs have not, uh, been doing very well on it. In fact, the approaches that are working well, uh, are more towards, uh, discrete program search, program synthesis.
- DPDwarkesh Patel
Mm. So f- first of all, I'll, I'll make a comment that I'm glad that as a skeptic of LLM, you have put out, yourself, a benchmark that... Uh, is it accurate to say that suppose that the biggest model we have in a year is able to get 80% on this, then your view would be we are on track, uh, to AGI with LLMs? How would you think about that?
- FCFrancois Chollet
Right. Um, I'm pretty skeptical that we're going to see LLM do 80% in a year.
- DPDwarkesh Patel
Yeah.
- FCFrancois Chollet
Uh, that said, if we do see it, you would also have to look at how this was achieved. If you just, uh, train the model on millions or billions of puzzles similar to ARC so that you're relying on, uh, the ability to have some overlap between the tasks that you train on and the tasks that you're going to see at this time, then you're still using memorization, right? And maybe, maybe it can work. You know, hopefully, um, ARC is gonna be good enough that it's gonna be, uh, resistant to this sort of attempt and brute forcing.
- DPDwarkesh Patel
Mm.
- FCFrancois Chollet
Um, but, you know, you never know. May- maybe, maybe it could happen. I'm not saying it's not going to happen. ARC is not a perfect benchmark. Maybe, maybe it has flaws, flaws. Maybe it could be hacked in that way.
- DPDwarkesh Patel
Mm. I'm... So I guess I'm curious about what would GPT-5 ha- ha- have to do that you're very confident that, you know, it's on the path to AGI?
- FCFrancois Chollet
What, what would make me change my mind about LLMs is basically if I start seeing, uh, a critical mass of cases where you show the model with something it has not seen before, a task that's actually novel from the perspective of its training data-
- DPDwarkesh Patel
Mm.
- FCFrancois Chollet
... something that's not in the training data, if it can actually adapt on the fly. Um, and this is true for LLMs, but really, uh, this would catch my attention with any... for AI, any AI technique out there. If I can see the ability to adapt to novelty on the fly to pick up new skills efficiently, then, uh, uh, I would be extremely interested. I wou- I would, I would think this is, uh, uh, on the path to AGI.
- DPDwarkesh Patel
So the advantage they have is that they do get to see everything. Maybe I'll take issue with how much they are relying on that. But le- suppose that they are relying. Obviously, they're relying on that more than humans do. To the extent that they do have so much in distribution, to the extent that we have trouble distinguishing whether a sin- uh, an example is in distribution or not, well, if they have everything in distribution, then they can do everything that we can do. Maybe it's not in distribution for us. Why, uh, why is it so crucial that it has to be out of distribution for them? Uh, you know, eh, why can't we just leverage the fact that they do get to see everything?
- FCFrancois Chollet
Right. What, you're, you're asking basically what's the difference between actual intelligence, which is the ability to adapt to things you've not been prepared for, and pure memorization, like reciting, uh, what you've seen before? And it's not just some semantic difference. Uh, the big difference is that you can never, uh, pre-train on everything that you might see at test time, right? Uh, because the world changes all the time. So it's not just the fact that the space of possible tasks is infinite. And if, even if you're trained on millions of them, uh, you've, you've only seen 0% of the total space. It's also the fact that the world is changing every day, right? Um, this is why, uh, we, uh, uh, uh, the human species has developed intelligence in the first place.If there was such a thing as a distribution for the world, for the universe, for our lives, then we would not need intelligence at all. In fact, uh, many, uh, creatures, many insects, for instance, do not have intelligence. Instead, what they have is, uh, they have in their, in their connectome, uh, in their genes, uh, hardcoded programs, behavioral programs that map some stimuli to appropriate response. And they can actually navigate their lives, their environments in a way that's very evolutionarily fits, that way, without needing to learn anything. And well, if our environment was, uh, static enough, predictable enough, uh, what would have happened is that evolution would have found the perfect behavioral program, a hardcoded, static behavioral program, would have written it, uh, into our genes. We would have a hardcoded brain connectome, and that's what we would be running on. But no, that's not what happened. Instead, we have general intelligence. So we are born with extremely little knowledge about the world, but we are born with the ability to learn very efficiently and to adapt in the face of things that we, we, we've never seen before. And that's what makes us unique, and that's what, that's what is really, really challenging to recreate in machines.
- DPDwarkesh Patel
I, I want to rabbit hole on that a little bit. But before I do that, maybe, uh, I'm gonna overlay some examples of what an ARC-like challenge look like for, uh, for the YouTube audience. But maybe for people listening on audio, can you just describe what, what, what would an... a sample ARC challenge look like?
- FCFrancois Chollet
Sure. So one ARC puzzle, it looks kind of like an IQ test puzzle. You've got a number of demonstration input-output pairs. So, uh, uh, one pair is, uh, made of two grids. So one grid shows you an input and the second grid shows you, uh, what you should produce as a response to that input. And you get, uh, a couple, uh, pairs like this to demonstrate the nature of the task, to demonstrate what you're supposed to do with your inputs. And then you get, uh, a new test input, and your job is to produce the corresponding, uh, test outputs. You look at the demonstration pairs, and from that, you figure out what you're supposed to do and you show that you've understood it on this new test pair. And, um, importantly, in order to... the sort of like knowledge basis that you need in order to approach these chan- these challenges is you just need core knowledge. And core knowledge is, uh, it's basically the knowledge of what makes an object, uh, basic counting, basic geometry, topology, symmetries, uh, that sort of thing. So extremely basic knowledge. LLMs for sure possess such knowledge. Any child possesses, uh, such knowledge. Um, and what's really interesting is that each puzzle is new. So it's not something that you're gonna find, uh, uh, uh, elsewhere on the internet, for instance. Uh, and that means that whether it's as a human or as machine, every puzzle, uh, you have to approach it from scratch. You have to actually reason your way through it. You cannot just fetch the response from your memory.
- DPDwarkesh Patel
So the core knowledge, one contention here is we are only now getting multimodal models who, because of the data they're trained on, are trained to do spatial reasoning. Whereas obviously not only humans, but for billions of years of evolution, we've had... our ancestors have had to learn how to understand abstract, physical and spatial properties and recognize the patterns there. And so one view would be in the next year, as we gain models that are multimodal native, that isn't just a sort of second class that is an add-on, but the multimodal capability is a priority, that it will understand these kinds of patterns because that's something we'd see natively. Whereas right now what ARC sees is some JSON string of 100100, and it's supposed to recognize a pattern there. And even if you showed a human such a g- like a just a, a sequence of these kinds of numbers, it would have a challenge making, uh, making sense of what, what kind of question you're asking it. So why would it be the case that as soon as we get multimodal models, which we're on the path to unlock right now, they're gonna be so much better at ARC-type spatial reasoning?
- FCFrancois Chollet
That, that's an empirical question, so I guess we're gonna see the answer within a few months. But, uh, my answer to that is, you know, our grids, they're just, uh, discrete 2D grids of symbols. They're pretty, pretty small. Like, it's not like, uh, if you flatten an i- an image as a sequence of pixels, for instance, uh, then you get something that's actually very, very difficult to parse. But that's not true for ARC because the grids are very small. Uh, you only have 10 possible symbols. So there's these 2D grids that are actually very easy to flatten, uh, as sequences and transformers. LLMs, they're very good at parsing such sequences. In fact, uh, you can show that LLMs do fine, uh, with processing ARC-like data, uh, by simply fine-tuning, uh, LL- uh, LLM on, uh, some subsets, uh, of the tasks and then trying to test it on small variations of these tasks. And you see that yeah, the, the LLM can encode just fine solution programs for tasks that it has seen before. So it does not really have a problem parsing the input or figuring out, uh, uh, um, the program. The reason why, uh, LLMs, uh, uh, don't do well on ARC is really just, uh, the unfamiliarity aspect. The fact that each new task is different from eth- e- every other task. You cannot... basically, you cannot memorize the solution programs in advance. You have to synthesize a new solution program on the fly for each new task. And that's really what LLMs are struggling with.
- DPDwarkesh Patel
Mm-hmm. So before I do more a devil's advocate, I just wanna step back and explain why I'm especially interested in having this conversation. And obviously the million dollar ARC Prize, I'm, uh, uh, excited to actually play around with it myself and hopefully, um, the Vesuvius Challenge, which was Nat Friedman's prize for solving, uh, decoding scrolls, the winner of that, decoding the scrolls from... that were buried in the volcanoes in, uh, the Herculaneum library, that was solved by a 22-year-old who was listening to the podcast, Luke Farritor. So hopefully somebody listening will find this challenge intriguing and
- 11:53 – 19:43
Why LLMs struggle with ARC
- DPDwarkesh Patel
find a solution. So I'm... and the reason I... I've had on recently a lot of people who are, uh-... bullish on LLMs and I've had discussions with them before interviewing you about, how do we explain the fact that LLMs don't seem to be natively performing that well on ARC? And I found their explanations somewhat contrived and I'll try out some, uh, some of the reasons on you. But it is actually an intriguing fact that they actually... these are- some of these problems are relatively straightforward for humans to understand and they do struggle with them if you just input them natively.
- FCFrancois Chollet
All of them are very easy for humans. Like, any- any smart human should be able to do 90%, 95% on ARC.
- DPDwarkesh Patel
A smart human.
- FCFrancois Chollet
A smart human. But even a five-year-old, so with very, very little knowledge, they could- they could definitely do over 50%.
- DPDwarkesh Patel
Hmm. So le- let's talk about that because you, y- I agree that smart humans will do very well on this test, but the im- average human will probably do, you know, mediocre.
- FCFrancois Chollet
Not- not really average. So we actually tried with average humans, uh, they score about 85.
- DPDwarkesh Patel
That was with Amazon Mechanical Turk workers, right?
- FCFrancois Chollet
Yeah, that's right.
- DPDwarkesh Patel
I'm- I- I honestly don't know the demographic profile of Amazon Mechanical Turk workers, but imagine just interacting with, uh, the- the platform that Amazon has set up to do remote work. That's not the median human across the planet, I'm guessing.
- FCFrancois Chollet
Uh-
- DPDwarkesh Patel
Well, I mean, the broader point here being that... So we see this spectrum in humans where humans obviously have AGI. But even within humans, you see a spectrum where some people are relatively dumber and they'll do- perform work on IQ-like tests. For example, uh, Raven's Progressive Matrices. If you look at how the average person performs on that and you look at the k- kind of questions that ha- is a sort of hit or miss, half of the people will get it right, half of the people will get it wrong. Some of them are, like, pretty trivial. Uh, for us, we might think, like, "This is a little..." This is kind of trivial. And so humans have AGI, but f- for, from relatively small tweaks, you can go from somebody who misses these kinds of basic IQ test questions to somebody who gets them all right. Which suggests that, actually, if these models are doing natively... Um, we'll talk about some of the previous performances that people have tried with these models. But somebody with a- t- uh, a Jack Cole with a 240 million parameter model got 35%. Doesn't that suggest that they're on the s- on the spectrum that clearly exists within humans and they're gonna go saturated at pretty soon?
- FCFrancois Chollet
Yeah, so that's, uh, that's a bunch of interesting points. Yeah. So, uh, there is indeed a, um, a branch of LLM approaches suspended by Jack Cole that are doing quite well, that are in fact, uh, uh, state of the art. Uh, but you have to look at, uh, what's going on there. So there are two things. The first thing is that, uh, to get these numbers, you need to pre-train your LLM on millions of generated ARC tasks. And of course, if you- if you compare that to a five-year-old child looking at ARC for the first time, uh, the child has never done an IQ test before, has never seen something like- like an ARC- ARC test before. The only overlap between what they know and what they have to do in the test is core knowledge, is knowing about, like, counting and objects and symmetries and things like that. And still, uh, they're gonna do really well and they're gonna do much better than the LLM trained on millions of similar tasks. And the second thing that's, uh, that's, um, something to note about, uh, the Jack Cole approach is, um, one thing that's really critical to making the model work at all is test-time fine-tuning. And that's something that's really missing, by the way, from LLM approaches, um, right now is that, you know, most of the time when you're- when you're using an LLM, it's just doing static inference. The model is frozen and you're just, uh, prompting it and then you're get- you're getting an answer. So the model is not actually learning anything on the fly. Its- its state is not adapting, uh, to do, to the task at hand. And what Jack Cole is actually doing is that for every test problem is on the fly, is fine-tuning a version of the LLM, uh, uh, for that task. And that's really what's unlocking performance. If you don't do that, you get like 1%, 2%. So basically, something completely- completely negligible. And if you do test-time fine-tuning and you add a bunch of tricks on top, then you end up with interesting performance numbers. So I think what he's doing is trying to address one of the key limitations of LLMs today, uh, which is the lack of active inference, is actually adding active inference to LLMs and that's working ex- extremely well actually. So that's fascinating to me.
- DPDwarkesh Patel
That- there's so many interesting rabbit holes there. Should I take them in sequence or deal with them all at once? Let me, let me just start. So a c- uh, the point you made about the fact that you need to unlock the adapter compute/test-time compute, a lot of the scale maximalists... I think this will be an interesting rabbit hole to explore with you, because a lot of the scaling maximalists have your broader perspective in the sense that they think that in addition to scaling, you need these kinds of things like unlocking adaptive compute or doing some sort of RL to get the system to working. And their perspective is that this is a relatively straightforward thing that will be added atop the capab- the representations that a scaled-up model has greater access to.
- FCFrancois Chollet
No, it's- it's not, it's not just a technical detail. It's not a straightforward thing. It is everything. It is the important part. And the- the- the scale, uh, maximalist argument, even it boils down to, um... You know, th- these people, they- they refer to scaling laws, which is this- this empirical relationship that you can draw between how much compute you spend on training a model and the performance you're getting on benchmarks, right? And the- the key question here, of course, is, well, how do you measure performance? What it is that you're actually, uh, uh, improving by adding more compute and more data. And well, it's- it's benchmark performance, right? And th- the- the thing is, the way you measure performance is not a technical detail. Uh, it- it's- it's not, it's not an afterthought because it's gonna, uh, narrow down the sort of questions that you're asking. And so, uh, accordingly, it's gonna narrow down the set of answers that- that you're- that you're looking for. If you look at, uh, the benchmarks we're using for LLMs, they are all memorization-based benchmarks. Like, sometimes they're literally just knowledge-based, like- like a school test. And even if you look at the ones that are, uh, uh-... you know, uh, explicitly about reasoning. You realize if you look closely that it's, uh, in order to solve them, it's enough to memorize, uh, a finite set of, uh, uh, reasoning patterns, uh, and then you just reapply them. They're, they're like static programs. LLMs are very good at memorizing static programs, small statics programs. And, and they've got this sort of like bank of, uh, solution programs. And when you give them a new puzzle, uh, they can just fetch, uh, the appropriate program, uh, apply it, and it's looking like it's reasoning, but really it's not doing any sort of on-the-fly program synthesis. All it's doing is program fetching. So you can actually solve all these benchmarks with memorization. And so what, what you're scaling up here, like if you look at the models, they are, uh, uh, big biometric curves, uh, fitted to a data distribution, which I call an incandescent. So there are basically these big interpretive, uh, databases, interpolative memories. And of course, if you scale up the size of your database and you cram into it, uh, more knowledge, more patents and so on, uh, you are gonna be increasing its, its performance as measured by memorization benchmark. That's, that's kind of obvious. But as you're doing it, you are not increasing the intelligence of the system one bit. You are increasing the skill of the system. You are, you are increasing its usefulness, its, uh, scope of applicability, but not its intelligence, because skill is not intelligence. And that's the fundamental confusion, um, that, that, that people, uh, uh, run into is that they're confusing skill
- 19:43 – 28:38
Skill vs intelligence
- FCFrancois Chollet
and intelligence.
- DPDwarkesh Patel
Th- th- yeah, there, there's a lot of fascinating things to talk about here. So skill, intelligence, interpolation, um, I mean, okay, so the, the thing about they're fitting some manifold into, uh, that maps the input data, there's a reductionist way to talk about what happens in the human brain that says that it's just axons firing at each other, but we, but we, we don't care about the reductionist explanation of what's happening. We care about what the, the, the sort of m- meta at the, uh, at the macroscopic level, what happens when these things, uh, combine. As far as the interpolation goes, so okay, let, let's look at one of the benchmarks here. There's th- there's one benchmark that does great school math, and these are problems that m- m- like a smart high schooler would be able to solve. Um, it's called GSM8K. And these models get 95% on these, like basically-
- FCFrancois Chollet
Sure.
- DPDwarkesh Patel
... they always nail it.
- FCFrancois Chollet
That's memorization benchmark.
- DPDwarkesh Patel
Okay. Let, let's talk about what that means. So here's one question about from that benchmark. So 30 students are in a class, one-fifth of them are 12-year-olds, one-third are 13-year-old, one-tenth are 11-year-olds. How many of them are not 11, 12, or 13 years old? So I agree, like this is not rocket science, right? You can write down on paper how you go through this problem, and a high school kid, at least a smart high school kid, should be able to solve it.
- FCFrancois Chollet
(smacks lips)
- DPDwarkesh Patel
Now, when you say memorization, it still has to reason through how to think about fractions and what is the context of the whole problem, and then combining the different calculations it's doing.
- FCFrancois Chollet
It depends how you, how you wanna define reasoning, but there, there are two definitions you can use. So one is, I have available, uh, a set of program templates. It's, it's like the structure of the puzzle, uh, which, which can also generate its solution. And I'm just gonna identify the right template, which is in my memory. Um, I'm gonna input the new values into the template, run the program, get the solution. And you could say, this is reasoning. And I say, "Yeah, sure. Okay." Uh, but another definition we can use is re- reasoning is the ability to when you're faced with a, with a puzzle, given that you don't have already a program in memory to solve it, you must synthesize on the fly a new program based on, uh, bits and pieces of existing programs that you have, you have to do on-the-fly program synthesis. And this is actually dramatically harder than just fetching the right memorized program and reapplying it.
- DPDwarkesh Patel
So I think maybe we are overestimating the extent to which humans are so sample efficient, they also don't need training in this way where they have to drill in these kinds of pathways of reasoning through c- certain kind of problems. So let's take math, for example.
- FCFrancois Chollet
Yeah.
- DPDwarkesh Patel
It's not like you can just show a baby the axioms of set theory, and now they know math, right? So they, they, when they're growing up, you had to do years of teaching them pre-algebra, then you gotta do a year of teaching them, uh, doing drills and going through the same kind of problem in algebra, then geometry, pre-calculus, calculus.
- FCFrancois Chollet
Uh, absolutely. So training-
- DPDwarkesh Patel
And, but, yeah. But isn't that like the same kind of thing where you, you, you can't, you can't just see one example and now you have the program or whatever? You actually had to drill it. These models also had to drill with a bunch of free training data.
- FCFrancois Chollet
Sure. I mean, in order to do on-the-fly program synthesis, you actually need, uh, building blocks to work from. So knowledge and memory are actually tremendously important in the process. I'm not say- I'm not saying it's memory versus reasoning. In order to do effective reasoning, you need memory.
- DPDwarkesh Patel
But it sounds like it's compatible with your story that through seeing a lot of different kinds of examples, these things can learn to reason within the context of those examples. And, uh, we can also see within bigger and bigger models. So that was an example of a high school level math problem. Um, uh, let's say a model that's like smaller than GPT-3 couldn't do that at all. As these models get bigger, they can, they seem to be able to pick up bigger and bigger-
- FCFrancois Chollet
It's, it's not, it's not really a size issue, it's more like a training data issue in this case.
- DPDwarkesh Patel
Well, uh, uh, y- yeah, bigger models can pick up these kinds of circuits, which smaller models apparently, uh, uh, don't do a good job of doing this, even if you were to train them on this kind of data. Doesn't that just suggest that as you have bigger and bigger models, they can pick up bigger and bigger pathways or, uh, uh, more general ways of reasoning?
- FCFrancois Chollet
Absolutely.
- DPDwarkesh Patel
But then isn't that intelligence?
- FCFrancois Chollet
No. No, it's not. If, if you scale up your database and you keep adding to it more knowledge, uh, more program templates, then sure, it becomes more and more skillful, you can apply to more and more tasks. But general intelligence is not task-specific skill scaled up to many skills. It- because there is an infinite, uh, space of possible skills. General intelligence is the ability to approach any problem, any skill, and very quickly master it using valuable data, because this is what makes you able to face anything you might have encountered. This is what makes, uh, uh...- this, this is the definition of generality. Like, generality is not specificity scaled up, it is, uh, the ability to apply your mind to anything at all, to arbitrary things. And this requires, fundamentally this requires the ability to adapt, to learn on-the-fly, efficiently.
- DPDwarkesh Patel
So I, my claim i- is that by doing this pre-training on bigger and bigger models, you are gaining the capacity to then generalize very efficiently. Let me give you an example-
- FCFrancois Chollet
And I think in practice-
- DPDwarkesh Patel
Let me, let me give you an example. So, your own company, Google, in their p-paper on Gemini 1.5, they had this very interesting example where the, um, they would give, in context, they would give the model the grammar book and the dictionary of a language that has less than 200 living speakers. So it's not in the pre-training data. And you just give them the, the dictionary, and it basically is able to speak this language and translate to it, including the complex and organic ways in which languages- languages are structured. So, a human, if you showed me a dictionary from, like, English to Spanish, I'm not gonna be able to pick up the how to structure sentences and how to say things in Spanish. Eh, the fact that because of the representations that it has gained through this pre-training, it is able to now extremely efficiently learn a new language. Doesn't that show that, uh, this kind of pre-training actually does increase your abil- ability to learn new tasks?
- FCFrancois Chollet
If you're right, if you were right, LLMs would do really well on ARC puzzles because ARC puzzles are not complex. Each one of them requires very little knowledge, each one of them is, is very low on complexity. You don't need to think that hard about it. Uh, they're actually extremely obvious for humans, like, even children can do them. But LLMs cannot. Uh, even if, uh, even LLMs that have, you know, 100,000 times more knowledge than you do, they still cannot. And the only thing, uh, that makes ARC special is that it was designed with this intent to resist memorization. This is the only thing, and this is the huge blocker, uh, for LLM performance. Right. And so, uh, you know, I think if, if you look at LLMs closely, it's pretty obvious that they're not really, like, synthesizing new programs on-the-fly t- to solve the tasks that, that they're faced with. They're very much replying things that they've, uh, they've stored in memory. For instance, um, one thing that's very striking is LLMs can solve a Caesar cipher. You know, like a-
- DPDwarkesh Patel
Mm-hmm.
- FCFrancois Chollet
... a Caesar cipher, like transposing, uh, letters to, to, to code, uh, a message. Um, and while that's a fairly complex algorithm, right? Uh, but it comes up quite a bit on the internet. So they've b- basically memorized it, and what's really interesting is that they can do it, uh, for a transposition length of, like, three or five because they are very, very common numbers in examples provided on the internet. But if you, if you try to do it with an arbitrary number, like nine, it's gonna fail because it does not encode the generalized form of the algorithm, but only specific cases. It has memorized specific cases with the algorithm, right? And if, if it could actually synthesize on-the-fly the solver algorithm, uh, then the value of N would not matter at all because it does not increase the problem complexity.
- DPDwarkesh Patel
I, I think this is true of humans as well, where, uh, ev- what was the study that-
- FCFrancois Chollet
Humans use memorization and pattern matching all the time, of course, but humans are not limited to memorization and pattern matching. They have this very unique ability to adapt to new situations on-the-fly. This is exactly what, uh, enables you to navigate, uh, uh, every-
- DPDwarkesh Patel
So-
- 28:38 – 49:11
Do we need “AGI” to automate most jobs?
- FCFrancois Chollet
- DPDwarkesh Patel
Suppose there's a programmer at Google. They g- go into the office in the morning. At what point are they doing something that 100% cannot be due to fetching some, the template that could, could... e- even if they, suppose they were an LLM, they could not do it if they had fetched some template from the program. Like, at w- at what point do they have to use this so-called extreme generalization capability?
- FCFrancois Chollet
Forget about Google software developers. Every human, every day of their lives is full of novel things that they've not been prepared for. You cannot navigate your life based on memorization alone. It's impossible.
- DPDwarkesh Patel
I, I'm, I'm sort of d- denying the premise that they're all, uh, uh, you are also agreed they're not doing, like, quote-unquote memorization. They, you're, eh, it seems like you're saying they're l- less capable of generalization, but I'm just curious of, like, the kind of generalization they do. You, if you, if you, you g- get into the office and you try to do this kind of generalization, you're gonna fail at your job. What is the first point... You're a programmer, what is the first point when you try to do that generalization and you would, you would, you would lose your job because you can't do the extreme generalization?
- FCFrancois Chollet
I, I don't have any specific examples, but l- literally, like, take this situation, for instance. You've never been here in this room. Uh, maybe you've been in, in, in this, in this city a few times, I don't know. But, eh, there's a fair amount of novelty. You've never been, been, you know, uh, uh, interviewing me. There's a fair amount of novelty in every hour of every day in your life. And it's, in fact, uh, um, b- by and large, more novelty than, uh, any, any LLM could handle. Like, if you just put a LLM in, in a robot, it could not be doing all the things that you've been doing today. Right. Um, or take, take, I don't know, like self-driving cars, for instance. You take, uh, a self-driving car operating in the, in the Bay Area. Do you think you could just drop it in New York City or drop it in London where people drive on the left? Uh, no, it's, it's gonna fail. So not, not only can you drop, not, like, make it generalize to, uh, uh, a change of rules, um-... uh, uh, of driving rules, but you cannot even make it generalize to a new city. It needs to be trained on each specific environment.
- DPDwarkesh Patel
Yeah. I mean, I, I agree that self-driving cars aren't AGI. (laughs) Um, but it -
- FCFrancois Chollet
But it's the same type of model. They are transformers as well.
- DPDwarkesh Patel
I mean, we... w- w- ... I don't know. It's... it's also-
- FCFrancois Chollet
It's the same architecture.
- DPDwarkesh Patel
... has brains with neurons in them, but they're less intelligent because they're smaller.
- FCFrancois Chollet
Based on the same architecture.
- DPDwarkesh Patel
Uh, we, we can get into that. But so I still don't understand wha- uh, uh, like, a concrete thing of... we also need training, that's why education exists, that's why we have to spend the first 18 years of our life doing drills.
- FCFrancois Chollet
We have a memory, but we are not a memory. We are not limited to just a memory.
- DPDwarkesh Patel
But I, I'm denying the premise that that's, uh, necessarily the only thing these models are doing. And I'm still not sure what is the... what is the task that a, a remote worker would be doing... uh, have to... like, suppose you just started out a remote work with an LLM and they're, they're a programmer. What is the first point at which you realize this is not a human, this is an LLM?
- FCFrancois Chollet
What about I just send them, uh, an ARC puzzle and see how they do? (laughs)
- DPDwarkesh Patel
No, no. Like part of their job, you know.
- FCFrancois Chollet
But you, you have to deal with novelty, uh, all the time.
- DPDwarkesh Patel
Okay. So if you... is there a world in which all the programmers are replaced and then we're still saying, "Ah, but they're only doing memorization-laden programming tasks," but they're still producing a trillion dollars of, uh, worth of, uh, you know, output in the form of code.
- FCFrancois Chollet
So, so software development is actually a pretty good example of a job where you're dealing with novelty all the time. Uh, or if you're not, well, I'm not sure what you're doing. So I, I personally use, uh, generative AI very little in my software development job. And before, before LLMs were a thing, I was also using Stack Overflow very little. You know, some people maybe are just copy-pasting stuff from Stack Overflow, or nowadays, copy-pasting stuff from, from an LLM. Um, personally, I try to focus on problem-solving. The syntax is just a technical detail.
- DPDwarkesh Patel
Mm-hmm.
- FCFrancois Chollet
What's really important is the problem-solving. Like the essence of programming is, uh, engineering, uh, mental models like mental representations of the problem you're trying to solve.
- DPDwarkesh Patel
But, uh, uh, you can a-... uh, you know, we, we have many... people can interact with these systems themselves and you can go to ChatGPT and say, "Here's a specification of the kind of program I want," they'll build it for you.
- FCFrancois Chollet
Yeah, as long as there are many examples of this program on like GitHub and Stack Overflow and so on, sure, they will fetch the program for you from their memory.
- DPDwarkesh Patel
But you can change arbitrary details.
- FCFrancois Chollet
No, it doesn't work.
- DPDwarkesh Patel
You can say, "I need it to work on this different kind of server." It, uh... "I, I need it to-"
- FCFrancois Chollet
If, if, if that were true, there would be no software engineers today.
- DPDwarkesh Patel
I, I, I agree we're not at a full AGI yet in the sense that these models have, let's say, less than a trillion parameters. A human brain has somewhere on the order of 10 to 30 trillion synapses. It... I mean if you're just doing some naive math, you're like at least 10X under-parameterized. So I agree we're not there yet, but it... I'm, I'm sort of confused on why we're not on the spectrum where, yes, I agree that there's m- many kinds of generalization they can do, but it seems like they're on this kind of smooth spectrum that we see even within humans where some humans would have a hard time doing an ARC-type test. We see that based on their performance on Progressive Raven's Matrices type IQ tests.
- FCFrancois Chollet
I'm, I'm not a fan of IQ tests because, uh, for the most part you can train, uh, on IQ tests and get better at them. So they're very much memorization-based. And this is actually what... the main pitfall that ARC tries not to fall for.
- DPDwarkesh Patel
I'm, I'm still not compu-... so if, um, if all remote jobs are automated in the next five years let's say, um, th- at least they don't require you to be like sort of a service... yo- it's not like a salesperson where you n-... you want the human to be talking, but like programming or whatever. In that world, um, w- would you say that that's not possible because a lot of what a programmer needs to do definitely requires things that would not be in any free training corpus?
- 49:11 – 1:01:23
Future of AI progress: deep learning + program synthesis
- DPDwarkesh Patel
level is the fluid intelligence?
- FCFrancois Chollet
It's intri- intrinsically limited because the substrate of your, of your model is a big biometric curve and all you can do with this is local generalization. Uh, if you want to go beyond this towards broader or even extreme generalization, you have to move to a different type of model and, uh, my paradigm of choice is discrete program search, program synthesis. So, and if you want to understand that, you can sort of like compare, compare it, contrast it with deep learning. So in deep learning, your model is a, a, a parametric, a differentiable parametric curve. In program synthesis, your model is a discrete graph of operators. So you've got, like, a set of logical operators, like, a domain-specific language, uh, you're picking, uh, instances of it, you're structuring that into a graph that's a program. And that's actually very similar to, like, a program you might, you might write in, in, in Python or C++ and so on. And, uh, in deep learning, your learning engine, because we are doing much learning here, like, we are trying to automatically learn these models, uh, in deep learning, your learning engine is, uh, gradient descent, right? Uh, and gradient descent is very compute-efficient because you have this very strong informative feedback signal about, uh, where the solution is, so you can get to the solution very quickly. But it is very data inefficient, uh, meaning that in order to make it work, you need a dense sampling of the operating space. You need a dense sampling of the data distribution and then you're, you're limited to only generalizing within that data distribution. And the reason why you have this limitation is because your model is a curve. And meanwhile, if you look at discrete program search, um, the learning engine is combinatorial research. You're just trying a bunch of programs until you find one that actually meets, uh, your spec. This process is extremely data efficient. You can learn a generalizable program from just one example, two examples, which is why it works so well on ARC, by the way. But, uh, the big limitation is that it's extremely, uh, compute inefficient because you're running into, uh, combinatorial explosion, of course. And so you can, you can sort of see, uh, here how deep learning and discrete program search, uh, they have very complime- uh, complementary strength and limitations as well. Like, every limitation of deep learning, uh, has a strength, a corresponding strength in, in, uh, in program synthesis and, and inversely. And I think the path forward is gonna be to merge the two.
- DPDwarkesh Patel
Mm.
- FCFrancois Chollet
To basically start doing... So, uh, another way you can think about it is, uh, so these, these parametric curves, uh, uh, trained with gradient descent, they're a great fit for everything that's, uh, system one type thinking, like pattern cognition, intuition, uh, memorization and so on. And discrete program search, uh, uh, is, is a great fit for a type two thinking, system two thinking. Uh, for instance, planning, reasoning, uh, uh, quickly figuring out a generalizable model that matches just one or two, uh, examples, like for an ARC puzzle, for instance. And, um, I think humans are never doing pure system one or pure system two, they're always mixing and matching both, and right now we have all the tools for system one, we have almost nothing for system two. The way forward is to create a, a hybrid system, and I think the, the form it's gonna take is it's gonna be, uh, mostly system two, so the, the outer structure is gonna be a discrete program search system, uh, but you're gonna fix the fundamental limitation of discrete program search, which is combinatorial explosion, you're gonna fix it with deep learning. You're gonna leverage deep learning to guide, uh, to provide intuition, uh, in program space to guide, uh, the, the program search. And I think that's very similar to, um, what you see, for instance, uh, when, when, when you're playing chess or when you're trying to prove, uh, a theorem is that, um, it's mostly, uh, uh, a, a reasoning thing, but you start out with some intuition about the shape of the solution. And that's very much something you can get, uh, via deep learning models, yeah. Deep learning models, they're very much like intuition machines, uh, they're pattern matching, uh, machines. So you, you, you start from this, uh, shape of the solution and then you're, you're gonna do, uh, uh, actual explicit discrete program search, uh, but you're not gonna do it, uh, via brute force. You're not, you're not gonna try things kind of like randomly. Um, um, you're actually gonna ask another deep learning model for suggestions, like, "Here's the best likely next step."
- DPDwarkesh Patel
Yeah.
- FCFrancois Chollet
"Here's where in the graph you should be going." And you can also use a, yet another deep learning model for feedback about, "Well, here's what I have so far. Is it looking good? Should I just backtrack and try something new?" So, um, I think discrete program search is gonna be the key, but you want to make it dramatically better, orders of magnitude more efficient, by leveraging deep learning. And by the way, another thing that you can use deep learning for is of course things like, um, common sense knowledge and, and knowledge in general. Um, and I think you, you're gonna end up with this sort of system where you have this, uh, on the fly, uh, synthesis engine that can adapt to new situations. But the way it adapts is that it's gonna, uh, fetch, um, from a, a bank of patterns, uh, modules that could be themselves, uh, curves that could be, uh, uh, differentiable modules, and some others that could be algorithmic in nature. It's gonna assemble them, uh, via this, uh, process that's, uh, intuition guided and it's gonna give you, for every new situation you might be faced with, it's gonna give you with, uh, uh, a generalizable model that was synthesized using very, very little data.
- DPDwarkesh Patel
Mm.
- FCFrancois Chollet
... and that, that something like this would solve ARC.
- DPDwarkesh Patel
That's actually a really interesting, um, uh, a prompt because I think, uh, a w- an interesting crux here is when I talk to my friends who are extremely optimistic about LLMs and expect AGI within the next couple of years, they also, in some sense, agree that scaling is not all you need, but that the rest of the progress is undergirded and enabled by scaling. And, but still you need to add the system to the test and compute atop these models. And their perspective is that it's relatively straightforward to do that because you have this library representations that you built up from free training, but it's almost talking like, uh, you know, it just, like, skimming through textbooks. You need some more deliberate way in which it engages with the material it learns. In-context learning is- is extremely sample efficient, but to actually distill that into the weights, you need the model to, like, talk through the things it sees and then add it back to the weights. As far as the System 2.0 goes, they talk about adding some kind of RL setup so that it is encouraged to proceed on the reasoning traces that end up being correct. And they think this is, uh, relatively straightforward stuff that will be added within the next couple of years.
- FCFrancois Chollet
That's an empirical question.
- DPDwarkesh Patel
Yeah.
- FCFrancois Chollet
So I think we'll see.
- DPDwarkesh Patel
Uh, your intuition, I assume, is not that. I'm curious why.
- FCFrancois Chollet
My intuition is, in fact, this whole, like, System 2.0 architecture is the hard part, is the very hard and non-obvious part. Scaling up the interpretative memory is the easy part. All you need is, is, uh, like, it's literally just a, a big curve. All you need is more data. It's a representation of a dataset, an interpretative representation of a dataset. That's the easy part. The hard part is the architecture of intelligence. Memory and intelligence are separate components. We have the memory, we don't have the intelligence yet. And I agree with you that, well, having the memory is actually very useful, and, um, if you just had the intelligence but it was not hooked up to an extensive memory, it would not be that useful because it would not have enough material, uh, uh, to work from.
- DPDwarkesh Patel
Yeah. Uh, the- the alternative hypothesis here that former guest Trenton Brooking, uh, uh, a- advanced is that intelligence is just, um, hierarchically-associated memory where higher level patterns... When Sherlock Holmes goes into a crime scene, and he's extremely sample efficient, he can just, like, look at a few clues and figure out who was the murderer. And the way he's able to do that is he has learned higher r- uh, higher level sort of associations. It's memory in some fundamental sense, but... So, I mean, here- here's, uh, one way to ask the question. In the brain, uh, supposedly we do program synthesis, but it is just synapses connected to one a- each other, and so physically it's gotta be that you just query the right circuit, right?
- FCFrancois Chollet
You are, yeah, yeah, yeah.
- DPDwarkesh Patel
So...
- FCFrancois Chollet
You know, it's, it's a matter of degree.
- DPDwarkesh Patel
But if you can learn it, if, if, uh, you know, training in the environment that the hu- human ancestors were trained in means you learn that, those circuits, trai- uh, training on the same kinds of outputs that humans produce, which to replicate require these kinds of circuits, wouldn't that train the same kind of, um, whatever humans have?
- FCFrancois Chollet
You know, it's, it's a matter of degree. Um, if you have a system that has a memory and is only capable of, of doing local generalization from that, it's not gonna be very adaptable. Um, to be really general, you, you need the memory plus the ability to search, uh, to quite some depth, uh, to achieve, you know, broader even extreme generalization. Um, you know, uh, like, one, one of my favorite, uh, uh, psychologists, so Jean Piaget was the, the, the founder of, uh, developmental psychology. He had, uh, a very good quote about intelligence. He said, "Intelligence is what you use when you don't know what to do." And it's like, as, as a human, uh, living your life, I- in most situations you already know what to do because you've been in this situation before. You already have the answer, right? Um, and you're, you're only gonna need to use intelligence when you're faced with, uh, with novelty, with, with something you didn't expect, with something that you weren't prepared for, either by, uh, your own experience, your own life experience, or by your evolutionary history. Like, this day that you're living right now is different, uh, in some important ways from, uh, every day you've lived before, but it's also different from any day ever lived by any of your ancestors, and still you're capable of, of being functional, right?
- DPDwarkesh Patel
Right, I mean, uh...
- FCFrancois Chollet
How is it possible?
- DPDwarkesh Patel
I, I'm, I'm not denying that generalization is extremely imp- uh, and is the basis for i- i- intelligence. I, that, that's not the qu- crux. The crux is, like, how much is, of that is happening in the models. But, um, okay, l- let me ask a separate question. Uh, we mo- we might, uh, keep, uh, going in this circle here. There are differences in intelligence between humans. Uh, maybe the intelligence tests, because of, uh, reasons you mentioned are not measuring it well, but clearly there's differences in intelligence between different humans.
- FCFrancois Chollet
Sure.
- DPDwarkesh Patel
What is your explanation for what's going on there? 'Cause I think that's sort of compatible with this, my story that there's a spectrum of generality and that these models are climbing up, uh, to a human level, and even some humans haven't even climbed up to the Einstein level or the, uh, uh, the, uh, the Francois level, but... (laughs)
- FCFrancois Chollet
So that's a great question. You know, um, there is extensive evidence that intelligence, uh, difference in intelligence are mostly genetic in nature, right? Meaning that if you take someone who is not very intelligent, there is no amount of training, of, like, training data-
- DPDwarkesh Patel
Yeah.
- FCFrancois Chollet
... you can expose that person to that would, uh, uh, make them become Einstein. And this kinda points to the fact that you really need, uh, a better architecture, you need a better algorithm, and more training data is not in fact what you need.
- DPDwarkesh Patel
I think I agree with that. I think what maybe a w- way I might phrase it is that the people who are smarter have, in ML language, better initializations. It just... Th- their, th- the neural wiring, if you just look at, it's u- it's more efficient. They have maybe greater density of, um, firing. And so some, some part of this story is scaling. There is some correlation between brain size and intelligence. And we also see within the context of, quote-unquote, "Scaling," that people talk about within the context of LLMs, architectural improvements where a model like Gemini 1.5 Flash is... Performs as well as GPT-4 did when GPT-4 was released a year ago, but is 57 times cheaper on output. So...The p- part of the scaling story is that the architectural improvements are, we're in, like, extremely low-hanging fruit, uh, territory when it comes to those.
- 1:01:23 – 1:09:20
How Mike Knoop got nerd-sniped by ARC
- DPDwarkesh Patel
Okay, we're back now with the co-founder of Zapier, Mike Knuth. We had to restart a few times there. (laughs) And you're funding this prize and you're running this prize with Francois. And so, tell me about how this came together. How, how, what, what prompted you guys to launch this prize?
- MKMike Knoop
Yeah. Um, I guess I've been sort of, like, AI curious for 13 years. Uh, I've been... I co-founded Zapier, been running it for the last 13 years. And I think I first got introduced to your work in, during COVID. Um, I kind of went down the rabbit hole. (laughs) I had a lot of free time, um, and it was right after you'd, um, published your On Measures of Intelligence paper where you sort of introduced the concept of AGI, this, like, efficiency of skill acquisition is, like, the right definition in, in the ARC puzzles. But I don't think the first Kaggle contest was done yet. I think it was still running. And so I kind of... It was interesting, um, but I just parked the idea. Uh, and, uh, I had bigger fish to fry at Zapier. We were in the middle of this big turnaround of trying to get to our second product. Um, and then, uh, it was January 2022 when the Chain of Thought paper came out that really, like, awoken me to sort of the progress. I, I gave a whole presentation to the Zapier on, like, the GPT-3 paper even. So I, I sort of felt like I had priced in everything that LLMs could do, and that paper was really shocking to me in terms of, oh, there's these latent capabilities that LLMs have that, um, I didn't expect that they had. And so I actually gave up my, uh, exec team role at Zapier. I was running half the company at that point. I went back to be an individual contributor and just do, to go do AI research, uh, alongside Bryan, my co-founder. Um, and ultimately that led me to back towards ARC. Uh, I was looking into it again, and I had sort of expected to see, uh, this h- you know, saturation effect that, you know, MMLU has, that GMSK-8K has. And when I looked at the scores and the progress th- since the last four years, um, I was really, again, shocked to see actually we've made very little objective progress towards it. And, um, it felt very... It felt like a really, really important e- eval. And as I sort of spent the last year asking people, quizzing people about it in sort of my networking community, um, very people, few people n- even knew it existed. And, uh, that, that felt like, okay, if it's right that this is a really, really, like, globally, singularly unique, um, AGI eval, and it's different from every other eval that exists that are more, that more nar- you know, m- narrowly measures AI skill, um, like, more people should know about this thing. Uh, I had my own ideas on how to beat the ARC as well. So I'm, like, like I was working on nights and weekends on that, and I flew up to meet Francois earlier this year, um, to sort of quiz him, show him my ideas, and ultimately
- FCFrancois Chollet
Yeah.
- MKMike Knoop
... I was like, "Well, you know, why don't you think more people know about ARC?" Um, I think you should actually answer that. I think it's a really interesting question. Like, why, why don't you think more people know about ARC?
- FCFrancois Chollet
Sure. You know, I think benchmarks that gain, uh, traction in the research community are benchmarks that are already fairly tractable. Because the dynamic that you see is that some research group is gonna make some initial breakthrough, and then this is gonna catch the attention of everyone else. And so you're going to get follow-up papers with, with people trying to, uh, beat the first team and so on. And for ARC, this has not really happened because ARC is actually very hard for existing AI techniques. Kind of ARC requires you to try new ideas. And that's very much the point, by the way. Like, the point is not that, uh, yeah, you should just be able to apply existing technology and, and solve ARC. The point is that existing technology, uh, has reached a plateau. And if you want to go beyond that, if you want to start being able to tackle problems that you haven't memorized, that you haven't seen before, uh, you need to try new ideas. And, um, ARC is not just meant to be, um, uh, this sort of like, uh, measure of how close we are to AGI. It's also meant to be a s- a source of inspiration. Like, I want, I want researchers to look at these puzzles and be like, "Hey, it's really strange that these puzzles are so simple and, uh, most humans can, can just do them, uh, very quickly. Why is it so hard for existing AI systems? Why, why is it so hard for LLMs?" And so on. And this is true for LLMs, but ARC was actually released, uh, before LLMs were really a thing. And, um, the only thing that made it special at the time was that it was designed to be resistant to memorization. And the fact that it has survived LLMs and gen AI in general so well, uh, kind of shows that, yes, it is actually resistant to memorization.
- MKMike Knoop
Yeah. Th- this is what nerds night me, um, because I went and took a bunch of the puzzles myself. I showed it to all my friends and family, too, and they're all like, "Oh, yeah, this is, like, super easy. Um, are you sure AI can't solve this?" Like, that, that's the reaction, and, and the same one for me as well. And the more you dig in, you're like, okay, yep, there's not just empirical evidence over the last four years that it's unbeaten, but there's theoretical, like, concepts behind why. Um, and I, I completely agree at this point that, like, new ideas basically are needed to beat ARC, and there's a lot of current trends in the world that are actually, I think, working against that, um, happening, basically. I think we're actually less likely to generate new ideas right now. Um, you know, I think one of the kind of trends is the closing up frontier research, right? The GPT-4 paper from OpenAI had no technical details shared. Uh, the Gemini paper had no technical details shared on, like, the longer context, uh, part of that work. And, and yet that open innovation, that open progress and sharing is what got us to transformers in the first place. That's what got us to L- LLMs in the first place. Um, so it's, it's kind of, it, uh, disappointing a little bit actually that, like, so much frontier work has gone closed. It's really making a bet that, like, these individual labs are gonna be the b- have the breakthrough and not the ecosystem, um, is gonna have the breakthrough. And I think sort of the internet open source has shown that that's, like, the most powerful innovation ecosystem that's ever existed, probably in the entire world.
- FCFrancois Chollet
I think that's, that's actually really sad that, uh, frontier research is no longer being published. Uh, if you look, if you look back, you know, four years ago, um, well, everything was just openly shared. Like, all the state-of-the-art results were, were published. And this is no longer the case. And it very much, you know, OpenAI single-handedly, uh, changed the game. And I think, um, OpenAI basically set back, uh, progress towards AGI by quite a few years, probably like five to 10 years, for two reasons. And one is that, while they co- they caused this, uh, uh, complete closing down, uh, of research, frontier research publishing, uh, but also, uh, they triggered this, uh, initial, uh-... burst of, uh, hype, uh, ar- around LLMs. And now LLMs have, uh, sucked the oxygen out of the room. Like, everything, everyone is just doing LLMs. Um, and I see LLMs as more of an off-ramp on the, on the path to AGI actually. Um, and all these new resources, uh, they're actually going to LLMs instead of everything else that could be, they could be going to. And, you know, if you look, uh, further into the past to, like, 2015, 2016, there were like a thousand times fewer people doing AI back then and yet I feel like the rate of progress was higher because, uh, people were exploring more directions. Um, the world felt, felt more open-ended, like, you could just go and try, like, ha- have a cool idea for lunch and try it and get some interesting results. So there was, there was this, this energy. And now everyone is very much doing some variation of the same thing. And the big labs also tried, uh, their hand on ARC, but because they got bad results, they didn't publish anything. Like, you know, people only publish, uh, positive results.
- DPDwarkesh Patel
Um, I wonder how much effort people have put into trying to prompt or scaffold, do some sort of maybe Devin-type approach into getting the frontier models, and the frontier models of today, not just a year ago because a lot of post-training has gone into making them better, so Claude 3 Opus or GPT-4o into g- getting good solutions on ARC. I, um, I, I hope that one of the things this episode does is get people to try out this open competition where they have to put in an open source model to compete. But also to, like, figure out if there... maybe the late capability is latent in Claude Opus and just sh- see if you can show that.
- 1:09:20 – 1:11:16
Million $ ARC Prize
- DPDwarkesh Patel
I, I think that would be super interesting. So let's talk about the prize. How much do you win if you solve it? Uh, you know, get whatever percent on ARC-
- FCFrancois Chollet
Yeah.
- DPDwarkesh Patel
How much do you get if you get the best submission but don't crack it?
- MKMike Knoop
So we got a million dollar pl- actually a little over a million dollars in the prize pool. We're running the contest, uh, on an annual basis. We're gonna... we're starting it today, um, through the, uh, middle of November. And the goal is to get 85%. That's the lower bound of the human average that you guys talked about earlier. And, uh, there's a $500,000 prize for the first team that can get to the 85% benchmark. We're also gonna run... we don't expect that to happen this year actually. Um, one of the early, um, s- uh, statisticians at Zapier gave me this line that has always stuck with me, uh, that the, the longer it takes, the longer it takes. So my prior is that, like, ARC is gonna take years to solve. Um, and so we're gonna keep do-... we're also gonna break down and do a progress prize this year. So we're g-... there's $100,000 progress prize which we will pay out to, uh, the top scorers. So $50,000 is gonna go to the top objective scorers this year on the Kaggle leaderboard, which is... we're hosting it on Kaggle. And then we're gonna have a $50,000 pot set for, uh, a paper award, um, for the best paper that explains conceptually the, the scores that they were able to achieve. And one of the, I think, interesting things we're also gonna be doing is, um, we're gonna be requiring that in order to win the prize money that you put the solution or your paper out into public domain. Um, the reason for this is, you know, t- tend to... typically with contests you see a lot of, like, closed up sharing. People are kind of private, secretive, they want to hold their alpha to themself (laughs) during the contest period. And because we expect it's gonna be multiple years, um, we, we want an early game here. So the, the plan is, you know, at the end of November we will award the $100,000 prize money to the top progress prize and then use the downtime between December, January, February to, uh, share out all the knowledge, um, from the top scorers and the approaches folks were taking in order to re-baseline the community up to whatever the state of the art is, and then run the contest again next year, and keep doing that on a yearly basis until we get 85%.
- 1:11:16 – 1:18:51
Resisting benchmark saturation
- MKMike Knoop
- DPDwarkesh Patel
I, I'll give some people some context on why I think this prize is, uh, very interesting. I was having conversations with my friends who are very much believers in models as they exist today and first of all, it was intriguing to me that they didn't know about ARC. These are experienced ML researchers and so you show them the... this is, this happened a couple of nights ago. We went to dinner and I showed them an example problem and they said, "Of course an, uh, LLM would be able to solve something like this." And then we take a screenshot of it, we just put it into our ChatGPT app and it doesn't get the pattern. And so I think it's very interest-... like, i- it is a notable fact. I was sort of playing devil's advocate against you on these kinds of questions, but this is a very intriguing fact and I'm extrem-... I think this prize is extremely interesting because we're gonna learn, we're gonna learn something fascinating one way or another.
- FCFrancois Chollet
Yeah.
- DPDwarkesh Patel
Um, so with regards to the 85%, separate from this prize, I'd be very curious if somebody could replicate that result because obviously in psychology and other kinds of fields which this result seems to be analogous to when you run test on some small sample of people, often they're hard to replicate. So I'd be very curious if you try to replicate this how... wh- what does an average human perform on ARC? Um, as for the difficulty and how long it will take to crack this benchmark, it's very interesting because the other benchmarks that are now fully saturated, like MMLU-Math, actually the people who made them, Dan Hendricks and Colin Burns who did MMLU and Math, I, I think they were grad students or college students when th- they made it. And the goal when they made it just a couple of years ago was that this will be a test of AGI, and of course it got totally saturated. And I, I know you argue that th- these are a test in memorization, but I think the pattern we've seen... in fact, Epoch AI has a very interesting graph that I'll sort of overlay for the YouTube version here, where you see this almost exponential where it gets, you know, 5%, 10%, 30%, 40% as you increase the compute across models and then it just shoots up. And in the GPT-4 technical report, they had this interesting graph of the human eval problem set, which was, uh, 22 coding problems. And they had to graph it on the mean log path curve basically because early on in training or if even smaller models can have the right idea-... of how to solve this problem, but it takes a lot of reliability to make sure they stay on track to solve the whole problem. And so you really wanna upweight the signal where they get it right at least some of the time, maybe one in a hundred times or one in a thousand, and then so they go from, like, one in 1,000, one in 100, one in 10, and then they just, like, totally saturate it. I guess the question I have what th- this is all leading up to is why won't the same thing happen with ARC where people had to try really hard, big, b- bigger models, um, and then now they've figured out these techniques that Jack Cole has figured out with only a 240 million parameter, uh, language model that can get 35%. Shouldn't we see the same pattern we saw across all these other benchmarks where you just, like, sort of eke out and then once you get the general idea then you just go all the way to 100?
- FCFrancois Chollet
That's an empirical question, so we'll see in practice what happens. Um, but what, uh, what Jack Cole is doing is actually very unique. It's not just pre-training an LLM and- and- and then prompting it. He's actually trying to do active inference. And then-
- MKMike Knoop
He's doing it at test time, right?
- FCFrancois Chollet
Yeah.
- MKMike Knoop
He's doing, like, test-time fine-tuning.
- FCFrancois Chollet
Like test-time fine-tuning.
- MKMike Knoop
Yeah.
- FCFrancois Chollet
And this is actually trying to lift one of the key limitations of LLMs, which is that at inference time they cannot learn anything new. They cannot adapt on the fly to what they are seeing. And he's actually trying to, uh, uh, learn. So what he's doing is effectively a form of program synthesis, um, because the LLM contains a lot of useful building blocks, like programming building blocks, and by fine-tuning it on the task at test time, you are trying to assemble these building blocks into the right pattern that matches the... that matches the task. This is exactly what, uh, what program synthesis is about. And the way we'd contrast this approach, uh, with, uh, discrete program search is that in discrete program search, so you're trying to assemble, uh, a program from a set of primitives. You have very few primitives. So people working on- on discrete program search on ARC, for instance, they tend to work with DSLs that have, like, 100 to 200, uh, primitive programs. So very small DSL, but then they are trying to combine these primitives, uh, into very, uh, complex programs. So there- there's a very deep, uh, depth of search. And, uh, on the other hand, um, if you look at what Jack Cole is doing with- with LLMs is that, um, he- he's got this sort of like vector program, uh, database DSL of millions of building blocks in the LLM that are mined by pre-training the LLM not just on a ton of programming problems, uh, but also on millions of generated ARC-like tasks. Uh, so you have an extraordinarily large DSL, uh, and then the fine-tuning is very, very shallow, uh, recombination of these primitives. So discrete program search, very deep recombination, very small set, uh, of- uh, of- uh, primitive programs, and the LLM approach is the same but o- on the complete opposite end of that spectrum where you scale up the memorization by a massive factor and you're doing very, very shallow search. But they are the same thing-
Episode duration: 1:34:39
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode UakqL6Pj9xo
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome