Skip to content
Y CombinatorY Combinator

Francois Chollet: Why ARC-AGI Shows Scaling Hits a Wall

ARC-AGI benchmarks expose where LLMs stop at pattern recognition; Ndea pursues program synthesis as a more efficient alternative to gradient descent.

François CholletguestGarry TanhostDiana Huhost
Mar 27, 202657mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:000:31

    AGI by 2030?

    1. FC

      I think we're probably looking at AGI twenty thirty. Around the time, uh, that we're gonna be releasing like maybe ARC six or ARC seven. You're not gonna stop, uh, AI progress. I think, I think it's too late for that. And so the next question is, okay, like AI progress is here. Uh, it's actually gonna keep accelerating. How do you make use of it? How do you leverage? How do you ride the wave? That's the question to ask.

    2. GT

      [on-hold music]

  2. 0:311:08

    Introducing Ndea: A New Path Beyond Deep Learning

    1. GT

      Today, we're lucky to be joined by François Chollet, founder of the ARC Prize, a global competition to solve the ARC-AGI benchmark. His latest project is Ndea, a lab exploring a new paradigm in frontier AI research. François is one of the best people in the world to help us understand the current AI moment and where all of this is going. François, thank you so much for joining us today, and congrats on the launch of ARC-AGI V3.

    2. FC

      Thanks so much for having me. I'm super excited to be here. Super exciting time to talk about AI.

    3. DH

      So François, tell us a little bit about India. So what exactly is it and what are you guys trying to achieve?

  3. 1:081:30

    A New ML Paradigm

    1. FC

      Right. So Ndea is this new AGI research lab, and we are trying some very different ideas. And so our goal is basically to build this new branch of machine learning that will be much closer to optimal, unlike, unlike deep learning.

    2. GT

      All of us right now are sort of taken by what's going on with code. Uh, I have sort of this viral moment right

  4. 1:303:04

    Replacing neural nets with compact symbolic programs

    1. GT

      now where I got to forty thousand stars this morning-

    2. FC

      Oh, wow

    3. GT

      ... on, uh, GStack. So it's like, oh, this is an open source project that now is one of the biggest ones, and I have more than a hundred PRs from contributors to deal with. I guess you're, you know, one of the best people to talk to about this because you're, you're actually literally coming up with something that is a totally different pathway.

    4. FC

      That's right. That's right. So, uh, what we're doing at Ndea is, uh, we're doing program synthesis research. And when I talk about program synthesis, often people ask me, "Oh, so are you doing like codegen? Are you, uh, building an alternative to coding agents?" And it's actually not at all what we are doing. We are working at a much, much more, uh, much lower level than that. Uh, what we're actually doing is that we are trying to build a new branch of machine learning, an alternative to deep learning itself, uh, rather than like coding agents. Coding agents are like this very, very high level last layer piece of the stack, and we're actually trying to rebuild the whole stack on top of different foundations. So we're building a new learning substrate that's very different from, you know, parametric learning, deep learning. So if you go back to, uh, the problem of machine learning, you have some input data, some target data, and you're trying to find a function that will map the inputs to the targets that will hopefully generalize to new inputs. And, uh, if you're doing deep learning, what you're doing is that you have this parametric curve that serves as your, as your function, as your model, and you're trying to fit the parameters of the curve via gradient descent. And this is basically what we are doing, uh, except

  5. 3:045:20

    Why Ndea Isn’t Competing With Coding Agents

    1. FC

      we are replacing the parametric curve with a symbolic model that is meant to be as small as possible. It's like the simplest, uh, possible, uh, model to explain the data, to model what's going on. Uh, and of course, if you're doing that, you cannot apply gradient descent anymore, so we are building something that we call, uh, symbolic descent, which is like the symbolic space equivalent of gradient descent. The idea is to build this new machine learning engine that's giving you, uh, extremely concise symbolic models of the data you're feeding into it, and then we are gonna make it scale. And so everything you're doing with machine learning today, with parametric curves, we should be able to do it, uh, with symbolic models in the future in a, in a way that will be much, much closer to optimality. Much closer to optimality in the sense that you're gonna need much less data to obtain the models. The models are gonna run much more efficiently at, at inference time because they're gonna be so small. And because they are so small, they will also generalize much better and compose much better. You know, the, the minimum description length principle, that the model of the data that is most likely to generalize is the shortest. And I think you cannot find a model like this if you're doing parametric learning. You need to, you need to try symbolic learning.

    2. GT

      That's fascinating.

    3. DH

      So the rest of the industry is just pouring more and more billions of dollars down an approach that was set years ago. Can you like help make the case for why you think that it's the right thing to explore alternate approaches instead of just to keep putting more money into the current approach?

    4. FC

      I mean, everybody's, uh, is, uh, uh, you know, building on top of the LLM stack these days, which makes sense because, you know, the, the returns are there, like it's actually working. So it would seem very sensible for everybody to just be doing, uh, what seems to be the, the, the currently most proactive path. Uh, but often it's actually-- it's, it's counterproductive to have everybody working on the same thing. Like, I personally don't think that, um, machine learning or AI in fifty years is still gonna be built on this stack. I think this is a stack that is, uh, very nice. Maybe it even gets us to AGI, uh, but it's not as efficient as it should be. I think i-it's inevitable that, uh, the world of AI will, uh, trend over time

  6. 5:207:22

    Why Everyone Might Be Wrong About Scaling LLMs

    1. FC

      towards optimality. And so I'm trying to sort of like leapfrog directly, uh, to optimality, like to build, to build the foundations of optimality today. But in general, you know, our vision is very ambitious, and I'm not saying that we're gonna be successful, like we have maybe a, a ten or fifteen percent chance of success. Uh, but that is enough, uh, that it's worth trying, right? And I think in general, like among, among listeners, if you have, uh, a big idea and it has very low chance of success, but, uh, if it works, it's gonna be big and no one else is gonna be working on it, right? It's, it's not something popular. It's not something if you don't do it, no one else will do it. And this is basically our situation. If you're in this situationThen you, then you should, you should, should try your chance. You know? Should, should go and work on it.

    2. DH

      I mean, that's almost like the mission statement of Y Combinator, the thing that you just said. [laughs]

    3. FC

      Yeah.

    4. DH

      [laughs]

    5. FC

      Yeah. The reason it's important is that, again, if we don't do it, no one else will do it, right? So it's worth trying. Even if we don't succeed, it's worth trying.

    6. SP

      Has the success-- well, very specifically of the coding agents, I guess, built on top of the LLM stack, like, has their success surprised you at all, in, in particular, like, say, over the last six months or so?

    7. FC

      Yeah, absolutely. I think it has surprised many people. It definitely did surprise me. If you look at why everything is, is starting to work so well with coding agents, it's really because, uh, code provides you with a, a verifiable reward signal. And I think right now we're in the situation where any problem where the solutions you propose can be, uh, uh, formally verified, and you can actually trust the reward signal. It's not just some guess made by a model. Any domain like this, uh, can be fully automated with current technology, with, with the LLM-based stack. And, uh, code is sort of like the first domain to fall, but there will be many others in the future. I think mathematics is also, is also primed to see a, a revolution in the next few years for the same reasons, again, because the domain just gives you verifiable rewards.

    8. DH

      I guess a challenge for a formally verified domain is you have to somehow take a domain and make it verifiable, which is the trick. I mean, code is very natural.

  7. 7:228:50

    Why Coding Agents Suddenly Work So Well

    1. DH

      You could test, there's bugs, compiles, et cetera. And mathematics as well, whether all the theorems and proofs work out. I guess it becomes more nebulous when you go a couple degrees off, where there are fields that are not naturally formally verified-

    2. FC

      Yeah

    3. DH

      ... and you need to come with a, again, with some sort of a function to come up with that reward that makes it verifiable.

    4. FC

      Yeah, yeah.

    5. DH

      With very fuzzy things like, let's say, English language and composing the perfect essay, how do you make that formally verifiable?

    6. FC

      Yeah, yeah. Absolutely. I mean, writing essays is, you know, the typical example of a domain that's not-

    7. DH

      Mm-hmm

    8. FC

      ... uh, verifiable. And so what you're gonna see is that progress of reasoning models and, and base LLMs on this type of, of, of domain is, is, you know, is gonna be very slow because the stack we're using, like the LLM stack, is very, very reliant on its training data. It's basically just operationalizing the training data. And for writing essays, the training data is coming from, uh, human experts, like annotating, uh, answers, and that's costly. So you're gonna see this very, very slow progress. Maybe, maybe it's even gonna stall. But for, for any, any verifiable domain, like take code for instance, which was the big unlock, is, uh, when, uh, when people started creating these code-based like training environments, uh, for, for post-training, uh, where the, the, the reward signal, the verification signal is provided by things like, uh, unit tests-

  8. 8:5010:48

    The Limits of LLMs in Non-Verifiable Domains

    1. DH

      Mm-hmm

    2. FC

      ... and so on. And so that means that, uh, the model was not just working from human provided annotations. It was actually trying some things, uh, verifying the answer, and, uh, and generating a lot, lot more training data in the process. So a much denser coverage of the problem space. And not just coverage in terms of like, is, is the answer right or wrong, but also, uh, starting to build, uh, models of the execution traces, right? Uh, so that the models could start incorporating a, uh, an execution model. Very much the way that, uh, uh, human programmers, you know, when they look at code, they're, they're sort of like executing the code in their minds. They, they keep track of the value of variables and so on, is also what the models are trying to do now, and this is why it's working so well. And it's possible because you're working with this very, uh, formal, fully verifiable environment. You cannot do that with essays, you cannot do that with, you know, law or, or many other problems.

    3. DH

      I think I really like how you define intelligence and how to measure it, which brings to the question of, uh, also sharing, having you share the history of, uh, ARC-AGI.

    4. FC

      Yeah. So my, my definition of, uh, general intelligence, you know, many people, uh, around the industry these days, they say, uh, AGI is gonna be a system that can automate most economically, economically valuable tasks. And to me, that definition is, uh, it's, it's about automation. It's not about intelligence, it's not about general intelligence.

    5. DH

      Mm-hmm.

    6. FC

      So my definition is, uh, AGI is basically gonna be a system that can approach any new problem, any new task, any new domain, and make sense of it, like model it, uh, become competent at it, uh, with the same degree of efficiency as a human could.

    7. DH

      Mm-hmm.

    8. FC

      So meaning it's gonna need basically the same amount of training data, uh, and training compute-

    9. DH

      Mm-hmm

    10. FC

      ... as, as a human would. Which is, which is very little. Like humans are really, really, uh, data efficient. So general intelligence is human level skill acquisition efficiency on the, on the same scope

  9. 10:4813:30

    What AGI Actually Means (And Why Most Definitions Are Wrong)

    1. FC

      of tasks that, uh, humans could potentially, uh, le-learn to do.

    2. DH

      Do you think it's possible that we will accomplish the first definition of AGI, the automate most economically useful work, before we accomplish your definition?

    3. FC

      Absolutely. I think that's, that's the trajectory that we're on right now. And I think it's already true that in principle, current technology can fully automate at human level or beyond any domain where you have, uh, verifiable rewards, right? And code, code being the first one. And I think figuring out AGI, figuring out like human level, uh, you know, learning efficiency over arbitrary tasks, that's probably gonna take, uh, a different sort of technology, a different, a different mindset, a different approach.

    4. DH

      Do you think that LLMs can be bent to have the same sample efficiency as humans, or do you think it's like fundamentally just impossible and we need a new approach, and that's, that's the thing that you're hoping, hoping to solve?

    5. FC

      With enough compute, everything starts looking like everything else. Every... Like compute is a great equalizer. Every approach starts looking the same. And I think it's possible in principle to build something that looks a lot like AGI on top of the LLM stack. Uh, but it's not gonna be LLMs per se, it's gonna be this new layer. Perhaps, you know, it's gonna be even a, a, a few layers above, not just one layer above, but a few layers above. Uh, but it... you, you can build it on top of, uh, LLMs because LLMs are a kind of computer, right?

    6. DH

      I see.

    7. FC

      Uh, I do believe, however, this would be the wrong thing to do.Because it will be very inefficient. I think AI, AI research will have to trend towards not just efficiency, but in fact optimality over time. And for this reason, future AI in a few decades, uh, it's not gonna be this, uh, harness on top of a reasoning model on top of a basal LLM. Uh, it's gonna be much, much lower than that.

    8. GT

      To Diana's question, do you wanna talk about how you actually designed ARC-AGI and why it's a good barometer of that?

    9. FC

      I mean, I, I, you know, I've been doing deep learning for a very, very long time, and initially my, my, my take, my mindset was that deep learning was gonna be able to do everything.

    10. DH

      You were the creator of Keras before even all the other frameworks became very popular.

    11. FC

      Yeah, that's right. That's right. I was, uh, trained deep learning model, uh, uh, for natural language processing, in fact-

    12. DH

      Mm-hmm

    13. FC

      ... in, uh, 2014, and, uh, from that work, uh, you know, I actually started, uh, developing this open source library, which I, I released, uh, in fact, uh, exactly 11 years ago, uh, March, March 2015.

    14. DH

      Mm.

    15. FC

      Uh, so it was Keras, and, and then it got popular, and then I ended up, uh, sort of like doing less of the research that I, that I started Keras for and, uh, more of working on the framework itself, just because it had really, really good product market fit. And so my, my take,

  10. 13:3014:00

    Why Deep Learning Hits a Wall

    1. FC

      you know, around that time, around like 2015, 2016, was that deep learning was extremely general, that you could do everything with deep learning. That you didn't need anything else. It was too incomplete. So, uh, my take was basically that deep learning was differentiable programming. Uh, so anything you would do with software, you could in principle train a deep learning model on the right inputs and outputs to do the same thing. And, uh, in, uh, 2016, I was doing, uh, research at Google Brain on trying

  11. 14:0018:20

    ARC’s Origin Story

    1. FC

      to train deep learning models to help with, uh, reasoning problems, and in particular, uh, uh, first order logic problems, uh, uh, theorem proving, and so on. And I started finding that you could not really get gradient descent to encode, uh, uh, sort of like reasoning style algorithms. It was not because the models could not represent these algorithms. It was because gradient descent could not find them, right? So the problem was that it wasn't about deep learning not being too incomplete or anything like that. Like, it... That was not the problem. The problem was gradient descent, right? Gradient descent would not find generalizable programs. It would instead, uh, end up doing, uh, overfit pattern matching, right, uh, over, over sequences of, uh, uh, input tokens. And-

    2. GT

      Which I guess people could argue like that's what's happening.

    3. FC

      I mean, that's, that's-

    4. GT

      It's useful, but-

    5. FC

      That's still what's happening today-

    6. GT

      Yeah

    7. FC

      ... in a, in a, in a slightly... It's, it's a, it's slightly higher-

    8. GT

      Yeah

    9. FC

      ... higher level version of, uh-

    10. DH

      It's with a lot of data, so it doesn't feel like overfitting-

    11. GT

      Yeah

    12. DH

      ... because the data has a lot more distribution with-

    13. FC

      Yeah

    14. DH

      ... what you see.

    15. FC

      With a lot more data, and also I, I think models today, uh, they are a lot more compressive-

    16. GT

      Yeah

    17. FC

      ... of the data, which is why, why they, they generalize better.

    18. GT

      So all models are wrong, but some models are useful, and then I guess what I'm hearing is like your method might find the right model.

    19. FC

      That's right. That's, uh, that's, uh, where, where the, uh, idea came from, and I was like, you know, at the time back in 2016, 2017, I was like, "Okay, we are gonna need a, a benchmark-

    20. DH

      Yes, a benchmark

    21. FC

      ... to capture these ideas. Uh, we're gonna need a program synthesis benchmark." And, uh, my, my mental model for that was ImageNet.

    22. DH

      Mm.

    23. FC

      I was like, "Oh, I'm gonna make the ImageNet of reasoning." So I started brainstorming a few ideas around like 20, 2017. I explored many different things. Uh, I tried working with, uh, in particular cellular automata, like, uh, uh, a setup where you show a model, uh, cellular automata outputs, and it must recreate, uh, the program that generated them, like that sort of thing. Uh, and eventually I settled on the, uh, ARC-AGI format, uh, around like early 2018. You know, I was doing this on the side. It was a side project. Like, my main project was, uh, developing Keras at Google. I wasn't moving very, very fast, uh, on that. Uh, so summer 2018, uh, I wrote the ARC task editor, and then I started just making lots of tasks by hand. And about one year later, I had made 1,000 tasks. And so I wrote up, uh, the paper that was explaining what this was about, what the big idea was, like intelligence as a, as a skill acquisition efficiency, uh, and I published, uh, all of that in, uh, in 2019.

    24. DH

      In parallel, GPT-3 2020 was coming out and starting to show signs until the ChatGPT moment around 2022, end of the year, and the industry took off with that. And this was one of the benchmark that was really performing really badly, and it was very obscure. I don't think many people knew about it. It was mostly niche research communities that maybe read your paper.

    25. FC

      Yeah. People who worked on program synthesis knew about it, uh, but a lot of people who worked on, on deep learning, on scaling up LLMs didn't really care for it. And part of the reason why is because LLMs did not work well or at all on the benchmark. For a benchmark to capture the attention of the research community, it needs to start working a little.

    26. GT

      [laughs]

    27. FC

      Right? Uh, if it's too hard, people are gon- are just gonna dismiss it.

    28. GT

      You're just ahead of your time, clearly, because we're not on ARC-AGI V1 anymore, and then II is reaching saturation, and then-

    29. FC

      That's right

    30. GT

      ... III is out now.

  12. 18:2022:49

    ARC Benchmarks Explained: From V1 to V3

    1. FC

      reasoning, yeah. So the base models. So performance of, uh, of base- base LLMs on, on V1 stayed very, very low even though-In the meantime, you know, we had scaled up these models by 50,000 X, right?

    2. GT

      Mm-hmm.

    3. FC

      So it was really telling you that, you know, more scale, scaling up pre-training alone was not gonna crack the benchmark. This was not enough to demonstrate that the model had fluid intelligence. And then, uh, the moment, uh, models started performing well on ARC 1 was with the first reasoning models. In particular, uh, the, the OpenAI o1 and then o3, uh, models, which by the way, they were demonstrated by OpenAI on ARC because it was the one unsaturated reasoning benchmark that was really showing that this model was different. It had new capabilities that we had not seen before. And so with reasoning models, you start seeing this sudden like step function change, uh, on, on ARC 1. And so ARC 1 was really the benchmark that signaled that at this moment in time, something was happening. And so-

    4. GT

      Something big.

    5. FC

      Yeah, something big, like new capabilities were emerging, like reasoning was new and different. And it was actually not obvious at the time. Like, you know, I don't know if you remember when the, when the, uh, o3, uh, preview was, was announced by OpenAI. It was-

    6. GT

      That was end of 2024 actually.

    7. FC

      Yeah, December 2024. And like sure, it was like a, a, a huge like step function progress on ARC, uh, but it was very expensive. It did not really have product market fit effectively. But if you looked at, uh, at ARC results, you knew that this was big and important. And then we released ARC 2, which was the same format but, uh, more difficult, like with more, uh, uh, composition, uh, uh, at the level of the, the, the reasoning chains. And what happened is that, so the, the earliest reasoning models started very, very low on ARC 2. And then around the same time as, uh, coding agents started working, you saw this-

    8. GT

      Just last year.

    9. FC

      Yeah. So ve-very, very recent, just a few months ago, you saw this, uh, uh, uh, very, very fast like saturation, uh, of ARC 2. And so again, like ARC 2 signaled that, yes, there was this, uh, this new set of capabilities emerging. So I think the benchmark did a really good job at capturing the advent of reasoning models and then the advent, uh, of agentic coding. Like this, this new paradigm where if you have, uh, verifiable rewards, then you can basically fully automate, uh, the domain, which by the way is true of ARC. Like ARC does provide a, a verifiable reward.

    10. GT

      I guess for V2, what, what caused the... So one was clearly reasoning. Two, a benchmark doesn't care how you solve it. I guess embedded in what you said, like were people using codegen to then solve V2?

    11. FC

      That's right. So not, not necessarily codegen, uh, per se, but, uh, the frontier labs have been targeting ARC V2. And, uh, the progress you saw on ARC V2 is actually a result, uh, of this very, very large scale targeting.

    12. GT

      Yeah.

    13. FC

      So what you can do to solve ARC V2 is you ask your reasoning model to make more tasks like those in the benchmark. Uh, and then you try to solve them using, let's say, let's say program induction, for instance, uh, uh, still using your reasoning model. Then you verify the solution. Again, it's verifiable, so you can, you can trust, uh, the answer. Um, and then you fine-tune the model on the successful reasoning chains, and then you keep repeating, like you generate new tasks, you solve them, you verify the solution, you fine-tune the model on the reasoning chains. And, um, you can keep doing this millions of times, right? Like you, you just need to spend more money.

    14. GT

      Yeah. This is the RL loop that-

    15. FC

      This is... Yeah, exactly

    16. GT

      ... is happening. Yeah.

    17. FC

      And the, the new paradigm in AI is basically that any domain where this is true, where you have, uh, the ability to generate these, these true, uh, uh, verification signals, you, you can run this, this kind of loop, right? If you can run this kind of loop, you can mine, uh, you can brute force mine effectively the entire space and get extremely high performance. This is basically the, the process through which ARC 2 was saturated. So what it tells you is that it's not so much that the models have higher fluid intelligence, uh, than, than they did with the, with the first reasoning models. It's just that you have this new paradigm of post-training, and this is exactly what led to agentic coding. So it does matter. It is, it is valuable. It is useful.

    18. GT

      It's not that the mar- models are smarter, it's that they're suddenly more useful. It is possible to be more useful in particular domains without being smarter.

    19. FC

      Yeah, absolutely.

    20. GT

      Clearly, because that's means good

  13. 22:4927:03

    The RL Loop Powering Coding Agents Today

    1. GT

      things for me. I'm not getting any smarter [chuckles] right now, like at, at, you know, age 45, but, you know, I can learn how to do things, and that's sort of what's happening with the models as of like late.

    2. FC

      Yeah, absolutely. When, when it comes to, uh, competency, there's always a trade-off between intelligence and knowledge. If you have more knowledge, if you have better training, uh, you need less intelligence to be competent. And that's exactly, uh, uh, what happened with the, the rise of coding agents, right? The models don't have higher fluid intelligence per se. They don't have like a higher, uh, uh, IQ so to speak. It's just that they're way better trained, and they're way better trained in, uh, in two ways. So they're, they're not just trained to, to complete code anymore. They're actually trained via trial and error in these, uh, RL, uh, post-training environments with, you know, true reward signals. And also they're trained, uh, to embed this, uh, model of code execution, right? Where they, they, they, they, they learn to keep track of the value of variables, uh, uh, over es- an execution cycle. And that's what, what's leading to this extremely strong product market fit, uh, for agentic coding today. And it's really, it's completely changing software engineering.

    3. GT

      This has happened not too long ago, the saturation. We actually had the founders of Poetic that came and spoke about-

    4. FC

      Yeah

    5. GT

      ... the approach, which is really sounds like this new way of, uh, getting LLMs to perform is building this, uh, agent harness, right?

    6. FC

      Yeah.

    7. GT

      And the harness is basically structuring a problem domain into something that can be formally verified, and they did that basically for ARC V2.

    8. FC

      Yeah.

    9. GT

      Which, uh, when they released it, they were at the top of the benchmark. But then the crazy thing is I actually worked with a company in the Winter '26 batch not too long ago called Confluence Lab, which actually ended up saturating the V2 results with 97%, and I think their task cost was, uh, a lot more efficient too.

    10. DH

      And the approach they basically took is similar to this. I think they built the harnesses on top of it in order to get the LLMs to, to go and build different tasks and program through it.

    11. FC

      Yeah.

    12. DH

      Which then for me, I was like, wow, is this batch-- during the batch, they only worked on it for a couple of months, and they were able to saturate this benchmark that has been around for a long time. It's like something special is happening.

    13. FC

      Yeah. Yeah. There's a lot of progress right now that's driven by custom harnesses around the task, and the harness is basically a way for the, the human programmer to, um, input into the model like, uh, higher level, like, uh, solution strategies, basically. I mean, to me, the fact that you need humans to engineer these harnesses is also a sign that we're, we're, we're short of AGI today because if we had AGI, you know, AGI would just make its own harness. It would not need to be told how to solve a problem. It would just figure it out. But it is very effective. Like harnesses, I don't think they get us closer to AGI in any sense, uh, but they-- it's a very valuable area of research because that can lead to task automation at scale.

    14. SP

      YC's next batch is now taking applications. Got a startup in you? Apply at ycombinator.com/apply. It's never too early, and filling out the app will level up your idea. Okay, back to the video.

    15. DH

      Can you tell us about then what V3 is gonna measure that's, uh-

    16. FC

      Yes

    17. DH

      ... just got released?

    18. FC

      Yeah, absolutely. So if you look at V1, V2, uh, it was really focusing on your ability to, uh, produce like causal models, uh, of a pattern that was just given to you, like the data was given to you. Uh, so it was static, it was, uh, passive and really focused on, uh, modeling. And, uh, V3, it's completely different. We are trying to measure, uh, agentic intelligence. So it's interactive, it's active, like the data is not provided to you. You must go get it. The idea is that your agent is dropped into a new environment, which is kind of like a, a mini video game, and it's not provided any instructions. It's not told what to do. It's not told, uh, what the goal even is or what the controls even are. And it must figure out everything on its own via trial and error. So we are, we are not just, uh, measuring, you know, the, uh, the AI's

  14. 27:0331:14

    ARC-AGI V3: Measuring “Agentic Intelligence”

    1. FC

      ability to model its environment. We are also looking at, uh, its exploration efficiency, its ability to acquire goals on its own, like goal setting, and of course, its ability to plan, uh, through the model of the environment it's created and, and to execute the plan. Uh, and so together, you know, all, all of these abilities, we call that agentic intelligence, and we are looking for AI systems that could learn to play these games and, and, you know, crack them with the same degree of action efficiency as a human. If you look at the human, they are dropped into this new environment. They, they try a few things. They start understanding how things work. Uh, they can, they can solve the environment, you know, in, in a few hundreds to thousands of actions. We're trying to look for AI systems that could match, uh, this efficiency. And by the way, we know that all of these test environments in ARC3 are solvable by humans with no prior training because we actually, uh, tested them, uh, on, on regular people. Yeah, at first you just see this screen, and you, you know, you have, uh, these keys available, but you don't know what they do, and you must figure out everything from scratch. And humans are really good at that, by the way. They're really good at exploring efficiently, with making sense of something new, and eventually cracking the game. And frontier models today, they are not very good at it.

    2. DH

      If the reasoning models cracked V1 and the, like, reinforcement learning environments cracked V2, do, do we need a new advance to crack V3? Do the... Do, do even the best techniques currently, like, not work?

    3. FC

      Yeah. I mean, I'm pretty curious to see how frontier labs are gonna react to V3 and how they're gonna start to target it. Um, it is designed to be more resistant, uh, to the same kind of targeting strategy as what we saw for V2 in, in particular. Like, of course, you can try to just make more ARC3-like games and then train your agents, uh, in them. Um, but the thing is, we've, uh, deliberately tried to create a private set of environments that is significantly different from the public set. Like, you can look at the public set. It's not actually giving you that much information about what's in the private set. Uh, in the private set, you will have very different games with very different concepts.

    4. DH

      Mm-hmm.

    5. FC

      And also the public set is meant to be substantially easier. So your performance on the public set is not actually, it's not representative of how well the system would do on private. So for this reason, it's gonna be harder to target.

    6. DH

      Uh.

    7. FC

      And that makes it a better test of fluid intelligence as opposed to a test of how much effort you put into, into cracking it.

    8. SP

      I'm so curious, how do you come up with these games? They're so creative.

    9. FC

      Yeah. We set up a, an entire, uh, uh, video game studio, right, to, to create them. Uh, so we got, uh, over 250 games. Uh, and you know, they're, they're, they're pretty quick to play. Like, uh, each game takes you maybe 10 minutes or, or, or a bit less, uh, uh, to play from scratch, like up- upon first contact. And we have like 250 plus, and, uh, we set up this, uh, uh, a very productive game studio where we had any given week, we had, uh, multiple games, uh, in progress. We had like this, this pipeline, uh, including, you know, design, implementation, uh, review, human testing and, and, uh, and, uh, many, many iteration cycles to, to, to make sure that the, the, the game comes out right.

    10. DH

      Who, who's working in the studio?

    11. FC

      Right. Uh, we-

    12. DH

      Who are the creators?

    13. FC

      Yeah. We hired a, a team of game developers, and we built our own game engine.

    14. DH

      Wow. So, so it's actually people who, like, previously worked in the game-- in the, in the video game industry.

    15. FC

      That's right. That's right. So one thing to keep in mind though is that the games in ARC3 are unique, right? They're, they're trying to not borrow elements, concepts from previous video games. Uh, they're built entirely on top of, uh, core knowledge priors. Like things like just, just, you know, elementary knowledge, like basic physics, uh, understanding of objects.understanding of the notion of, uh, agents, for instance, like an agents in objects with goals and in-in-intentions. Um, but we are, we are not incorporating any language, any, like, cultural symbols like, you know, arrows for instance, uh, or the color green meaning go and color red meaning stop, that sort of thing. Uh, there's no external

  15. 31:1435:31

    Inside the ARC Game Studio

    1. FC

      knowledge that's involved, uh, in these games.

    2. SP

      It's like one of those, uh, IQ tests that are just pattern matching, but now it has time series.

    3. FC

      Yeah. Uh, and it's not just time series, it's interactive.

    4. SP

      Interactive.

    5. FC

      You must create your own path through game space, right? You mu- you must-

    6. SP

      Mm.

    7. FC

      You know, in, in, in an IQ test like problem, like, you know, what ARC one and two is, the data that you must model is provided to you. You already have the data, you just, you just need to find a causal rule to explain it. With ARC three, actually you must gather the data, uh, and you must do so efficiently.

    8. SP

      Mm.

    9. FC

      Like of course you could say, "Well, I'm just gonna, you know, brute force mine, uh, the space of, uh, every possible game state, and then I find the solution." You cannot do that because if, if you try to do that, you will score extremely low, even if you manage to solve the level, uh, because you're scored on your efficiency. You must match human level efficiency.

    10. GT

      It's funny, it's like a almost, uh, coming full circle. This level of AGI with games sort of is the match pair to OpenAI writing... I mean, you know, Tom Brown, uh, one of the co-founders of Anthropic, had to write like the harness code to allow like the, you know, pre-GPT AI at OpenAI to play StarCraft. [chuckles]

    11. FC

      Yeah, yeah. OpenAI worked on the, on the, in particular on the, on DOTA 2.

    12. GT

      Mm-hmm.

    13. FC

      Uh, they had the OpenAI 5 model which was, if I recall correctly... So this was like not just pre-GPT, but also mostly pre-transformers because they were working with a stack of LSTM-

    14. GT

      Yeah

    15. FC

      ... uh, layers if I recall correctly. And even before OpenAI, uh, DeepMind worked a lot on video game, uh, uh, you know, solving video games. You had DeepRL, uh, and they were the first to do, uh, uh, Atari games right back in 2013. That, you know, they were very, very early, very, very visionary in that sense to, to work on, on this problem so early with these methods, uh, which are still very modern methods.

    16. GT

      Yeah.

    17. FC

      So the big difference is that if you look at, um, at Atari games for instance, or even DOTA, you're training, uh, on, on the same environment as what you use for testing. So effectively you're just trying to memorize the best strategies. You're trying to, uh, at, at training time, explore the full, uh, space of possible game states and productionize, operationalize, uh, that knowledge into, into, into the model and then at inference time you're basically just recalling that knowledge. And that's explicitly what you are trying to avoid with ARC three. Uh, you're not playing games, uh, that you've seen before. You're not playing games that you've been trained on like for millions of hours. Like the, the OpenAI 5 model for instance was playing a, a restricted version of DOTA 2 and it was trained on like tens of thousands of, of hours of gameplay effectively. I think may-maybe in millions. But it was just an insane amount of training data. With ARC three you're being evaluated on games that you're seeing for the very first time.

    18. GT

      Mm.

    19. FC

      And every action you spend exploring is counted towards your efficiency score, right? So you're really focused on measuring fluid intelligence, your ability to efficiently explore, efficiently produce a world model, uh, of the environment and then use this model, uh, to infer goals, uh, plan towards these goals, uh, and, and eventually crack the game.

    20. GT

      One of the arguments for, um, you know, Ndea is that you're able to do all of the intelligent tasks for, you know, an ARC task might be like .3 cen- you know cents for an ARC task. But you know, for the same task on a foundation model with LLMs it's, you know, a dollar to $10. And then there's this other aspect that we've been tracking where it seems like, uh, more and more intelligence, um, at least on the LLM side, uh, can be distilled down into smaller and smaller models. And so on the one hand like they're scaling up, but then they're like distilling smarter and smarter small models. I guess your approach might indicate that it's not billions of parameters, like the, you know, Ndea achieving AGI might not be a, a, you know, sort of inherently a scale thing at all. There's a platonic ideal of the Ndea model that achieves AGI.

  16. 35:3144:01

    Could AGI Fit in 10,000 Lines of Code?

    1. FC

      Yeah.

    2. GT

      Do you ever think about it in terms of like, well it would fit on a floppy disk?

    3. FC

      Well okay, there are, there are two things to separate. There's the sort of like fluid intelligence engine.

    4. GT

      Mm.

    5. FC

      I think it's gonna be a very, very small code base, uh, and a very small set of models associated with it, and it's probably gonna be on the order of megabytes, right? And then you have the knowledge base so to speak, uh, that's gonna be, uh, layered below this, this fluid intelligence engine. Like you know, fluid intelligence has to draw on some knowledge and that knowledge is gonna take up a lot more space. So I think it's, it's-

    6. GT

      Yeah

    7. FC

      ... uh, it's important to, to differentiate the two. I do believe that, you know, when you create AGI retrospectively it will turn out that it's a code base that's less than 10,000 lines of code.

    8. GT

      Hmm.

    9. FC

      And that if you had, if you had known about it back in the, in the 1980s you could have done AGI back then-

    10. GT

      Oh really?

    11. FC

      ... using the, the computer resources available back then.

    12. GT

      Wow, that's a crazy prediction.

    13. FC

      That's... I, I think retrospectively this will turn out-

    14. GT

      My God

    15. FC

      ... to be, to be true.

    16. GT

      Wow.

    17. FC

      Yeah.

    18. GT

      So it was just like hiding under our noses in plain sight for like 40 years. It took us like 40 years to figure it out.

    19. FC

      Yeah, that's right. That's right.

    20. GT

      Well that second thing sounds like Douglas Lenat's like Cyc project or is that the wrong way to think about it? It's like there's sort of knowledge about the world.

    21. FC

      Yeah, yeah.

    22. GT

      And then there's methods. Like the program, what I hear is like the program might be 10,000 lines and then it operates on like-

    23. FC

      On knowledge base that's very large. So the problem with Cyc, uh, I mean th-th-there were many issues with it, but one of the big issues is that, uh, there was no learning involved.

    24. GT

      Yeah.

    25. FC

      Right?

    26. GT

      It's just the knowledge l- like in some sort of-

    27. FC

      Like the knowledge was hand crafted

    28. GT

      ... it's like purely symbolic knowledge and it was probably inaccurate.

    29. FC

      The way you want to be building AGIIs that you want to be removing humans, uh, from, from the improvement loop as much as possible. You don't want a system where every improvement in system capability, uh, has to involve a human engineer doing something. And it's actually the strength, uh, of deep learning and foundation models is that you can just scale up the knowledge base. Like an LLM is effectively a knowledge base. It's a bank, uh, of, uh, of, you know, modular, uh, vector programs that map patterns of input tokens to patterns of output tokens. And you can, can scale up that knowledge base by just adding training data and training compute with no further human involvement. I mean, of course, there's still a little bit of human involvement in, in making sure the training job completes, but it's, it's minor. You've managed to remove humans, uh, from this improvement loop as much as possible, and that's also, uh, what we want for our system. We want a system that's, uh, self-improving, where the improvements are compounding, meaning that every time the system increases its capabilities, it's also increasing the rate at which it increases its capabilities.

    30. GT

      I think this is a PGism. It's like, I'm sorry the essay is so long. Uh, if I had more time, I would make it shorter. [chuckles]

  17. 44:0146:46

    Building Ndea: From Idea to Compounding Research Stack

    1. DH

      there be an ARC four, five, six? Can you keep making it harder?

    2. FC

      Yeah. Yeah. I think there, there will absolutely be ARC four and, and ARC five. I mean, we're currently planning ARC five. Um, the, the point of the ARC-AGI benchmark series is not to say that, well, you know, here's this test. If you pass it, this is AGI. Um, instead, what we are trying to do is we are target- we're targeting, uh, the residual gap of fair capabilities. Like frontier AI is advancing, and we are saying, well, uh, if you compare it to, to, to human abilities, there, there's all these tasks, all these things, it's not doing well. So we are gonna create a benchmark to target that. Uh, and so it's a moving target, right? It's, it's not fixed point, it's a moving target. There will be ARC four, which will be, uh, in the spirit of ARC three, but more focused on continual learning and, and curriculum learning at longer timescales. So you're gonna, you're gonna have fewer games, uh, but they're gonna have way more levels, and the levels are gonna be compounding, meaning that for, for each level, you need to reuse stuff that you've learned before. Then there's gonna be ARC five. And I'm actually really excited with ARC five. It's very, very new and different, and it's all about invention. And I mean, you, you will see, you will see what that means. Eventually, I expect we will, we'll run out of things to test. Like, as, uh, as we get closer to AGI, um, eventually there will be no measurable difference, uh, between human capabilities and particularly human learning efficiency and, and frontier AI. And when that happens, when it, when it becomes effectively impossible to measure the gap, this is the AGI moment.

    3. GT

      Well, then the machines will take over, and then they will create ARC ASI one.

    4. FC

      Yes. ARC ASI.

    5. GT

      And then it'll continue from there.

    6. FC

      Exactly. Yeah. Yeah.

    7. GT

      Yeah. If you had to put a guess, I mean, years, decades, months? [chuckles]

    8. FC

      Uh, my timeline to AGI, you know, if you, if you just try to, to extrapolate from the, the current rate of progress and the amount of investment that's going into not just the LLM stack, but also like, uh, side ideas, side bets that might work out, like, you know, Ndea for instance, I think we're probably looking at AGI 2030, early 2030s, uh, most likely. So around the time, uh, that we're gonna be releasing like maybe ARC six or ARC seven, uh, that's probably gonna be AGI.

    9. DH

      You guys are doing a different approach to LLMs. Um, do you think there's room for more startups to explore other new approaches? And are there any other ones that you think are promising that don't have time to explore yourself?

    10. FC

      Yeah, absolutely. I mean, there are many different approaches that you could try. I've said like compute is a, is a great equalizer. I think if you look at the amount

  18. 46:4647:21

    The Future of ARC: Benchmarks That Evolve With AI

    1. FC

      of compute and resources that we've thrown at, uh, deep learning and, and gradient descent and, and scaling that up, if you had thrown the same amount of investment into almost anything else, you would also have seen ex-extremely exciting results, like genetic algorithms, for instance. Uh, if you try to scale up genetic algorithms, I mean, I'm sure you can do incredible things with that. Um, you could, you could in fact probably do new, new science, uh, because, uh, uh, that's based on search, and search is the, is the, is the best fit for, uh, automating the scientific method. Uh, I think, so right now, there's also like approaches

  19. 47:2153:37

    Why There’s Still Huge Opportunity for New AI Paradigms

    1. FC

      that, uh, build on top of the current stack, but they're slightly alternative, like, uh, state space models, for instance. Uh, there's, uh, the, the xNCM architecture. Like you, you can basically, you know, current frontier AI is it's, it's a stack of things, and you, you can take any layer in the stack and try to propose an alternative. Like if you propose an alternative architecture, uh, you can be doing, for instance, like, yeah, like more like, uh, recurrent models instead of transformers, uh, for, for the architecture. Uh, or you can do even lower level. You're gonna be like, "Okay, we're still gonna be training, uh, parameter curves, but you're gonna get rid of gradient descent, right? We're gonna use like search." Maybe you're gonna do neural evolution. Uh, that's, that's lower level. And the lowest level is, uh, the low, the level where, where we're operating, where we're saying, "Well, actually, uh, forget about curves, uh, forget about parameter tuning, for-forget about gradient descent. We're just gonna do something completely different." Um, and I think if you want to build optimal AI, you are kind of forced to go back to the foundation of the stack. It cannot be like, uh, uh, one, one layer added onto the pile.

    2. DH

      Do you think for aspiring researchers to want to do a new neural lab with a different approach, they should be reading research papers from the '70s or '80s and-

    3. FC

      Yeah

    4. DH

      ...go deeply in those with approaches that were not as invested nowadays?

    5. FC

      That is actually a great idea because, uh, earlier in the, in the history of the AI research timeline, people were exploring more things and very different things. You've had this sort of like collapse of everything into one approach. It's, it's actually kind of a bad idea. Uh, like consider that not too long ago, like about, about 20 years ago-

    6. DH

      We had the collapse into SVMs too.

    7. FC

      Yeah. I mean, it's, it wasn't-- I wouldn't describe it as a collapse because there, there weren't that many people doing SVMs, and AI was a much, much, uh, smaller field back then. But there was this, uh, co-A widespread understanding that neural networks were, were a failed approach, that neural networks didn't work.

    8. GT

      Mm-hmm.

    9. FC

      And it, it was a waste of time to, to, to keep trying that.

    10. GT

      In the '90s, right?

    11. FC

      Yeah. No, even, even in the, in the, in the late, uh, 2000s-

    12. GT

      Right

    13. FC

      ... this was, this was, this was the sort of things. Uh, basically like when, when I got into, into AI-

    14. GT

      Mm-hmm

    15. FC

      ... uh, people were telling me like, "Hey, neural networks, don't, don't try that." I was like, "Yeah, but it, it looks a lot like what the brain is doing. Like I'm-"

    16. GT

      Mm-hmm.

    17. FC

      "I'm interested in that." If everybody's working on something, you are discarding ideas that will, uh, actually turn out to be very proactive ideas, right? And yeah, like back in the '70s, back in the '80s, people were trying more things, and I think genetic algorithms are actually a very good example of that. Uh, I think this is an approach that has a tremendous amount of potential, but there's, there's not too many people are looking into scaling it up, uh, deeply.

    18. GT

      Are there any characteristics that you would be looking for? I mean, is it as simple as, like, if there's a scaling law that could happen, then even if it's different, or is it... is that too, like, you know, thinking by analogy?

    19. FC

      I think you are looking for approaches that scale.

    20. GT

      Yeah.

    21. FC

      Uh, I think it's, it's a non-starter. If you're working on something, but the only way to increase capabilities of the system is to have, uh, human engineers and researchers spend time on it, it will not work. 'Cause even if the idea is very clever and very elegant and works really well, capabilities are gonna be bounded. They're gonna be bounded by human investment.

    22. GT

      Mm.

    23. FC

      Right? You want to be in a setup where the system can improve its capabilities-

    24. GT

      Yeah

    25. FC

      ... with no human in the loop, with no human bottleneck.

    26. GT

      So you would say, like, don't just do it the way we did it, like, 10 years ago. Do it with the idea that recursive self-improvement is baked in at the beginning.

    27. FC

      Yeah. Yeah, not necessarily recursive self-improvement because deep learning, for instance, is not, is not recursively self-improving.

    28. GT

      Mm.

    29. FC

      But with the idea of scaling up with no human bottlenecks.

    30. GT

      Got it.

  20. 53:3756:39

    How to Build a Breakout Open Source Project - Lessons From Kera

    1. FC

      the, the API simple and intuitive. There was this b- big focus on usability, and this was inspired by Scikit-learn. Like, Scikit-learn was sort of like the OG, um, machine learning library for Python, and what made it successful was that it was so easy to get started with it. So at first it was like, okay, uh, I'm gonna package, uh, all this functionality I've created under a really, really simple API. It's gonna be like the Scikit-learn API. That was, like, the big idea. The focus on usability is not just making sure the API is simple. It's also making sure the entire on- onboarding experience is nice and easy. Like, the docs should be very informative. You should... You know, the docs should be not just telling you about how to use this thing, but they should actually be teaching you about the domain in the first place because the, the folks who land on your website, they're not gonna be already deep learning experts. They're gonna be people looking to maybe start using deep learning, and so you, you have to teach them not just how to use the tool, but what the tool is good for, um, and, and the entire field around it. And then, uh, you know, you have to put a lot of investment into community building. Um, one thing we, uh, we did a bit, uh, at Google, in fact, you know, Google made it kind, kind of difficult and, and I was sad about that, is, uh, hire your power users.

    2. GT

      Mm.

    3. FC

      Like, hire your fans. This, this is-

    4. GT

      Smart

    5. FC

      ... a really, really good idea.

    6. GT

      Yeah.

    7. FC

      Like, find, find the, the most enthusiastic users from your community, uh, and, and, and just hire them on your team.

    8. GT

      Amazing.

    9. FC

      Yeah. And, uh, th- this is, these are the always the best people, right?

    10. GT

      All right, time to start gstack.org.

    11. FC

      [laughs]

    12. GT

      Uh, put in a bunch of my own money, and then hire a bunch of people to work on it. That sounds good.I think you've been a leader and pioneer, and we're so lucky to have you sit with us. There are people watching who are at the beginning of their, you know, adulthood even, like their c- certainly their professional careers, uh, or actually like people just around the world, they're like trying to understand like what does this mean as intelligence becomes broadly applicable? Like, what would you tell... You know, if you were 18 right now, what would you tell them?

    13. FC

      Yeah. I mean, there's a lot of people today who are very, uh, pessimistic, very negative takes about the, the rise in AI capabilities. They say, "Oh, you know, uh, I'm gonna be out of a job soon, and there's gonna be mass unemployment. Uh, AI's just gonna take over completely." And my, my take is actually, you know, the more you know, the more expertise you have about things like programming, for instance, the better you're able to use and leverage these tools for your own benefit. And with the right kind of expertise, uh, all this AI progress is actually empowerment. Like, it's something that you can leverage for yourself. I mean, that's, that's exactly what you did with your project, right?

    14. GT

      Yeah.

    15. FC

      And yeah, more people should have this mindset of trying to learn as much as possible, not just about AI, uh, but about the, the domain that they want, uh, uh, uh, to apply AI to, right? So that they should, they should seek to

  21. 56:3957:22

    Advice For How To Think About AI

    1. FC

      turn this, uh, uh, this, this new development into an opportunity, into, into a tool they can use for themselves to improve their own lives. I think that's, that's the right mindset because, you know, you're not gonna stop, uh, AI progress. I think, I think it's too late for that. And so the next question is, okay, like AI progress is here. Uh, it's actually gonna keep accelerating. How do you make use of it? How do you leverage? How do you ride the wave? That's the question to ask.

    2. GT

      I wish we could, uh, keep going for a couple hours, 'cause I'm sure we could. François, thank you so much for spending time with us.

    3. FC

      Thanks so much for having me. [outro music]

Episode duration: 57:23

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode k2ZLQC8P7dc

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome