Francois Chollet — Why the biggest AI models can't solve simple puzzles

Name: Francois Chollet — Why the biggest AI models can't solve simple puzzles
Uploaded: 2024-06-11T12:00:00Z
Duration: 1 h 34 min 39 s
Description: Francois Chollet explains the ARC (Abstraction and Reasoning Corpus) benchmark and a new $1M ARC Prize as a way to measure and drive progress toward genuine machine intelligence, not just larger language models.

Dwarkesh PodcastJun 11, 20241h 34m

Francois Chollet (guest), Dwarkesh Patel (host), Mike Knoop (guest)

Design and purpose of the ARC benchmark as an IQ test for machinesLimitations of LLMs: memorization vs. true generalization and intelligenceCore knowledge in humans vs. learned knowledge and its role in ARCProgram synthesis, discrete search, and hybrid System 1 / System 2 architecturesDetails and goals of the $1M ARC Prize competitionImpact of LLM hype and closed research practices on AGI progressDebate over scaling laws, multimodal models, and the path to AGI

In this episode of Dwarkesh Podcast, featuring Francois Chollet and Dwarkesh Patel, Francois Chollet — Why the biggest AI models can't solve simple puzzles explores aRC prize challenges LLM dominance, demands true machine intelligence progress Francois Chollet explains the ARC (Abstraction and Reasoning Corpus) benchmark and a new $1M ARC Prize as a way to measure and drive progress toward genuine machine intelligence, not just larger language models.

ARC prize challenges LLM dominance, demands true machine intelligence progress

Francois Chollet explains the ARC (Abstraction and Reasoning Corpus) benchmark and a new $1M ARC Prize as a way to measure and drive progress toward genuine machine intelligence, not just larger language models.

ARC is designed like an IQ test requiring humans’ core knowledge (objects, counting, basic physics, spatial patterns) but is intentionally resistant to memorization, which makes it very hard for current LLMs despite being easy for humans.

Chollet argues that today’s LLMs scale skill and memorized patterns, not true general intelligence, which he defines as the ability to rapidly acquire new skills and adapt to novel tasks from very little data.

He and co-sponsor Mike Knoop hope ARC will catalyze new architectures that merge deep learning with program synthesis / discrete search, and that open, reproducible solutions will reorient AI research away from closed, LLM-only paths.

Key Takeaways

ARC is explicitly built to defeat memorization-based AI.

Each ARC puzzle is a small, novel grid-based transformation task that cannot be solved by recalling seen examples; success requires synthesizing a new solution program from minimal demonstrations, much like human IQ tests.

Get the full analysis with uListen AI

Current LLMs mainly scale skill and stored patterns, not intelligence.

Chollet distinguishes between expanding a model’s bank of solution templates (skill) and the ability to quickly learn entirely new tasks from sparse data (intelligence); scaling LLMs improves the former but shows little evidence of the latter on ARC.

Get the full analysis with uListen AI

True general intelligence hinges on rapid adaptation to novelty.

Because real-world tasks and environments constantly change, no system can be pre-trained on every possible situation; intelligence is therefore defined as efficiently acquiring new skills and handling out-of-distribution scenarios on the fly.

Get the full analysis with uListen AI

Hybrid architectures that combine deep learning with program search are likely necessary.

Deep learning excels at pattern recognition and intuition (System 1), while discrete program synthesis excels at data-efficient, explicit reasoning (System 2) but is computationally expensive; Chollet argues AGI will require fusing these strengths.

Get the full analysis with uListen AI

Test-time learning and ‘active inference’ are crucial missing ingredients.

Most LLM use is static inference over frozen weights; methods like Jack Cole’s ARC approach, which fine-tune or adapt the model per task at inference time, point toward architectures that genuinely learn during problem-solving.

Get the full analysis with uListen AI

The $1M ARC Prize is designed to force openness and new ideas.

The prize requires winning solutions and papers to be open source/public domain, is run annually with progress prizes, and uses constrained compute to encourage algorithmic innovation rather than brute-force scaling or data leakage.

Get the full analysis with uListen AI

LLM hype and closed frontier research may be slowing AGI-relevant progress.

Chollet contends that OpenAI’s shift to secrecy and the field’s narrow focus on LLM scaling have both reduced exploration of alternative architectures, potentially delaying AGI progress by years.

Get the full analysis with uListen AI

Notable Quotes

“If you scale up the size of your database, you are not increasing the intelligence of the system one bit.”
— Francois Chollet

“ARC is intended as a kind of IQ test for machine intelligence… it’s designed to be resistant to memorization.”
— Francois Chollet

“General intelligence is not task-specific skill scaled up to many skills.”
— Francois Chollet

“Intelligence is what you use when you don’t know what to do.”
— Francois Chollet (quoting Jean Piaget)

“OpenAI basically set back progress towards AGI by probably, like, five to ten years.”
— Francois Chollet

Questions Answered in This Episode

If a frontier multimodal model reaches human-level performance on ARC, what specific evidence would convince skeptics that it’s demonstrating true generalization rather than clever memorization?

Get the full analysis with uListen AI

How could we rigorously distinguish between ‘program fetching’ and genuine on-the-fly program synthesis in neural models?

Get the full analysis with uListen AI

What would a practical hybrid System 1/System 2 architecture look like in deployed products, and how would it change the way we develop software or use AI tools?

Get the full analysis with uListen AI

To what extent can human core knowledge be learned from scratch by a general learner, versus needing to be hardcoded or architecturally baked in?

Get the full analysis with uListen AI

If ARC remains unsolved for many years despite large prizes, what should that force us to reconsider about current deep learning paradigms and our assumptions about scaling laws?

Get the full analysis with uListen AI

Transcript Preview

Francois Chollet

LLMs are very good at memorizing study programs. If you scale up the size of your database, you are not increasing the intelligence of the system one bit.

Dwarkesh Patel

I feel like you're using words like memorization, which we would never use for human children if, if they can just solve any ar- uh, arbitrary algebraic problem. You wouldn't say, like, they've memorized algebra. They'd say they've learned algebra.

Mike Knoop

So we got a million dollar prize pool, and there's a $500,000 prize for the first team that can get to the 85% benchmark. If ARC survives three months from here, we'll, we'll up the prize.

Francois Chollet

OpenAI basically set back progress towards AGI by probably, like, five to 10 years. They caused this complete closing down of Frontier Research Publishing. And now LLMs have, uh, sucked the oxygen out of the room. Like, everyone is just doing LLMs.

Dwarkesh Patel

Okay. Today, I have the pleasure to speak with Francois Chollet, who is a AI researcher at Google and creator of Keras. And he's launching a prize in collaboration with Mike Knuth, the co-founder of Zapier, who we'll also be talking to in a second. A million dollar prize to solve the ARC benchmark that he created. So first question, what is the ARC benchmark and why do you even need this prize? Why won't the biggest LLM we have in a year be able to just saturate it?

Francois Chollet

Sure. So ARC is intended as a kind of IQ test for machine intelligence. And what makes it different from, uh, most LM benchmarks out there is that it's designed to be resistant to memorization. So if you look at the way LLMs work, they are basically this, uh, big interpretive memory. And the way you scale up their capabilities is by trying to cram as much, uh, knowledge and patterns as possible into them. And, uh, by contrast, uh, ARC does not require a lot of knowledge at all. It's designed to only require what's known as, uh, core knowledge, which is, uh, basic knowledge about things like, um, elementary physics, object nests, counting, that sort of thing. Um, the sort of knowledge that any four-year-old or five-year-old, uh, possesses, right? Um, but what's interesting is that each puzzle in ARC is novel, is something that you've probably not encountered before, even if you've memorized the entire internet. And that's what, that's what makes it... (clears throat) Sorry. That's what, uh, makes, makes ARC challenging for LLMs. And so far, LLMs have not, uh, been doing very well on it. In fact, the approaches that are working well, uh, are more towards, uh, discrete program search, program synthesis.

Dwarkesh Patel

Mm. So f- first of all, I'll, I'll make a comment that I'm glad that as a skeptic of LLM, you have put out, yourself, a benchmark that... Uh, is it accurate to say that suppose that the biggest model we have in a year is able to get 80% on this, then your view would be we are on track, uh, to AGI with LLMs? How would you think about that?

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome