This Startup Beat Gemini 3 on ARC-AGI — at Half the Cost

Poetiq is a new startup founded by former DeepMind researchers that recently achieved a major jump on the ARC-AGI benchmark by layering a recursive self-improvement system on top of Gemini 3. In this conversation at NeurIPS, YC's Francois Chaubaurd sat down with Poetiq co-founder Ian Fisher to find out how they're increasing performance using prompts and system design alone. They also explore recursive self-improvement, benchmarking progress toward AGI, and why automating prompt engineering may be one of the most powerful levers in AI today. Chapters 00:11 — Introducing Poetiq and the ARC-AGI Breakthrough 00:49 — How Big Is the Performance Jump? 01:18 — Ian Fisher’s Background: YC, Google, DeepMind 02:00 — Recursive Self-Improvement Explained 03:00 — Why Poetiq Targeted ARC-AGI 03:58 — Improving Models Without Access to Weights 04:26 — Ensembles, Voting, and System-Level Optimization 05:30 — Why Gemini 3 Changed Everything 06:21 — What’s Next: Benchmarks, Research, and Customers 07:14 — Is Recursive Self-Improvement a Path to AGI? 08:46 — When to Stop Hill-Climbing 09:16 — Automating Prompt Engineers and Agents

Francois ChaubaurdhostIan Fisherguest

Jan 29, 202611mWatch on YouTube ↗

EVERY SPOKEN WORD

10 min read · 2,281 words

0:00 – 0:11
Intro
1. FCFrancois Chaubaurd
  [upbeat music]
0:11 – 0:49
Introducing Poetiq and the ARC-AGI Breakthrough
1. FCFrancois Chaubaurd
  My name is Francois. I'm a visiting partner here at Y Combinator. We're here with Ian at NeurIPS, uh, to learn a little bit about Poetic and your background and your big, uh, announcement.
2. IFIan Fisher
  Great.
3. FCFrancois Chaubaurd
  Uh, maybe introduce yourself.
4. IFIan Fisher
  Yeah. Uh, I'm Ian Fisher. I'm co-founder, co-CEO of, of Poetic. Poetic's a new company we just started, like, back in June, mostly ex-DeepMind folks. We just announced, uh, a pretty exciting result where with Poetic on top of Gemini 3, we have, uh, 54% on the ARC 2 private test set evaluation, which is, you know, uh, a very, very exciting, uh, increase over the previous
0:49 – 1:18
How Big Is the Performance Jump?
1. IFIan Fisher
  state-of-the-art.
2. FCFrancois Chaubaurd
  How much is that over Gemini 3?
3. IFIan Fisher
  Yeah. So, uh, Gemini 3, I think, uh, whoa, don't quote me on this, and that's a weird thing to say in front of a camera-
4. FCFrancois Chaubaurd
  [laughs]
5. IFIan Fisher
  But, uh, I think it was, uh, they were at, like, 33%, uh, uh, 31%.
6. FCFrancois Chaubaurd
  So you got, like, a 17% bump?
7. IFIan Fisher
  Uh, yeah, but the, the, the more fair comparison is, uh, Gemini 3 DeepThink-
8. FCFrancois Chaubaurd
  Mm-hmm
9. IFIan Fisher
  ... which, uh, got 45%, but it costs twice as much as Poetic.
10. FCFrancois Chaubaurd
  Oh, right, right, right. I see, I see.
11. IFIan Fisher
  So yeah, 9, 10 percentage points better and, uh, half the cost.
12. FCFrancois Chaubaurd
  Remind me, uh, your background.
1:18 – 2:00
Ian Fisher’s Background: YC, Google, DeepMind
1. IFIan Fisher
  Uh, yeah. So this is, uh, Poetic's actually my third company. Second company was a YC company, uh, called Affordable. Uh, we sold that to Google in 2015, and, uh, when I joined Google, I, uh, realized I really wanted to be doing machine learning research. It turns out that was a really good place to be doing machine learning research, so I switched into Google Research, uh, and just did fundamental research for, for a while, but then LLMs came along. It was clearly the most important thing happening, so refocused, uh, my research direction. This led to the genesis of Poetic. Uh, I realized there was this, um, uh, there was a much faster and cheaper way to do recursive self-improvement, where the AI is
2:00 – 3:00
Recursive Self-Improvement Explained
1. IFIan Fisher
  making itself smarter, and of course, y- you know, many people are going after this. Uh, there's a l- a lot of competition in this space, both from the major labs and from other startups like Poetic, which I think is great, right? You know, who, who knows what the actual right answer will be? But, um, uh, you know, recursive self-improvement is kind of the holy grail of AI.
2. FCFrancois Chaubaurd
  Yeah.
3. IFIan Fisher
  If we can get the models to just make themselves better, then we, you know, we can sit back and relax. Uh, you know, of course, there are differing opinions there about whether or not we should want that.
4. FCFrancois Chaubaurd
  Mm-hmm.
5. IFIan Fisher
  Poetic obviously wants to do this safely. Um, uh, I think, you know, most, most people want to do recursive self-improvement safely, so we have a particular perspective there as well.
6. FCFrancois Chaubaurd
  Tell me about, um, I guess, the story of, like, you, you targeted ARC-AGI, you're running it, Gemini 3 comes out, you're running this procedure on top of it, and you're seeing it hill climb. Like, what are your thoughts? Did you expect it to be as good as it was, like it was fully in expectation, or this was, like, beat expectation, um, and then when you finally got the results, you're like, "Wow, this is cool"?
3:00 – 3:58
Why Poetiq Targeted ARC-AGI
1. IFIan Fisher
  Yeah. It's, uh, you know, it was really interesting. We were actually really focused on ARC 1. Uh, we weren't paying that much attention to ARC-AGI 2. We, we, like, ran our models on ARC 2, uh, just to make certain, you know, it was, like, reasonable.
2. FCFrancois Chaubaurd
  With different, uh, API models, right?
3. IFIan Fisher
  Yeah, yeah, with d- different API providers. Um, but, uh, we, you know, we were getting very exciting results on ARC 1, and we figured, you know, it's like y- it's easier, we'll, like, start with that. ARC 2 seems really hard. We were in a really good position. You know, I don't, I don't want to, um, I don't want to, like, overclaim. You know, I, I think what Poetic's done is, like, very good, but Gemini 3 came out. It was, it's a really quite astonishingly good model.
4. FCFrancois Chaubaurd
  Yeah.
5. IFIan Fisher
  So a little bit, a little bit of technical background. The recursive self-improvement loop, what it does is, like, we run it on other tasks that we can evaluate. Uh, so the, the, our system is improving itself by improving other systems, right?
6. FCFrancois Chaubaurd
  And you, and you don't have access to the weights, so the only thing-
3:58 – 4:26
Improving Models Without Access to Weights
1. IFIan Fisher
  Right, exactly
2. FCFrancois Chaubaurd
  ... you can really, the only thing in your action space to change is the prompt itself.
3. IFIan Fisher
  It's the prompt and the system around the prompt. Like, so, you know, where the, the system that we are using, it, you know, it's like an ensemble, um, that calls, uh, you know, the underlying model, in this case Gemini 3, um, in, at multiple times to refine each ensemble member's independent and is refining its own answer, and then they, we combine them with some voting scheme that works well.
4. FCFrancois Chaubaurd
  And there was some DSPy stuff
4:26 – 5:30
Ensembles, Voting, and System-Level Optimization
1. FCFrancois Chaubaurd
  that was similar w- way back when that I've tried, and I've not really seen it be super great. Um-
2. IFIan Fisher
  Right
3. FCFrancois Chaubaurd
  ... and you guys are, you know, like, in the same spirit, but-
4. IFIan Fisher
  Yeah, yeah, yeah
5. FCFrancois Chaubaurd
  ... meaningfully better.
6. IFIan Fisher
  Yeah, so DS- uh, DSPy is a very cool project. Uh, and, uh, I, you know, I wish, I wish I could hire the, the people who made it.
7. FCFrancois Chaubaurd
  [laughs]
8. IFIan Fisher
  Uh, if you're watching and you're thinking about, like, leaving your current job-
9. FCFrancois Chaubaurd
  You have a job offer coming [laughs]
10. IFIan Fisher
  Yeah. Uh, but, uh, I, I think, you know, there, there's, uh, some, uh, you know, trade secret insights that, that we have that go a little bit beyond, um, that, and, uh, it seemed to make a big difference.
11. FCFrancois Chaubaurd
  Right.
12. IFIan Fisher
  So basically, the system out, is an output of our system. The, the, the ARC-AGI solver is an output of, uh, of our system. Uh, and it was really designed and, and, and trained on ARC 1, so we never trained at all on ARC 2. So when Gemini 3 came out, uh, we saw this big, uh, jump in performance also on ARC 1, relatively large. We were at, like, 89% with other models, and then we got to 95% with Gemini 3 on ARC 1. And of course, we had to try it on ARC 2, and we saw a, like,
5:30 – 6:21
Why Gemini 3 Changed Everything
1. IFIan Fisher
  you know, kind of holy cow moment of, like, this is amazingly good.
2. FCFrancois Chaubaurd
  Mm-hmm.
3. IFIan Fisher
  Um, and the, you know, I think that's the thing driving the performance improvement there is the Google team has done some, somehow in this particular model, they've done a really good job at, uh, having a model that is good at coding, writing code for, like, visual problem solving-
4. FCFrancois Chaubaurd
  Mm
5. IFIan Fisher
  ... better than, you know, uh, kind of all the previous models that had been out.
6. FCFrancois Chaubaurd
  Yeah.
7. IFIan Fisher
  Um, of course, Opus, uh, 4.5, 4.5 came out from Anthropic, um, uh, you know, similar, you know, pretty quickly thereafter, and, uh, it, it, it's-Quality seems to be pretty similar, uh, to Gemini 3. It's, it's more expensive. What we saw is, like, we could just replace Gemini 3 with Opus and get, uh, you know, similar results.
8. FCFrancois Chaubaurd
  I guess, what's next for you guys?
9. IFIan Fisher
  Yeah.
10. FCFrancois Chaubaurd
  Other benchmarks?
6:21 – 7:14
What’s Next: Benchmarks, Research, and Customers
1. FCFrancois Chaubaurd
  You wanna go, like, more benchmarks, proving more stuff out, uh, productizing other ideas, more research, all the above?
2. IFIan Fisher
  All the above. Yeah, yeah.
3. FCFrancois Chaubaurd
  [laughs]
4. IFIan Fisher
  Yeah, so we have some more benchmarks in mind that we think, uh, are, you know, really high-impact benchmarks that we might be able to make, uh, you know, an interesting dent on. Um, we'll ... I won't say which ones so that, uh, not everybody's, like, um, jumping in front of us, but, uh, uh, you know, you can probably guess at what some of them would be.
5. FCFrancois Chaubaurd
  How, how big is Poetic?
6. IFIan Fisher
  Oh, yeah. Poetic is, uh, currently six people.
7. FCFrancois Chaubaurd
  Wow.
8. IFIan Fisher
  We have our-
9. FCFrancois Chaubaurd
  Six people, and you're state-of-the-art.
10. IFIan Fisher
  Yeah.
11. FCFrancois Chaubaurd
  That's pretty impressive.
12. IFIan Fisher
  Um, yeah. They ... I- I mean, I'm really honored to be working with the team. They're ... Everybody is fantastic.
13. FCFrancois Chaubaurd
  Yeah.
14. IFIan Fisher
  Um, uh, we have a seventh person joining who is also fantastic starting January, so, um, yeah.
15. FCFrancois Chaubaurd
  And the DSPY team coming soon.
16. IFIan Fisher
  Yeah, yeah. [laughs]
17. FCFrancois Chaubaurd
  [laughs] Um, I mean, do you think that ... Ob- obviously ARC-AGI,
7:14 – 8:46
Is Recursive Self-Improvement a Path to AGI?
1. FCFrancois Chaubaurd
  um, AGI is in the name, and so do you think that, uh, RSI, recursive self-improvement, is a path to AGI? Or do you think that this is just like ... It just gives you a nice bump. It's like dropout. You don't do dropout. You do dropout. You just get, like, a nice 3, 4% bump.
2. IFIan Fisher
  Yeah. That's a, a ... It's a really nice way of, of putting it. Like, I- I- I think that both things are true, right? Like, you want that, uh, that bump from doing this because, uh, you know, uh, as we showed in our, um, initial blog post, well, it's a little bit, it's a little bit of a hack. I, I don't, again, don't wanna over claim things here, but on ARC-AGI, because they allow you to present two solutions, uh, that allowed us to actually outperform the underlying models while being cheaper. We, we, we only provided one solution, but because of the bump in performance, we were able to still do better than when the underlying model was providing two solutions, right?
3. FCFrancois Chaubaurd
  Mm-hmm.
4. IFIan Fisher
  So in general, if you're only allowed one response, Poetic will always be more expensive-
5. FCFrancois Chaubaurd
  Mm-hmm
6. IFIan Fisher
  ... um, uh, or at least, uh, at least the same price, right? But, uh, if you're allowed, you know, if you're dealing with multiple response settings, then Poetic could be cheaper, but it, it should always be better. Uh, and so you always want that bump. But then coming back to the original question, uh, does this lead to AGI? I mean, I don't believe it's the only path, but I believe it's, like, you know, the most exciting ... In my mind, it's the most exciting path, and it is a path to AGI and beyond.
7. FCFrancois Chaubaurd
  Um, did you actually stop it from hill climbing and say it's, it's good enough or did it actually plateau?
8. IFIan Fisher
  I stopped it. It, it's, uh ... Yeah.
8:46 – 9:16
When to Stop Hill-Climbing
1. IFIan Fisher
  It, uh ... This ARC-AGI was fairly expensive to run the hill climbing on.
2. FCFrancois Chaubaurd
  Okay, so you need money.
3. IFIan Fisher
  Yeah. Yeah. We-
4. FCFrancois Chaubaurd
  But then, then you could have gotten even better. [laughs]
5. IFIan Fisher
  Right. Uh-
6. FCFrancois Chaubaurd
  We can solve that in the world.
7. IFIan Fisher
  Yeah, yeah. [laughs]
8. FCFrancois Chaubaurd
  [laughs] We know how to do that.
9. IFIan Fisher
  Yeah. If anybody has any money who's listening. Uh, but yeah, you know, we want to service our customers, right? And we can't be, like, w- you know, w- out of money when we need to run experiments for our customers, so yeah.
10. FCFrancois Chaubaurd
  What else are you more, most excited about, uh, coming up in the future for you guys?
11. IFIan Fisher
  You know, there's the benchmarks, but, uh, yeah,
9:16 – 11:22
Automating Prompt Engineers and Agents
1. IFIan Fisher
  we're starting to have conversations with, with customers, uh, around how we can help them. Uh, we're very excited about that. You know, this is a company that's doing research, uh, but we always intended it to be a company that makes a real difference in the market, right?
2. FCFrancois Chaubaurd
  Mm-hmm.
3. IFIan Fisher
  Like, we want to solve important problems for actual businesses, uh, along the way to, uh ... You know, while we run our recursive self-improvement.
4. FCFrancois Chaubaurd
  Right. Yeah. I mean, I just see it so obvious because, like, in the action space of things that if you believe Sam that the models are only going to get better-
5. IFIan Fisher
  Mm-hmm
6. FCFrancois Chaubaurd
  ... and you should, and you want to use them, the only thing in your action space to c- you know, uh, condition the model on what you want it to do is the prompt, and the ... And it's just prompt engineering.
7. IFIan Fisher
  Right.
8. FCFrancois Chaubaurd
  Right? And just, like, try stuff really is the answer. And then you have some evals, and you're just trying stuff and then testing the evals, and it feels like we're back to, like, feature engineering-
9. IFIan Fisher
  Right
10. FCFrancois Chaubaurd
  ... and just, like, hog, sift, surf descriptors again-
11. IFIan Fisher
  [laughs]
12. FCFrancois Chaubaurd
  ... like back in the day. [laughs] And, like, but that's not, you know, clearly not the answer. Like, get ... The whole thing of deep learning since 2012 is get yourself out of the way-
13. IFIan Fisher
  Yes
14. FCFrancois Chaubaurd
  ... out of the loop.
15. IFIan Fisher
  Absolutely. Yeah.
16. FCFrancois Chaubaurd
  So it makes a lot, a ton of sense. I'm really excited for you guys.
17. IFIan Fisher
  Yeah, yeah. I mean, the way, the way you're putting it is, is really nice. The, like, uh ... You know, this, um, relates back to research that we were doing at, at DeepMind before we left, where, um, we were building systems like what Poetic can build, but we were doing it manually. And so the Poetic technology is completely different from that research that we did in that, uh, uh, you know, that was like we put together a car by hand, right?
18. FCFrancois Chaubaurd
  Mm-hmm.
19. IFIan Fisher
  Uh, and now we've, like, built a factory to build cars, which is something completely different. But, you know, we were in ... You know, we are quite intentionally automating ourselves. Automating prompt engineers, automating people who are building agents. It's a power tool, right?
20. FCFrancois Chaubaurd
  Yeah, yeah.
21. IFIan Fisher
  Um ...
22. FCFrancois Chaubaurd
  Well, I'm really excited for you. Thanks for joining us.
23. IFIan Fisher
  Yeah. Thanks so much. Thanks. [outro music]

Episode duration: 11:23

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode OLEjyBLo8sQ

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome