Skip to content
YC Root AccessYC Root Access

This Startup Beat Gemini 3 on ARC-AGI — at Half the Cost

Poetiq is a new startup founded by former DeepMind researchers that recently achieved a major jump on the ARC-AGI benchmark by layering a recursive self-improvement system on top of Gemini 3. In this conversation at NeurIPS, YC's Francois Chaubaurd sat down with Poetiq co-founder Ian Fisher to find out how they're increasing performance using prompts and system design alone. They also explore recursive self-improvement, benchmarking progress toward AGI, and why automating prompt engineering may be one of the most powerful levers in AI today. Chapters 00:11 — Introducing Poetiq and the ARC-AGI Breakthrough 00:49 — How Big Is the Performance Jump? 01:18 — Ian Fisher’s Background: YC, Google, DeepMind 02:00 — Recursive Self-Improvement Explained 03:00 — Why Poetiq Targeted ARC-AGI 03:58 — Improving Models Without Access to Weights 04:26 — Ensembles, Voting, and System-Level Optimization 05:30 — Why Gemini 3 Changed Everything 06:21 — What’s Next: Benchmarks, Research, and Customers 07:14 — Is Recursive Self-Improvement a Path to AGI? 08:46 — When to Stop Hill-Climbing 09:16 — Automating Prompt Engineers and Agents

Francois ChaubaurdhostIan Fisherguest
Jan 29, 202611mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:000:11

    Intro

    1. FC

      [upbeat music]

  2. 0:110:49

    Introducing Poetiq and the ARC-AGI Breakthrough

    1. FC

      My name is Francois. I'm a visiting partner here at Y Combinator. We're here with Ian at NeurIPS, uh, to learn a little bit about Poetic and your background and your big, uh, announcement.

    2. IF

      Great.

    3. FC

      Uh, maybe introduce yourself.

    4. IF

      Yeah. Uh, I'm Ian Fisher. I'm co-founder, co-CEO of, of Poetic. Poetic's a new company we just started, like, back in June, mostly ex-DeepMind folks. We just announced, uh, a pretty exciting result where with Poetic on top of Gemini 3, we have, uh, 54% on the ARC 2 private test set evaluation, which is, you know, uh, a very, very exciting, uh, increase over the previous

  3. 0:491:18

    How Big Is the Performance Jump?

    1. IF

      state-of-the-art.

    2. FC

      How much is that over Gemini 3?

    3. IF

      Yeah. So, uh, Gemini 3, I think, uh, whoa, don't quote me on this, and that's a weird thing to say in front of a camera-

    4. FC

      [laughs]

    5. IF

      But, uh, I think it was, uh, they were at, like, 33%, uh, uh, 31%.

    6. FC

      So you got, like, a 17% bump?

    7. IF

      Uh, yeah, but the, the, the more fair comparison is, uh, Gemini 3 DeepThink-

    8. FC

      Mm-hmm

    9. IF

      ... which, uh, got 45%, but it costs twice as much as Poetic.

    10. FC

      Oh, right, right, right. I see, I see.

    11. IF

      So yeah, 9, 10 percentage points better and, uh, half the cost.

    12. FC

      Remind me, uh, your background.

  4. 1:182:00

    Ian Fisher’s Background: YC, Google, DeepMind

    1. IF

      Uh, yeah. So this is, uh, Poetic's actually my third company. Second company was a YC company, uh, called Affordable. Uh, we sold that to Google in 2015, and, uh, when I joined Google, I, uh, realized I really wanted to be doing machine learning research. It turns out that was a really good place to be doing machine learning research, so I switched into Google Research, uh, and just did fundamental research for, for a while, but then LLMs came along. It was clearly the most important thing happening, so refocused, uh, my research direction. This led to the genesis of Poetic. Uh, I realized there was this, um, uh, there was a much faster and cheaper way to do recursive self-improvement, where the AI is

  5. 2:003:00

    Recursive Self-Improvement Explained

    1. IF

      making itself smarter, and of course, y- you know, many people are going after this. Uh, there's a l- a lot of competition in this space, both from the major labs and from other startups like Poetic, which I think is great, right? You know, who, who knows what the actual right answer will be? But, um, uh, you know, recursive self-improvement is kind of the holy grail of AI.

    2. FC

      Yeah.

    3. IF

      If we can get the models to just make themselves better, then we, you know, we can sit back and relax. Uh, you know, of course, there are differing opinions there about whether or not we should want that.

    4. FC

      Mm-hmm.

    5. IF

      Poetic obviously wants to do this safely. Um, uh, I think, you know, most, most people want to do recursive self-improvement safely, so we have a particular perspective there as well.

    6. FC

      Tell me about, um, I guess, the story of, like, you, you targeted ARC-AGI, you're running it, Gemini 3 comes out, you're running this procedure on top of it, and you're seeing it hill climb. Like, what are your thoughts? Did you expect it to be as good as it was, like it was fully in expectation, or this was, like, beat expectation, um, and then when you finally got the results, you're like, "Wow, this is cool"?

  6. 3:003:58

    Why Poetiq Targeted ARC-AGI

    1. IF

      Yeah. It's, uh, you know, it was really interesting. We were actually really focused on ARC 1. Uh, we weren't paying that much attention to ARC-AGI 2. We, we, like, ran our models on ARC 2, uh, just to make certain, you know, it was, like, reasonable.

    2. FC

      With different, uh, API models, right?

    3. IF

      Yeah, yeah, with d- different API providers. Um, but, uh, we, you know, we were getting very exciting results on ARC 1, and we figured, you know, it's like y- it's easier, we'll, like, start with that. ARC 2 seems really hard. We were in a really good position. You know, I don't, I don't want to, um, I don't want to, like, overclaim. You know, I, I think what Poetic's done is, like, very good, but Gemini 3 came out. It was, it's a really quite astonishingly good model.

    4. FC

      Yeah.

    5. IF

      So a little bit, a little bit of technical background. The recursive self-improvement loop, what it does is, like, we run it on other tasks that we can evaluate. Uh, so the, the, our system is improving itself by improving other systems, right?

    6. FC

      And you, and you don't have access to the weights, so the only thing-

  7. 3:584:26

    Improving Models Without Access to Weights

    1. IF

      Right, exactly

    2. FC

      ... you can really, the only thing in your action space to change is the prompt itself.

    3. IF

      It's the prompt and the system around the prompt. Like, so, you know, where the, the system that we are using, it, you know, it's like an ensemble, um, that calls, uh, you know, the underlying model, in this case Gemini 3, um, in, at multiple times to refine each ensemble member's independent and is refining its own answer, and then they, we combine them with some voting scheme that works well.

    4. FC

      And there was some DSPy stuff

  8. 4:265:30

    Ensembles, Voting, and System-Level Optimization

    1. FC

      that was similar w- way back when that I've tried, and I've not really seen it be super great. Um-

    2. IF

      Right

    3. FC

      ... and you guys are, you know, like, in the same spirit, but-

    4. IF

      Yeah, yeah, yeah

    5. FC

      ... meaningfully better.

    6. IF

      Yeah, so DS- uh, DSPy is a very cool project. Uh, and, uh, I, you know, I wish, I wish I could hire the, the people who made it.

    7. FC

      [laughs]

    8. IF

      Uh, if you're watching and you're thinking about, like, leaving your current job-

    9. FC

      You have a job offer coming [laughs]

    10. IF

      Yeah. Uh, but, uh, I, I think, you know, there, there's, uh, some, uh, you know, trade secret insights that, that we have that go a little bit beyond, um, that, and, uh, it seemed to make a big difference.

    11. FC

      Right.

    12. IF

      So basically, the system out, is an output of our system. The, the, the ARC-AGI solver is an output of, uh, of our system. Uh, and it was really designed and, and, and trained on ARC 1, so we never trained at all on ARC 2. So when Gemini 3 came out, uh, we saw this big, uh, jump in performance also on ARC 1, relatively large. We were at, like, 89% with other models, and then we got to 95% with Gemini 3 on ARC 1. And of course, we had to try it on ARC 2, and we saw a, like,

  9. 5:306:21

    Why Gemini 3 Changed Everything

    1. IF

      you know, kind of holy cow moment of, like, this is amazingly good.

    2. FC

      Mm-hmm.

    3. IF

      Um, and the, you know, I think that's the thing driving the performance improvement there is the Google team has done some, somehow in this particular model, they've done a really good job at, uh, having a model that is good at coding, writing code for, like, visual problem solving-

    4. FC

      Mm

    5. IF

      ... better than, you know, uh, kind of all the previous models that had been out.

    6. FC

      Yeah.

    7. IF

      Um, of course, Opus, uh, 4.5, 4.5 came out from Anthropic, um, uh, you know, similar, you know, pretty quickly thereafter, and, uh, it, it, it's-Quality seems to be pretty similar, uh, to Gemini 3. It's, it's more expensive. What we saw is, like, we could just replace Gemini 3 with Opus and get, uh, you know, similar results.

    8. FC

      I guess, what's next for you guys?

    9. IF

      Yeah.

    10. FC

      Other benchmarks?

  10. 6:217:14

    What’s Next: Benchmarks, Research, and Customers

    1. FC

      You wanna go, like, more benchmarks, proving more stuff out, uh, productizing other ideas, more research, all the above?

    2. IF

      All the above. Yeah, yeah.

    3. FC

      [laughs]

    4. IF

      Yeah, so we have some more benchmarks in mind that we think, uh, are, you know, really high-impact benchmarks that we might be able to make, uh, you know, an interesting dent on. Um, we'll ... I won't say which ones so that, uh, not everybody's, like, um, jumping in front of us, but, uh, uh, you know, you can probably guess at what some of them would be.

    5. FC

      How, how big is Poetic?

    6. IF

      Oh, yeah. Poetic is, uh, currently six people.

    7. FC

      Wow.

    8. IF

      We have our-

    9. FC

      Six people, and you're state-of-the-art.

    10. IF

      Yeah.

    11. FC

      That's pretty impressive.

    12. IF

      Um, yeah. They ... I- I mean, I'm really honored to be working with the team. They're ... Everybody is fantastic.

    13. FC

      Yeah.

    14. IF

      Um, uh, we have a seventh person joining who is also fantastic starting January, so, um, yeah.

    15. FC

      And the DSPY team coming soon.

    16. IF

      Yeah, yeah. [laughs]

    17. FC

      [laughs] Um, I mean, do you think that ... Ob- obviously ARC-AGI,

  11. 7:148:46

    Is Recursive Self-Improvement a Path to AGI?

    1. FC

      um, AGI is in the name, and so do you think that, uh, RSI, recursive self-improvement, is a path to AGI? Or do you think that this is just like ... It just gives you a nice bump. It's like dropout. You don't do dropout. You do dropout. You just get, like, a nice 3, 4% bump.

    2. IF

      Yeah. That's a, a ... It's a really nice way of, of putting it. Like, I- I- I think that both things are true, right? Like, you want that, uh, that bump from doing this because, uh, you know, uh, as we showed in our, um, initial blog post, well, it's a little bit, it's a little bit of a hack. I, I don't, again, don't wanna over claim things here, but on ARC-AGI, because they allow you to present two solutions, uh, that allowed us to actually outperform the underlying models while being cheaper. We, we, we only provided one solution, but because of the bump in performance, we were able to still do better than when the underlying model was providing two solutions, right?

    3. FC

      Mm-hmm.

    4. IF

      So in general, if you're only allowed one response, Poetic will always be more expensive-

    5. FC

      Mm-hmm

    6. IF

      ... um, uh, or at least, uh, at least the same price, right? But, uh, if you're allowed, you know, if you're dealing with multiple response settings, then Poetic could be cheaper, but it, it should always be better. Uh, and so you always want that bump. But then coming back to the original question, uh, does this lead to AGI? I mean, I don't believe it's the only path, but I believe it's, like, you know, the most exciting ... In my mind, it's the most exciting path, and it is a path to AGI and beyond.

    7. FC

      Um, did you actually stop it from hill climbing and say it's, it's good enough or did it actually plateau?

    8. IF

      I stopped it. It, it's, uh ... Yeah.

  12. 8:469:16

    When to Stop Hill-Climbing

    1. IF

      It, uh ... This ARC-AGI was fairly expensive to run the hill climbing on.

    2. FC

      Okay, so you need money.

    3. IF

      Yeah. Yeah. We-

    4. FC

      But then, then you could have gotten even better. [laughs]

    5. IF

      Right. Uh-

    6. FC

      We can solve that in the world.

    7. IF

      Yeah, yeah. [laughs]

    8. FC

      [laughs] We know how to do that.

    9. IF

      Yeah. If anybody has any money who's listening. Uh, but yeah, you know, we want to service our customers, right? And we can't be, like, w- you know, w- out of money when we need to run experiments for our customers, so yeah.

    10. FC

      What else are you more, most excited about, uh, coming up in the future for you guys?

    11. IF

      You know, there's the benchmarks, but, uh, yeah,

  13. 9:1611:22

    Automating Prompt Engineers and Agents

    1. IF

      we're starting to have conversations with, with customers, uh, around how we can help them. Uh, we're very excited about that. You know, this is a company that's doing research, uh, but we always intended it to be a company that makes a real difference in the market, right?

    2. FC

      Mm-hmm.

    3. IF

      Like, we want to solve important problems for actual businesses, uh, along the way to, uh ... You know, while we run our recursive self-improvement.

    4. FC

      Right. Yeah. I mean, I just see it so obvious because, like, in the action space of things that if you believe Sam that the models are only going to get better-

    5. IF

      Mm-hmm

    6. FC

      ... and you should, and you want to use them, the only thing in your action space to c- you know, uh, condition the model on what you want it to do is the prompt, and the ... And it's just prompt engineering.

    7. IF

      Right.

    8. FC

      Right? And just, like, try stuff really is the answer. And then you have some evals, and you're just trying stuff and then testing the evals, and it feels like we're back to, like, feature engineering-

    9. IF

      Right

    10. FC

      ... and just, like, hog, sift, surf descriptors again-

    11. IF

      [laughs]

    12. FC

      ... like back in the day. [laughs] And, like, but that's not, you know, clearly not the answer. Like, get ... The whole thing of deep learning since 2012 is get yourself out of the way-

    13. IF

      Yes

    14. FC

      ... out of the loop.

    15. IF

      Absolutely. Yeah.

    16. FC

      So it makes a lot, a ton of sense. I'm really excited for you guys.

    17. IF

      Yeah, yeah. I mean, the way, the way you're putting it is, is really nice. The, like, uh ... You know, this, um, relates back to research that we were doing at, at DeepMind before we left, where, um, we were building systems like what Poetic can build, but we were doing it manually. And so the Poetic technology is completely different from that research that we did in that, uh, uh, you know, that was like we put together a car by hand, right?

    18. FC

      Mm-hmm.

    19. IF

      Uh, and now we've, like, built a factory to build cars, which is something completely different. But, you know, we were in ... You know, we are quite intentionally automating ourselves. Automating prompt engineers, automating people who are building agents. It's a power tool, right?

    20. FC

      Yeah, yeah.

    21. IF

      Um ...

    22. FC

      Well, I'm really excited for you. Thanks for joining us.

    23. IF

      Yeah. Thanks so much. Thanks. [outro music]

Episode duration: 11:23

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode OLEjyBLo8sQ

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome