YC Root AccessThis Startup Beat Gemini 3 on ARC-AGI — at Half the Cost
EVERY SPOKEN WORD
10 min read · 2,281 words- 0:00 – 0:11
Intro
- FCFrancois Chaubaurd
[upbeat music]
- 0:11 – 0:49
Introducing Poetiq and the ARC-AGI Breakthrough
- FCFrancois Chaubaurd
My name is Francois. I'm a visiting partner here at Y Combinator. We're here with Ian at NeurIPS, uh, to learn a little bit about Poetic and your background and your big, uh, announcement.
- IFIan Fisher
Great.
- FCFrancois Chaubaurd
Uh, maybe introduce yourself.
- IFIan Fisher
Yeah. Uh, I'm Ian Fisher. I'm co-founder, co-CEO of, of Poetic. Poetic's a new company we just started, like, back in June, mostly ex-DeepMind folks. We just announced, uh, a pretty exciting result where with Poetic on top of Gemini 3, we have, uh, 54% on the ARC 2 private test set evaluation, which is, you know, uh, a very, very exciting, uh, increase over the previous
- 0:49 – 1:18
How Big Is the Performance Jump?
- IFIan Fisher
state-of-the-art.
- FCFrancois Chaubaurd
How much is that over Gemini 3?
- IFIan Fisher
Yeah. So, uh, Gemini 3, I think, uh, whoa, don't quote me on this, and that's a weird thing to say in front of a camera-
- FCFrancois Chaubaurd
[laughs]
- IFIan Fisher
But, uh, I think it was, uh, they were at, like, 33%, uh, uh, 31%.
- FCFrancois Chaubaurd
So you got, like, a 17% bump?
- IFIan Fisher
Uh, yeah, but the, the, the more fair comparison is, uh, Gemini 3 DeepThink-
- FCFrancois Chaubaurd
Mm-hmm
- IFIan Fisher
... which, uh, got 45%, but it costs twice as much as Poetic.
- FCFrancois Chaubaurd
Oh, right, right, right. I see, I see.
- IFIan Fisher
So yeah, 9, 10 percentage points better and, uh, half the cost.
- FCFrancois Chaubaurd
Remind me, uh, your background.
- 1:18 – 2:00
Ian Fisher’s Background: YC, Google, DeepMind
- IFIan Fisher
Uh, yeah. So this is, uh, Poetic's actually my third company. Second company was a YC company, uh, called Affordable. Uh, we sold that to Google in 2015, and, uh, when I joined Google, I, uh, realized I really wanted to be doing machine learning research. It turns out that was a really good place to be doing machine learning research, so I switched into Google Research, uh, and just did fundamental research for, for a while, but then LLMs came along. It was clearly the most important thing happening, so refocused, uh, my research direction. This led to the genesis of Poetic. Uh, I realized there was this, um, uh, there was a much faster and cheaper way to do recursive self-improvement, where the AI is
- 2:00 – 3:00
Recursive Self-Improvement Explained
- IFIan Fisher
making itself smarter, and of course, y- you know, many people are going after this. Uh, there's a l- a lot of competition in this space, both from the major labs and from other startups like Poetic, which I think is great, right? You know, who, who knows what the actual right answer will be? But, um, uh, you know, recursive self-improvement is kind of the holy grail of AI.
- FCFrancois Chaubaurd
Yeah.
- IFIan Fisher
If we can get the models to just make themselves better, then we, you know, we can sit back and relax. Uh, you know, of course, there are differing opinions there about whether or not we should want that.
- FCFrancois Chaubaurd
Mm-hmm.
- IFIan Fisher
Poetic obviously wants to do this safely. Um, uh, I think, you know, most, most people want to do recursive self-improvement safely, so we have a particular perspective there as well.
- FCFrancois Chaubaurd
Tell me about, um, I guess, the story of, like, you, you targeted ARC-AGI, you're running it, Gemini 3 comes out, you're running this procedure on top of it, and you're seeing it hill climb. Like, what are your thoughts? Did you expect it to be as good as it was, like it was fully in expectation, or this was, like, beat expectation, um, and then when you finally got the results, you're like, "Wow, this is cool"?
- 3:00 – 3:58
Why Poetiq Targeted ARC-AGI
- IFIan Fisher
Yeah. It's, uh, you know, it was really interesting. We were actually really focused on ARC 1. Uh, we weren't paying that much attention to ARC-AGI 2. We, we, like, ran our models on ARC 2, uh, just to make certain, you know, it was, like, reasonable.
- FCFrancois Chaubaurd
With different, uh, API models, right?
- IFIan Fisher
Yeah, yeah, with d- different API providers. Um, but, uh, we, you know, we were getting very exciting results on ARC 1, and we figured, you know, it's like y- it's easier, we'll, like, start with that. ARC 2 seems really hard. We were in a really good position. You know, I don't, I don't want to, um, I don't want to, like, overclaim. You know, I, I think what Poetic's done is, like, very good, but Gemini 3 came out. It was, it's a really quite astonishingly good model.
- FCFrancois Chaubaurd
Yeah.
- IFIan Fisher
So a little bit, a little bit of technical background. The recursive self-improvement loop, what it does is, like, we run it on other tasks that we can evaluate. Uh, so the, the, our system is improving itself by improving other systems, right?
- FCFrancois Chaubaurd
And you, and you don't have access to the weights, so the only thing-
- 3:58 – 4:26
Improving Models Without Access to Weights
- IFIan Fisher
Right, exactly
- FCFrancois Chaubaurd
... you can really, the only thing in your action space to change is the prompt itself.
- IFIan Fisher
It's the prompt and the system around the prompt. Like, so, you know, where the, the system that we are using, it, you know, it's like an ensemble, um, that calls, uh, you know, the underlying model, in this case Gemini 3, um, in, at multiple times to refine each ensemble member's independent and is refining its own answer, and then they, we combine them with some voting scheme that works well.
- FCFrancois Chaubaurd
And there was some DSPy stuff
- 4:26 – 5:30
Ensembles, Voting, and System-Level Optimization
- FCFrancois Chaubaurd
that was similar w- way back when that I've tried, and I've not really seen it be super great. Um-
- IFIan Fisher
Right
- FCFrancois Chaubaurd
... and you guys are, you know, like, in the same spirit, but-
- IFIan Fisher
Yeah, yeah, yeah
- FCFrancois Chaubaurd
... meaningfully better.
- IFIan Fisher
Yeah, so DS- uh, DSPy is a very cool project. Uh, and, uh, I, you know, I wish, I wish I could hire the, the people who made it.
- FCFrancois Chaubaurd
[laughs]
- IFIan Fisher
Uh, if you're watching and you're thinking about, like, leaving your current job-
- FCFrancois Chaubaurd
You have a job offer coming [laughs]
- IFIan Fisher
Yeah. Uh, but, uh, I, I think, you know, there, there's, uh, some, uh, you know, trade secret insights that, that we have that go a little bit beyond, um, that, and, uh, it seemed to make a big difference.
- FCFrancois Chaubaurd
Right.
- IFIan Fisher
So basically, the system out, is an output of our system. The, the, the ARC-AGI solver is an output of, uh, of our system. Uh, and it was really designed and, and, and trained on ARC 1, so we never trained at all on ARC 2. So when Gemini 3 came out, uh, we saw this big, uh, jump in performance also on ARC 1, relatively large. We were at, like, 89% with other models, and then we got to 95% with Gemini 3 on ARC 1. And of course, we had to try it on ARC 2, and we saw a, like,
- 5:30 – 6:21
Why Gemini 3 Changed Everything
- IFIan Fisher
you know, kind of holy cow moment of, like, this is amazingly good.
- FCFrancois Chaubaurd
Mm-hmm.
- IFIan Fisher
Um, and the, you know, I think that's the thing driving the performance improvement there is the Google team has done some, somehow in this particular model, they've done a really good job at, uh, having a model that is good at coding, writing code for, like, visual problem solving-
- FCFrancois Chaubaurd
Mm
- IFIan Fisher
... better than, you know, uh, kind of all the previous models that had been out.
- FCFrancois Chaubaurd
Yeah.
- IFIan Fisher
Um, of course, Opus, uh, 4.5, 4.5 came out from Anthropic, um, uh, you know, similar, you know, pretty quickly thereafter, and, uh, it, it, it's-Quality seems to be pretty similar, uh, to Gemini 3. It's, it's more expensive. What we saw is, like, we could just replace Gemini 3 with Opus and get, uh, you know, similar results.
- FCFrancois Chaubaurd
I guess, what's next for you guys?
- IFIan Fisher
Yeah.
- FCFrancois Chaubaurd
Other benchmarks?
- 6:21 – 7:14
What’s Next: Benchmarks, Research, and Customers
- FCFrancois Chaubaurd
You wanna go, like, more benchmarks, proving more stuff out, uh, productizing other ideas, more research, all the above?
- IFIan Fisher
All the above. Yeah, yeah.
- FCFrancois Chaubaurd
[laughs]
- IFIan Fisher
Yeah, so we have some more benchmarks in mind that we think, uh, are, you know, really high-impact benchmarks that we might be able to make, uh, you know, an interesting dent on. Um, we'll ... I won't say which ones so that, uh, not everybody's, like, um, jumping in front of us, but, uh, uh, you know, you can probably guess at what some of them would be.
- FCFrancois Chaubaurd
How, how big is Poetic?
- IFIan Fisher
Oh, yeah. Poetic is, uh, currently six people.
- FCFrancois Chaubaurd
Wow.
- IFIan Fisher
We have our-
- FCFrancois Chaubaurd
Six people, and you're state-of-the-art.
- IFIan Fisher
Yeah.
- FCFrancois Chaubaurd
That's pretty impressive.
- IFIan Fisher
Um, yeah. They ... I- I mean, I'm really honored to be working with the team. They're ... Everybody is fantastic.
- FCFrancois Chaubaurd
Yeah.
- IFIan Fisher
Um, uh, we have a seventh person joining who is also fantastic starting January, so, um, yeah.
- FCFrancois Chaubaurd
And the DSPY team coming soon.
- IFIan Fisher
Yeah, yeah. [laughs]
- FCFrancois Chaubaurd
[laughs] Um, I mean, do you think that ... Ob- obviously ARC-AGI,
- 7:14 – 8:46
Is Recursive Self-Improvement a Path to AGI?
- FCFrancois Chaubaurd
um, AGI is in the name, and so do you think that, uh, RSI, recursive self-improvement, is a path to AGI? Or do you think that this is just like ... It just gives you a nice bump. It's like dropout. You don't do dropout. You do dropout. You just get, like, a nice 3, 4% bump.
- IFIan Fisher
Yeah. That's a, a ... It's a really nice way of, of putting it. Like, I- I- I think that both things are true, right? Like, you want that, uh, that bump from doing this because, uh, you know, uh, as we showed in our, um, initial blog post, well, it's a little bit, it's a little bit of a hack. I, I don't, again, don't wanna over claim things here, but on ARC-AGI, because they allow you to present two solutions, uh, that allowed us to actually outperform the underlying models while being cheaper. We, we, we only provided one solution, but because of the bump in performance, we were able to still do better than when the underlying model was providing two solutions, right?
- FCFrancois Chaubaurd
Mm-hmm.
- IFIan Fisher
So in general, if you're only allowed one response, Poetic will always be more expensive-
- FCFrancois Chaubaurd
Mm-hmm
- IFIan Fisher
... um, uh, or at least, uh, at least the same price, right? But, uh, if you're allowed, you know, if you're dealing with multiple response settings, then Poetic could be cheaper, but it, it should always be better. Uh, and so you always want that bump. But then coming back to the original question, uh, does this lead to AGI? I mean, I don't believe it's the only path, but I believe it's, like, you know, the most exciting ... In my mind, it's the most exciting path, and it is a path to AGI and beyond.
- FCFrancois Chaubaurd
Um, did you actually stop it from hill climbing and say it's, it's good enough or did it actually plateau?
- IFIan Fisher
I stopped it. It, it's, uh ... Yeah.
- 8:46 – 9:16
When to Stop Hill-Climbing
- IFIan Fisher
It, uh ... This ARC-AGI was fairly expensive to run the hill climbing on.
- FCFrancois Chaubaurd
Okay, so you need money.
- IFIan Fisher
Yeah. Yeah. We-
- FCFrancois Chaubaurd
But then, then you could have gotten even better. [laughs]
- IFIan Fisher
Right. Uh-
- FCFrancois Chaubaurd
We can solve that in the world.
- IFIan Fisher
Yeah, yeah. [laughs]
- FCFrancois Chaubaurd
[laughs] We know how to do that.
- IFIan Fisher
Yeah. If anybody has any money who's listening. Uh, but yeah, you know, we want to service our customers, right? And we can't be, like, w- you know, w- out of money when we need to run experiments for our customers, so yeah.
- FCFrancois Chaubaurd
What else are you more, most excited about, uh, coming up in the future for you guys?
- IFIan Fisher
You know, there's the benchmarks, but, uh, yeah,
- 9:16 – 11:22
Automating Prompt Engineers and Agents
- IFIan Fisher
we're starting to have conversations with, with customers, uh, around how we can help them. Uh, we're very excited about that. You know, this is a company that's doing research, uh, but we always intended it to be a company that makes a real difference in the market, right?
- FCFrancois Chaubaurd
Mm-hmm.
- IFIan Fisher
Like, we want to solve important problems for actual businesses, uh, along the way to, uh ... You know, while we run our recursive self-improvement.
- FCFrancois Chaubaurd
Right. Yeah. I mean, I just see it so obvious because, like, in the action space of things that if you believe Sam that the models are only going to get better-
- IFIan Fisher
Mm-hmm
- FCFrancois Chaubaurd
... and you should, and you want to use them, the only thing in your action space to c- you know, uh, condition the model on what you want it to do is the prompt, and the ... And it's just prompt engineering.
- IFIan Fisher
Right.
- FCFrancois Chaubaurd
Right? And just, like, try stuff really is the answer. And then you have some evals, and you're just trying stuff and then testing the evals, and it feels like we're back to, like, feature engineering-
- IFIan Fisher
Right
- FCFrancois Chaubaurd
... and just, like, hog, sift, surf descriptors again-
- IFIan Fisher
[laughs]
- FCFrancois Chaubaurd
... like back in the day. [laughs] And, like, but that's not, you know, clearly not the answer. Like, get ... The whole thing of deep learning since 2012 is get yourself out of the way-
- IFIan Fisher
Yes
- FCFrancois Chaubaurd
... out of the loop.
- IFIan Fisher
Absolutely. Yeah.
- FCFrancois Chaubaurd
So it makes a lot, a ton of sense. I'm really excited for you guys.
- IFIan Fisher
Yeah, yeah. I mean, the way, the way you're putting it is, is really nice. The, like, uh ... You know, this, um, relates back to research that we were doing at, at DeepMind before we left, where, um, we were building systems like what Poetic can build, but we were doing it manually. And so the Poetic technology is completely different from that research that we did in that, uh, uh, you know, that was like we put together a car by hand, right?
- FCFrancois Chaubaurd
Mm-hmm.
- IFIan Fisher
Uh, and now we've, like, built a factory to build cars, which is something completely different. But, you know, we were in ... You know, we are quite intentionally automating ourselves. Automating prompt engineers, automating people who are building agents. It's a power tool, right?
- FCFrancois Chaubaurd
Yeah, yeah.
- IFIan Fisher
Um ...
- FCFrancois Chaubaurd
Well, I'm really excited for you. Thanks for joining us.
- IFIan Fisher
Yeah. Thanks so much. Thanks. [outro music]
Episode duration: 11:23
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode OLEjyBLo8sQ
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome