Skip to content
Y CombinatorY Combinator

Ian Fischer: How Stilts Beat a Frontier Model on ARC-AGI V2

Poetic's stilts pair self-improvement with an inference harness, not fine-tuning; it topped ARC-AGI V2 at lower cost than frontier deep-thinking modes.

Ian FischerguestJared FriedmanhostDiana Huhost
Feb 27, 202619mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:000:40

    Intro

    1. IF

      The world is changing so quickly. This is probably a little bit obvious, but you should just try things, and, and like every day, do something with AI. Last summer, I took a weekend and used, um, GPT-5 to help me build an iPhone app. I hadn't done that in a decade.

    2. JF

      So fast.

    3. IF

      And yeah, it's so fast and so easy, and that was a, you know, an age ago. That was like eight months ago, uh, now it's even faster and easier. Don't limit yourself. Like, anything that you imagine, you should just try to use AI and see how far you can get with it, and you'll be, you know, making the world better. [upbeat music]

  2. 0:401:07

    What Is Poetiq?

    1. JF

      Welcome to another episode of The Lightcone. Ian Fischer is the co-founder and co-CEO of Poetic, which is building recursively self-improving AI reasoning harnesses for LLMs. Previously, he spent a decade as a researcher at Google DeepMind and founded a mobile dev tools company through YC years ago. Welcome, Ian.

    2. IF

      Thank you. I'm so happy to be here.

    3. JF

      What is Poetic? How's it different than RL? You know, how's it different than context engineering?

  3. 1:072:07

    Recursive Self-Improvement Explained

    1. IF

      At Poetic, what we're building is a recursively self-improving system, and so recursive self-improvement is this, uh, you know, kind of the holy grail of AI, where the AI is making itself smarter. The core insight that we had is that, uh, we could do recursive self-improvement far faster and cheaper than all of the other ways that people had been proposing to do this. Uh, and so obviously, I'm-- I can't go into details about what that, what that is, um, what our particular approach is, but, um, most of the approaches out there involve, uh, you know, they require you to train a new LLM from scratch. And training LLMs from scratch costs, you know, hundreds of millions of dollars and takes, uh, months of effort, and so the-

    2. JF

      And then Anthropic or OpenAI will come along and just eat your lunch in the next model release.

    3. IF

      Right, right. Uh, and, you know, of course, Anthropic and OpenAI and Google, they're exploring recursive self-improvement, but a- typically at that level of, um, having the, you know, having to train a new model, uh, for every step of, uh, self-improvement

  4. 2:072:59

    The Fine-Tuning Trap

    1. IF

      that they do.

    2. JF

      I mean, that seems like actually the, like, defining thing that a startup really, really wants. Like, I know that I want to take advantage of whatever the next model is, but the second you're in fine-tuning land, I'm spending, you know, millions to hundreds of millions of dollars, and then guess what? Like, it-- I just lit it on fire 'cause, you know, the next version of the frontier model comes out, and I'll never catch up.

    3. IF

      Yeah.

    4. JF

      Whereas, like, working with your systems means that I will always have the thing that is, uh, better-

    5. IF

      Right

    6. JF

      ... than the thing that's out of box, and that's sort of like the holy grail.

    7. IF

      Yeah. We think that this is, uh, incredibly valuable to anybody who's building on top of, uh, uh, large language models. And we don't view the, uh, you know, the frontier models as competitors. They're, you know, they're the ones that we're using the stilts, uh, you know, building stilts-

    8. JF

      Mm

    9. IF

      ... to stand on top of. But, uh, if we didn't have that, um, fund- that foundational layer, then, you know, Poetic

  5. 2:593:14

    “Stilts” for LLMs

    1. IF

      couldn't exist.

    2. JF

      Yeah. I mean, being the smartest model, uh, you know, it's a game of inches, actually [chuckles] and, like, so those inches matter a lot.

    3. IF

      Right, right.

    4. JF

      How do we actually get started? I mean, you've built something that, uh, basically any startup could use, that, uh, it's sort of like stilts,

  6. 3:145:05

    Recursive Self-Improvement vs. Fine-Tuning

    1. JF

      really.

    2. IF

      We have built a system that, um, uh, can automatically generate systems for your particular problem that will always outperform the underlying language models. And without kind of the r- massive expense, as you're saying about the bitter lesson, where, you know, what would you, what would you have done without Poetic? You probably would've said, "Okay, we're gonna first collect a large data set, you know, like tens of thousands of examples for our particular problem that we're working on, and we're going to fine-tune, you know, the best model we can put-- get our hands on." Uh, maybe that's, you know, one of the frontier models, or maybe it's an open-weights model, doesn't particularly matter. You're going to spend a lot of money on that fine-tuning. The ex- the compute is so expensive, uh, and then at the end of it, you have something that, uh, you know, works better than the thing that you fine-tuned on top of, but by then, a new model has come out, and it's better than the thing that you fine-tuned. You know, you fine-tuned, you know, like, uh, three years ago on top of GPT-3.5 or whatever, and then GPT-404 comes out, and it just blows you out of the water. And so are you gonna do that again, or are you gonna go out of business? And, like, in some cases, the, the latter. With Poetic, uh, what we end up giving you is a, uh... You know, people are calling these things harnesses now, but, you know, or an agentic system, or whatever you wanna call it, that sits on top of one or more language models, and it just performs better than them. Uh, and when the new model comes out, that same harness is, uh, perfectly compatible with it, uh, and you don't need to change anything to get the, uh, you know, an even, uh, bigger performance bump. Additionally, we can, you know, continue to optimize for this new model, uh, whatever the new model is that you want to use, uh, and, you know, make it even better. Uh, but you, you don't lose out on the, you know, hundreds of millions of dollars. In fact, we do this so much more cheaply, uh, than fine-tuning would cost

  7. 5:056:37

    Taking the Top Spot on ARC-AGI

    1. IF

      as well.

    2. JF

      And you've done this actually a bunch of times, right? Like, I remember when you first came out with your paper in December of last year, uh, you shot to the top of ARC-AGI V2, and then you've done this a bunch of times for other benchmarks, too. What, you know, what was that like?

    3. IF

      ARC-AGI V2 was, uh, yeah, this was kind of, yeah, our-- us coming out of stealth, letting people know that we could, um, tackle these really hard problems. Uh, and in particular, you know, we wanted to show that our system could generate these, um, uh, what we call, you know, we call our system, like the Poetic meta system, can generate, uh, reasoning systems that are, um, highly effective. Gemini 3 had-- Deep Think had just come out, uh, and they were, you know, really quite, uh, dramatically, uh, uh, at the top of the leaderboard at forty-five percent. Uh, and two days later, we released our results, where, um, uh, we were showing that we could get, uh,... a lot higher than that, uh-

    4. JF

      So they come out with Soda, and then you come in right above them every single time.

    5. IF

      Yeah, yeah.

    6. JF

      Which is, like, wild to see, honestly.

    7. IF

      Right.

    8. JF

      That's what it's like to have stilts, you know?

    9. IF

      Yeah, yeah.

    10. JF

      Like, whatever model comes out, you can be taller than that one with Poetic, [chuckles] which is like, that's so awesome!

    11. IF

      Yeah, so the interesting thing is that, uh, we were half the cost of Gemini 3 DeepThink, because we were building on top of Gemini 3 Pro, which is a much cheaper model. Um, but we still got, uh, in the end, a nine percentage point improvement on the official verification. So they were at forty-five percent, and we were- and, like, seventy-something dollars, and we were at, uh, fifty-four percent and thirty-two dollars

  8. 6:378:40

    Beating Claude on Humanity’s Last Exam

    1. IF

      per problem.

    2. DH

      So recently, you guys just announced some incredible results for Humanity's Last Exam. Can you tell us more about those?

    3. IF

      Humanity's Last Exam is a, a set of twenty-five hundred really, really hard questions written by experts in, uh, many different domains. They're, they're meant to be, uh, challenging even for, uh, PhDs in those fields. AI hasn't passed it yet, uh, but we got to fifty-five percent, which is almost two percentage points higher than the, the previous, uh, state-of-the-art, which came out just last week, uh, from, uh, Anthropic with Claude Opus four point six. Uh, they got fifty-three point one percent, and we got fifty-five percent on it.

    4. DH

      And one thing that, uh, Humanity Last Exam doesn't publish is the cost of getting those results. In your case, this run was done with less than around six figure. How much was it?

    5. IF

      We didn't publish any, uh, cost for this, but, uh, I can say that the, the optimization cost us less than one hundred K. Yeah.

    6. DH

      Which is impressive, because each of these big foundation models' train runs are in the hundreds of millions of dollars. And you guys, as a company, you're only seven people?

    7. IF

      That's right. Yeah, yeah, seven, uh, seven research scientists and research engineers. Yeah.

    8. DH

      That's impressive. And I think the thing that's very interesting about your approach is sort of taking a very scientific approach to the emergent behaviors that a lot of the best founders are doing with models. I think a lot of, uh, founders that get very good results for agents, they treat the underlying model as a common layer that you can switch in between, and there's this cer- certain tasks, for example, for GPT five point two, like very hard-to-verify bugs, get sent to that, versus architecture that gets sent to Claude four point six. But you're kinda doing this automatically instead of having a human conducting, is, uh, very impressive. I think there's something more special going on underneath. Can you tell us a bit about how it works?

    9. JF

      Yeah, it sounds magical, so-

    10. IF

      Yeah, yeah.

    11. JF

      What can you tell us?

  9. 8:4010:26

    How the Meta-System Works

    1. IF

      Right, you're, you're-- so you're getting at a core, a, a really core thing. You know, uh, these harnesses, they are, um, code, prompts, data, uh, you know, built on top of one or more language models, right? And so this is something that, in principle, you can build by hand, um, or with, like, Cloud Code or whatever. But, uh, in, in practice, it takes a lot of work to do these, yeah, to, to, you know, have all the insights, uh, to make this, uh, uh, to make these work well. And so the core technology that we've developed at Poetic is, uh, recursive self-improvement. So we, we have a recursively self-improving, uh, system, which we call the Poetic Meta-system. The output of that system is systems that solve hard problems, um, where a hard problem is, you know, something that you-- if you gave it to, uh, GPT five two, uh, it would struggle to give you a reliable, robust result, you know, uh, just to use an example. So this is a, a very big advantage, uh, for us. We can generate these systems i, in a much more automated manner, uh, which means that we can do it much more quickly and much more cheaply than if you hired a team yourself to try to make, um, your own, you know, your own agent to solve your particular task. But not only that, um, since, you know, this is really an automated optimization process, if you already have done that work, you, you know, you're a, you're a startup that's, like, going after a particular vertical, and you've put together... You know, you think you understand your, your problem pretty well. You've put together your agent, and you, you, uh, you know, maybe it's working pretty well, but you know you can get something better, or you really need something better, um, then you can bring that to us, uh, and we can optimize, uh, that entire agent or pieces of that agent. We could optimize just the prompts, just the reasoning strategies, uh, there's a lot of different things that we can do, uh, depending on your particular needs.

  10. 10:2611:32

    Beyond RL: A New S-Curve

    1. DH

      It sounds like this is a complete different paradigm than RL, because we went through the S-curve of regular pre-training, RL with when OpenAI released o1, and now this feels like a new one. It, it sounds special. It sounds... It rhymes a lot with, uh, RNNs-

    2. IF

      Mm-hmm

    3. DH

      ... which is a whole different paradigm than R- than RL, right?

    4. IF

      It's going to depend on the particular task, the particular type of problem that we're going after, that we're trying to solve, um, and the underlying models that we're working with. But, uh, e- e- effectively, you could say, like, each model or each set of models that we're working with will have its, their own, uh, S-curve. The Poetic system, the Poetic meta-system itself, is also going to have its own S-curve. And so as the Poetic meta-system gets better, and as the underlying models get better, you, you'll find that the, uh, you know, the S-curve that you're dealing with keeps shifting higher and higher until ultimately either you saturate or, like-

    5. DH

      Reach AGI? [chuckles]

    6. IF

      ... yeah, reach AGI, uh, reach superintelligences. Yeah.

    7. JF

      Given its stilts, you might, like, hit the ceiling first, then.

    8. IF

      That's the goal, right?

    9. JF

      Yeah.

    10. IF

      You wanna hit the ceiling first with Poetic. [laughing]

  11. 11:3213:37

    Automating Prompt Engineering

    1. JF

      I think a lot of startups that we work with, um, and then in my spare time, I s- you know, do a bunch of context engineering.

    2. IF

      Mm-hmm.

    3. JF

      And then the thing is, we're sort of, like, tuning it, tuning evals, tuning-- like, we're context stuffing ourselves. What does that even feel like to have a, you know, recursively self-improving version of, like, prompt engineering and context engineering?

    4. IF

      We don't spend a lot of time looking at the particular data that we're working with. Uh, instead, we're letting the Poetic meta-system look at that data, and, and so, like, the meta-system, you know, if it, if it thinks that it needs to put more things into context, uh, do more context stuffing or whatever, it'll, it'll do that. If it needs to, like, generate a bunch of examples, um, uh, to-... get the, get better performance, it'll do that for you, right? It, it was pretty interesting to look at the, um, prompt outputs in particular, I'd say, for ARC-AGI, in that, uh, you know, I think you can read those and say, "Well, that's not what a human would've written,"-

    5. JF

      Hmm

    6. IF

      ... uh, pretty clearly. And, and it's, uh, you know, there's some unexpected stuff and, you know, it made some really simple examples, and one of the examples is actually wrong. Uh, but we didn't change it. We're like, "Well, this is, you know, this is the thing that it output. We'll just leave it be." Um, you know, we don't wanna go in and monkey around with things. And so, uh, historically, in machine learning, you always, you know, it's like the, the rule was you have to know your data set really well. Um, but now we're kind of outsourcing that to the AI itself, where the AI is- the, the... It's the AI's job to understand the data set and figure out where are the failure modes, um, and wh- where are the kind of robust reasoning strategies that, uh, the model-- that, that the agent could, uh, use, um, to get better performance.

    7. JF

      How much of it is, like, much u- uh, the output is much better prompts, and then how much of it is, like, the harness itself, uh, context stuffing, or summarizing in the right way, or re-ranking in the right way so that, like, you have some number of, like, mega LLM calls-

    8. IF

      Right

    9. JF

      ... and then how do you get the most out of, um, each of

  12. 13:3714:50

    From 5% to 95% Performance

    1. JF

      those calls?

    2. IF

      Yeah, and so that definitely varies per problem. But, uh, what we've seen, uh, in fact, uh, our, our last paper at DeepMind was not doing this recursive self-improving stuff, but we were, um, we were showing that you could build these harnesses, um, manually to solve really hard problems. And what we saw is, uh, there is that, uh, you know, we manually optimized the prompts really hard for these very hard problems, and that got us a little bit of the way. Uh, i- in this particular case, you know, the, the hardest, the hardest task we were working on, we got, like, to 5% performance with Gemini 1.5 Flash. This was a while ago. And then, when we added on the, the reasoning strategies, we went from 5% to 95%.

    3. JF

      Oh, my God.

    4. IF

      Uh, and so, uh, this is typically what we see. You know, like, everybody's out there kind of doing some amount-- I, I wouldn't say everybody, but many people are out there kind of doing some amount of automated prop- prompt optimization. Uh, every, you know, JePA is this very popular paper. Everybody's kind of re-implementing that. That will get you some performance improvements, but it's very far from, uh, everything that you can get, uh, if you actually think about these reasoning strategies that are really gonna be written in code rather than in, in just better

  13. 14:5016:17

    Early Access & Putting Your Agent on Stilts

    1. IF

      prompts.

    2. SP

      So if startups want to use Poetic to put their agent on stilts, what should they do?

    3. IF

      Yeah, so right now, uh, we haven't released anything yet. But, uh, if you go to poetic.ai, uh, there is a button you can click to get, uh, signed up for early access. And if you're a startup, uh, or a company who has a really hard problem, uh, and you've tried everything that you can to make it, uh, reliable a- and robust, and you just can't get all the way there, you, you need something more, then, uh, let us know. We're looking for problems like that. Uh, so just tell us, tell us what it is that you're working on, and, uh, we'll reach out. Uh, you'll be the first to know when we're, when we're ready to work with you.

    4. JF

      I mean, if you're at the top of, um, Humanity's Last Exam, then, I mean, that's, that's pretty big. So [chuckles] it's- you're all the-- you're already all the way out there at SOTA, and then I guess the stilts basically let any agentic company become SOTA.

    5. IF

      That's the idea. Yeah, yeah. And, uh, you know, we view the ARC-AGI results and the Humanity's Last Exam results as showing kind of two different, uh, capabilities that we have. We can really improve your reasoning, and we can really improve, uh, deep knowledge extraction, uh, from these models.

    6. JF

      And then you're just totally vaccinated against the bitter lesson.

    7. IF

      Exactly.

    8. JF

      YC's next batch is now taking applications. Got a startup in you? Apply at ycombinator.com/apply. It's never too early, and filling out the app will level up your idea. Okay, back to the

  14. 16:1718:29

    From YC Founder to DeepMind Researcher

    1. JF

      video.

    2. SP

      A slight sort of change, change of topic, but something I was curious about, uh, so you arrived at Google over a decade ago when they acquired your first YC startup, Apportable. Apportable was, it's porting mobile apps cross-platform, right? Like Android or, or whatever. It's quite different to, um, recursive self-improving AGI. [chuckles] Um, how did you make that leap? What happened once you got to Google? Um, what made you think that you maybe wanted to shift out and do something different? And just would love to hear that story.

    3. IF

      The acquisition, uh, was this amazing opportunity to, um, reflect on what I really wanted to be doing next, right? Like, Google was in, uh, uh, you know, uh, itself is a place where you can do so many different things. Uh, so I spent some time thinking about, um, where, uh, where I wanted to go next in, in, uh, in my journey. I realized that the problems that I was most, uh, excited about were really actually AI and, uh, and robotics. Um, and the best people in the world, many of them in those fields, were at Google at the time. Uh, and so I went and talked to them. They, uh, let me come join, you know, a new AI robotics team, uh, in Google Research, which was this amazing opportunity for me, since that wasn't my background. My background was, like, uh, computer security, and then this cross-platform mobile, you know, systems building, uh, stuff. I was able to join this team, and I'll, I'll tell you the, the truth, that, uh, I very quickly realized that, uh, hardware is hard-

    4. SP

      [chuckles]

    5. IF

      ... uh, and I didn't really wanna be doing robotics. It was more aspirational at that moment. Uh, but I was really, um, uh, passionate about machine learning, so I just, uh, uh, made a very hard switch into just doing machine learning research, uh, and did that for, you know, uh, about a decade at Google. And, and then, uh, Google and then DeepMind.

    6. SP

      What's maybe some advice that you have today for engineers who want to get into sort of more of the AI side, probably the applied AI, and build startups around AI? Like, how should they think about that?

    7. IF

      You know,

  15. 18:2919:45

    Advice for Engineers in the AI Era

    1. IF

      the world is changing so quickly. Uh, this is probably a little bit obvious, but you should just try things, and, and, like, every day, uh, do something, uh, do something with AI. Uh, always try to push yourself to find the boundaries of what they're capable of, uh, and, uh, and build the things that you, that you want to build, right? Um, uh, e- even, even for me, you know, last summer, I took a weekend and used, um, GPT-5 to help me build an iPhone app. Uh, I hadn't, I hadn't done that in-

    2. SP

      That's amazing

    3. IF

      ... a decade.

    4. JF

      So fast.

    5. IF

      Yeah, it's so fast and so easy, and that was, you know, that was, uh, an, you know, an age ago. That was, like, eight months ago. Uh, now it's even faster and easier. Don't limit yourself. Like, anything that you imagine, you should just try to use AI and see how far you can get with it, and you'll be, you know, making the world better.

    6. JF

      That's all we have time for today. But Ian, thank you so much for giving us all Stilts. Uh, we can't wait to use it at YC. I can't wait to use it for Garry's List. I mean, there's just so much to do, so...

    7. IF

      Yeah. Thank you for having me. This was a lot of fun. [upbeat music]

Episode duration: 19:45

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode UPGB-hsAoVY

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome