How A Team Of 7 Keeps Breaking AI Benchmark Records

How A Team Of 7 Keeps Breaking AI Benchmark Records

Y CombinatorFeb 27, 202619m

Ian Fischer (guest), Jared Friedman (host), Diana Hu (host)

Poetic and “reasoning harnesses”Recursive self-improvement vs. retrainingThe fine-tuning trap and the bitter lesson“Stilts” metaphor: staying ahead of frontier modelsARC-AGI V2 leaderboard results and cost per problemHumanity’s Last Exam result and optimization costAutomating prompt engineering and coded reasoning strategiesEngineer advice: daily experimentation with AI

In this episode of Y Combinator, featuring Ian Fischer and Jared Friedman, How A Team Of 7 Keeps Breaking AI Benchmark Records explores poetic’s seven-person team builds “stilts” that boost LLM reasoning Poetic (founded by ex-DeepMind researcher Ian Fischer) develops a “recursively self-improving” meta-system that generates task-specific reasoning harnesses—code, prompts, data, and multi-model routing—that sit on top of existing LLMs.

Poetic’s seven-person team builds “stilts” that boost LLM reasoning

Poetic (founded by ex-DeepMind researcher Ian Fischer) develops a “recursively self-improving” meta-system that generates task-specific reasoning harnesses—code, prompts, data, and multi-model routing—that sit on top of existing LLMs.

The company argues this approach avoids the “fine-tuning trap,” where teams spend heavily customizing a model only to be leapfrogged by the next frontier release.

They showcase benchmark wins (ARC-AGI V2 and Humanity’s Last Exam), claiming higher scores at lower per-problem or overall optimization cost by leveraging cheaper base models plus better reasoning systems.

The conversation frames this as a new post-RL S-curve: automation of prompt/context engineering plus coded reasoning strategies that can transfer forward to newer models and keep improving.

Key Takeaways

Avoid fine-tuning if model churn will erase your gains.

Fischer and Friedman describe fine-tuning as expensive and fragile: by the time you’ve tuned on today’s model, a new release can outperform your customized version, forcing an endless and costly catch-up cycle.

Get the full analysis with uListen AI

A reusable harness can outperform the base model and carry forward to new releases.

Poetic’s “stilts” are positioned as model-agnostic systems layered on top of one or more LLMs; when a better base model arrives, the same harness can immediately benefit without being rebuilt from scratch.

Get the full analysis with uListen AI

Recursive self-improvement can be done without training new foundation models.

Poetic’s core claim is that many recursive self-improvement approaches require retraining an LLM each iteration (slow and costly), while their method improves performance via system-level optimization instead.

Get the full analysis with uListen AI

Big performance jumps often come from coded reasoning strategies, not just better prompts.

Fischer cites internal experience where aggressive prompt optimization barely helped, but adding reasoning strategies implemented in code drove a jump from ~5% to ~95% on a hard task—implying “agent architecture” is the main lever.

Get the full analysis with uListen AI

Automated systems can replace manual ‘context engineering’ trial-and-error.

Instead of humans iterating on context stuffing, examples, and routing, Poetic’s meta-system inspects failures and proposes changes (prompts, examples, strategies) as part of an automated optimization loop.

Get the full analysis with uListen AI

Benchmark wins are used to demonstrate cost-effective ‘stilts’ on cheaper models.

On ARC-AGI V2, Poetic claims to beat a top entry by building on a cheaper base model (Gemini 3 Pro vs. ...

Get the full analysis with uListen AI

Small teams can compete by optimizing above the foundation layer.

Poetic highlights achieving state-of-the-art benchmark results with a seven-person research team and sub-$100k optimization cost (for one run), contrasting with $100M+ foundation model training budgets.

Get the full analysis with uListen AI

Notable Quotes

Most of the approaches out there involve... train a new LLM from scratch. And training LLMs from scratch costs... hundreds of millions of dollars.

Ian Fischer

The second you're in fine-tuning land... I just lit it on fire 'cause... the next version of the frontier model comes out.

Jared Friedman

We have built a system that... can automatically generate systems for your particular problem that will always outperform the underlying language models.

Ian Fischer

We were at... fifty-four percent and thirty-two dollars per problem.

Ian Fischer

When we added on the... reasoning strategies, we went from 5% to 95%.

Ian Fischer

Questions Answered in This Episode

What exactly is being recursively improved in Poetic’s meta-system—prompts, tool/code structure, data generation, model routing, or all of the above?

Poetic (founded by ex-DeepMind researcher Ian Fischer) develops a “recursively self-improving” meta-system that generates task-specific reasoning harnesses—code, prompts, data, and multi-model routing—that sit on top of existing LLMs.

Get the full analysis with uListen AI

On ARC-AGI V2, what parts of the gain came from more LLM calls vs. better reasoning structure per call (e.g., decomposition, verification, search)?

The company argues this approach avoids the “fine-tuning trap,” where teams spend heavily customizing a model only to be leapfrogged by the next frontier release.

Get the full analysis with uListen AI

You mention one ARC-AGI prompt example was ‘actually wrong’ yet kept—why would an incorrect example help, and how do you prevent that from becoming a reliability risk?

They showcase benchmark wins (ARC-AGI V2 and Humanity’s Last Exam), claiming higher scores at lower per-problem or overall optimization cost by leveraging cheaper base models plus better reasoning systems.

Get the full analysis with uListen AI

For Humanity’s Last Exam, what did you change to boost “deep knowledge extraction”—retrieval, citation/verification, multi-model arbitration, or reasoning protocols?

The conversation frames this as a new post-RL S-curve: automation of prompt/context engineering plus coded reasoning strategies that can transfer forward to newer models and keep improving.

Get the full analysis with uListen AI

How do you define and measure “optimization cost” (tokens, wall-clock time, engineer time), and how does it scale with task complexity?

Get the full analysis with uListen AI

Transcript Preview

Ian Fischer

The world is changing so quickly. This is probably a little bit obvious, but you should just try things, and, and like every day, do something with AI. Last summer, I took a weekend and used, um, GPT-5 to help me build an iPhone app. I hadn't done that in a decade.

Jared Friedman

So fast.

Ian Fischer

And yeah, it's so fast and so easy, and that was a, you know, an age ago. That was like eight months ago, uh, now it's even faster and easier. Don't limit yourself. Like, anything that you imagine, you should just try to use AI and see how far you can get with it, and you'll be, you know, making the world better. [upbeat music]

Jared Friedman

Welcome to another episode of The Lightcone. Ian Fischer is the co-founder and co-CEO of Poetic, which is building recursively self-improving AI reasoning harnesses for LLMs. Previously, he spent a decade as a researcher at Google DeepMind and founded a mobile dev tools company through YC years ago. Welcome, Ian.

Ian Fischer

Thank you. I'm so happy to be here.

Jared Friedman

What is Poetic? How's it different than RL? You know, how's it different than context engineering?

Ian Fischer

At Poetic, what we're building is a recursively self-improving system, and so recursive self-improvement is this, uh, you know, kind of the holy grail of AI, where the AI is making itself smarter. The core insight that we had is that, uh, we could do recursive self-improvement far faster and cheaper than all of the other ways that people had been proposing to do this. Uh, and so obviously, I'm-- I can't go into details about what that, what that is, um, what our particular approach is, but, um, most of the approaches out there involve, uh, you know, they require you to train a new LLM from scratch. And training LLMs from scratch costs, you know, hundreds of millions of dollars and takes, uh, months of effort, and so the-

Jared Friedman

And then Anthropic or OpenAI will come along and just eat your lunch in the next model release.

Ian Fischer

Right, right. Uh, and, you know, of course, Anthropic and OpenAI and Google, they're exploring recursive self-improvement, but a- typically at that level of, um, having the, you know, having to train a new model, uh, for every step of, uh, self-improvement that they do.

Jared Friedman

I mean, that seems like actually the, like, defining thing that a startup really, really wants. Like, I know that I want to take advantage of whatever the next model is, but the second you're in fine-tuning land, I'm spending, you know, millions to hundreds of millions of dollars, and then guess what? Like, it-- I just lit it on fire 'cause, you know, the next version of the frontier model comes out, and I'll never catch up.

Ian Fischer

Yeah.

Jared Friedman

Whereas, like, working with your systems means that I will always have the thing that is, uh, better-

Ian Fischer

Right

Jared Friedman

... than the thing that's out of box, and that's sort of like the holy grail.

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome