
How A Team Of 7 Keeps Breaking AI Benchmark Records
Ian Fischer (guest), Jared Friedman (host), Diana Hu (host)
In this episode of Y Combinator, featuring Ian Fischer and Jared Friedman, How A Team Of 7 Keeps Breaking AI Benchmark Records explores poetic’s seven-person team builds “stilts” that boost LLM reasoning Poetic (founded by ex-DeepMind researcher Ian Fischer) develops a “recursively self-improving” meta-system that generates task-specific reasoning harnesses—code, prompts, data, and multi-model routing—that sit on top of existing LLMs.
Poetic’s seven-person team builds “stilts” that boost LLM reasoning
Poetic (founded by ex-DeepMind researcher Ian Fischer) develops a “recursively self-improving” meta-system that generates task-specific reasoning harnesses—code, prompts, data, and multi-model routing—that sit on top of existing LLMs.
The company argues this approach avoids the “fine-tuning trap,” where teams spend heavily customizing a model only to be leapfrogged by the next frontier release.
They showcase benchmark wins (ARC-AGI V2 and Humanity’s Last Exam), claiming higher scores at lower per-problem or overall optimization cost by leveraging cheaper base models plus better reasoning systems.
The conversation frames this as a new post-RL S-curve: automation of prompt/context engineering plus coded reasoning strategies that can transfer forward to newer models and keep improving.
Key Takeaways
Avoid fine-tuning if model churn will erase your gains.
Fischer and Friedman describe fine-tuning as expensive and fragile: by the time you’ve tuned on today’s model, a new release can outperform your customized version, forcing an endless and costly catch-up cycle.
Get the full analysis with uListen AI
A reusable harness can outperform the base model and carry forward to new releases.
Poetic’s “stilts” are positioned as model-agnostic systems layered on top of one or more LLMs; when a better base model arrives, the same harness can immediately benefit without being rebuilt from scratch.
Get the full analysis with uListen AI
Recursive self-improvement can be done without training new foundation models.
Poetic’s core claim is that many recursive self-improvement approaches require retraining an LLM each iteration (slow and costly), while their method improves performance via system-level optimization instead.
Get the full analysis with uListen AI
Big performance jumps often come from coded reasoning strategies, not just better prompts.
Fischer cites internal experience where aggressive prompt optimization barely helped, but adding reasoning strategies implemented in code drove a jump from ~5% to ~95% on a hard task—implying “agent architecture” is the main lever.
Get the full analysis with uListen AI
Automated systems can replace manual ‘context engineering’ trial-and-error.
Instead of humans iterating on context stuffing, examples, and routing, Poetic’s meta-system inspects failures and proposes changes (prompts, examples, strategies) as part of an automated optimization loop.
Get the full analysis with uListen AI
Benchmark wins are used to demonstrate cost-effective ‘stilts’ on cheaper models.
On ARC-AGI V2, Poetic claims to beat a top entry by building on a cheaper base model (Gemini 3 Pro vs. ...
Get the full analysis with uListen AI
Small teams can compete by optimizing above the foundation layer.
Poetic highlights achieving state-of-the-art benchmark results with a seven-person research team and sub-$100k optimization cost (for one run), contrasting with $100M+ foundation model training budgets.
Get the full analysis with uListen AI
Notable Quotes
“Most of the approaches out there involve... train a new LLM from scratch. And training LLMs from scratch costs... hundreds of millions of dollars.”
— Ian Fischer
“The second you're in fine-tuning land... I just lit it on fire 'cause... the next version of the frontier model comes out.”
— Jared Friedman
“We have built a system that... can automatically generate systems for your particular problem that will always outperform the underlying language models.”
— Ian Fischer
“We were at... fifty-four percent and thirty-two dollars per problem.”
— Ian Fischer
“When we added on the... reasoning strategies, we went from 5% to 95%.”
— Ian Fischer
Questions Answered in This Episode
What exactly is being recursively improved in Poetic’s meta-system—prompts, tool/code structure, data generation, model routing, or all of the above?
Poetic (founded by ex-DeepMind researcher Ian Fischer) develops a “recursively self-improving” meta-system that generates task-specific reasoning harnesses—code, prompts, data, and multi-model routing—that sit on top of existing LLMs.
Get the full analysis with uListen AI
On ARC-AGI V2, what parts of the gain came from more LLM calls vs. better reasoning structure per call (e.g., decomposition, verification, search)?
The company argues this approach avoids the “fine-tuning trap,” where teams spend heavily customizing a model only to be leapfrogged by the next frontier release.
Get the full analysis with uListen AI
You mention one ARC-AGI prompt example was ‘actually wrong’ yet kept—why would an incorrect example help, and how do you prevent that from becoming a reliability risk?
They showcase benchmark wins (ARC-AGI V2 and Humanity’s Last Exam), claiming higher scores at lower per-problem or overall optimization cost by leveraging cheaper base models plus better reasoning systems.
Get the full analysis with uListen AI
For Humanity’s Last Exam, what did you change to boost “deep knowledge extraction”—retrieval, citation/verification, multi-model arbitration, or reasoning protocols?
The conversation frames this as a new post-RL S-curve: automation of prompt/context engineering plus coded reasoning strategies that can transfer forward to newer models and keep improving.
Get the full analysis with uListen AI
How do you define and measure “optimization cost” (tokens, wall-clock time, engineer time), and how does it scale with task complexity?
Get the full analysis with uListen AI
Transcript Preview
The world is changing so quickly. This is probably a little bit obvious, but you should just try things, and, and like every day, do something with AI. Last summer, I took a weekend and used, um, GPT-5 to help me build an iPhone app. I hadn't done that in a decade.
So fast.
And yeah, it's so fast and so easy, and that was a, you know, an age ago. That was like eight months ago, uh, now it's even faster and easier. Don't limit yourself. Like, anything that you imagine, you should just try to use AI and see how far you can get with it, and you'll be, you know, making the world better. [upbeat music]
Welcome to another episode of The Lightcone. Ian Fischer is the co-founder and co-CEO of Poetic, which is building recursively self-improving AI reasoning harnesses for LLMs. Previously, he spent a decade as a researcher at Google DeepMind and founded a mobile dev tools company through YC years ago. Welcome, Ian.
Thank you. I'm so happy to be here.
What is Poetic? How's it different than RL? You know, how's it different than context engineering?
At Poetic, what we're building is a recursively self-improving system, and so recursive self-improvement is this, uh, you know, kind of the holy grail of AI, where the AI is making itself smarter. The core insight that we had is that, uh, we could do recursive self-improvement far faster and cheaper than all of the other ways that people had been proposing to do this. Uh, and so obviously, I'm-- I can't go into details about what that, what that is, um, what our particular approach is, but, um, most of the approaches out there involve, uh, you know, they require you to train a new LLM from scratch. And training LLMs from scratch costs, you know, hundreds of millions of dollars and takes, uh, months of effort, and so the-
And then Anthropic or OpenAI will come along and just eat your lunch in the next model release.
Right, right. Uh, and, you know, of course, Anthropic and OpenAI and Google, they're exploring recursive self-improvement, but a- typically at that level of, um, having the, you know, having to train a new model, uh, for every step of, uh, self-improvement that they do.
I mean, that seems like actually the, like, defining thing that a startup really, really wants. Like, I know that I want to take advantage of whatever the next model is, but the second you're in fine-tuning land, I'm spending, you know, millions to hundreds of millions of dollars, and then guess what? Like, it-- I just lit it on fire 'cause, you know, the next version of the frontier model comes out, and I'll never catch up.
Yeah.
Whereas, like, working with your systems means that I will always have the thing that is, uh, better-
Right
... than the thing that's out of box, and that's sort of like the holy grail.
Install uListen to search the full transcript and get AI-powered insights
Get Full TranscriptGet more from every podcast
AI summaries, searchable transcripts, and fact-checking. Free forever.
Add to Chrome