Ian Fischer: How Stilts Beat a Frontier Model on ARC-AGI V2

Poetic's stilts pair self-improvement with an inference harness, not fine-tuning; it topped ARC-AGI V2 at lower cost than frontier deep-thinking modes.

Ian FischerguestJared FriedmanhostDiana Huhost

Feb 26, 202619mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Poetic’s seven-person team builds “stilts” that boost LLM reasoning

Poetic (founded by ex-DeepMind researcher Ian Fischer) develops a “recursively self-improving” meta-system that generates task-specific reasoning harnesses—code, prompts, data, and multi-model routing—that sit on top of existing LLMs.
The company argues this approach avoids the “fine-tuning trap,” where teams spend heavily customizing a model only to be leapfrogged by the next frontier release.
They showcase benchmark wins (ARC-AGI V2 and Humanity’s Last Exam), claiming higher scores at lower per-problem or overall optimization cost by leveraging cheaper base models plus better reasoning systems.
The conversation frames this as a new post-RL S-curve: automation of prompt/context engineering plus coded reasoning strategies that can transfer forward to newer models and keep improving.

IDEAS WORTH REMEMBERING

5 ideas

Avoid fine-tuning if model churn will erase your gains.

Fischer and Friedman describe fine-tuning as expensive and fragile: by the time you’ve tuned on today’s model, a new release can outperform your customized version, forcing an endless and costly catch-up cycle.

A reusable harness can outperform the base model and carry forward to new releases.

Poetic’s “stilts” are positioned as model-agnostic systems layered on top of one or more LLMs; when a better base model arrives, the same harness can immediately benefit without being rebuilt from scratch.

Recursive self-improvement can be done without training new foundation models.

Poetic’s core claim is that many recursive self-improvement approaches require retraining an LLM each iteration (slow and costly), while their method improves performance via system-level optimization instead.

Big performance jumps often come from coded reasoning strategies, not just better prompts.

Fischer cites internal experience where aggressive prompt optimization barely helped, but adding reasoning strategies implemented in code drove a jump from ~5% to ~95% on a hard task—implying “agent architecture” is the main lever.

Automated systems can replace manual ‘context engineering’ trial-and-error.

Instead of humans iterating on context stuffing, examples, and routing, Poetic’s meta-system inspects failures and proposes changes (prompts, examples, strategies) as part of an automated optimization loop.

WORDS WORTH SAVING

5 quotes

Most of the approaches out there involve... train a new LLM from scratch. And training LLMs from scratch costs... hundreds of millions of dollars.

— Ian Fischer

The second you're in fine-tuning land... I just lit it on fire 'cause... the next version of the frontier model comes out.

— Jared Friedman

We have built a system that... can automatically generate systems for your particular problem that will always outperform the underlying language models.

— Ian Fischer

We were at... fifty-four percent and thirty-two dollars per problem.

— Ian Fischer

When we added on the... reasoning strategies, we went from 5% to 95%.

— Ian Fischer

Poetic and “reasoning harnesses”Recursive self-improvement vs. retrainingThe fine-tuning trap and the bitter lesson“Stilts” metaphor: staying ahead of frontier modelsARC-AGI V2 leaderboard results and cost per problemHumanity’s Last Exam result and optimization costAutomating prompt engineering and coded reasoning strategiesEngineer advice: daily experimentation with AI

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.