This Startup Beat Gemini 3 on ARC-AGI — at Half the Cost

Poetiq is a new startup founded by former DeepMind researchers that recently achieved a major jump on the ARC-AGI benchmark by layering a recursive self-improvement system on top of Gemini 3. In this conversation at NeurIPS, YC's Francois Chaubaurd sat down with Poetiq co-founder Ian Fisher to find out how they're increasing performance using prompts and system design alone. They also explore recursive self-improvement, benchmarking progress toward AGI, and why automating prompt engineering may be one of the most powerful levers in AI today. Chapters 00:11 — Introducing Poetiq and the ARC-AGI Breakthrough 00:49 — How Big Is the Performance Jump? 01:18 — Ian Fisher’s Background: YC, Google, DeepMind 02:00 — Recursive Self-Improvement Explained 03:00 — Why Poetiq Targeted ARC-AGI 03:58 — Improving Models Without Access to Weights 04:26 — Ensembles, Voting, and System-Level Optimization 05:30 — Why Gemini 3 Changed Everything 06:21 — What’s Next: Benchmarks, Research, and Customers 07:14 — Is Recursive Self-Improvement a Path to AGI? 08:46 — When to Stop Hill-Climbing 09:16 — Automating Prompt Engineers and Agents

Francois ChaubaurdhostIan Fisherguest

Jan 28, 202611mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Poetic boosts Gemini 3 ARC-AGI scores via recursive optimization

Poetic reports 54% on the ARC-AGI 2 private test set by running its system on top of Gemini 3, exceeding Gemini 3 DeepThink’s ~45% while costing about half as much.
The company frames its core method as recursive self-improvement: an automated loop that iteratively improves a solver using evaluable tasks, without changing model weights.
Because Poetic operates through API models, its “action space” is prompt-and-system design—using ensembles, multi-step refinement, and voting schemes rather than fine-tuning.
A key catalyst was Gemini 3’s jump in code-writing ability for visual problem-solving, which also lifted Poetic’s ARC-1 performance (roughly ~89% to ~95%) despite no training on ARC-2.
Poetic positions this as both a practical performance “bump” for customers today and a potentially meaningful path toward AGI, while acknowledging cost/compute limits and the need to stop hill-climbing strategically.

IDEAS WORTH REMEMBERING

5 ideas

System-level optimization can beat a stronger baseline at lower cost.

Poetic claims 54% on ARC-AGI 2 using Gemini 3 underneath, outperforming Gemini 3 DeepThink’s ~45% while being ~50% cheaper, suggesting orchestration can dominate raw model choice in some settings.

Recursive self-improvement doesn’t require weight updates to be useful.

Their loop improves performance by iterating on prompts and the surrounding execution scaffold (refinement steps, aggregation), making RSI accessible even when only an API is available.

Ensembles plus refinement plus voting are the practical levers.

Poetic repeatedly calls the base model to produce/refine multiple candidate solutions and then combines them with a voting scheme, treating the solver as a “system” rather than a single prompt.

Model capability shifts (especially coding) can unlock sudden benchmark jumps.

They attribute the “holy cow” ARC-2 gains to Gemini 3 being unusually strong at writing code for visual problem solving, which made their previously ARC-1-trained solver transfer better than expected.

Benchmark rules can change the apparent cost-performance frontier.

Ian notes ARC-AGI allows submitting two solutions; Poetic’s improvement let it submit one and still beat a baseline that uses two—making it look cheaper in that specific multi-response setting.

WORDS WORTH SAVING

5 quotes

We just announced, uh, a pretty exciting result where with Poetic on top of Gemini 3, we have, uh, 54% on the ARC 2 private test set evaluation, which is, you know, uh, a very, very exciting, uh, increase over the previous state-of-the-art.

— Ian Fisher

So yeah, 9, 10 percentage points better and, uh, half the cost.

— Ian Fisher

Uh, I realized there was this, um, uh, there was a much faster and cheaper way to do recursive self-improvement, where the AI is making itself smarter... recursive self-improvement is kind of the holy grail of AI.

— Ian Fisher

It's the prompt and the system around the prompt.

— Ian Fisher

We are quite intentionally automating ourselves. Automating prompt engineers, automating people who are building agents. It's a power tool, right?

— Ian Fisher

ARC-AGI 2 private test performance and cost comparisonsRecursive self-improvement (RSI) via hill-climbing loopsNo-weight-access optimization (prompt + system layer)Ensembles, self-refinement, and voting schemesGemini 3 vs Gemini 3 DeepThink vs Anthropic OpusBenchmark strategy and competitive secrecyAutomating prompt engineers and agent builders

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.