This Startup Beat Gemini 3 on ARC-AGI — at Half the Cost

Poetiq is a new startup founded by former DeepMind researchers that recently achieved a major jump on the ARC-AGI benchmark by layering a recursive self-improvement system on top of Gemini 3. In this conversation at NeurIPS, YC's Francois Chaubaurd sat down with Poetiq co-founder Ian Fisher to find out how they're increasing performance using prompts and system design alone. They also explore recursive self-improvement, benchmarking progress toward AGI, and why automating prompt engineering may be one of the most powerful levers in AI today. Chapters 00:11 — Introducing Poetiq and the ARC-AGI Breakthrough 00:49 — How Big Is the Performance Jump? 01:18 — Ian Fisher’s Background: YC, Google, DeepMind 02:00 — Recursive Self-Improvement Explained 03:00 — Why Poetiq Targeted ARC-AGI 03:58 — Improving Models Without Access to Weights 04:26 — Ensembles, Voting, and System-Level Optimization 05:30 — Why Gemini 3 Changed Everything 06:21 — What’s Next: Benchmarks, Research, and Customers 07:14 — Is Recursive Self-Improvement a Path to AGI? 08:46 — When to Stop Hill-Climbing 09:16 — Automating Prompt Engineers and Agents

Francois ChaubaurdhostIan Fisherguest

Jan 29, 202611mWatch on YouTube ↗

CHAPTERS

0:11 – 0:49
Poetiq’s ARC-AGI 2 result: 54% on the private test set
At NeurIPS, Ian Fisher introduces Poetiq (a new startup founded in June, largely by ex-DeepMind researchers) and announces their headline benchmark result. Using Poetiq on top of Gemini 3, they achieve 54% on the ARC-AGI 2 private test set—described as a major leap over prior state of the art.
0:49 – 1:18
Quantifying the jump: better than Gemini 3 DeepThink at half the cost
Ian breaks down the performance comparison and emphasizes that the fairest baseline is Gemini 3 DeepThink. Poetiq is roughly 9–10 points higher than that stronger baseline while costing about half as much, framing the win as both a quality and efficiency improvement.
1:18 – 2:00
Ian Fisher’s path: YC founder → Google acquisition → ML research → Poetiq
Ian shares his background and how it led to Poetiq. After founding and selling a YC company to Google, he moved into Google Research, then refocused heavily as LLMs became central—eventually motivating him to start Poetiq around recursive self-improvement.
2:00 – 3:00
Recursive self-improvement (RSI): the “holy grail,” pursued with safety in mind
Ian frames recursive self-improvement as a major competitive frontier: systems that can make themselves smarter. He notes both the promise (rapid capability gains) and the need for safety, acknowledging that many labs and startups are racing toward similar goals.
3:00 – 3:58
Why ARC benchmarks: starting with ARC 1, then a surprise on ARC-AGI 2
Poetiq initially focused on ARC 1 because it was more tractable and served as the training ground for their solver-generation system. ARC-AGI 2 wasn’t the main target early on; they only sanity-checked it across different API models until Gemini 3 triggered a standout “holy cow” improvement.
3:58 – 4:26
How Poetiq improves performance without model weights: prompt + system optimization
Because Poetiq doesn’t have access to model weights, their improvement “action space” is prompting and the surrounding system design. Ian describes a recursive loop that improves itself by optimizing performance on tasks with measurable evaluation signals.
4:26 – 5:30
Ensembles and voting: multiple calls, refinement, and aggregation
Ian outlines the system structure: an ensemble that calls the underlying model multiple times, with each ensemble member iteratively refining its own answer. Final outputs are combined with a voting scheme, and Poetiq claims additional “trade secret” insights beyond prior art like DSPy.
5:30 – 6:21
Why Gemini 3 was the inflection point: coding for visual problem solving
When Gemini 3 arrived, Poetiq saw a strong improvement on ARC 1 and then an unexpectedly large jump on ARC-AGI 2—despite not training on ARC 2. Ian attributes much of the gain to Gemini 3’s apparent strength in writing code for visual reasoning tasks.
6:21 – 7:14
Model portability: swapping Gemini 3 with Anthropic Opus yields similar quality (higher cost)
Ian notes that Poetiq’s system can generalize across frontier models. They tested replacing Gemini 3 with Anthropic’s Opus and observed comparable quality, but with higher cost—reinforcing their emphasis on efficiency as well as accuracy.
7:14 – 8:46
What’s next: more high-impact benchmarks, plus real customers
Poetiq plans to expand to additional benchmarks and continue research, but also wants commercial impact. Ian describes early customer conversations and a dual mandate: advance RSI research while delivering practical value for businesses.
8:46 – 9:16
Small team, outsized result—and the economics of hill-climbing
Ian shares that Poetiq is a six-person team (with a seventh joining soon), underscoring how leverageable system-level approaches can be. He also explains that their optimization/hill-climbing is expensive to run, and they stopped before plateauing to conserve resources for customer work.
9:16 – 11:23
Is RSI a path to AGI—and automating prompt engineers and agent builders
Ian argues RSI is both a practical way to get incremental performance bumps and, potentially, a path to AGI and beyond (though not the only path). He frames Poetiq as building a “factory” that automates what prompt engineers and agent designers do manually—turning trial-and-eval into a scalable power tool.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Poetiq’s ARC-AGI 2 result: 54% on the private test set

Quantifying the jump: better than Gemini 3 DeepThink at half the cost

Ian Fisher’s path: YC founder → Google acquisition → ML research → Poetiq

Recursive self-improvement (RSI): the “holy grail,” pursued with safety in mind

Why ARC benchmarks: starting with ARC 1, then a surprise on ARC-AGI 2

How Poetiq improves performance without model weights: prompt + system optimization

Ensembles and voting: multiple calls, refinement, and aggregation

Why Gemini 3 was the inflection point: coding for visual problem solving

Model portability: swapping Gemini 3 with Anthropic Opus yields similar quality (higher cost)

What’s next: more high-impact benchmarks, plus real customers

Small team, outsized result—and the economics of hill-climbing

Is RSI a path to AGI—and automating prompt engineers and agent builders

Get more out of YouTube videos.