Skip to content
YC Root AccessYC Root Access

This Startup Beat Gemini 3 on ARC-AGI — at Half the Cost

Poetiq is a new startup founded by former DeepMind researchers that recently achieved a major jump on the ARC-AGI benchmark by layering a recursive self-improvement system on top of Gemini 3. In this conversation at NeurIPS, YC's Francois Chaubaurd sat down with Poetiq co-founder Ian Fisher to find out how they're increasing performance using prompts and system design alone. They also explore recursive self-improvement, benchmarking progress toward AGI, and why automating prompt engineering may be one of the most powerful levers in AI today. Chapters 00:11 — Introducing Poetiq and the ARC-AGI Breakthrough 00:49 — How Big Is the Performance Jump? 01:18 — Ian Fisher’s Background: YC, Google, DeepMind 02:00 — Recursive Self-Improvement Explained 03:00 — Why Poetiq Targeted ARC-AGI 03:58 — Improving Models Without Access to Weights 04:26 — Ensembles, Voting, and System-Level Optimization 05:30 — Why Gemini 3 Changed Everything 06:21 — What’s Next: Benchmarks, Research, and Customers 07:14 — Is Recursive Self-Improvement a Path to AGI? 08:46 — When to Stop Hill-Climbing 09:16 — Automating Prompt Engineers and Agents

Francois ChaubaurdhostIan Fisherguest
Jan 29, 202611mWatch on YouTube ↗

CHAPTERS

  1. 0:11 – 0:49

    Poetiq’s ARC-AGI 2 result: 54% on the private test set

    At NeurIPS, Ian Fisher introduces Poetiq (a new startup founded in June, largely by ex-DeepMind researchers) and announces their headline benchmark result. Using Poetiq on top of Gemini 3, they achieve 54% on the ARC-AGI 2 private test set—described as a major leap over prior state of the art.

  2. 0:49 – 1:18

    Quantifying the jump: better than Gemini 3 DeepThink at half the cost

    Ian breaks down the performance comparison and emphasizes that the fairest baseline is Gemini 3 DeepThink. Poetiq is roughly 9–10 points higher than that stronger baseline while costing about half as much, framing the win as both a quality and efficiency improvement.

  3. 1:18 – 2:00

    Ian Fisher’s path: YC founder → Google acquisition → ML research → Poetiq

    Ian shares his background and how it led to Poetiq. After founding and selling a YC company to Google, he moved into Google Research, then refocused heavily as LLMs became central—eventually motivating him to start Poetiq around recursive self-improvement.

  4. 2:00 – 3:00

    Recursive self-improvement (RSI): the “holy grail,” pursued with safety in mind

    Ian frames recursive self-improvement as a major competitive frontier: systems that can make themselves smarter. He notes both the promise (rapid capability gains) and the need for safety, acknowledging that many labs and startups are racing toward similar goals.

  5. 3:00 – 3:58

    Why ARC benchmarks: starting with ARC 1, then a surprise on ARC-AGI 2

    Poetiq initially focused on ARC 1 because it was more tractable and served as the training ground for their solver-generation system. ARC-AGI 2 wasn’t the main target early on; they only sanity-checked it across different API models until Gemini 3 triggered a standout “holy cow” improvement.

  6. 3:58 – 4:26

    How Poetiq improves performance without model weights: prompt + system optimization

    Because Poetiq doesn’t have access to model weights, their improvement “action space” is prompting and the surrounding system design. Ian describes a recursive loop that improves itself by optimizing performance on tasks with measurable evaluation signals.

  7. 4:26 – 5:30

    Ensembles and voting: multiple calls, refinement, and aggregation

    Ian outlines the system structure: an ensemble that calls the underlying model multiple times, with each ensemble member iteratively refining its own answer. Final outputs are combined with a voting scheme, and Poetiq claims additional “trade secret” insights beyond prior art like DSPy.

  8. 5:30 – 6:21

    Why Gemini 3 was the inflection point: coding for visual problem solving

    When Gemini 3 arrived, Poetiq saw a strong improvement on ARC 1 and then an unexpectedly large jump on ARC-AGI 2—despite not training on ARC 2. Ian attributes much of the gain to Gemini 3’s apparent strength in writing code for visual reasoning tasks.

  9. 6:21 – 7:14

    Model portability: swapping Gemini 3 with Anthropic Opus yields similar quality (higher cost)

    Ian notes that Poetiq’s system can generalize across frontier models. They tested replacing Gemini 3 with Anthropic’s Opus and observed comparable quality, but with higher cost—reinforcing their emphasis on efficiency as well as accuracy.

  10. 7:14 – 8:46

    What’s next: more high-impact benchmarks, plus real customers

    Poetiq plans to expand to additional benchmarks and continue research, but also wants commercial impact. Ian describes early customer conversations and a dual mandate: advance RSI research while delivering practical value for businesses.

  11. 8:46 – 9:16

    Small team, outsized result—and the economics of hill-climbing

    Ian shares that Poetiq is a six-person team (with a seventh joining soon), underscoring how leverageable system-level approaches can be. He also explains that their optimization/hill-climbing is expensive to run, and they stopped before plateauing to conserve resources for customer work.

  12. 9:16 – 11:23

    Is RSI a path to AGI—and automating prompt engineers and agent builders

    Ian argues RSI is both a practical way to get incremental performance bumps and, potentially, a path to AGI and beyond (though not the only path). He frames Poetiq as building a “factory” that automates what prompt engineers and agent designers do manually—turning trial-and-eval into a scalable power tool.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome