Skip to content
YC Root AccessYC Root Access

This Startup Beat Gemini 3 on ARC-AGI — at Half the Cost

Poetiq is a new startup founded by former DeepMind researchers that recently achieved a major jump on the ARC-AGI benchmark by layering a recursive self-improvement system on top of Gemini 3. In this conversation at NeurIPS, YC's Francois Chaubaurd sat down with Poetiq co-founder Ian Fisher to find out how they're increasing performance using prompts and system design alone. They also explore recursive self-improvement, benchmarking progress toward AGI, and why automating prompt engineering may be one of the most powerful levers in AI today. Chapters 00:11 — Introducing Poetiq and the ARC-AGI Breakthrough 00:49 — How Big Is the Performance Jump? 01:18 — Ian Fisher’s Background: YC, Google, DeepMind 02:00 — Recursive Self-Improvement Explained 03:00 — Why Poetiq Targeted ARC-AGI 03:58 — Improving Models Without Access to Weights 04:26 — Ensembles, Voting, and System-Level Optimization 05:30 — Why Gemini 3 Changed Everything 06:21 — What’s Next: Benchmarks, Research, and Customers 07:14 — Is Recursive Self-Improvement a Path to AGI? 08:46 — When to Stop Hill-Climbing 09:16 — Automating Prompt Engineers and Agents

Francois ChaubaurdhostIan Fisherguest
Jan 29, 202611mWatch on YouTube ↗

CHAPTERS

  1. 0:11 – 0:49

    Poetiq’s ARC-AGI 2 result: 54% on the private test set

    At NeurIPS, Ian Fisher introduces Poetiq (a new startup founded in June, largely by ex-DeepMind researchers) and announces their headline benchmark result. Using Poetiq on top of Gemini 3, they achieve 54% on the ARC-AGI 2 private test set—described as a major leap over prior state of the art.

    • Poetiq is a newly formed company (started in June), mostly ex-DeepMind
    • Core announcement: 54% on ARC-AGI 2 private test set evaluation
    • Result is achieved by running Poetiq “on top of” Gemini 3
    • Positioned as a significant jump over previous best results
  2. 0:49 – 1:18

    Quantifying the jump: better than Gemini 3 DeepThink at half the cost

    Ian breaks down the performance comparison and emphasizes that the fairest baseline is Gemini 3 DeepThink. Poetiq is roughly 9–10 points higher than that stronger baseline while costing about half as much, framing the win as both a quality and efficiency improvement.

    • Gemini 3 (non-DeepThink) is cited around ~31–33% (approximate)
    • Fairer comparison: Gemini 3 DeepThink at ~45%
    • Poetiq achieves 54%—about 9–10 points better than DeepThink
    • Cost claim: Poetiq’s approach is ~half the cost of Gemini 3 DeepThink
  3. 1:18 – 2:00

    Ian Fisher’s path: YC founder → Google acquisition → ML research → Poetiq

    Ian shares his background and how it led to Poetiq. After founding and selling a YC company to Google, he moved into Google Research, then refocused heavily as LLMs became central—eventually motivating him to start Poetiq around recursive self-improvement.

    • Poetiq is Ian’s third company
    • Second company (Affordable) was YC-backed and acquired by Google in 2015
    • He transitioned into Google Research to do fundamental ML research
    • LLMs shifted his focus and seeded the idea for Poetiq
  4. 2:00 – 3:00

    Recursive self-improvement (RSI): the “holy grail,” pursued with safety in mind

    Ian frames recursive self-improvement as a major competitive frontier: systems that can make themselves smarter. He notes both the promise (rapid capability gains) and the need for safety, acknowledging that many labs and startups are racing toward similar goals.

    • RSI described as AI making itself smarter via iterative improvement loops
    • Seen as a highly competitive space (big labs + startups)
    • Poetiq’s stated intent: pursue RSI safely
    • RSI presented as a particularly exciting route to major capability gains
  5. 3:00 – 3:58

    Why ARC benchmarks: starting with ARC 1, then a surprise on ARC-AGI 2

    Poetiq initially focused on ARC 1 because it was more tractable and served as the training ground for their solver-generation system. ARC-AGI 2 wasn’t the main target early on; they only sanity-checked it across different API models until Gemini 3 triggered a standout “holy cow” improvement.

    • Primary early focus: ARC 1 (easier, better for iteration)
    • ARC-AGI 2 initially treated as “really hard,” not a main focus
    • They tested across multiple API model providers for reasonableness
    • Gemini 3 release prompted a surprising jump worth exploring on ARC 2
  6. 3:58 – 4:26

    How Poetiq improves performance without model weights: prompt + system optimization

    Because Poetiq doesn’t have access to model weights, their improvement “action space” is prompting and the surrounding system design. Ian describes a recursive loop that improves itself by optimizing performance on tasks with measurable evaluation signals.

    • No weight access: improvements are via prompts and system scaffolding
    • RSI loop runs on evaluable tasks to guide optimization
    • System output includes an ARC solver produced by their method
    • Emphasis on system-level levers rather than fine-tuning
  7. 4:26 – 5:30

    Ensembles and voting: multiple calls, refinement, and aggregation

    Ian outlines the system structure: an ensemble that calls the underlying model multiple times, with each ensemble member iteratively refining its own answer. Final outputs are combined with a voting scheme, and Poetiq claims additional “trade secret” insights beyond prior art like DSPy.

    • Uses an ensemble that queries the base model (e.g., Gemini 3) multiple times
    • Independent ensemble members refine their solutions over iterations
    • A voting/aggregation scheme selects or combines final answers
    • DSPy is cited as similar in spirit, but Poetiq claims key extra insights
  8. 5:30 – 6:21

    Why Gemini 3 was the inflection point: coding for visual problem solving

    When Gemini 3 arrived, Poetiq saw a strong improvement on ARC 1 and then an unexpectedly large jump on ARC-AGI 2—despite not training on ARC 2. Ian attributes much of the gain to Gemini 3’s apparent strength in writing code for visual reasoning tasks.

    • Solver was designed/trained on ARC 1; no training on ARC 2
    • ARC 1 performance rose from ~89% (other models) to ~95% (Gemini 3)
    • ARC-AGI 2 results produced a “holy cow” reaction
    • Hypothesis: Gemini 3 is unusually strong at coding for visual problem solving
  9. 6:21 – 7:14

    Model portability: swapping Gemini 3 with Anthropic Opus yields similar quality (higher cost)

    Ian notes that Poetiq’s system can generalize across frontier models. They tested replacing Gemini 3 with Anthropic’s Opus and observed comparable quality, but with higher cost—reinforcing their emphasis on efficiency as well as accuracy.

    • Opus (Anthropic) tested as a drop-in replacement for Gemini 3
    • Observed result quality was “pretty similar” across the two models
    • Cost matters: Opus described as more expensive
    • Poetiq positioned as model-agnostic system-level capability layer
  10. 7:14 – 8:46

    What’s next: more high-impact benchmarks, plus real customers

    Poetiq plans to expand to additional benchmarks and continue research, but also wants commercial impact. Ian describes early customer conversations and a dual mandate: advance RSI research while delivering practical value for businesses.

    • Plans include tackling additional “high-impact” benchmarks (not named)
    • Continued research and iteration on their approach
    • Customer discovery and solutioning are now starting
    • Company goal: solve real business problems while pursuing RSI
  11. 8:46 – 9:16

    Small team, outsized result—and the economics of hill-climbing

    Ian shares that Poetiq is a six-person team (with a seventh joining soon), underscoring how leverageable system-level approaches can be. He also explains that their optimization/hill-climbing is expensive to run, and they stopped before plateauing to conserve resources for customer work.

    • Team size: 6 people, with a 7th starting in January
    • They view the team as a key strength and source of execution speed
    • Hill-climbing/optimization runs are costly on ARC-AGI
    • They stopped optimization due to budget/prioritization, not because it plateaued
  12. 9:16 – 11:23

    Is RSI a path to AGI—and automating prompt engineers and agent builders

    Ian argues RSI is both a practical way to get incremental performance bumps and, potentially, a path to AGI and beyond (though not the only path). He frames Poetiq as building a “factory” that automates what prompt engineers and agent designers do manually—turning trial-and-eval into a scalable power tool.

    • RSI provides immediate gains; may also scale toward AGI-level progress
    • Benchmark quirks: ARC-AGI allows multiple solutions, affecting cost/benefit dynamics
    • Poetiq aims to automate manual workflow: prompt engineering and agent construction
    • Analogy: moving from hand-building systems to building a “factory” that builds them

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.