YC Root AccessThis Startup Beat Gemini 3 on ARC-AGI — at Half the Cost
CHAPTERS
- 0:11 – 0:49
Poetiq’s ARC-AGI 2 result: 54% on the private test set
At NeurIPS, Ian Fisher introduces Poetiq (a new startup founded in June, largely by ex-DeepMind researchers) and announces their headline benchmark result. Using Poetiq on top of Gemini 3, they achieve 54% on the ARC-AGI 2 private test set—described as a major leap over prior state of the art.
- 0:49 – 1:18
Quantifying the jump: better than Gemini 3 DeepThink at half the cost
Ian breaks down the performance comparison and emphasizes that the fairest baseline is Gemini 3 DeepThink. Poetiq is roughly 9–10 points higher than that stronger baseline while costing about half as much, framing the win as both a quality and efficiency improvement.
- 1:18 – 2:00
Ian Fisher’s path: YC founder → Google acquisition → ML research → Poetiq
Ian shares his background and how it led to Poetiq. After founding and selling a YC company to Google, he moved into Google Research, then refocused heavily as LLMs became central—eventually motivating him to start Poetiq around recursive self-improvement.
- 2:00 – 3:00
Recursive self-improvement (RSI): the “holy grail,” pursued with safety in mind
Ian frames recursive self-improvement as a major competitive frontier: systems that can make themselves smarter. He notes both the promise (rapid capability gains) and the need for safety, acknowledging that many labs and startups are racing toward similar goals.
- 3:00 – 3:58
Why ARC benchmarks: starting with ARC 1, then a surprise on ARC-AGI 2
Poetiq initially focused on ARC 1 because it was more tractable and served as the training ground for their solver-generation system. ARC-AGI 2 wasn’t the main target early on; they only sanity-checked it across different API models until Gemini 3 triggered a standout “holy cow” improvement.
- 3:58 – 4:26
How Poetiq improves performance without model weights: prompt + system optimization
Because Poetiq doesn’t have access to model weights, their improvement “action space” is prompting and the surrounding system design. Ian describes a recursive loop that improves itself by optimizing performance on tasks with measurable evaluation signals.
- 4:26 – 5:30
Ensembles and voting: multiple calls, refinement, and aggregation
Ian outlines the system structure: an ensemble that calls the underlying model multiple times, with each ensemble member iteratively refining its own answer. Final outputs are combined with a voting scheme, and Poetiq claims additional “trade secret” insights beyond prior art like DSPy.
- 5:30 – 6:21
Why Gemini 3 was the inflection point: coding for visual problem solving
When Gemini 3 arrived, Poetiq saw a strong improvement on ARC 1 and then an unexpectedly large jump on ARC-AGI 2—despite not training on ARC 2. Ian attributes much of the gain to Gemini 3’s apparent strength in writing code for visual reasoning tasks.
- 6:21 – 7:14
Model portability: swapping Gemini 3 with Anthropic Opus yields similar quality (higher cost)
Ian notes that Poetiq’s system can generalize across frontier models. They tested replacing Gemini 3 with Anthropic’s Opus and observed comparable quality, but with higher cost—reinforcing their emphasis on efficiency as well as accuracy.
- 7:14 – 8:46
What’s next: more high-impact benchmarks, plus real customers
Poetiq plans to expand to additional benchmarks and continue research, but also wants commercial impact. Ian describes early customer conversations and a dual mandate: advance RSI research while delivering practical value for businesses.
- 8:46 – 9:16
Small team, outsized result—and the economics of hill-climbing
Ian shares that Poetiq is a six-person team (with a seventh joining soon), underscoring how leverageable system-level approaches can be. He also explains that their optimization/hill-climbing is expensive to run, and they stopped before plateauing to conserve resources for customer work.
- 9:16 – 11:23
Is RSI a path to AGI—and automating prompt engineers and agent builders
Ian argues RSI is both a practical way to get incremental performance bumps and, potentially, a path to AGI and beyond (though not the only path). He frames Poetiq as building a “factory” that automates what prompt engineers and agent designers do manually—turning trial-and-eval into a scalable power tool.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome