YC Root AccessThis Startup Beat Gemini 3 on ARC-AGI — at Half the Cost
CHAPTERS
- 0:11 – 0:49
Poetiq’s ARC-AGI 2 result: 54% on the private test set
At NeurIPS, Ian Fisher introduces Poetiq (a new startup founded in June, largely by ex-DeepMind researchers) and announces their headline benchmark result. Using Poetiq on top of Gemini 3, they achieve 54% on the ARC-AGI 2 private test set—described as a major leap over prior state of the art.
- •Poetiq is a newly formed company (started in June), mostly ex-DeepMind
- •Core announcement: 54% on ARC-AGI 2 private test set evaluation
- •Result is achieved by running Poetiq “on top of” Gemini 3
- •Positioned as a significant jump over previous best results
- 0:49 – 1:18
Quantifying the jump: better than Gemini 3 DeepThink at half the cost
Ian breaks down the performance comparison and emphasizes that the fairest baseline is Gemini 3 DeepThink. Poetiq is roughly 9–10 points higher than that stronger baseline while costing about half as much, framing the win as both a quality and efficiency improvement.
- •Gemini 3 (non-DeepThink) is cited around ~31–33% (approximate)
- •Fairer comparison: Gemini 3 DeepThink at ~45%
- •Poetiq achieves 54%—about 9–10 points better than DeepThink
- •Cost claim: Poetiq’s approach is ~half the cost of Gemini 3 DeepThink
- 1:18 – 2:00
Ian Fisher’s path: YC founder → Google acquisition → ML research → Poetiq
Ian shares his background and how it led to Poetiq. After founding and selling a YC company to Google, he moved into Google Research, then refocused heavily as LLMs became central—eventually motivating him to start Poetiq around recursive self-improvement.
- •Poetiq is Ian’s third company
- •Second company (Affordable) was YC-backed and acquired by Google in 2015
- •He transitioned into Google Research to do fundamental ML research
- •LLMs shifted his focus and seeded the idea for Poetiq
- 2:00 – 3:00
Recursive self-improvement (RSI): the “holy grail,” pursued with safety in mind
Ian frames recursive self-improvement as a major competitive frontier: systems that can make themselves smarter. He notes both the promise (rapid capability gains) and the need for safety, acknowledging that many labs and startups are racing toward similar goals.
- •RSI described as AI making itself smarter via iterative improvement loops
- •Seen as a highly competitive space (big labs + startups)
- •Poetiq’s stated intent: pursue RSI safely
- •RSI presented as a particularly exciting route to major capability gains
- 3:00 – 3:58
Why ARC benchmarks: starting with ARC 1, then a surprise on ARC-AGI 2
Poetiq initially focused on ARC 1 because it was more tractable and served as the training ground for their solver-generation system. ARC-AGI 2 wasn’t the main target early on; they only sanity-checked it across different API models until Gemini 3 triggered a standout “holy cow” improvement.
- •Primary early focus: ARC 1 (easier, better for iteration)
- •ARC-AGI 2 initially treated as “really hard,” not a main focus
- •They tested across multiple API model providers for reasonableness
- •Gemini 3 release prompted a surprising jump worth exploring on ARC 2
- 3:58 – 4:26
How Poetiq improves performance without model weights: prompt + system optimization
Because Poetiq doesn’t have access to model weights, their improvement “action space” is prompting and the surrounding system design. Ian describes a recursive loop that improves itself by optimizing performance on tasks with measurable evaluation signals.
- •No weight access: improvements are via prompts and system scaffolding
- •RSI loop runs on evaluable tasks to guide optimization
- •System output includes an ARC solver produced by their method
- •Emphasis on system-level levers rather than fine-tuning
- 4:26 – 5:30
Ensembles and voting: multiple calls, refinement, and aggregation
Ian outlines the system structure: an ensemble that calls the underlying model multiple times, with each ensemble member iteratively refining its own answer. Final outputs are combined with a voting scheme, and Poetiq claims additional “trade secret” insights beyond prior art like DSPy.
- •Uses an ensemble that queries the base model (e.g., Gemini 3) multiple times
- •Independent ensemble members refine their solutions over iterations
- •A voting/aggregation scheme selects or combines final answers
- •DSPy is cited as similar in spirit, but Poetiq claims key extra insights
- 5:30 – 6:21
Why Gemini 3 was the inflection point: coding for visual problem solving
When Gemini 3 arrived, Poetiq saw a strong improvement on ARC 1 and then an unexpectedly large jump on ARC-AGI 2—despite not training on ARC 2. Ian attributes much of the gain to Gemini 3’s apparent strength in writing code for visual reasoning tasks.
- •Solver was designed/trained on ARC 1; no training on ARC 2
- •ARC 1 performance rose from ~89% (other models) to ~95% (Gemini 3)
- •ARC-AGI 2 results produced a “holy cow” reaction
- •Hypothesis: Gemini 3 is unusually strong at coding for visual problem solving
- 6:21 – 7:14
Model portability: swapping Gemini 3 with Anthropic Opus yields similar quality (higher cost)
Ian notes that Poetiq’s system can generalize across frontier models. They tested replacing Gemini 3 with Anthropic’s Opus and observed comparable quality, but with higher cost—reinforcing their emphasis on efficiency as well as accuracy.
- •Opus (Anthropic) tested as a drop-in replacement for Gemini 3
- •Observed result quality was “pretty similar” across the two models
- •Cost matters: Opus described as more expensive
- •Poetiq positioned as model-agnostic system-level capability layer
- 7:14 – 8:46
What’s next: more high-impact benchmarks, plus real customers
Poetiq plans to expand to additional benchmarks and continue research, but also wants commercial impact. Ian describes early customer conversations and a dual mandate: advance RSI research while delivering practical value for businesses.
- •Plans include tackling additional “high-impact” benchmarks (not named)
- •Continued research and iteration on their approach
- •Customer discovery and solutioning are now starting
- •Company goal: solve real business problems while pursuing RSI
- 8:46 – 9:16
Small team, outsized result—and the economics of hill-climbing
Ian shares that Poetiq is a six-person team (with a seventh joining soon), underscoring how leverageable system-level approaches can be. He also explains that their optimization/hill-climbing is expensive to run, and they stopped before plateauing to conserve resources for customer work.
- •Team size: 6 people, with a 7th starting in January
- •They view the team as a key strength and source of execution speed
- •Hill-climbing/optimization runs are costly on ARC-AGI
- •They stopped optimization due to budget/prioritization, not because it plateaued
- 9:16 – 11:23
Is RSI a path to AGI—and automating prompt engineers and agent builders
Ian argues RSI is both a practical way to get incremental performance bumps and, potentially, a path to AGI and beyond (though not the only path). He frames Poetiq as building a “factory” that automates what prompt engineers and agent designers do manually—turning trial-and-eval into a scalable power tool.
- •RSI provides immediate gains; may also scale toward AGI-level progress
- •Benchmark quirks: ARC-AGI allows multiple solutions, affecting cost/benefit dynamics
- •Poetiq aims to automate manual workflow: prompt engineering and agent construction
- •Analogy: moving from hand-building systems to building a “factory” that builds them