YC Root AccessThis Startup Beat Gemini 3 on ARC-AGI — at Half the Cost
At a glance
WHAT IT’S REALLY ABOUT
Poetic boosts Gemini 3 ARC-AGI scores via recursive optimization
- Poetic reports 54% on the ARC-AGI 2 private test set by running its system on top of Gemini 3, exceeding Gemini 3 DeepThink’s ~45% while costing about half as much.
- The company frames its core method as recursive self-improvement: an automated loop that iteratively improves a solver using evaluable tasks, without changing model weights.
- Because Poetic operates through API models, its “action space” is prompt-and-system design—using ensembles, multi-step refinement, and voting schemes rather than fine-tuning.
- A key catalyst was Gemini 3’s jump in code-writing ability for visual problem-solving, which also lifted Poetic’s ARC-1 performance (roughly ~89% to ~95%) despite no training on ARC-2.
- Poetic positions this as both a practical performance “bump” for customers today and a potentially meaningful path toward AGI, while acknowledging cost/compute limits and the need to stop hill-climbing strategically.
IDEAS WORTH REMEMBERING
5 ideasSystem-level optimization can beat a stronger baseline at lower cost.
Poetic claims 54% on ARC-AGI 2 using Gemini 3 underneath, outperforming Gemini 3 DeepThink’s ~45% while being ~50% cheaper, suggesting orchestration can dominate raw model choice in some settings.
Recursive self-improvement doesn’t require weight updates to be useful.
Their loop improves performance by iterating on prompts and the surrounding execution scaffold (refinement steps, aggregation), making RSI accessible even when only an API is available.
Ensembles plus refinement plus voting are the practical levers.
Poetic repeatedly calls the base model to produce/refine multiple candidate solutions and then combines them with a voting scheme, treating the solver as a “system” rather than a single prompt.
Model capability shifts (especially coding) can unlock sudden benchmark jumps.
They attribute the “holy cow” ARC-2 gains to Gemini 3 being unusually strong at writing code for visual problem solving, which made their previously ARC-1-trained solver transfer better than expected.
Benchmark rules can change the apparent cost-performance frontier.
Ian notes ARC-AGI allows submitting two solutions; Poetic’s improvement let it submit one and still beat a baseline that uses two—making it look cheaper in that specific multi-response setting.
WORDS WORTH SAVING
5 quotesWe just announced, uh, a pretty exciting result where with Poetic on top of Gemini 3, we have, uh, 54% on the ARC 2 private test set evaluation, which is, you know, uh, a very, very exciting, uh, increase over the previous state-of-the-art.
— Ian Fisher
So yeah, 9, 10 percentage points better and, uh, half the cost.
— Ian Fisher
Uh, I realized there was this, um, uh, there was a much faster and cheaper way to do recursive self-improvement, where the AI is making itself smarter... recursive self-improvement is kind of the holy grail of AI.
— Ian Fisher
It's the prompt and the system around the prompt.
— Ian Fisher
We are quite intentionally automating ourselves. Automating prompt engineers, automating people who are building agents. It's a power tool, right?
— Ian Fisher
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome