At a glance
WHAT IT’S REALLY ABOUT
Hands-on Opus 4.8 review: strong one-shots, weak edge reliability
- Opus 4.8 impressed in a greenfield, one-shot build by planning and shipping a working prototype that matched requested architecture.
- Performance degraded in the “last 10%” as iteration continued, with repeated edge-case bugs and notable hallucinations during debugging.
- When applied to existing codebases (rebases and branch fixes), the model struggled to orient itself and required many corrective cycles.
- In business strategy comparisons, Opus 4.7 produced more data-anchored, structured analysis while Opus 4.8 skewed hand-wavy and over-weighted small data points.
- Ergonomics were strong—fast, token-efficient, and pleasant voice—but the reviewer questions whether efficiency came at the cost of grounding and accuracy.
IDEAS WORTH REMEMBERING
5 ideasOpus 4.8 is best used for greenfield prototypes and first drafts.
In a one-shot build of a new prototyping capability, it planned and shipped working code quickly and followed architectural constraints, making it strong for rapid initial implementations.
Expect reliability issues during iterative polishing and edge-case handling.
Once the prototype moved from “works” to “make it robust,” the model began introducing bugs repeatedly, aligning with the reviewer’s core theme: it performs well until the final, detail-heavy stretch.
Hallucinations reappeared in practical debugging and business tasks.
Claire reports the model making claims based on hypothesis rather than evidence, including admitting it didn’t actually validate bugs or search sources—so confidence cannot be treated as proof.
Existing codebases are a stress test where Opus 4.8 can lose its footing.
When asked to rebase and reconcile branches after a foundational PR, the model needed cycle after cycle of fixes and struggled to understand boundaries and “where to operate” within the code.
Agentic “ambition” may be lower than expected without heavy prompting and scaffolding.
Even when pushed to create impressive games for a nine-year-old (including a 3D follow-up), outputs were cool but not the “10x blow-my-mind” agentic leap the reviewer expected.
WORDS WORTH SAVING
5 quotesIt does really, really well until it doesn't do well, and I found it did not do well consistently over time with the same types of trouble.
— Claire Vo
Where it failed was this last 10%, and this is really gonna be my theme of this episode.
— Claire Vo
But over my experience early testing Opus 4.8, both on business use cases as well as coding use cases, it 100% made up things based on hypothesis, not data.
— Claire Vo
It's just over-tuned and has kind of narrow vision. So it's smart, it's fast, it's efficient, but it's overly confident absent true validation.
— Claire Vo
I would use it for greenfield prototypes. It's really impressive on a one-shot.
— Claire Vo
High quality AI-generated summary created from speaker-labeled transcript.
