No hype Claude Opus 4.8 review—my real experience

I got a few hours of early-access testing with Anthropic’s newly released model Opus 4.8. I walk through real coding, design, and strategy tasks across Claude Code and Claude Cowork, and give you my unfiltered view on what impressed me and what didn’t. *What you’ll learn:* 1. Where Opus 4.8 excels: greenfield prototypes, one-shot features, and fast execution 2. Where it struggles: the last 10%, edge cases in existing codebases, and hallucinations 3. How Opus 4.8 compares to Opus 4.7 on business strategy work 4. Why I’m still reaching for Opus 4.7 on data-heavy strategy and roadmap work 5. The new features shipping alongside the model: dynamic workflows with parallel subagents and effort control in Claude.ai and Cowork 6. The prompting and harness strategy I’d use to get the most out of it *In this episode, we cover:* (00:00) Introduction to Opus 4.8 (00:44) Benchmark performance and pricing (01:53) First coding test: Building a prototyping tool (03:00) Where it failed: The last 10% problem (03:27) The hallucination problem (04:23) Testing Opus 4.8 on existing codebases (05:24) The ambition test: Building games for a 9-year-old (07:03) Business strategy test: 4.7 vs 4.8 (08:23) The roadmap test (09:17) Final verdict *References:* • System Card: Claude Opus 4.8: https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0925c805a1a7ca77314ccbf4a6.pdf • Introducing Claude Opus 4.8 on X: https://x.com/claudeai/status/2060042702150930686?s=20 *Where to find Claire Vo:* ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ X: https://x.com/clairevo _Production and marketing by https://penname.co/._ _For inquiries about sponsoring the podcast, email jordan@penname.co._

Claire Vohost

May 27, 202613mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Hands-on Opus 4.8 review: strong one-shots, weak edge reliability

Opus 4.8 impressed in a greenfield, one-shot build by planning and shipping a working prototype that matched requested architecture.
Performance degraded in the “last 10%” as iteration continued, with repeated edge-case bugs and notable hallucinations during debugging.
When applied to existing codebases (rebases and branch fixes), the model struggled to orient itself and required many corrective cycles.
In business strategy comparisons, Opus 4.7 produced more data-anchored, structured analysis while Opus 4.8 skewed hand-wavy and over-weighted small data points.
Ergonomics were strong—fast, token-efficient, and pleasant voice—but the reviewer questions whether efficiency came at the cost of grounding and accuracy.

IDEAS WORTH REMEMBERING

5 ideas

Opus 4.8 is best used for greenfield prototypes and first drafts.

In a one-shot build of a new prototyping capability, it planned and shipped working code quickly and followed architectural constraints, making it strong for rapid initial implementations.

Expect reliability issues during iterative polishing and edge-case handling.

Once the prototype moved from “works” to “make it robust,” the model began introducing bugs repeatedly, aligning with the reviewer’s core theme: it performs well until the final, detail-heavy stretch.

Hallucinations reappeared in practical debugging and business tasks.

Claire reports the model making claims based on hypothesis rather than evidence, including admitting it didn’t actually validate bugs or search sources—so confidence cannot be treated as proof.

Existing codebases are a stress test where Opus 4.8 can lose its footing.

When asked to rebase and reconcile branches after a foundational PR, the model needed cycle after cycle of fixes and struggled to understand boundaries and “where to operate” within the code.

Agentic “ambition” may be lower than expected without heavy prompting and scaffolding.

Even when pushed to create impressive games for a nine-year-old (including a 3D follow-up), outputs were cool but not the “10x blow-my-mind” agentic leap the reviewer expected.

WORDS WORTH SAVING

5 quotes

It does really, really well until it doesn't do well, and I found it did not do well consistently over time with the same types of trouble.

— Claire Vo

Where it failed was this last 10%, and this is really gonna be my theme of this episode.

— Claire Vo

But over my experience early testing Opus 4.8, both on business use cases as well as coding use cases, it 100% made up things based on hypothesis, not data.

— Claire Vo

It's just over-tuned and has kind of narrow vision. So it's smart, it's fast, it's efficient, but it's overly confident absent true validation.

— Claire Vo

I would use it for greenfield prototypes. It's really impressive on a one-shot.

— Claire Vo

Anthropic benchmarks and pricing (SWE-bench Pro, cost)Greenfield one-shot feature build in Claude CodeThe “last 10%” problem in iterative refinementHallucinations and lack of grounding/validationWorking inside existing codebases (rebasing, edge cases)Ambition ceiling in agentic/creative coding tasksStrategy/roadmap comparison: Opus 4.7 vs 4.8New harness features: dynamic workflows, effort controls

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.