No hype Claude Opus 4.8 review—my real experience

I got a few hours of early-access testing with Anthropic’s newly released model Opus 4.8. I walk through real coding, design, and strategy tasks across Claude Code and Claude Cowork, and give you my unfiltered view on what impressed me and what didn’t. *What you’ll learn:* 1. Where Opus 4.8 excels: greenfield prototypes, one-shot features, and fast execution 2. Where it struggles: the last 10%, edge cases in existing codebases, and hallucinations 3. How Opus 4.8 compares to Opus 4.7 on business strategy work 4. Why I’m still reaching for Opus 4.7 on data-heavy strategy and roadmap work 5. The new features shipping alongside the model: dynamic workflows with parallel subagents and effort control in Claude.ai and Cowork 6. The prompting and harness strategy I’d use to get the most out of it *In this episode, we cover:* (00:00) Introduction to Opus 4.8 (00:44) Benchmark performance and pricing (01:53) First coding test: Building a prototyping tool (03:00) Where it failed: The last 10% problem (03:27) The hallucination problem (04:23) Testing Opus 4.8 on existing codebases (05:24) The ambition test: Building games for a 9-year-old (07:03) Business strategy test: 4.7 vs 4.8 (08:23) The roadmap test (09:17) Final verdict *References:* • System Card: Claude Opus 4.8: https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0925c805a1a7ca77314ccbf4a6.pdf • Introducing Claude Opus 4.8 on X: https://x.com/claudeai/status/2060042702150930686?s=20 *Where to find Claire Vo:* ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ X: https://x.com/clairevo _Production and marketing by https://penname.co/._ _For inquiries about sponsoring the podcast, email jordan@penname.co._

Claire Vohost

May 28, 202613mWatch on YouTube ↗

CHAPTERS

0:04 – 0:34
Opus 4.8 first impressions and what Claire tested in early access
Claire sets the context: she had a few hours of early access to Anthropic’s new Opus 4.8 and wants to share real-world impressions rather than hype. She frames the review around where it shines (especially quick, greenfield coding) and where it breaks down (edges, bugs, and grounding).
- •Mini-episode recorded outside the usual studio due to early access timing
- •Goal: identify intended strengths vs. real performance in practice
- •Preview of recurring theme: strong start, weak last-mile execution
0:34 – 2:05
What Anthropic claims: benchmarks, agent focus, and pricing tradeoffs
Claire summarizes Anthropic’s positioning of Opus 4.8 as a step-change model for agents: more honest, longer-horizon autonomy, and more instruction-following for enterprise. She notes benchmark claims (SWE-bench Pro gains over 4.7 and other models) and highlights the high cost.
- •Positioned as better for agents: honesty, less “designed flop,” autonomy, enterprise instruction-following
- •Benchmark highlight: 69.2% on SWE-bench Pro, up ~5 points vs Opus 4.7
- •Pricing called out as expensive: $5/m input tokens, $25/m output tokens
- •Effort defaults to high; fast mode promises lower latency
2:05 – 2:36
Greenfield coding test: one-shot prototyping tool built successfully
She describes a first coding experiment in Claude Code: asking Opus 4.8 to build a full prototyping capability for ChatPRD with specified architecture choices. The model planned and coded autonomously for ~20 minutes, and the preview deployment worked—impressing her for a one-shot feature build.
- •Task: build a prototyping tool inside ChatPRD with given architectural constraints
- •Workflow: plan first, then autonomous coding session (~20 minutes)
- •Result: shipped and worked when pushed to a preview branch
- •Takeaway: strong performance on brand-new surface area one-shot builds
2:36 – 3:06
The “last 10% problem”: edge cases, iteration, and bugs after the initial win
Once the feature worked at a basic level, Claire tried to push it further with successive improvements. That’s where Opus 4.8 began to struggle—shipping bugs and faltering on the details—setting the theme that it performs well until edge cases appear.
- •Initial spec-to-feature success didn’t carry through iterative refinement
- •Repeated failures show up in edge cases and implementation details
- •Model struggles to consistently finish and polish beyond the first working version
3:06 – 4:37
Hallucinations during bug-hunting: hypotheses presented as facts
Claire reports seeing unusually direct hallucinations when Opus 4.8 was asked to debug. Even with high effort and scoped follow-ups, it made claims based on hypotheses rather than grounding in observed code behavior or validated data.
- •Hallucinations surfaced specifically during bug investigation and follow-up prompts
- •Issue occurred despite high-effort mode and scoped tasks
- •Perceived lack of grounding/verification compared with her recent experiences in other models
4:37 – 5:07
Existing codebases test: rebases and integration work expose brittleness at the edges
When she pointed Opus 4.8 at in-flight branches that needed rebasing after a big underlying PR, it repeatedly introduced edge-case bugs. Claire characterizes the failure mode as difficulty understanding where to “insert itself” and what level of abstraction it should operate at inside real, messy code.
- •Use case: rebase branches and fix issues caused by shifting base code
- •Needed multiple cycles of rebase/fix due to recurring edge-case errors
- •Model struggled with boundaries, context, and the right operational elevation in an existing codebase
5:07 – 6:38
Ambition test: building games for a 9-year-old wasn’t as “agentic” as expected
Claire tried a playful prompt to see how far the model could push agentic coding—creating a game, then iterating by “playing” and tuning it for fun. While the outputs were impressive relative to human effort, she felt the results didn’t reach the ambitious, boundary-pushing level she expected.
- •Prompt: propose fun one-shot builds a nine-year-old would love
- •Great idea from the model: build, playtest via screen, and auto-tune difficulty
- •Actual shipped games were cool but not “10X blow-my-mind” agentic outcomes
- •Even when asked for more (e.g., 3D), ambition plateaued
6:38 – 7:09
Coding takeaway: serviceable overall, but struggles with polish, context, and ambition
She synthesizes her coding results: Opus 4.8 is not bad—often strong on initial construction—but reliability drops on the final 10%, orientation in existing code is shaky, and the model seems less bold in scope than peers she’s used.
- •Good at one-shot builds; weaker at sustained refinement
- •Trouble operating inside existing, real-world codebases
- •Perceived lack of ambition when pushed to do more creative/agentic builds
7:09 – 8:09
Business strategy showdown: Opus 4.7 stays data-anchored while 4.8 feels hand-wavy
Claire compares Opus 4.7 vs 4.8 on a strategy prompt using the same business context: analyzing how she spent time vs what would 10X the business, then writing a strategy prompt. She found 4.7 more structured and numbers-grounded, while 4.8 struggled to retrieve relevant evidence and over-weighted small signals.
- •Same prompt and same data access for 4.7 vs 4.8
- •4.7: structured, table-driven, clearly grounded in real metrics
- •4.8: harder time finding relevant data; over-rotated on minor data points
- •Result: large quality gap in strategic usefulness
8:09 – 9:10
Roadmap test and verification gap: 4.8 admits it didn’t check sources
In a follow-up request to create a roadmap, Claire again saw 4.7 produce specifics while 4.8 stayed vague. When challenged, 4.8 often acknowledged it didn’t actually search GitHub or validate information—reinforcing her concern that its confidence wasn’t backed by real checks.
- •Roadmap prompt magnified the difference: 4.7 specific vs 4.8 vague
- •Direct questioning revealed frequent non-verification (“No, I didn’t search/validate”)
- •Pattern supports earlier hallucination/grounding concerns
- •Claire’s preference shifts toward 4.7 for strategy work
9:10 – 10:11
What 4.8 gets right: pleasant voice, token efficiency, speed, and fewer “slop tells”
Despite output issues, Claire praises the user experience: the model’s writing voice is easy to read, not overbearing, and feels efficient. She also notes strong speed—especially in fast mode—plus improvements like removing annoying stylistic quirks.
- •Readable, non-annoying voice; fewer stylistic tells
- •Token-efficient responses: “enough, but not too much”
- •Fast experience (especially expected with Fast Mode)
- •Ergonomics strong even if content quality sometimes missed the mark
10:11 – 11:11
Claire’s theory: over-tuned efficiency leads to narrow vision and overconfidence
She hypothesizes that Opus 4.8’s speed and efficiency may come with a cost: it latches onto a few details, extrapolates, and becomes overly confident without validation. This shows up both in code and strategy as “missing the forest for the trees.”
- •Model appears smart/fast but insufficiently grounded
- •Tends to latch onto specific points and treat them as truth
- •Potential tradeoff: efficiency vs accuracy and self-validation depth
- •Didn’t fully match expectations of “more honest” long-horizon autonomy
11:11 – 13:39
Final verdict and what to use it for (plus new Claude features to watch)
Claire concludes Opus 4.8 is good but not mind-blowing in her early access testing. She’d use it for greenfield prototypes and tool-use-heavy work, but would be cautious in existing codebases and number-heavy strategy unless prompting/harnessing is improved; she also notes new Claude Code parallel workflows and effort controls across products.
- •Best fit: greenfield prototypes and one-shot feature generation
- •Caution zones: existing codebases, edge cases, and numeric strategy work
- •Advice: double-check confident claims; validate grounding and sources
- •Feature updates: dynamic workflows with parallel subagents; effort controls in Claude.ai/Cowork
- •She’ll keep testing given strong benchmark claims

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Opus 4.8 first impressions and what Claire tested in early access

What Anthropic claims: benchmarks, agent focus, and pricing tradeoffs

Greenfield coding test: one-shot prototyping tool built successfully

The “last 10% problem”: edge cases, iteration, and bugs after the initial win

Hallucinations during bug-hunting: hypotheses presented as facts

Existing codebases test: rebases and integration work expose brittleness at the edges

Ambition test: building games for a 9-year-old wasn’t as “agentic” as expected

Coding takeaway: serviceable overall, but struggles with polish, context, and ambition

Business strategy showdown: Opus 4.7 stays data-anchored while 4.8 feels hand-wavy

Roadmap test and verification gap: 4.8 admits it didn’t check sources

What 4.8 gets right: pleasant voice, token efficiency, speed, and fewer “slop tells”

Claire’s theory: over-tuned efficiency leads to narrow vision and overconfidence

Final verdict and what to use it for (plus new Claude features to watch)

Get more out of YouTube videos.