CHAPTERS
What Opus 4.8 is and what Claire tested in early access
Claire Vo explains she had a few hours of early access to Anthropic’s new Claude Opus 4.8 and is sharing an experience-based review rather than hype. She frames the evaluation around where the model shines, where it breaks down, and what patterns repeatedly showed up across coding and business tasks.
Benchmarks, positioning, and cost: what Anthropic claims vs real-world expectations
She summarizes Anthropic’s positioning: a step-change model for agents that’s more honest, better for long-horizon autonomy, and more enterprise-ready in instruction following. She also highlights benchmark claims and notes the pricing is expensive, setting expectations that “on paper” it looks impressive.
Greenfield one-shot coding win: building a prototyping tool end-to-end
Claire’s first major test was a complex, brand-new feature request in Claude Code: building a prototyping capability inside ChatPRD. The model planned, coded autonomously for ~20 minutes, shipped to a preview branch, and the feature worked—impressing her for greenfield, one-shot delivery.
The “last 10%” problem: edge cases, iteration, and bug shipping
After the initial success, the model struggled as Claire tried to level up the feature through follow-on iterations. She describes a consistent failure mode: it performs very well initially, then begins introducing bugs and missing edge-case details when refining or extending work.
Hallucinations during debugging: confident hypotheses over grounded checks
While bug hunting, Claire observed what she calls clear hallucinations—claims made from hypothesis rather than evidence. She notes this surprised her because she hadn’t seen such direct hallucinations in a while, and it happened even under high-effort settings and scoped prompts.
Working in existing codebases: rebases, branch fixes, and “edges destroy it”
When pointing Opus 4.8 at existing code and real maintenance tasks, it had trouble inserting itself correctly—especially around the boundaries of what it should touch. A concrete example was rebasing and fixing branches after a major underlying PR, which led to repeated cycles of rebase-and-fix due to edge-case bugs.
Ambition test: building games for a 9-year-old and pushing agentic creativity
Claire tried a playful agentic-coding test: generate “rad” one-shot projects for her nine-year-old and push for more interesting, boundary-pushing outputs. While the model produced functional games (including attempts at more advanced versions like 3D), she felt it lacked the ambition and wow-factor she’s seen elsewhere.
Coding verdict so far: serviceable, but weak on last-mile polish and codebase orientation
She consolidates her coding impressions: Opus 4.8 isn’t bad—it's capable and often impressive—yet it repeatedly falters on the last 10%, struggles more in existing codebases, and doesn’t naturally escalate ambition. This makes it better suited to prototypes than to complex maintenance or deep integration without careful oversight.
Business strategy comparison: Opus 4.7 feels grounded; 4.8 feels hand-wavy
Switching to business tasks in Claude Cowork, Claire ran the same prompts on Opus 4.7 and 4.8: analyze how she spent time vs priorities to 10x the business, then write a strategy prompt. She found 4.7 more structured and data-anchored, while 4.8 struggled to surface relevant data and over-weighted minor points.
Roadmap test and tool-use honesty: ‘No, I didn’t search’ admissions
When asked to produce a roadmap, Claire again found 4.7 more specific and strategically sound, while 4.8 stayed vague. She directly questioned whether 4.8 had searched GitHub or validated information, and it repeatedly admitted it had not—reinforcing her concern that it was asserting conclusions without real verification.
What she liked anyway: voice, ergonomics, speed, and “not annoying” outputs
Despite issues with grounding and ambition, Claire highlights user-experience positives. She found the writing voice clean and token-efficient, without the “slop tells” or irritating stylistic quirks, and she experienced strong speed—especially expecting Fast Mode to feel snappy in production.
Her theory: over-tuned efficiency causes narrow vision and overconfidence
Claire’s working hypothesis is that Opus 4.8 is optimized for speed and efficiency in a way that narrows context and encourages overconfident conclusions without validation. In both code and strategy, she felt it missed the forest for the trees—latching onto specific details and treating them as truth.
Final verdict and who should use it: great for prototypes, cautious for real-world edges + new product features
Claire concludes Opus 4.8 is “good, not mind-blowing”: magical for rapid greenfield prototypes, design/ergonomics, tool use, and speed, but risky around existing codebases, edge cases, and number-heavy strategy without stronger prompting and verification. She also notes new Claude product capabilities (dynamic workflows, effort controls) that may change how teams harness the model.
