EVERY SPOKEN WORD
15 min read · 2,576 words- 0:00 – 0:44
Introduction to Opus 4.8
- CVClaire Vo
[upbeat music] Welcome back to How I AI. I'm Claire Vo, product leader and AI obsessive, here on a mission to help you build better with these new tools. Today, we have a very special mini episode because Anthropic just dropped Opus 4.8, their latest state-of-the-art coding model. And I got a few hours of early access, and I'm here to share my very early thoughts about where this model is intended to perform well, where it did a great job and totally impressed me, and where there's still a little bit further to go. Let's get to it. As you can tell, I am not in my regular How I AI studio, and that's because I am so excited to give you my early thoughts on Opus 4.8, and couldn't wait between meetings to share
- 0:44 – 1:53
Benchmark performance and pricing
- CVClaire Vo
what I thought. So to get started, I wanna talk about what this model is, what Anthropic has told us about its benchmarks, performance, and what it's good at. So Anthropic is shipping Opus 4.8. It is supposed to be their step change model for agents, and there's a couple things they've called out that this model does particularly well. It's supposed to be more honest, a less designed flop, longer horizon autonomy on long-running tasks, and enterprise ready, so it means it follows its instructions. And they're saying that SWE-bench Pro, they're hitting 69.2%, which is almost five points higher than Opus 4.7, almost 10 points higher than GPT 5.5, and 15 points higher than Gemini 3.1. Now, this model is not cheap. It's $5 per input tokens and $25 per million output tokens. And then same as 4.7, effort defaults to high, and fast mode can be a lot faster. This is what they say. This is what you're gonna read on the blog post. And so on paper, this is a very exciting model, but I wanna tell you my personal experience using this model, and where I thought it did a really good job, and again, where it did not
- 1:53 – 3:00
First coding test: Building a prototyping tool
- CVClaire Vo
do a perfect job. And so when I was giving feedback to the team, I said, "Surprise, surprise, LOL, it's a good coding model," in that when I opened up Claude Code and asked it to do a fairly complex one-shot brand-new surface area task, it did a pretty good job. So I asked, um, in Claude Code, Opus 4.8 to build a prototyping capability in ChatPRD. So we make PRDs. I said, "Let's just go whole hog. Let's, uh, compete with the big boys. Let's make an entire prototyping tool." And I gave it some architecture decisions I wanted to make, what platforms I wanted to use, how I wanted it to function. It went through plan, and then it autonomously coded for, I would say, about 20 minutes, and shipped it. And when I pushed this live to my preview branch, it worked. Um, and so I would say from a one-shot feature, it did quite a good job. The code was right, and it followed the architecture I want. Where it failed was this last 10%, and this is really gonna be my theme of this episode. It does really, really well until it doesn't do well, and I found it did not do well consistently over time with the same types of trouble.
- 3:00 – 3:27
Where it failed: The last 10% problem
- CVClaire Vo
So I'm curious, as you all get your hands on this model, if you have the same experience I did, where it does, like, really, really, really well, and then struggles in the edge cases and the details. So what it nailed here is it did take the spec, it planned the work, it shipped the feature. But then as soon as I got it live and started trying to take it to the next level, the next level, the next level, it really struggled and started to ship bugs. And even more than its inability to finish that last 10%, when it was bug
- 3:27 – 4:23
The hallucination problem
- CVClaire Vo
hunting, it hallucinated. And I am gonna tell you, I have not seen a straight-up hallucination in a very, very, very long time. But over my experience early testing Opus 4.8, both on business use cases as well as coding use cases, it 100% made up things based on hypothesis, not data. And this was really interesting to me. This was on high effort, and so I don't think it was effort or reasoning. There's something about this model where it's really not grounding itself as effectively as I've seen in other models. Again, this was a one-shot, but then very specifically prompted up on scoped surface areas for follow-ups. Like I saw a bug in, um, the preview branch and got these hallucinations, and so this is a really interesting reflection of this bug. I'm gonna have to run at it a little bit more to see if this holds over time with coding use cases, but it was kind of the theme of my test here.
- 4:23 – 5:24
Testing Opus 4.8 on existing codebases
- CVClaire Vo
Okay, this headline is a little dramatic. It says, "In real codebases, the edges destroy it." This is not Opus 4.8. This is just Claude Cowork fail here. It doesn't have the screen, um, the screenshot, so I'll have to show you the GitHub for this. But basically what I saw is when I pointed it at existing code, it also struggled to sort of insert itself and understand the edges of where it was supposed to work. So let me give you an example of this. I had a couple branches in flight that I needed to rebase, that I needed to bring up to base because we had shipped a big underlying PR, and it kind of messed up, um, the state of the code. And so I asked Opus 4.8 to rebase and check these branches for code. And as you can see here, I had to do cycle after cycle of rebase and fixes because it was continuing to ship really edge case bugs into the code. And again, this was my experience. I thought it did a really good job one-shot on a surface area, but then when you got into the specifics, it struggled to understand the elevation at which it should be operating.
- 5:24 – 7:03
The ambition test: Building games for a 9-year-old
- CVClaire Vo
The third Git coding use case that I tried was a fun one, which is I pulled up Claude Code and asked it, "Just what are some one-- fun things we can one-shot with Claude Code that my nine-year-old would think is rad?" And I really tried to push it to say, "Make it really interesting. Think about the edges of agentic coding." And aside from the code quality itself, which I struggled with, sort of had highs and lows, the other thing I reflected on when I was coding with Opus 4.8 is it just wasn't ambitious enough. And so it gave me this awesome prompt, which was, "Build a game, then play it yourself by watching the screen and tweaking the difficulty until it's fun for a nine-year-old." Amazing. This is state-of-the-art coding agent. It's gonna cook. Let me show you what it actually shipped.It shipped this, which is, like, fine. Of course magic. Like, I would've never been able to ship this by myself without a lot of effort. But not pushing the edges of agentic coding. And even when I said, "Great, let's make it 3D. Let's do something even more fun," it shipped something like this, which again, is super cool. I would've never been able to do this, but it's not 10X agentic coding blow my mind impressive. And so this is where I really struggled with Opus 4.8, is I kept saying, "More, more. Do better, do better," and it just wasn't as ambitious as I've seen other models be. So in terms of coding, I think it does a totally serviceable job. I wouldn't say it's bad at coding. I would just say my experience has been it struggles with the last 10%, it's not exceptional at orienting itself inside existing codebases, and that it's just not that ambitious.
- 7:03 – 8:23
Business strategy test: 4.7 vs 4.8
- CVClaire Vo
Now let's talk about business work. So I also tested Opus 4.8 in Claude Cowork, and I tested it on strategy. And I gave it this very broad prompt, and I tested 4.7 vs 4.8. And I basically said, "Based on what you can gather about my last three months, where am I spending my time versus where my priorities should be if I want to 10X my business?" I gave it access to all the same business context. And then once it did that analysis, I said, "Please write me a strategy prompt." And this is where the performance of Opus 4.7 vs Opus 4.8 really became apparent. Opus 4.7 was very numbers anchored. You can see this table here. I obfuscated some of the numbers, but, um, it was very numbers anchored. It was very structured and, and rooted in real data. While both of these exercises did have access to the same data, Opus 4.8 had a harder time discovering the relevant data, and it over-rotated on small data points and took them as truth, as opposed to what I experienced Opus 4.7 doing, which is it zoomed out a lot more and put everything in context. Now again, this is mutual one-shots. It was, like, basically two-shot. It was like, "Analyze my time, and then give me a strategy to grow my business." But the difference between these two were very,
- 8:23 – 9:17
The roadmap test
- CVClaire Vo
very high. I then asked it a follow-up prompt to build a roadmap, and again, 4.7, very anchored in specifics, very good strategy, and 4.8 was incredibly hand-wavy. And in fact, with Opus 4.8, it gave me a roadmap, and then I said, "We have all this. Did you search through GitHub? Did you look online?" And what's really funny, again, with the hallucination, is you see here, "No, I didn't." This is a common thing that I had Opus 4.8 say to me. "No, I didn't search GitHub. No, I didn't actually look up that data. No, I didn't actually validate that bug." Now again, this was early access, so I'm not 100% sure if this is prompting error, if it's the shape of the model, if it's the harness that needs to be tuned. But I consistently got this experience of the model hallucinating or over-rotating on a hypothesis it had, as opposed to being anchored in true code truth or in true business truth. And so
- 9:17 – 13:38
Final verdict
- CVClaire Vo
I honestly would continue to reach for Opus 4.7, which I think did an exceptional job on strategy, versus 4.8, which I think was a lot more hand-wavy and just over-rotated on things I didn't think was important. Now, that being said, what positively impressed me? Voice is great. Claude is not an annoying girlfriend, is what I would say. It was easy to read. It didn't have slop tells. It was token efficient. It felt like it was talking enough, but not too much. And it was fast. Now, I got early access. Who knows what the production latency is? But with Fast Mode, I anticipate you'll have this fast experience. So I think the ergonomics were very nice. Now, if we zoom out and I say the writing was very good, and then Opus 4.7 wrote this slide. I don't know if I love this slide that much, so hopefully 4.8 would've done a better job [laughs] with the voice and ergonomics. But I do think the experience of using the model was very nice. I had no complaints. It was not annoying. It did not have ticks and tells. Just the outputs were not exactly what I wanted. So here's my theory, and this is what I saw. It's just over-tuned and has kind of narrow vision. So it's smart, it's fast, it's efficient, but it's overly confident absent true validation. That's what I would want you to walk away from in my review of Opus 4.8. It really latches onto specific data points, specific code points. It draws conclusions for them, and then says, "This must be truth." And so it sort of misses the forest for trees, both in coding and in strategy, and this might be part of its efficiency. Like, I thought it was super efficient, but does that come at the cost of accuracy? And would I rather a long-running, sort of relentless coding model really going deep and validating its own opinion before shipping? So I didn't quite experience, I would say, this more honest and long-horizon autonomy. I did see it was fast. I did find it was enjoyable to work with. Think it followed instructions well, but it stayed too much in scope, if that makes sense, because it didn't zoom out and contextualize the work that it's doing. So my verdict, I mean, all these models are great. They're all magic, so, like, let's, let's be real. Every model is magic. [laughs] The fact that I could do any of this in just a couple hours is pretty, pretty genius. But I would use it for greenfield prototypes. It's really impressive on a one-shot. I think its design is better. It got rid of the italics emphasis words, which were driving me crazy from Claude Design. It's good at tool use. It's fast. It's not annoying. Um, where I would test it and really figure out the right prompting strategy and the right harness strategy is with existing codebases in branches with real edge case, with strategy work that requires you to think about numbers. And again, you can probably prompt this, but I would just think about that prompting, and I would just double check where it's really confident, because my experience was its confidence was not rooted in fact. Again, I'm really excited to see this model come out. There's a couple more features as well in Claude Code, as well as Claude.ai and Cowork. In Claude Code, you now have dynamic workflows which can let you spin off hundreds of parallel subagents. And in Claude.ai and Cowork, you now can set effort control from low to max, which you were able to do in Claude Code. So these are all really interesting shifts in both the harness and the model. I would say it's a good model. It's not the most amazing model. It didn't blow my mind. It has some quirks to it, what I think you need to be aware of. But I'm definitely gonna keep testing it, because with the benchmarks, with the work that's gone into the product, I think it's a model worth keeping your eye on. So that's it. That's my quick review of Opus 4.8 that just came out today from Anthropic. I'd love to hear your experience with it, especially how it does in coding, how it does in design, and whether or not it gives you strategy anchored in reality. Thanks for joining How I AI. [upbeat music] Thanks so much for watching. If you enjoyed the show, please like and subscribe here on YouTube, or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app. Please consider leaving us a rating and review, which will help others find the show. You can see all our episodes and learn more about the show at howiaipod.com. See you next time.
Episode duration: 13:39
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode h0gZf1hL4D4
