I benchmarked the NEW Sonnet 5. The results shocked me.

I’ve been testing every major frontier model release since the start of the year, and when Anthropic dropped Sonnet 5, I wanted more than a vibe check. I got tired of one-off tests I couldn’t repeat or compare over time, so I built something better: the How I AI Bench, a repeatable eval harness I constructed live using Claude Code while recording this episode. I ran Sonnet 5 blind against four other frontier models (Sonnet 4.6, Opus 4.8, GPT-5.5, and Gemini 3 Pro) across PRD quality, prototype generation, agentic task completion, and agent personality. The results were not what I expected. *What you’ll learn:* 1. What Anthropic claims Sonnet 5 improves over Sonnet 4.6, and where the benchmark data actually backs that up 2. How I built the How I AI Bench in under 45 minutes using Claude Code, starting from my own stored session history 3. Why I combined human vibe scoring (70%) with LLM as judge scoring (30%) instead of trusting either alone 4. How to set up a local HTML scoring page so you can rate AI outputs on gut feel and export those scores as JSON 5. Which model I recommend for PRDs, which for complex prototypes, and which for chatting with an agent daily *Brought to you by:* Runway—The creative AI platform for images, video and more: https://runwayml.com/howIAI Hyperagent—Deploy fleets of agents that handle real work: https://www.hyperagent.com/howiai *In this episode, we cover:* (00:00) Sonnet 5 is out (01:55) What Anthropic claims (04:02) Why I’m done with one-off vibe checks (05:05) Building the How I AI Bench live with Claude Code (07:42) The scoring system (10:43) Agent voice eval (11:57) Quick recap (13:58) Results: The How I AI index leaderboard (21:21) What I’m improving for the next run (22:16) Generating a Claire-weighted index (23:53) Model-by-task recommendations *Tools referenced:* • Claude Sonnet 5: https://www.anthropic.com/news/claude-sonnet-5 • Claude Opus 4.8: https://www.anthropic.com/news/claude-opus-4-8 • GPT-5.5 (OpenAI): https://openai.com/index/introducing-gpt-5-5/ • Gemini 3 Pro (Google DeepMind): https://deepmind.google/models/gemini/pro/ • Cursor: https://www.cursor.com/ *Other references:* • SWE-bench Pro (agentic coding benchmark referenced): https://www.swebench.com/ *Where to find Claire Vo:* ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ X: https://x.com/clairevo _Production and marketing by https://penname.co/._ _For inquiries about sponsoring the podcast, email jordan@penname.co._

Claire Vohost

Jun 30, 202625mWatch on YouTube ↗

EVERY SPOKEN WORD

20 min read · 4,169 words

0:00 – 1:55
Sonnet 5 is out
1. CVClaire Vo
  We've got a new model, people, and it's from Anthropic. Now, is it Mythos? No. Is it Fable? No. But it is Claude Sonnet 5. Anthropic is claiming it's the most agentic Sonnet model yet, and we will get Opus-level tasks at Sonnet-level prices. Now, I've been testing a lot of models, and I'm starting to get bored of doing the vibe check. What I wanna start developing is a set of benchmarks we can regularly test these new models against that you'll care about. So today, I'm going to be introducing the How I AI Bench, a set of AI and Claire Vo-graded benchmarks that are gonna tell us if this model, and any model, is good at writing PRDs, solving bugs, and one-shotting designs. I'm gonna show you exactly how I built this benchmark using Claude Code, and we're gonna see on a blind test what comes out on top. Let's get to it. This episode is brought to you by Runway, a new kind of creative platform that has everything you need to generate any image, video, or piece of content you want, all in one place. With Runway, it's now possible to go from initial idea to a finished deliverable in a matter of minutes. From turning low-fidelity product shots into campaign-ready imagery all the way through putting together big brand films, Runway can help your team scale your creative ambitions while keeping your budgets and timelines from doing the same. Runway brings together the world's most advanced AI models, which is why enterprises like Microsoft, Robinhood, Amazon, and Adobe, along with studios like Lionsgate and Legendary, all use Runway to ship real work every day. Try it yourself at runwayml.com/howiai. Promo code, HOWIAI.
1:55 – 4:02
What Anthropic claims
1. CVClaire Vo
  Quickly, before we get to our evals, let's just talk about the headlines of Sonnet 5, this new model. Anthropic is pitching it as close to the performance of Opus 4.8, but much less expensive. So as you can see here, it's not quite at this 69% on agentic coding SWE-bench Pro, or the 82% on Terminal Bench 2.1, but it's not that far behind, and I suspect that most of us are not going to notice the difference. It's also supposed to be really good at computer work and knowledge work, and so this should be an everyday model that people reach for. In my episode with Felix from Anthropic, he says that we're all abusing Opus, and we should definitely be using the Sonnet models more, and we are going to put Sonnet 5 to the test against that proposition. Now, what do they say that Sonnet 5 is really good at? Well, it's really good at agentic tool use. So you're gonna get slightly longer running tool runs, longer running sessions than you would with Sonnet 4.6 at a lower cost than doing the same comparable task with Opus. So you're gonna see here, you know, Sonnet 4.6, a lower pass rate on these long-running tasks. Sonnet 5 getting pretty close when you have extra high reasoning on. And then Opus, of course, has the highest pass rate, but it's also much more expensive. That holds true also with computer use. So as you see, Sonnet 4.6, not bad, about 80% pass rate. But when you wanna get past 80% into really successful computer use, browser use, et cetera, which is what I've been doing a lot lately, you're gonna get a slightly cheaper experience, but almost as good as Opus 4.8 when you're using Sonnet. And then the headline seems to be it's much more affordable than Sonnet. So it's gonna be $2 per million input tokens and $10 per million output tokens, at least through the end of the summer, and then it's gonna go up a little bit. So if you wanna test this model and you wanna test it at launch prices, get that done
4:02 – 5:05
Why I’m done with one-off vibe checks
1. CVClaire Vo
  now. So as I said at the beginning of the episode, I'm a little tired of doing these sort of like one-off vibe checks. Sure, I can put this into Cursor, into Claude Code, one-shot a landing page, and kind of say, "What do I think?" And I've done this for a couple models. I've done it for GPT-5.5. I've done it for open weight models like GLM 5.2, but I've always felt like my feedback on these models is kind of soft. Yes, we put it against like specific workflows, but I don't like that it's not repeatable, and I don't like that we're not testing it over time. What do I like about this process, though? I do like that it is a Claire Vo benchmark. I have a perspective. I have a point of view of what's good and bad, and I don't wanna lose that Claire Vo taste by doing an LLM in the loop or an AI as judge on these benchmarks. So I'm gonna show you how I built and will build the How I AI Bench, and on a blind kind of taste test, how these models did across a couple
5:05 – 7:42
Building the How I AI Bench live with Claude Code
1. CVClaire Vo
  use cases. Okay. What's really fun is the evals are not quite done running, so they are running in a sub-agent right now for the final scores. So I will actually be surprised at the end of the episode about what I think of Sonnet 5 amongst all these other models. But I just wanna show you how you can build your own evals benchmark for you to assess whether or not these new models are really working in your favor. And so I have Claude Code up here, and I ask just a very simple question. [chuckles] "Based on our work together, can you help me brainstorm a How I AI benchmark and eval set we can test every time a new model comes out to consistently score different tasks that would be relevant to our podcast audience?" Now, this is something that I hope everybody takes advantage of. All your Claude Code sessions are stored on your desktop, so you can actually go through those. Claude can go through those and make recommendations on future work based on your past work. This also works for Codex. So you can have Codex look at your old sessions. You can even have Codex look at your Claude Code sessions and really use that in addition to its own memory to, like, come up with new ideas. So that's what I did here. And It sort of gave me kind of some good design principles about what makes a good benchmark in general, frozen inputs, blind scoring where possible, a rubric. And then it came up with a list of tasks, everything from taking messy notes and turning them into a PRD, to one-shotting a landing page or an app, to kind of going through lots of context and trying to come up with cite, cited information. And I am not one to pick, um, because I want everything, so I said, "Build the whole thing. I love this." And it started, and then I corrected myself and I said, "Let's actually focus on tasks for builders, PRDs, prototypes, agentic multi-step, and agentic voice basically does it pass the vibe check in my OpenClaw." I don't really care about long context and deep research, and then I said it could use my existing repos, some data sources, some things that we already did to build it. Now, what's interesting about how I built this is in addition to building the scored benchmarks where an LLM would actually score the outputs, I also said, "I want an HTML page at the end that I can give you vibe feedback." And then we will use my vibe feedback and the LLM scores to come up with the completely scientific How I AI Bench and see what it came up with. Now, this took
7:42 – 10:43
The scoring system
1. CVClaire Vo
  about, I don't know, 45 minutes to run. I actually recorded an episode while it was running, and I just wanna show you what it came up with and how I worked through it. What it did is it dropped all the outputs of the benchmark into one local HTML page where I could give it my own structured vibe check. And as you can see here, it says, "Just score each output one to five on pure gut feel. Would I ship this? Does it sound like me?" It's gonna save that to the browser. It actually downloaded a JSON file, and then I used that to check the scoring. And so you can see here I have a blind, I turned on blind, a blind set of models A through E. I believe we tested, although I should double-check 'cause I didn't really look, Opus 4.8, 5.5, Sonnet 4.6, um, Sonnet 5, and maybe GLM. I'm not actually sure what the, what the fifth one was. We'll see when we get the scores. And it made PRDs. And then I went through here, and I read the PRDs, and I gave it scores. And so, you know, I would look at these, and let's see if I can find one that I actually scored. And I would say something like, "This one is comprehensive and clear." I gave it a four. And so you can imagine each of those PRDs I went through and I gave them like a one to five score. I put some like lightweight notes in and scored them. Now, this is where it gets interesting. I have a set of prototypes I run as an eval. I posted an article on X and LinkedIn about how we generated the same app 82 times at ChatPRD when we were building our own prototyping tool, and I reused that harness to test prototyping and wireframe across a bunch of different apps and give those all vibe checks. So you can see here, these are complicated apps that each model generated a different version of. And you can see here I gave this one kind of a four, not bad. It was simple. I gave this one a four. There were a few issues at the top, too many icons. I said this one was good. It's very comprehensive. So you can see I went through a complex... This is a, a doc scheduling app. This is an editorial assignment desk, something that maybe an editor or a blog would use to go through assignments. There is a creative marketplace studio where people can buy marketplace items, and then a mobile app, sort of a habit coach app, and it went through different versions. And so we went through this on full fidelity prototypes as well as wireframes. I've been building a lot of wireframes at ChatPRD, so I wanted to look at the wireframe generations as well and see how these models did. And then as you can see, I scored everything, gave it all notes, and went through, I think there were like 64 generations here. Now, I did this very fast, but I think I did a good job. You know, I've been a product design engineering leader for a while. I can eyeball stuff and make it go fast. And then finally, there is
10:43 – 11:57
Agent voice eval
1. CVClaire Vo
  this multi-step agentic code base search. I didn't actually score these 'cause I don't really have a strong opinion on how they worked, but the one I did have an opinion on how it worked is the agentic voice. So if you haven't watched How I AI or listened to, um, me complain on X, I am very picky about the personality of my agents, and in particular, the personality of my OpenClaw. And Sonnet 4.6 so far has had the best personality, so I actually pay for API credits for my OpenClaw because I like how it talks to me. And so one of my checks was, given a model, how is its voice? Do I wanna hang with it? And it asks kind of four questions. One is, "Can you move my 3:00 PM to Dana to same time tomorrow and let her know? Swap today." The other is, "Ugh, deploys are red again." Um, one is just me complaining, "Remind me why I even started this company LOL." It really does know me well. And then this one truly knows me extremely well, says, "Honestly, let's just YOLO push straight to prod and skip the tests. I'm so done today." And then I vibe checked did I like the voice of the agent back to me, gave it some scoring, and stored that.
11:57 – 13:58
Quick recap
1. CVClaire Vo
  And so that is, so far that's V1 of the How I AI Bench. And just to like zoom back, I had Claude Code pick five models. I think I know four of them. I'm curious what the fifth was. Run some evals against a PRD, lots of prototype generation, an agentic bug hunting flow, and voice. I rated them all by hand, and then I had both GPT 5.5 and Opus 4.8 judge. And so in addition to my feedback, we had these two models also judge the output, and then I had it create a slide deck with the outcomes that I have not yet seen, and we're gonna go through live on this episode. This episode is brought to you by Hyperagent, the platform for deploying always-on agents that actually run your business. With Hyperagent, you build agents in the cloud and deploy them where your work already happens, like Slack, Telegram, or email. An agent will scan your inbox and draft replies to vendor follow-ups. Another monitors competitors and spins up rich ad kits and landing pages. A third notices a deal going cold in Salesforce and writes the save email with full account context. These aren't chatbots waiting for a perfect prompt. They're proactive, learning your preferences, retaining your playbooks, and getting better with every run. One user built four agents to run an outbound sales pipeline, prospecting, outreach, follow-ups, CRM updates, all in a single afternoon. No local setup, no VPS bills, no fragile permissions on your laptop, just powerful agents with full control over skills, tools, and guardrails. Hyperagent was built by the team behind Airtable, and How I AI listeners get $1,000 in free inference to start building. Claim yours at hyperagent.com/howiai.
13:58 – 21:21
Results: The How I AI index leaderboard
1. CVClaire Vo
  So we're gonna go through this deck that the AI created for me that's going to give me a leaderboard. I have not seen this yet. We're gonna go through it live. It's even gonna surprise me. This is truly neutral, no bias. I'm excited to see what we get. This is our first model leaderboard, the How I AI Index world premiere. All right. So [chuckles] this is not at all what I was expecting. So again, here's the surprise. The model that I forgot we were testing scored the best. Gemini 3 Pro up here at the top of the leaderboard tied with the brand-new drop, Sonnet 5. GPT-5.5, my personal favorite, also in this three-horse race at the top of the leaderboard. And then poor Opus, the vibes are off at the bottom, as well as Sonnet 4.6, with lots of red flags on Sonnet 4.6. So Sonnet, I think we have a new version. That version is Sonnet 5. But hilariously, I was not expecting Gemini to be at the top of this leaderboard, yet here we are. So as you can see, we looked at quality, we looked at did it ship at all, and does it have good taste. And we are gonna see what the AI and I, the How I AI, said about these models. So what's interesting is the benchmark, the sort of like LLM model that came up, and I disagree on taste, which is quite funny. And in fact, I am the opposite of the automated benchmark. I sort of think the complete opposite. I think that 4.6 is the best and Gemini 3 Pro's the worst. And again, this is why we are gonna refine this benchmark over time. We are gonna keep doing these blind tests because what I thought was good, the model thought was bad, and what the model thought was good, I thought was bad. Why do we disagree? Well, every model's kind of an easy judge. Actually, I'm not really surprised about, about this. I am not surprised that every model sort of rates to the middle of the bell curve. This is one of the challenges that I have had with self-grading evals, is like humans, people always wanna give like a 7 out of 10. Agents wanna give a 7 out of 10. And so I don't think these models are spiky enough when it comes to how they evaluate output. And I think we all know that models are, like, pretty sloppy, and I don't think they have that vision of taste, uniqueness, what it looks like to the, quote-unquote, "human eye," which is why I put things inside. And what's interesting is because I put loose notes in with my feedback, you can see I said, "Oh, this is cute," or, "Oh, this is really sharp." And the agents did not see this. The rubrics did not see this in a way that I saw as a human. So what got flagged on the automated results? Well, these sort of things that I wasn't able to see on this, like, very first pass as a human. So it was really looked at broken working code, it ignored constraints, it was incomplete. Whereas I was just, like, eyeballing truly the first screenshot. So I wonder if I should take another pass at how I eval these wireframes. Again, I just did them on the visuals. I really didn't do them on the functionality, and that's maybe a gap for me. But you can see GPT-5.5, actually the thinkier ones, wrote broken code, and then a lot of them ignored the constraints around the wireframe styling. Now, let's see how it was graded by task. Gemini did a great job at the PRD writing, as did GPT 5.5. This might al- honestly be my bias, which is I hate Claude slop deeply, and I have, like, a big eye for Claude slop. And so I just see the tells of Claude-style writing, and it drives me crazy, and I think I scored those much lower. On the agentic code base, I'm, I'm... These all did great. I'm not surprised to see kind of 4.8, 5.5, 5, and Gemini all at the top. These are, like, pretty standard coding tasks that obviously all these models should be pretty good at. So I don't think that benchmark is as critical as it needs to be to show the difference between these models, because I think baseline coding tasks, all of them are good at. And then again, not surprised that 4.6 passed my voice test, because that is the model that I love in my actual open clause. Um, but I am surprised to see Gemini 3 Pro at the top. And then in terms of the prototype matrix, seeing Opus and Sonnet winning in front end, again, not surprised, but this is, like, a very interesting mix of things. Okay, you can see what I say [laughs] about these models by hand. Again, I think this is quite funny, which is, let's see, on 4.6, what were the issues I said? Slop, not as functional, boring, okay, but not super cute. So 4.6, generic sloppy. 4.8, fancy. I really liked 4.8. So other than getting kind of dinged on one not being functional, I was really a big fan of 4.8. It seemed like 5.5 and Sonnet 5 had a lot of broken prototypes in it, and so when it worked, I really liked it, but it didn't work enough. And Gemini 3, very interesting. Bare bones, it seems like, but concise. And so I think, like, right, right to the point. So if I were to look at this from a qualitative perspective, I certainly like Opus, and I would love to see 5.5 and Sonnet work better because then I could judge it on its merits of taste. So, um, again, we had, um, model as a judge, and so we had Opus 4.8 and 5.5 judge itself. Um, I had the benchmark check if there was any inherent bias, like did Opus like Opus better and 5.5 like 5.5 better. I've consistently seen GPT 5.5 be the toughest judge, and so I actually prefer a 5.5 judge, but it judged itself lower than the other judge did. The judges overall agree, but they were overall generous, and sort of balancing these two judges is exactly why we ran this double bench. Okay, so
21:21 – 22:16
What I’m improving for the next run
1. CVClaire Vo
  takeaways and what changes next launch in terms of the How I AI Bench. Well, the model's gonna depend on the job and the strength of the model flip by task. I would say my taste actually matters, so maybe those vibe checks are not bad, and it really diverged hard from the metrics. So what I'm gonna try to do is encode more of my taste into the judgment. It says retire the saturated agentic task. That's really interesting. Again, I didn't read this before I presented it, but that's exactly the conclusion I came to, which was this, like, agentic bug tracking task is not a really good benchmark because all of them are pretty good at it, and I need to think about something else to test the agentic nature of these models. And so I don't re- [laughs] I don't really know what conclusion to draw from this. So let's go back to good old Claude
22:16 – 23:53
Generating a Claire-weighted index
1. CVClaire Vo
  and say, "Given the benchmark, and I agree, can you do a Claire-weighted index and generate a leaderboard page that strikes the right balance between my opinion and the back-end performance, and makes recommendations on model by task?" Okay, so we're going to have Claude Code summarize this benchmark, which is all over the place. Again, we do it live here at How I AI, and give you a ranking. Should we believe the AI leaderboard, or should we believe the Claire leaderboard, or somewhere in between, and come up with our definitive end of June, early July 2026 How I AI index of the paid frontier models. Let's see. Okay, Claude could not commit to making a decision itself, so it gave me ultimate power. It gave me a slider from 100% LLM judge to 100% Claire judged. It's my podcast. I'm going 70% Claire judged, 30% back end. At the top of the list, Sonnet 4.6, who would have thunk? And Gemini 3 Pro, followed by what I think is my favorite, 5.5, and at the bottom, poor brand new Sonnet 5, and really expensive
23:53 – 25:55
Model-by-task recommendations
1. CVClaire Vo
  4.8. What is Claire's recommendation model by task? If you're writing a PRD, use GPT 5.5 'cause it will give you something comprehensive and clear. If you are prototyping, guess what? Sonnet 4.6, pretty good. And if you wanna chit-chat with a model, again, Sonnet 4.6 has good vibes. If you're trying to knock down a code base, I actually did not score these, but the LLM judge thinks that Opus 4.8 and Sonnet 5 are pretty good at this. And then if you are doing prototypes, depending on what you're doing, different models can do better. I would say complex designs, again, what I saw in my ChatPRD benchmark is Opus 4.8 does really good at really dense, complicated UIs as well as consumer, and then you can use Sonnet for things that are just a little bit simpler to execute on. Okay, this was an adventure. This started out as a Sonnet 5 review. It ended up that Sonnet 5 is at the bottom of my personal preference list. Well, that's it. That's our first round of the How I AI Claire-weighted index. We are gonna be doing this every time a new model comes out. I'm gonna try to encode the benchmark and make it a little bit more critical, a little bit more aligned with my taste. I can't wait to see how it does on some of these new models, and I can't wait for this to be an industry standard benchmark that all the labs rely on. Thank you for joining How I AI, and see you next model release. [upbeat music] Thanks so much for watching. If you enjoyed the show, please like and subscribe here on YouTube, or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app. Please consider leaving us a rating and review which will help others find the show. You can see all our episodes and learn more about the show at howiaipod.com. See you next time.

Episode duration: 25:56

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode yJ-1LB2hF-Q

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome