Skip to content
How I AIHow I AI

I benchmarked the NEW Sonnet 5. The results shocked me.

I’ve been testing every major frontier model release since the start of the year, and when Anthropic dropped Sonnet 5, I wanted more than a vibe check. I got tired of one-off tests I couldn’t repeat or compare over time, so I built something better: the How I AI Bench, a repeatable eval harness I constructed live using Claude Code while recording this episode. I ran Sonnet 5 blind against four other frontier models (Sonnet 4.6, Opus 4.8, GPT-5.5, and Gemini 3 Pro) across PRD quality, prototype generation, agentic task completion, and agent personality. The results were not what I expected. *What you’ll learn:* 1. What Anthropic claims Sonnet 5 improves over Sonnet 4.6, and where the benchmark data actually backs that up 2. How I built the How I AI Bench in under 45 minutes using Claude Code, starting from my own stored session history 3. Why I combined human vibe scoring (70%) with LLM as judge scoring (30%) instead of trusting either alone 4. How to set up a local HTML scoring page so you can rate AI outputs on gut feel and export those scores as JSON 5. Which model I recommend for PRDs, which for complex prototypes, and which for chatting with an agent daily *Brought to you by:* Runway—The creative AI platform for images, video and more: https://runwayml.com/howIAI Hyperagent—Deploy fleets of agents that handle real work: https://www.hyperagent.com/howiai *In this episode, we cover:* (00:00) Sonnet 5 is out (01:55) What Anthropic claims (04:02) Why I’m done with one-off vibe checks (05:05) Building the How I AI Bench live with Claude Code (07:42) The scoring system (10:43) Agent voice eval (11:57) Quick recap (13:58) Results: The How I AI index leaderboard (21:21) What I’m improving for the next run (22:16) Generating a Claire-weighted index (23:53) Model-by-task recommendations *Tools referenced:* • Claude Sonnet 5: https://www.anthropic.com/news/claude-sonnet-5 • Claude Opus 4.8: https://www.anthropic.com/news/claude-opus-4-8 • GPT-5.5 (OpenAI): https://openai.com/index/introducing-gpt-5-5/ • Gemini 3 Pro (Google DeepMind): https://deepmind.google/models/gemini/pro/ • Cursor: https://www.cursor.com/ *Other references:* • SWE-bench Pro (agentic coding benchmark referenced): https://www.swebench.com/ *Where to find Claire Vo:* ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ X: https://x.com/clairevo _Production and marketing by https://penname.co/._ _For inquiries about sponsoring the podcast, email jordan@penname.co._

Claire Vohost
Jun 30, 202625mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

Claire benchmarks Sonnet 5, finds surprises, builds repeatable eval index

  1. Anthropic positions Claude Sonnet 5 as a more agentic, near-Opus-performance model at significantly lower cost, especially for long-running tool use and computer-use tasks.
  2. Claire argues that one-off “vibe checks” aren’t repeatable, so she builds a reusable benchmark harness (How I AI Bench) with Claude Code and scores outputs blind across multiple models.
  3. The benchmark combines Claire’s human “would I ship this / does it sound like me” ratings with LLM-as-judge scores, revealing major disagreements between human taste and automated rubric scoring.
  4. Initial leaderboard results shock Claire: Gemini 3 Pro and Sonnet 5 tie near the top on the automated index, while Opus 4.8 and Sonnet 4.6 rank lower due to rubric-detected issues like broken code or ignored constraints.
  5. Claire then creates a “Claire-weighted index” (70% her scoring, 30% backend), which flips the ranking and leads to task-specific recommendations (e.g., GPT-5.5 for PRDs, Sonnet 4.6 for prototyping and agent voice).

IDEAS WORTH REMEMBERING

5 ideas

Sonnet 5’s value proposition is “near-Opus” at lower price, but taste may differ.

Anthropic highlights agentic tool use, computer use, and launch pricing, but Claire’s own preference ranking ultimately puts Sonnet 5 near the bottom once she weights for her subjective quality bar.

Repeatable evals beat ad hoc vibe checks for tracking models over time.

Claire’s core shift is from one-off demos to a standardized suite with frozen inputs, blind model labels, and a consistent rubric so new releases can be compared reliably.

Human taste and automated scoring can diverge dramatically.

The LLM judges rewarded factors like functional correctness and constraint adherence, while Claire often scored from first-impression design/tone; this created near-opposite rankings for some models.

LLM judges tend to compress scores toward the middle.

Claire observes that model grading often avoids “spiky” judgments, producing generous, bell-curve outcomes that may fail to reflect strong preferences or aesthetic nuance.

Some benchmarks saturate when all frontier models perform similarly.

The agentic codebase/bug-hunt task didn’t differentiate models well because baseline coding competence is now high, motivating Claire to replace or redesign that eval to better test “agentic-ness.”

WORDS WORTH SAVING

5 quotes

I'm starting to get bored of doing the vibe check.

Claire Vo

I don't like that it's not repeatable, and I don't like that we're not testing it over time.

Claire Vo

Sonnet 4.6 so far has had the best personality, so I actually pay for API credits for my OpenClaw because I like how it talks to me.

Claire Vo

This is truly neutral, no bias.

Claire Vo

This started out as a Sonnet 5 review. It ended up that Sonnet 5 is at the bottom of my personal preference list.

Claire Vo

Anthropic Sonnet 5 claims vs real-world performanceBuilding repeatable benchmarks with Claude CodeBlind evaluation via local HTML + JSON scoringPRD, prototype, wireframe, agentic debugging, and agent voice testsLLM-as-judge vs human taste disagreementLeaderboard construction and weighting (“Claire-weighted index”)Model-by-task recommendations and next-benchmark improvements

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.