Skip to content
How I AIHow I AI

I benchmarked the NEW Sonnet 5. The results shocked me.

I’ve been testing every major frontier model release since the start of the year, and when Anthropic dropped Sonnet 5, I wanted more than a vibe check. I got tired of one-off tests I couldn’t repeat or compare over time, so I built something better: the How I AI Bench, a repeatable eval harness I constructed live using Claude Code while recording this episode. I ran Sonnet 5 blind against four other frontier models (Sonnet 4.6, Opus 4.8, GPT-5.5, and Gemini 3 Pro) across PRD quality, prototype generation, agentic task completion, and agent personality. The results were not what I expected. *What you’ll learn:* 1. What Anthropic claims Sonnet 5 improves over Sonnet 4.6, and where the benchmark data actually backs that up 2. How I built the How I AI Bench in under 45 minutes using Claude Code, starting from my own stored session history 3. Why I combined human vibe scoring (70%) with LLM as judge scoring (30%) instead of trusting either alone 4. How to set up a local HTML scoring page so you can rate AI outputs on gut feel and export those scores as JSON 5. Which model I recommend for PRDs, which for complex prototypes, and which for chatting with an agent daily *Brought to you by:* Runway—The creative AI platform for images, video and more: https://runwayml.com/howIAI Hyperagent—Deploy fleets of agents that handle real work: https://www.hyperagent.com/howiai *In this episode, we cover:* (00:00) Sonnet 5 is out (01:55) What Anthropic claims (04:02) Why I’m done with one-off vibe checks (05:05) Building the How I AI Bench live with Claude Code (07:42) The scoring system (10:43) Agent voice eval (11:57) Quick recap (13:58) Results: The How I AI index leaderboard (21:21) What I’m improving for the next run (22:16) Generating a Claire-weighted index (23:53) Model-by-task recommendations *Tools referenced:* • Claude Sonnet 5: https://www.anthropic.com/news/claude-sonnet-5 • Claude Opus 4.8: https://www.anthropic.com/news/claude-opus-4-8 • GPT-5.5 (OpenAI): https://openai.com/index/introducing-gpt-5-5/ • Gemini 3 Pro (Google DeepMind): https://deepmind.google/models/gemini/pro/ • Cursor: https://www.cursor.com/ *Other references:* • SWE-bench Pro (agentic coding benchmark referenced): https://www.swebench.com/ *Where to find Claire Vo:* ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ X: https://x.com/clairevo _Production and marketing by https://penname.co/._ _For inquiries about sponsoring the podcast, email jordan@penname.co._

Claire Vohost
Jun 30, 202625mWatch on YouTube ↗

Episode Details

EPISODE INFO

Released
June 30, 2026
Duration
25m
Channel
How I AI
Watch on YouTube
▶ Open ↗

EPISODE DESCRIPTION

I’ve been testing every major frontier model release since the start of the year, and when Anthropic dropped Sonnet 5, I wanted more than a vibe check. I got tired of one-off tests I couldn’t repeat or compare over time, so I built something better: the How I AI Bench, a repeatable eval harness I constructed live using Claude Code while recording this episode. I ran Sonnet 5 blind against four other frontier models (Sonnet 4.6, Opus 4.8, GPT-5.5, and Gemini 3 Pro) across PRD quality, prototype generation, agentic task completion, and agent personality. The results were not what I expected. *What you’ll learn:*

  1. What Anthropic claims Sonnet 5 improves over Sonnet 4.6, and where the benchmark data actually backs that up
  2. How I built the How I AI Bench in under 45 minutes using Claude Code, starting from my own stored session history
  3. Why I combined human vibe scoring (70%) with LLM as judge scoring (30%) instead of trusting either alone
  4. How to set up a local HTML scoring page so you can rate AI outputs on gut feel and export those scores as JSON
  5. Which model I recommend for PRDs, which for complex prototypes, and which for chatting with an agent daily

*Brought to you by:* Runway—The creative AI platform for images, video and more: https://runwayml.com/howIAI Hyperagent—Deploy fleets of agents that handle real work: https://www.hyperagent.com/howiai *In this episode, we cover:* (00:00) Sonnet 5 is out (01:55) What Anthropic claims (04:02) Why I’m done with one-off vibe checks (05:05) Building the How I AI Bench live with Claude Code (07:42) The scoring system (10:43) Agent voice eval (11:57) Quick recap (13:58) Results: The How I AI index leaderboard (21:21) What I’m improving for the next run (22:16) Generating a Claire-weighted index (23:53) Model-by-task recommendations *Tools referenced:*

*Other references:*

• SWE-bench Pro (agentic coding benchmark referenced): https://www.swebench.com/ *Where to find Claire Vo:* ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ X: https://x.com/clairevo _Production and marketing by https://penname.co/._ _For inquiries about sponsoring the podcast, email jordan@penname.co._

SPEAKERS

  • Claire Vo

    host

    Host of the "How I AI" show, covering practical AI tools and model evaluations.

EPISODE SUMMARY

In this episode of How I AI, featuring Claire Vo, I benchmarked the NEW Sonnet 5. The results shocked me. explores claire benchmarks Sonnet 5, finds surprises, builds repeatable eval index Anthropic positions Claude Sonnet 5 as a more agentic, near-Opus-performance model at significantly lower cost, especially for long-running tool use and computer-use tasks.

RELATED EPISODES

How to turn your company into AI builders | John Kim (Sendbird, co-founder and CEO)

How to turn your company into AI builders | John Kim (Sendbird, co-founder and CEO)

How Mozilla Uses Claude Mythos to find Firefox bugs before hackers do

How Mozilla Uses Claude Mythos to find Firefox bugs before hackers do

How Gusto’s CTO uses Claude Code to ship like a startup

How Gusto’s CTO uses Claude Code to ship like a startup

The power user’s guide to Codex | Alexander Embiricos (product lead)

The power user’s guide to Codex | Alexander Embiricos (product lead)

Vibe coding a 3D multiplayer game in 15 minutes—with no game dev experience | Cody De Arkland

Vibe coding a 3D multiplayer game in 15 minutes—with no game dev experience | Cody De Arkland

I spent $200 on Claude Design so you don't have to

I spent $200 on Claude Design so you don't have to

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.