I benchmarked the NEW Sonnet 5. The results shocked me.

I’ve been testing every major frontier model release since the start of the year, and when Anthropic dropped Sonnet 5, I wanted more than a vibe check. I got tired of one-off tests I couldn’t repeat or compare over time, so I built something better: the How I AI Bench, a repeatable eval harness I constructed live using Claude Code while recording this episode. I ran Sonnet 5 blind against four other frontier models (Sonnet 4.6, Opus 4.8, GPT-5.5, and Gemini 3 Pro) across PRD quality, prototype generation, agentic task completion, and agent personality. The results were not what I expected. *What you’ll learn:* 1. What Anthropic claims Sonnet 5 improves over Sonnet 4.6, and where the benchmark data actually backs that up 2. How I built the How I AI Bench in under 45 minutes using Claude Code, starting from my own stored session history 3. Why I combined human vibe scoring (70%) with LLM as judge scoring (30%) instead of trusting either alone 4. How to set up a local HTML scoring page so you can rate AI outputs on gut feel and export those scores as JSON 5. Which model I recommend for PRDs, which for complex prototypes, and which for chatting with an agent daily *Brought to you by:* Runway—The creative AI platform for images, video and more: https://runwayml.com/howIAI Hyperagent—Deploy fleets of agents that handle real work: https://www.hyperagent.com/howiai *In this episode, we cover:* (00:00) Sonnet 5 is out (01:55) What Anthropic claims (04:02) Why I’m done with one-off vibe checks (05:05) Building the How I AI Bench live with Claude Code (07:42) The scoring system (10:43) Agent voice eval (11:57) Quick recap (13:58) Results: The How I AI index leaderboard (21:21) What I’m improving for the next run (22:16) Generating a Claire-weighted index (23:53) Model-by-task recommendations *Tools referenced:* • Claude Sonnet 5: https://www.anthropic.com/news/claude-sonnet-5 • Claude Opus 4.8: https://www.anthropic.com/news/claude-opus-4-8 • GPT-5.5 (OpenAI): https://openai.com/index/introducing-gpt-5-5/ • Gemini 3 Pro (Google DeepMind): https://deepmind.google/models/gemini/pro/ • Cursor: https://www.cursor.com/ *Other references:* • SWE-bench Pro (agentic coding benchmark referenced): https://www.swebench.com/ *Where to find Claire Vo:* ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ X: https://x.com/clairevo _Production and marketing by https://penname.co/._ _For inquiries about sponsoring the podcast, email jordan@penname.co._

Claire Vohost

Jun 30, 202625mWatch on YouTube ↗

EPISODE INFO

Released: June 30, 2026
Duration: 25m
Channel: How I AI
Watch on YouTube: ▶ Open ↗

EPISODE DESCRIPTION

I’ve been testing every major frontier model release since the start of the year, and when Anthropic dropped Sonnet 5, I wanted more than a vibe check. I got tired of one-off tests I couldn’t repeat or compare over time, so I built something better: the How I AI Bench, a repeatable eval harness I constructed live using Claude Code while recording this episode. I ran Sonnet 5 blind against four other frontier models (Sonnet 4.6, Opus 4.8, GPT-5.5, and Gemini 3 Pro) across PRD quality, prototype generation, agentic task completion, and agent personality. The results were not what I expected. *What you’ll learn:*
What Anthropic claims Sonnet 5 improves over Sonnet 4.6, and where the benchmark data actually backs that up
How I built the How I AI Bench in under 45 minutes using Claude Code, starting from my own stored session history
Why I combined human vibe scoring (70%) with LLM as judge scoring (30%) instead of trusting either alone
How to set up a local HTML scoring page so you can rate AI outputs on gut feel and export those scores as JSON
Which model I recommend for PRDs, which for complex prototypes, and which for chatting with an agent daily
*Brought to you by:* Runway—The creative AI platform for images, video and more: https://runwayml.com/howIAI Hyperagent—Deploy fleets of agents that handle real work: https://www.hyperagent.com/howiai *In this episode, we cover:* (00:00) Sonnet 5 is out (01:55) What Anthropic claims (04:02) Why I’m done with one-off vibe checks (05:05) Building the How I AI Bench live with Claude Code (07:42) The scoring system (10:43) Agent voice eval (11:57) Quick recap (13:58) Results: The How I AI index leaderboard (21:21) What I’m improving for the next run (22:16) Generating a Claire-weighted index (23:53) Model-by-task recommendations *Tools referenced:*
Claude Sonnet 5: https://www.anthropic.com/news/claude-sonnet-5
Claude Opus 4.8: https://www.anthropic.com/news/claude-opus-4-8
GPT-5.5 (OpenAI): https://openai.com/index/introducing-gpt-5-5/
Gemini 3 Pro (Google DeepMind): https://deepmind.google/models/gemini/pro/
Cursor: https://www.cursor.com/
*Other references:*
• SWE-bench Pro (agentic coding benchmark referenced): https://www.swebench.com/ *Where to find Claire Vo:* ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ X: https://x.com/clairevo _Production and marketing by https://penname.co/._ _For inquiries about sponsoring the podcast, email jordan@penname.co._

SPEAKERS

Claire Vo
host
Host of the "How I AI" show, covering practical AI tools and model evaluations.

EPISODE SUMMARY

In this episode of How I AI, featuring Claire Vo, I benchmarked the NEW Sonnet 5. The results shocked me. explores claire benchmarks Sonnet 5, finds surprises, builds repeatable eval index Anthropic positions Claude Sonnet 5 as a more agentic, near-Opus-performance model at significantly lower cost, especially for long-running tool use and computer-use tasks.

RELATED EPISODES

How to turn your company into AI builders | John Kim (Sendbird, co-founder and CEO)

How Mozilla Uses Claude Mythos to find Firefox bugs before hackers do

How Gusto’s CTO uses Claude Code to ship like a startup

The power user’s guide to Codex | Alexander Embiricos (product lead)

Vibe coding a 3D multiplayer game in 15 minutes—with no game dev experience | Cody De Arkland

I spent $200 on Claude Design so you don't have to

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Episode Details