At a glance
WHAT IT’S REALLY ABOUT
Claire benchmarks Sonnet 5, finds surprises, builds repeatable eval index
- Anthropic positions Claude Sonnet 5 as a more agentic, near-Opus-performance model at significantly lower cost, especially for long-running tool use and computer-use tasks.
- Claire argues that one-off “vibe checks” aren’t repeatable, so she builds a reusable benchmark harness (How I AI Bench) with Claude Code and scores outputs blind across multiple models.
- The benchmark combines Claire’s human “would I ship this / does it sound like me” ratings with LLM-as-judge scores, revealing major disagreements between human taste and automated rubric scoring.
- Initial leaderboard results shock Claire: Gemini 3 Pro and Sonnet 5 tie near the top on the automated index, while Opus 4.8 and Sonnet 4.6 rank lower due to rubric-detected issues like broken code or ignored constraints.
- Claire then creates a “Claire-weighted index” (70% her scoring, 30% backend), which flips the ranking and leads to task-specific recommendations (e.g., GPT-5.5 for PRDs, Sonnet 4.6 for prototyping and agent voice).
IDEAS WORTH REMEMBERING
5 ideasSonnet 5’s value proposition is “near-Opus” at lower price, but taste may differ.
Anthropic highlights agentic tool use, computer use, and launch pricing, but Claire’s own preference ranking ultimately puts Sonnet 5 near the bottom once she weights for her subjective quality bar.
Repeatable evals beat ad hoc vibe checks for tracking models over time.
Claire’s core shift is from one-off demos to a standardized suite with frozen inputs, blind model labels, and a consistent rubric so new releases can be compared reliably.
Human taste and automated scoring can diverge dramatically.
The LLM judges rewarded factors like functional correctness and constraint adherence, while Claire often scored from first-impression design/tone; this created near-opposite rankings for some models.
LLM judges tend to compress scores toward the middle.
Claire observes that model grading often avoids “spiky” judgments, producing generous, bell-curve outcomes that may fail to reflect strong preferences or aesthetic nuance.
Some benchmarks saturate when all frontier models perform similarly.
The agentic codebase/bug-hunt task didn’t differentiate models well because baseline coding competence is now high, motivating Claire to replace or redesign that eval to better test “agentic-ness.”
WORDS WORTH SAVING
5 quotesI'm starting to get bored of doing the vibe check.
— Claire Vo
I don't like that it's not repeatable, and I don't like that we're not testing it over time.
— Claire Vo
Sonnet 4.6 so far has had the best personality, so I actually pay for API credits for my OpenClaw because I like how it talks to me.
— Claire Vo
This is truly neutral, no bias.
— Claire Vo
This started out as a Sonnet 5 review. It ended up that Sonnet 5 is at the bottom of my personal preference list.
— Claire Vo
High quality AI-generated summary created from speaker-labeled transcript.
