CHAPTERS
- 0:00 – 0:30
Sonnet 5 launch: agentic promise at Sonnet pricing
Claire introduces Claude Sonnet 5 and frames the big question: is it truly more agentic while delivering near-Opus performance at a much lower cost. She explains why she wants a repeatable benchmark instead of relying on subjective first impressions.
- •Sonnet 5 positioned as “most agentic Sonnet” with Opus-level capability claims
- •Goal: evaluate whether Sonnet 5 can replace Opus for many workflows
- •Motivation to move beyond ad-hoc “vibe checks”
- •Plan to build a repeatable benchmark suite relevant to builders
- 0:30 – 2:01
How I AI Bench: what it measures (PRDs, prototypes, bugs, voice)
She introduces the How I AI Bench: a set of Claire-graded, audience-relevant tasks aimed at product building. The benchmark is designed to compare models on PRD writing, design/prototyping, debugging flows, and agent personality/voice.
- •Bench tasks: PRDs, one-shot designs/prototypes, bug-solving, agentic voice
- •Human taste preserved—no purely “AI judge” scoring
- •Blind comparisons across multiple models
- •Objective: track model quality over time with consistent inputs
- 2:01 – 4:03
Sonnet 5 headline claims: performance, agentic coding, computer use, and price
Claire walks through Anthropic’s positioning and metrics: Sonnet 5 approaches Opus performance on agentic coding and computer-use benchmarks while being cheaper. She highlights launch pricing and why it matters for early testing.
- •Anthropic claim: close to Opus 4.8 performance at lower cost
- •Benchmarks referenced: SWE-bench Pro, Terminal Bench, computer-use pass rates
- •Expectation: many users won’t notice the gap vs Opus in practice
- •Pricing: $2/M input tokens and $10/M output tokens (launch window)
- 4:03 – 5:04
Why she’s done with one-off vibe checks (and what “good” looks like)
Claire explains the shortcomings of informal model testing: it’s not repeatable and doesn’t allow tracking over time. She wants structured scoring while keeping her personal taste central to the evaluation.
- •One-off tests in tools (Cursor/Claude Code) feel “soft” and non-repeatable
- •Need for frozen inputs, rubrics, and consistent task harnesses
- •Keep “Claire Vo taste” without outsourcing judgment to LLMs
- •Use blind tests to reduce bias when comparing models
- 5:04 – 7:35
Building the benchmark live in Claude Code using past sessions
She shows how she used Claude Code to brainstorm and generate the benchmark suite, leveraging saved local sessions for context. The system proposes benchmark design principles and a menu of tasks before she narrows scope to builder-centric workflows.
- •Prompting Claude Code to design a recurring benchmark/eval set
- •Using stored desktop sessions for continuity and better recommendations
- •Benchmark principles: frozen inputs, blind scoring where possible, rubric
- •Scope narrowed to: PRDs, prototypes, multi-step agentic, and voice
- 7:35 – 9:06
Scoring workflow: blind HTML rater + structured gut-feel notes
The benchmark produces a local HTML page with outputs for blind human scoring. Claire rates each output 1–5 (“Would I ship this? Does it sound like me?”) and exports results as JSON for aggregation.
- •Local HTML page consolidates all outputs for fast review
- •Blind mode: models labeled A–E to reduce brand bias
- •Simple 1–5 rubric plus lightweight qualitative notes
- •Scores exported to JSON to combine with automated scoring later
- 9:06 – 10:37
Prototype & wireframe eval harness: reusing ChatPRD’s 82-run setup
Claire reuses a prototyping harness she previously used at ChatPRD to generate many UI variants across multiple app scenarios. She rapidly scores visual quality and taste across dozens of generations, covering both full-fidelity prototypes and wireframes.
- •Prototype tasks across complex app types (scheduling, editorial desk, marketplace, habit coach)
- •Full-fidelity and wireframe generation both evaluated
- •High-volume review (dozens of generations) to stress-test consistency
- •Human scoring emphasizes visual judgment and “ship-ability”
- 10:37 – 12:08
Agent voice evaluation: the OpenClaw personality test
She explains a dedicated eval for assistant personality because she’s highly sensitive to agent voice. Sonnet 4.6 has been her favorite for OpenClaw, and the benchmark includes several realistic prompts to judge tone, helpfulness, and vibe.
- •Voice matters as a core product attribute for always-on agents
- •Prompts include scheduling changes, reacting to deploy failures, founder venting
- •Measures: does the assistant feel like someone she’d work with
- •Sonnet 4.6 noted as her current best “personality” baseline
- 12:08 – 14:10
Two-judge system + surprise reveal setup (Claire + LLM judges)
Claire combines her human ratings with automated evaluation by two strong models (GPT-5.5 and Opus 4.8). She has the system generate a slide deck and leaderboard she hasn’t seen yet to preserve the “blind reveal” moment.
- •Hybrid scoring: Claire ratings plus LLM-as-judge outputs
- •Two judges used to reduce single-judge bias (GPT-5.5 and Opus 4.8)
- •Automated deck creation for results presentation
- •Goal: make results surprising—even to the host
- 14:10 – 15:42
Leaderboard shock: Gemini 3 Pro ties Sonnet 5; Opus and Sonnet 4.6 bottom (automated)
The first leaderboard surprises Claire: Gemini 3 Pro and Sonnet 5 appear at the top alongside GPT-5.5, while Opus 4.8 and Sonnet 4.6 fall to the bottom in the automated view. This immediately raises questions about what “taste” and “quality” mean in the scoring system.
- •Unexpected top performers on the index: Gemini 3 Pro and Sonnet 5 (with GPT-5.5 close)
- •Opus 4.8 and Sonnet 4.6 show more “red flags” in the automated scoring
- •Metrics include quality, did it ship, and taste dimensions
- •Sets up tension between human taste and automated grading
- 15:42 – 18:15
Why Claire and the automated benchmark disagree (taste vs functionality)
Claire explains she often disagrees with LLM judging: models cluster toward middle-of-the-road scores and lack sharp taste discrimination. She also realizes her quick visual wireframe scoring may miss functional issues that automated checks catch (broken code, ignored constraints, incompleteness).
- •LLM judges tend to be overly generous/centered (the “7/10 problem”)
- •Taste cues (“cute,” “sharp”) weren’t captured well by rubrics/judges
- •Automated flags: broken code, ignored constraints, incomplete outputs
- •Claire’s gap: she rated visuals fast, not deeper functionality
- 18:15 – 20:48
Task-by-task breakdown: PRDs, codebase search, voice, prototypes
Results differ by category: Gemini and GPT-5.5 score well on PRDs; codebase search looks saturated because all models perform similarly; voice aligns with Claire’s prior preference for Sonnet 4.6. Prototype outcomes show a mixed picture across models and UI complexity.
- •PRD writing: Gemini 3 Pro and GPT-5.5 stand out; Claire notes bias against “Claude slop”
- •Agentic codebase search: not discriminative—most models do fine
- •Voice: Sonnet 4.6 performs best by Claire’s preference
- •Prototypes: Opus/Sonnet strong in front-end, but results vary by app complexity
- 20:48 – 21:18
Judge behavior and bias checks: GPT-5.5 as tougher grader
Claire reviews how the two LLM judges behaved and whether they favored themselves. She notes GPT-5.5 tends to be stricter and even judged itself lower, while overall both judges were generous, reinforcing the value of combining perspectives.
- •Bias check: did Opus prefer Opus or GPT prefer GPT?
- •GPT-5.5 is consistently the toughest judge and self-critical
- •Judges broadly agree but skew generous overall
- •Motivation for a “double bench” (two-judge balance)
- 21:18 – 25:56
Next iteration: refine benchmarks + build a Claire-weighted index and recommendations
Claire outlines improvements for the next run: better encode her taste into evaluation and replace saturated tasks that don’t differentiate models. She then generates a “Claire-weighted” leaderboard (slider between LLM and Claire) and shares model-by-task recommendations—ending with the twist that Sonnet 5 lands near the bottom of her personal ranking.
- •Planned changes: make tasks more discriminative; retire saturated agentic bug benchmark
- •Introduce a weighting slider between LLM judge and Claire judge (she picks 70% Claire)
- •Claire-weighted ranking shifts: Sonnet 4.6 rises; Sonnet 5 and Opus 4.8 drop
- •Recommendations: GPT-5.5 for PRDs; Sonnet 4.6 for prototyping and chat; Opus 4.8 for dense/complex UIs; Opus/Sonnet 5 strong per LLM judge for codebase tasks
