Skip to content
ClaudeClaude

Evals for taste: Hill-climbing a slide-generation agent

Built rubric-driven replayable eval system from real user projects giving quality, cost, latency, error, token signals in under 6 hours per model change. Evolved into dev flywheel powered by real user dissatisfaction signals.

May 23, 202639mWatch on YouTube ↗

Episode Details

EPISODE INFO

Released
May 23, 2026
Duration
39m
Channel
Claude
Watch on YouTube
▶ Open ↗

EPISODE DESCRIPTION

Built rubric-driven replayable eval system from real user projects giving quality, cost, latency, error, token signals in under 6 hours per model change. Evolved into dev flywheel powered by real user dissatisfaction signals.

EPISODE SUMMARY

In this episode of Claude, Evals for taste: Hill-climbing a slide-generation agent explores building actionable evals to iteratively improve slide-generation agents fast Evals are positioned as the actionable bridge between subjective “vibes” and measurable signals for improving AI agents before issues hit production.

RELATED EPISODES

How we Claude Code

How we Claude Code

Agent Battle: Mine the most diamonds in 45 minutes

Agent Battle: Mine the most diamonds in 45 minutes

Agents that remember

Agents that remember

Tool, skill, or subagent? Decomposing an agent that outgrew its prompt

Tool, skill, or subagent? Decomposing an agent that outgrew its prompt

Making agentic workflows trustworthy and verifiable with a custom DSL

Making agentic workflows trustworthy and verifiable with a custom DSL

How AirOps chases friction to build AI products with Claude

How AirOps chases friction to build AI products with Claude

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome