Evals for taste: Hill-climbing a slide-generation agent

Built rubric-driven replayable eval system from real user projects giving quality, cost, latency, error, token signals in under 6 hours per model change. Evolved into dev flywheel powered by real user dissatisfaction signals.

May 23, 202639mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Building actionable evals to iteratively improve slide-generation agents fast

Evals are positioned as the actionable bridge between subjective “vibes” and measurable signals for improving AI agents before issues hit production.
The speaker contrasts grader types—code-based, model-based, and human—highlighting tradeoffs among determinism, nuance, cost, and calibration effort.
A slide-generation agent is used as a concrete case study to define domain-specific metrics (e.g., slide count, clutter, font size) and rubric-based judges (e.g., layout, color, text quality).
Iterative “hill-climbing” improvements are demonstrated by changing system prompts, adding requirements (like diagrams), and introducing a QA/self-critique loop to catch visual/layout defects.
The session warns that model-judge scores can be miscalibrated or misleading (e.g., giving a perfect image score with no images) and explains techniques to improve judge reliability, such as better anchors and ordering rationales before scoring.

IDEAS WORTH REMEMBERING

5 ideas

Build evals for your specific use case, not just rely on public benchmarks.

Benchmarks like SWE-bench indicate general capability, but they rarely measure the exact behaviors your agent must satisfy; bespoke evals let you detect regressions and compare models/configs for your domain.

Use code-based graders for hard requirements and invariants.

Checks like “deck file exists,” “exactly 5 slides,” or “emoji count” are fast, cheap, and deterministic, making them ideal as guardrails even if they don’t capture nuanced quality.

Use model-based graders for “taste” dimensions, but expect calibration work.

Rubrics for layout, color contrast, or readability capture nuance, yet can produce inflated or inconsistent scores unless you add anchors (what a 0 vs 5 looks like), consensus judging, and periodic recalibration.

Treat evals as living artifacts that evolve with your system.

As the agent improves, some evals saturate or stop providing actionable signal; you should revise metrics and rubrics to keep measuring failure modes that still matter.

A QA loop (criticize → fix → re-render → re-check) is a general-purpose quality booster.

Making the agent adversarial—assuming problems exist and forcing at least one verified fix cycle—often improves outputs across domains, including visual artifacts like overlap, stretching, and inconsistent spacing.

WORDS WORTH SAVING

5 quotes

Evals are systematic tests that measure how well an AI system performs on a specific domain or use case, right?

— Unknown

Evals is also the bridge between things like it seems to work or, like, um, we know it works.

— Unknown

If you add evals, you have clarity. You need to define what does success look like, right?

— Unknown

It's not because you have set up your evals once that they are now, like, the ground truth, you know? Um, evals, over time, they can evolve. They need to be a living artifact.

— Unknown

Approach QA as a bug hunt, not a confirmation step.

— Unknown

Why bespoke evals matter vs generic benchmarksEvals as an iteration loop for prompts and agentsGrader types: code, model-judge, human reviewSlide-deck quality metrics (clutter, fonts, text density, emojis)Rubric-based judging and calibration problemsQA/self-critique loops for quality improvementsSwitching models as an optimization lever (Sonnet vs Opus)

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.