Skip to content
ClaudeClaude

Evals for taste: Hill-climbing a slide-generation agent

Built rubric-driven replayable eval system from real user projects giving quality, cost, latency, error, token signals in under 6 hours per model change. Evolved into dev flywheel powered by real user dissatisfaction signals.

May 23, 202639mWatch on YouTube ↗

CHAPTERS

  1. Why evals matter: turning “vibes” into actionable feedback

    The speaker frames the session as a practical guide to building evals that help you systematically improve AI agents, not just rely on subjective impressions. Evals are positioned as the bridge from “it seems worse today” to concrete, debuggable signals you can iterate on.

  2. What evals are—and why standard benchmarks aren’t enough for your app

    The talk defines evals as systematic tests + grading logic for expected behavior. It contrasts public benchmarks (SWE-bench, tool-use benchmarks, reasoning benchmarks) with the reality that product teams need custom evals tailored to their own use cases.

  3. Life without evals: reactive firefighting and hidden regressions

    The speaker explains common failure patterns when teams don’t have evals: issues are discovered only in production, fixes introduce regressions elsewhere, and feedback is hard to distinguish from noise. The result is a lack of confidence in whether changes helped or hurt.

  4. Where evals fit in the iteration loop (from prompts to agents)

    Evals are presented as the core loop for improving prompts and increasingly complex agents. As systems add tool calls, skills, and more configuration levers, evals become even more critical to understand what changes drive quality.

  5. Grader types: code-based, model-based, and human review trade-offs

    The talk breaks graders into three categories and compares their strengths and weaknesses. Code-based graders are fast and deterministic but brittle; model-based graders add nuance but need calibration; human graders are highest quality but expensive and slow.

  6. Repo walkthrough: the slide-generation agent scaffold (agent.yaml + environment)

    The speaker briefly tours the provided repository setup. The baseline agent is defined with a minimal system prompt and a Python environment with python-pptx available to generate PowerPoint files via shell access.

  7. Choosing evals for slides: brainstorming quality signals and failure modes

    The audience suggests candidate metrics (e.g., word count per slide, content overflow/overlap), illustrating how some checks are deterministic while others require model judgment. The speaker emphasizes that evals should produce information you can act on; otherwise, they don’t belong.

  8. Implemented eval suite: code graders and LLM judges for slide decks

    The repo includes prebuilt graders in two directories: code graders (counts/structural properties) and judge graders (rubric-based 0–5 scores). Example code graders include emoji count, clutter heuristics, slide count, small-font detection, and text heaviness; judge graders cover color contrast, layout, text quality, and images.

  9. Baseline results: human inspection reveals obvious quality issues

    The speaker shows the first generated decks and highlights visible problems: awkward layout, overlaps, inconsistent styling, and “AI-looking” artifacts. The eval report provides initial signals, but visual review is used to validate what the metrics actually mean.

  10. First hill-climb: tighten the system prompt (typography, density, “AI tells”)

    Using eval findings, the agent’s system prompt is expanded with explicit typography rules, layout/density constraints, and instructions to avoid common AI slide tropes (accent lines, decorative emojis). This demonstrates using eval outputs to guide targeted prompt edits.

  11. Calibration reality check: when evals disagree with human judgment

    After improvements, the scoring output surfaces surprising numbers (e.g., unexpected emoji counts, “text-heavy” flags that feel arguable). The speaker underscores that graders can be mis-specified and must be recalibrated—evals are not immutable ground truth.

  12. Adding a new requirement: force diagrams/charts as images

    The agent is updated to ensure each slide includes at least one generated diagram/chart inserted as an image. The resulting deck looks more grounded and data-driven, but new issues emerge (e.g., stretched visuals), showing how changing requirements shifts what you must evaluate.

  13. QA loop as a general improvement lever: self-critique via render-and-inspect

    A QA loop is introduced where the agent must assume issues exist, render slides to images, inspect each slide, fix problems, and re-check until at least one fix/verify cycle is completed. This mirrors common code-review/self-critique loops and boosts consistency.

  14. Model choice as an optimization: switching to a smarter model (Opus)

    The speaker demonstrates that upgrading the underlying model can outperform extensive prompt engineering. With a simple baseline prompt, the smarter model produces cleaner decks (e.g., fewer emojis, better layout instincts), but judge graders still over-score, exposing evaluation weaknesses.

  15. Making model-based judging actionable: anchors, rationales, and ordering effects

    The talk closes by diagnosing why rubric judges can fail: they lack grounded anchors for what 0 vs 5 looks like, and they can rationalize arbitrary numeric scores. Techniques suggested include providing exemplar anchors, forcing pros/cons before scoring, and using multi-agent disagreement to reduce hallucinated grading.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome