Evals for taste: Hill-climbing a slide-generation agent

Built rubric-driven replayable eval system from real user projects giving quality, cost, latency, error, token signals in under 6 hours per model change. Evolved into dev flywheel powered by real user dissatisfaction signals.

May 23, 202639mWatch on YouTube ↗

CHAPTERS

0:19 – 1:20
Why evals matter: turning “vibes” into actionable feedback
The speaker frames the session’s goal: motivate attendees to build evals and use them to systematically improve AI agents. Evals are positioned as the bridge from subjective impressions to concrete, repeatable signals you can act on.
- •Session objective: inspire building evals and using them to improve agents
- •Evals as systematic tests tied to a specific domain/use case
- •Replacing “it feels worse today” with actionable measurements
- •Using evals to identify what’s good/bad and guide improvements
1:20 – 3:21
What evals are and how they encode expectations
Evals are defined as scenario-based tests plus grading logic that encode what the system must do and what “quality” looks like. The speaker emphasizes that failing evals should directly indicate misalignment with intended behavior.
- •Evals = tests + scenarios + expectations encoded via grading logic
- •Helpful for required invariants (e.g., output must include a slide deck)
- •Designed to reveal failure modes and improvement opportunities
- •Connects system behavior to measurable quality signals
3:21 – 4:52
Why generic benchmarks aren’t enough for your app
The talk surveys popular public benchmarks (e.g., SWE-bench) and explains why they’re useful for broad comparisons but often irrelevant for a specific product. This leads to the core guidance: build and maintain your own evals tailored to your use case.
- •Common benchmarks: SWE-bench, Terminal Bench, agent/tool benchmarks, reasoning benchmarks
- •Benchmarks provide general capability comparisons across models
- •Product needs differ; public evals rarely match your domain
- •Recommendation: build bespoke evals to choose models and tune agents
4:52 – 7:57
Life without evals: reactive debugging and invisible regressions
The speaker outlines typical failure patterns when teams don’t use evals: relying on customer complaints, debugging from sparse anecdotes, and accidentally introducing regressions. Evals provide a way to verify whether changes truly improve the system.
- •Without evals: issues discovered only in production
- •Hard to separate genuine problems from noisy feedback
- •Prompt tweaks can cause unexpected regressions elsewhere
- •No reliable way to confirm improvements over versions
7:57 – 9:28
How evals fit into agent iteration loops
Evals are placed into an iterative workflow: create test cases, run prompts/agents, refine, and repeat until results are acceptable. As systems become more complex (tools, skills, context strategies), evals become even more critical to manage many tuning levers.
- •Classic loop: test cases → prompt → run → refine → ship
- •Agents add complexity: tools, skills, context optimization
- •More levers increases risk of unintended side effects
- •Evals provide the concrete feedback needed for iteration
9:28 – 12:29
Grader types: code-based, model-based, and human evaluation
The speaker breaks down grader categories and tradeoffs. Code-based graders are cheap and deterministic but brittle; model-based graders add nuance but require calibration and can be nondeterministic; human grading is highest quality but costly and slow.
- •Code graders: string/regex/tool-call checks; fast/cheap/deterministic but brittle
- •Model graders: rubric scoring, pairwise preference, multi-judge consensus; nuanced but nondeterministic/costly
- •Human graders: best nuance/quality; slow and expensive; useful for A/B and spot checks
- •Choosing graders depends on what you’re measuring (hard constraints vs. taste/quality)
12:29 – 14:01
Repo walkthrough: the slide-generation agent setup
The session shifts to the hands-on setup: a managed agent defined in an agent.yaml with a system prompt and a Python environment including python-pptx. The speaker explains the minimal initial instructions given to generate a PowerPoint file to a specified location.
- •Agent is configured in agent.yaml with a system prompt
- •Task: given a topic, create a PowerPoint at a target path
- •Environment provides shell access and python-pptx installed
- •This baseline enables running repeated eval-driven iterations
14:01 – 15:35
Defining slide-deck evals: audience brainstorming to concrete metrics
Attendees suggest potential evals (e.g., words per slide, visual overflow/overlap). The speaker maps these ideas to grader choices and introduces prebuilt evaluators in the repo: deterministic code graders and model-based judges for harder-to-encode qualities.
- •Ideas proposed: word count per slide; detecting overflow/overlap
- •Quantifiable checks suit code graders; visual/layout issues often need model graders
- •Repo includes two evaluator directories: code and judge
- •Goal: pick evals that yield actionable information for iteration
15:35 – 19:38
Baseline output review: spotting real failure modes in generated slides
The speaker shows the first generated slide deck and highlights visible problems like overlaps, odd styling, and general lack of polish. This manual inspection motivates which eval signals should be tracked and improved first.
- •Baseline deck exists and follows the “make a slide deck” instruction
- •Visible quality issues: overlaps, awkward lines/boxes, inconsistent styling
- •Examples of “things you never want” in a deck (e.g., clutter, unreadable elements)
- •Human review helps decide what to measure and what to fix next
19:38 – 21:39
Running the scoring script: interpreting code metrics and judge scores
A scoring script aggregates metrics like slide count, clutter, small fonts, text heaviness, and emoji usage, plus judge scores for text/layout/color/image. The speaker notes that judge scores look suspiciously high, introducing the need for calibration and treating evals as a living artifact.
- •Score output includes code metrics (emoji count, clutter, small font, etc.)
- •Judge graders output 0–5 scores for text, layout, color, image, coherence
- •Observed mismatch: high scores despite mediocre slides
- •Evals must evolve; calibration and periodic re-validation are required
21:39 – 25:40
Hill-climbing via prompt improvements: typography, layout, and “AI tells”
The agent prompt is expanded with explicit typography rules, layout/density guidance, and instructions to avoid common AI slide cues (e.g., decorative emojis, thin accent lines). The speaker demonstrates how failures found by evals directly inform prompt changes and yield a cleaner deck.
- •Prompt adds font sizes for titles/headers/body/captions
- •Layout guidance: concise text, breathing room, left alignment
- •Avoid AI tells: no decorative emojis; avoid thin accent lines
- •Iterative loop: observe eval failures → change prompt → rerun → compare
25:40 – 27:42
New requirement: diagrams on every slide (and new measurement challenges)
The prompt is modified to require at least one generated diagram/chart per slide as an inserted image. The resulting deck looks more grounded, but metric outputs and judge scores still expose grader limitations and ambiguities (e.g., what counts as text-heavy, how to score images).
- •Requirement: every slide must include a generated diagram/chart image
- •Output improves perceived substance and readability in places
- •Scores highlight unresolved issues (text-heavy/small font flags)
- •Judge image score gives a number without actionable explanation; graders may need redesign
27:42 – 31:17
Adding a QA loop: self-critique through render–inspect–fix cycles
A QA loop is introduced where the agent assumes there are problems, renders slides to images, inspects each slide, fixes issues, and rechecks at least once. This mirrors coding review loops and improves judge scores, showing how structured self-verification can boost quality.
- •QA prompt: treat QA as bug-hunt; assume issues exist
- •Process: generate → render to images → inspect → fix → re-render → re-inspect
- •Helps catch layout/readability problems before “shipping”
- •Judge scores improve, indicating quality gains from verification loops
31:17 – 33:48
Model upgrade vs. prompt engineering: switching to Opus for better defaults
The speaker demonstrates that moving to a stronger model (Opus) with a simple baseline prompt can outperform a weaker model with heavier prompt tuning. However, the evals reveal another issue: judges may over-score or mis-score (e.g., giving perfect image scores when no images exist), reinforcing that evaluator design matters.
- •Switch from Sonnet to Opus using the original minimal prompt
- •Output quality improves: structure, readability, fewer “bad habits” (e.g., emojis)
- •Judge anomalies: very high scores; image judge returns 5 despite no images
- •Takeaway: better models help, but eval calibration remains essential
33:48 – 37:51
Making model-based judging reliable: anchoring, explanations, and ordering effects
The speaker critiques rubric-only 0–5 scoring without exemplars or anchors and explains how LLMs rationalize chosen numbers. They recommend gathering pros/cons first, then scoring, using multi-judge approaches, and being mindful of hallucinations—especially in high-stakes domains like legal summarization.
- •Rubrics need anchors: define what 0 vs. 5 looks like with examples
- •Ordering matters: scoring first biases the rationale (LLM justifies the number)
- •Prefer: generate reasons/pros/cons → then decide a final score
- •Multi-judge/consensus and adversarial review can reduce noise and hallucinations
37:51 – 39:15
Closing: evals as a continuous practice for building better agents
The talk concludes by connecting evals to how model providers build better systems: identify failures, iterate, and verify improvements. The same discipline applies to agentic applications—evals ensure changes are informed and measurably beneficial.
- •Benchmarks are evals; the same principles apply to product development
- •Evals help locate failure modes and track progress over iterations
- •Key loop: measure → adjust → re-measure to confirm improvements
- •Final message: make evals central to agent development and deployment