Lenny's PodcastHamel Husain & Shreya Shankar: How notes turn into AI evals
Manual error analysis on real traces, with one benevolent dictator labeling: open coding clusters notes into buckets, then narrow binary LLM judges check them.
CHAPTERS
- 0:00 – 1:07
Why evals are the highest-ROI lever in AI product building
The episode opens with a strong claim: building evals is the fastest path to improving AI products. Lenny, Hamel, and Shreya frame evals as practical, addictive, and fundamentally about making products better—not academic perfection.
- •Evals as a systematic way to improve AI apps (not just theory)
- •Why teams get “addicted” to doing error analysis
- •The goal: actionable product improvement, not perfect measurement
- •Acknowledgment of controversy and misconceptions around evals
- 1:07 – 5:07
Meet Hamel & Shreya—and why evals suddenly matter to everyone
Lenny introduces Hamel Husain and Shreya Shankar and explains why evals have become a recurring theme across top AI builders. He highlights their influence (and their course) in making evals approachable for PMs and engineers.
- •Evals emerging as a must-have skill for product builders
- •Context: AI labs and startups investing heavily in evals
- •Hamel & Shreya’s role in popularizing a structured approach
- •What the episode will cover: show-not-tell walkthrough + best practices
- 5:07 – 10:06
What evals actually are (and why they’re more than “unit tests for prompts”)
Hamel and Shreya define evals as systematic measurement of AI application quality, spanning everything from analytics to automated checks. They clarify that traditional unit tests are only a small slice of the evals spectrum.
- •Evals = systematic measurement + iteration loop for LLM apps
- •Why vibe checks help early but don’t scale
- •Unit tests are part of evals, but don’t cover open-ended behavior
- •Evals can include cohort analysis, monitoring metrics, and feedback loops
- 10:06 – 17:09
Live demo setup: traces, observability tools, and a real property-management assistant
Hamel introduces NurtureBoss, an AI assistant for property managers, and shows what modern AI app complexity looks like (channels, tool calls, RAG). He then opens an observability tool to inspect real production-like traces as the foundation for eval work.
- •NurtureBoss use case: leasing + customer service + scheduling
- •Why traces are essential for understanding AI behavior end-to-end
- •Observability tooling is interchangeable (Braintrust, Phoenix/Arize, LangSmith)
- •The system prompt and tool calls reveal what the model “saw” and did
- 17:09 – 23:55
Open coding in practice: writing quick error notes from real conversations
They walk through multiple traces and show the first core workflow: write simple notes on what went wrong. The emphasis is speed, sampling, and capturing the first upstream issue rather than cataloging everything in one trace.
- •Write notes directly on traces to capture failure observations
- •Sample traces—don’t try to review everything
- •Focus on the first/most upstream problem, then move on
- •Early notes can be informal, but should stay interpretable later
- 23:55 – 28:09
Why LLMs can’t replace humans during initial error analysis + “benevolent dictator” ownership
Shreya explains why asking an LLM to judge early-stage traces usually fails: it lacks crucial product and domain context. Hamel introduces the “benevolent dictator” model—assign one trusted domain expert to drive coding decisions so the process stays tractable.
- •LLMs often miss product-context errors and say “looks good”
- •Humans must do the first-pass sensemaking of failures
- •“Benevolent dictator” avoids slow committee-driven coding
- •The owner should be the domain expert (often the PM)
- 28:09 – 31:42
How many traces to review: ‘100’ as a starter and theoretical saturation as the real stop rule
They address the most common operational question: how many examples are enough. The practical recommendation is to start with ~100, but the principled answer is to stop when you’re no longer learning new failure modes (theoretical saturation).
- •Why 100 is a useful psychological target (not a magic number)
- •The real criterion: stop when new traces stop yielding new insights
- •The term: theoretical saturation from qualitative research
- •Your required sample size depends on app complexity and analyst skill
- 31:42 – 40:13
From messy notes to structured failure modes: axial coding with LLM help
With open-coded notes collected, they show how LLMs can help synthesize and cluster them into higher-level categories (axial codes). They stress that LLM output is a first draft—humans must refine labels into actionable failure modes.
- •Axial codes = reusable failure-mode categories derived from open codes
- •LLMs are strong at summarization/clustering once humans supply raw notes
- •Prompting with “open codes” and “axial codes” leverages established terminology
- •Human review is required to avoid overly generic or unactionable categories
- 40:13 – 44:32
Operationalizing categorization: spreadsheets, automated labeling, and ‘none of the above’
Hamel demonstrates turning refined categories into automated labeling in Google Sheets using LLM formulas. Shreya adds a practical robustness trick: include a “none of the above” bucket to detect missing categories and force taxonomy iteration.
- •Use simple tooling (Sheets + AI formulas) to label at scale
- •Open codes must be specific enough for reliable automated categorization
- •Add “none of the above” to reveal gaps in your category set
- •This workflow becomes fast and repeatable after the first setup
- 44:32 – 47:14
The results: pivot tables, prioritization, and deciding what deserves an eval
They reveal the payoff: once traces are categorized, a simple pivot table highlights the most common failure modes. From there, teams decide which issues are obvious fixes vs. candidates for automated evaluators (code-based or LLM judges).
- •Counting (basic aggregation) is a surprisingly powerful analytics tool
- •Frequency helps prioritize, but severity/risk can override counts
- •Not every problem needs a formal eval—some are straightforward fixes
- •Evals should be grounded in real observed failures, not hypothetical ones
- 47:14 – 54:45
Choosing evaluator types: code-based checks vs. LLM-as-judge for complex failures
They introduce the two major evaluator approaches: deterministic code-based evaluators when possible, and LLM-as-judge for nuanced, subjective failure modes. The key is narrowing scope: judges should evaluate one specific failure and output a binary pass/fail.
- •Prefer code-based evaluators when the signal is deterministic (format, length, schema)
- •Use LLM-as-judge for complex, context-dependent failure modes
- •Scope judges narrowly: one failure mode at a time
- •Make outputs binary to avoid vague scores that slow teams down
- 54:45 – 1:00:56
Building a binary LLM judge—and validating it against human judgment
Hamel shares a concrete judge prompt for a human-handoff failure mode and explains why teams must validate judges before trusting them. Shreya and Hamel show how to compare judge outputs to human labels and why raw “agreement %” can be misleading without error breakdowns.
- •Judge prompt design: explicit rules + true/false output
- •Don’t treat judge outputs as gospel—validate against human labels first
- •Agreement % can hide rare-event failures; inspect false positives/negatives
- •Use a confusion-matrix-style view to drive prompt iteration and alignment
- 1:00:56 – 1:09:57
Evals as ‘new PRDs’ + why rubrics drift (and must evolve)
Lenny connects judge prompts to PRDs: they encode product expectations in a runnable, continuously enforceable form. Shreya adds research context on “criteria drift”—teams can’t fully define rubrics upfront because their understanding changes as they see more model behavior.
- •LLM judge prompts resemble executable product requirements
- •Error analysis uncovers expectations you didn’t know you had
- •Criteria/rubrics shift as people review more outputs (criteria drift)
- •PRDs remain useful, but eval definitions must evolve with real-world data
- 1:09:57 – 1:24:13
The evals debate: vibes, dogfooding, A/B tests, and why definitions cause confusion
They unpack the social-media controversy: some teams claim to ship on vibes, while others insist on eval rigor. The guests argue much of the disagreement comes from narrow definitions of evals, plus negative experiences from poorly built (especially Likert-scale) LLM judges, and they explain when dogfooding works (and when it doesn’t).
- •Most debates stem from people defining “evals” too narrowly
- •Teams get burned by misaligned LLM judges and then reject evals entirely
- •Dogfooding can work for coding agents (users are domain experts) but often fails elsewhere
- •A/B tests are part of evals—but should be grounded in real error analysis, not guesses
- •Foundation-model benchmarks don’t reliably predict product-specific failure modes
- 1:24:13 – 1:46:32
Misconceptions, practical tips, time cost, course overview, and lightning round wrap-up
They close with the most common misconceptions (e.g., “just buy a tool” or “AI can eval itself”), then offer pragmatic advice: don’t fear looking at data, use LLMs to organize work (not replace judgment), and reduce friction by building lightweight internal tools. The episode ends with time expectations, a walkthrough of their course resources, and a quick lightning round before final goodbyes.
- •Misconception: tools/AI can fully automate evals without human context
- •Tip: the point isn’t perfect evals—it’s improving the product quickly
- •Use LLMs heavily for synthesis and organization, but keep humans in the loop
- •Practical time model: a few days upfront, then ~30 minutes/week maintenance
- •Course highlights: end-to-end lifecycle, custom interfaces, cost optimization, book + bot + community