Lenny's PodcastHamel Husain & Shreya Shankar: How notes turn into AI evals
Manual error analysis on real traces, with one benevolent dictator labeling: open coding clusters notes into buckets, then narrow binary LLM judges check them.
At a glance
WHAT IT’S REALLY ABOUT
AI evals: The new must-have superpower for serious product builders
- The episode argues that systematic AI evals—structured ways to measure and improve LLM applications—are becoming a core skill for PMs and engineers, comparable to knowing how to write PRDs or run A/B tests.
- Hamel Husain and Shreya Shankar walk through a concrete, end‑to‑end eval workflow using a real real-estate assistant: manual error analysis on traces, open coding, clustering failures with LLMs, and then building focused automated evaluators (code-based and LLM-as-judge).
- They emphasize that good evals start with looking at real product data, not with abstract benchmarks or generic tools, and that you only need a small number of well-chosen evals to unlock large product gains.
- The conversation also unpacks common misconceptions and Twitter drama around evals, arguing that ‘vibes’ and A/B tests are not alternatives but sit inside a broader, data-science‑driven eval practice that top AI teams quietly rely on.
IDEAS WORTH REMEMBERING
5 ideasStart with manual error analysis, not with writing tests.
Before building any evals, inspect real traces from your AI product, write quick notes about what went wrong (open coding), and look for upstream errors. This surfaces the real failure modes, rather than what you imagine might be wrong.
Use a ‘benevolent dictator’ to own qualitative judgments.
Avoid design-by-committee for labeling and error notes; appoint a single domain expert—often the PM—to make final calls on what counts as ‘good’ or ‘bad’. This keeps the process fast, consistent, and tractable.
Cluster your notes into a small set of failure modes.
Feed your open-coded notes into an LLM to synthesize axial codes (categories like ‘human handoff issues’ or ‘conversational flow problems’), then refine them into specific, actionable buckets and count their frequency to prioritize what to fix.
Reserve LLM-as-judge evals for complex, subjective failures.
Simple issues (JSON format, length, presence of a field) should be checked with code; use LLM judges only for nuanced behaviors (e.g., whether a handoff to a human was warranted), keeping each judge narrowly scoped and binary (pass/fail).
Always validate your LLM judges against human labels.
Don’t trust a judge just because the prompt looks good: compare its outputs to your human-coded labels via a confusion matrix, and iterate until misalignments (false positives/negatives) are acceptably low instead of relying on a single ‘agreement’ percentage.
WORDS WORTH SAVING
5 quotesTo build great AI products, you need to be really good at building evals. It's the highest ROI activity you can engage in.
— Hamel Husain
The goal is not to do evals perfectly. It's to actionably improve your product.
— Shreya Shankar
You can appoint one person whose taste you trust. It should be the person with domain expertise. Oftentimes, it is the product manager.
— Hamel Husain
People have been burned by evals in the past... They did evals badly, then they didn't trust it anymore, and then they're like, 'Oh, I'm anti-evals.'
— Shreya Shankar
There’s no world in which they are just being like, 'I made Claude Code. I'm never looking at anything.' All of this is evals.
— Shreya Shankar
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome