Lenny's PodcastHamel Husain & Shreya Shankar: How notes turn into AI evals
Manual error analysis on real traces, with one benevolent dictator labeling: open coding clusters notes into buckets, then narrow binary LLM judges check them.
Lenny RachitskyhostHamel HusainguestShreya Shankarguest
CHAPTERS
- 0:00 – 4:57
Introduction to Hamel and Shreya
- 4:57 – 9:56
What are evals?
- 9:56 – 16:51
Demo: Examining real traces from a property management AI assistant
- 16:51 – 23:54
Writing notes on errors
- 23:54 – 25:16
Why LLMs can’t replace humans in the initial error analysis
- 25:16 – 28:07
The concept of a “benevolent dictator” in the eval process
- 28:07 – 31:39
Theoretical saturation: when to stop
- 31:39 – 44:39
Using axial codes to help categorize and synthesize error notes
- 44:39 – 46:06
The results
- 46:06 – 48:31
Building an LLM-as-judge to evaluate specific failure modes
- 48:31 – 52:10
The difference between code-based evals and LLM-as-judge
- 52:10 – 54:45
Example: LLM-as-judge
- 54:45 – 1:00:51
Testing your LLM judge against human judgment
- 1:00:51 – 1:05:09
Why evals are the new PRDs for AI products
- 1:05:09 – 1:07:41
How many evals you actually need
- 1:07:41 – 1:09:57
What comes after evals
- 1:09:57 – 1:15:15
The great evals debate
- 1:15:15 – 1:18:23
Why dogfooding isn’t enough for most AI products
- 1:18:23 – 1:22:28
OpenAI’s Statsig acquisition
- 1:22:28 – 1:23:02
Tips and tricks for implementing evals effectively
- 1:23:02 – 1:24:13
The Claude Code controversy and the importance of context
- 1:24:13 – 1:30:37
Common misconceptions around evals
- 1:30:37 – 1:33:38
The time investment
- 1:33:38 – 1:37:57
Overview of their comprehensive evals course
- 1:37:57 – 1:46:32
Lightning round and final thoughts
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome