Hamel Husain & Shreya Shankar: How notes turn into AI evals

Manual error analysis on real traces, with one benevolent dictator labeling: open coding clusters notes into buckets, then narrow binary LLM judges check them.

Lenny RachitskyhostHamel HusainguestShreya Shankarguest

Sep 25, 20251h 46mWatch on YouTube ↗

CHAPTERS

0:00 – 4:57
Introduction to Hamel and Shreya
4:57 – 9:56
What are evals?
9:56 – 16:51
Demo: Examining real traces from a property management AI assistant
16:51 – 23:54
Writing notes on errors
23:54 – 25:16
Why LLMs can’t replace humans in the initial error analysis
25:16 – 28:07
The concept of a “benevolent dictator” in the eval process
28:07 – 31:39
Theoretical saturation: when to stop
31:39 – 44:39
Using axial codes to help categorize and synthesize error notes
44:39 – 46:06
The results
46:06 – 48:31
Building an LLM-as-judge to evaluate specific failure modes
48:31 – 52:10
The difference between code-based evals and LLM-as-judge
52:10 – 54:45
Example: LLM-as-judge
54:45 – 1:00:51
Testing your LLM judge against human judgment
1:00:51 – 1:05:09
Why evals are the new PRDs for AI products
1:05:09 – 1:07:41
How many evals you actually need
1:07:41 – 1:09:57
What comes after evals
1:09:57 – 1:15:15
The great evals debate
1:15:15 – 1:18:23
Why dogfooding isn’t enough for most AI products
1:18:23 – 1:22:28
OpenAI’s Statsig acquisition
1:22:28 – 1:23:02
Tips and tricks for implementing evals effectively
1:23:02 – 1:24:13
The Claude Code controversy and the importance of context
1:24:13 – 1:30:37
Common misconceptions around evals
1:30:37 – 1:33:38
The time investment
1:33:38 – 1:37:57
Overview of their comprehensive evals course
1:37:57 – 1:46:32
Lightning round and final thoughts

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Introduction to Hamel and Shreya

What are evals?

Demo: Examining real traces from a property management AI assistant

Writing notes on errors

Why LLMs can’t replace humans in the initial error analysis

The concept of a “benevolent dictator” in the eval process

Theoretical saturation: when to stop

Using axial codes to help categorize and synthesize error notes

The results

Building an LLM-as-judge to evaluate specific failure modes

The difference between code-based evals and LLM-as-judge

Example: LLM-as-judge

Testing your LLM judge against human judgment

Why evals are the new PRDs for AI products

How many evals you actually need

What comes after evals

The great evals debate

Why dogfooding isn’t enough for most AI products

OpenAI’s Statsig acquisition

Tips and tricks for implementing evals effectively

The Claude Code controversy and the importance of context

Common misconceptions around evals

The time investment

Overview of their comprehensive evals course

Lightning round and final thoughts

Get more out of YouTube videos.