Skip to content
Lenny's PodcastLenny's Podcast

Hamel Husain & Shreya Shankar: How notes turn into AI evals

Manual error analysis on real traces, with one benevolent dictator labeling: open coding clusters notes into buckets, then narrow binary LLM judges check them.

Lenny RachitskyhostHamel HusainguestShreya Shankarguest
Sep 25, 20251h 46mWatch on YouTube ↗

CHAPTERS

  1. 0:00 – 4:57

    Introduction to Hamel and Shreya

  2. 4:57 – 9:56

    What are evals?

  3. 9:56 – 16:51

    Demo: Examining real traces from a property management AI assistant

  4. 16:51 – 23:54

    Writing notes on errors

  5. 23:54 – 25:16

    Why LLMs can’t replace humans in the initial error analysis

  6. 25:16 – 28:07

    The concept of a “benevolent dictator” in the eval process

  7. 28:07 – 31:39

    Theoretical saturation: when to stop

  8. 31:39 – 44:39

    Using axial codes to help categorize and synthesize error notes

  9. 44:39 – 46:06

    The results

  10. 46:06 – 48:31

    Building an LLM-as-judge to evaluate specific failure modes

  11. 48:31 – 52:10

    The difference between code-based evals and LLM-as-judge

  12. 52:10 – 54:45

    Example: LLM-as-judge

  13. 54:45 – 1:00:51

    Testing your LLM judge against human judgment

  14. 1:00:51 – 1:05:09

    Why evals are the new PRDs for AI products

  15. 1:05:09 – 1:07:41

    How many evals you actually need

  16. 1:07:41 – 1:09:57

    What comes after evals

  17. 1:09:57 – 1:15:15

    The great evals debate

  18. 1:15:15 – 1:18:23

    Why dogfooding isn’t enough for most AI products

  19. 1:18:23 – 1:22:28

    OpenAI’s Statsig acquisition

  20. 1:22:28 – 1:23:02

    Tips and tricks for implementing evals effectively

  21. 1:23:02 – 1:24:13

    The Claude Code controversy and the importance of context

  22. 1:24:13 – 1:30:37

    Common misconceptions around evals

  23. 1:30:37 – 1:33:38

    The time investment

  24. 1:33:38 – 1:37:57

    Overview of their comprehensive evals course

  25. 1:37:57 – 1:46:32

    Lightning round and final thoughts

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome