Lenny's PodcastLenny's Podcast

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar

Lenny Rachitsky and Hamel Husain on aI evals: The new must-have superpower for serious product builders.

Lenny RachitskyhostHamel HusainguestShreya Shankarguest
Sep 25, 20251h 46mWatch on YouTube ↗
Definition and purpose of AI evals for LLM applicationsError analysis and open coding of traces (manual, human-led review)Using LLMs to cluster errors into actionable failure modes (axial codes)Designing automated evaluators: code-based tests vs. LLM-as-judgeAligning LLM judges with human judgments and avoiding bad metricsOperationalizing evals: unit tests, monitoring, dashboards, and flywheelsDebates and misconceptions around evals, vibes, and A/B testing
AI-generated summary based on the episode transcript.

In this episode of Lenny's Podcast, featuring Lenny Rachitsky and Hamel Husain, Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar explores aI evals: The new must-have superpower for serious product builders The episode argues that systematic AI evals—structured ways to measure and improve LLM applications—are becoming a core skill for PMs and engineers, comparable to knowing how to write PRDs or run A/B tests.

At a glance

WHAT IT’S REALLY ABOUT

AI evals: The new must-have superpower for serious product builders

  1. The episode argues that systematic AI evals—structured ways to measure and improve LLM applications—are becoming a core skill for PMs and engineers, comparable to knowing how to write PRDs or run A/B tests.
  2. Hamel Husain and Shreya Shankar walk through a concrete, end‑to‑end eval workflow using a real real-estate assistant: manual error analysis on traces, open coding, clustering failures with LLMs, and then building focused automated evaluators (code-based and LLM-as-judge).
  3. They emphasize that good evals start with looking at real product data, not with abstract benchmarks or generic tools, and that you only need a small number of well-chosen evals to unlock large product gains.
  4. The conversation also unpacks common misconceptions and Twitter drama around evals, arguing that ‘vibes’ and A/B tests are not alternatives but sit inside a broader, data-science‑driven eval practice that top AI teams quietly rely on.

IDEAS WORTH REMEMBERING

5 ideas

Start with manual error analysis, not with writing tests.

Before building any evals, inspect real traces from your AI product, write quick notes about what went wrong (open coding), and look for upstream errors. This surfaces the real failure modes, rather than what you imagine might be wrong.

Use a ‘benevolent dictator’ to own qualitative judgments.

Avoid design-by-committee for labeling and error notes; appoint a single domain expert—often the PM—to make final calls on what counts as ‘good’ or ‘bad’. This keeps the process fast, consistent, and tractable.

Cluster your notes into a small set of failure modes.

Feed your open-coded notes into an LLM to synthesize axial codes (categories like ‘human handoff issues’ or ‘conversational flow problems’), then refine them into specific, actionable buckets and count their frequency to prioritize what to fix.

Reserve LLM-as-judge evals for complex, subjective failures.

Simple issues (JSON format, length, presence of a field) should be checked with code; use LLM judges only for nuanced behaviors (e.g., whether a handoff to a human was warranted), keeping each judge narrowly scoped and binary (pass/fail).

Always validate your LLM judges against human labels.

Don’t trust a judge just because the prompt looks good: compare its outputs to your human-coded labels via a confusion matrix, and iterate until misalignments (false positives/negatives) are acceptably low instead of relying on a single ‘agreement’ percentage.

WORDS WORTH SAVING

5 quotes

To build great AI products, you need to be really good at building evals. It's the highest ROI activity you can engage in.

Hamel Husain

The goal is not to do evals perfectly. It's to actionably improve your product.

Shreya Shankar

You can appoint one person whose taste you trust. It should be the person with domain expertise. Oftentimes, it is the product manager.

Hamel Husain

People have been burned by evals in the past... They did evals badly, then they didn't trust it anymore, and then they're like, 'Oh, I'm anti-evals.'

Shreya Shankar

There’s no world in which they are just being like, 'I made Claude Code. I'm never looking at anything.' All of this is evals.

Shreya Shankar

QUESTIONS ANSWERED IN THIS EPISODE

5 questions

How do I practically choose which 4–7 failure modes deserve their own LLM-as-judge eval in my specific product?

The episode argues that systematic AI evals—structured ways to measure and improve LLM applications—are becoming a core skill for PMs and engineers, comparable to knowing how to write PRDs or run A/B tests.

What’s a good rule of thumb for when a failure should be addressed with a prompt change versus building a dedicated evaluator?

Hamel Husain and Shreya Shankar walk through a concrete, end‑to‑end eval workflow using a real real-estate assistant: manual error analysis on traces, open coding, clustering failures with LLMs, and then building focused automated evaluators (code-based and LLM-as-judge).

How can smaller teams without dedicated data scientists build lightweight tools and workflows for error analysis and trace review?

They emphasize that good evals start with looking at real product data, not with abstract benchmarks or generic tools, and that you only need a small number of well-chosen evals to unlock large product gains.

In products where ‘quality’ is highly subjective (e.g., creativity, tone), how do you define a binary pass/fail rubric without overconstraining the model?

The conversation also unpacks common misconceptions and Twitter drama around evals, arguing that ‘vibes’ and A/B tests are not alternatives but sit inside a broader, data-science‑driven eval practice that top AI teams quietly rely on.

How should eval strategies evolve as a product matures from MVP with low traffic to a scaled product serving millions of users?

EVERY SPOKEN WORD

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome