Evals, error analysis, and better prompts: A systematic approach to improving your AI products

Hamel Husain, an AI consultant and educator, shares his systematic approach to improving AI product quality through error analysis, evaluation frameworks, and prompt engineering. In this episode, he demonstrates how product teams can move beyond “vibe checking” their AI systems to implement data-driven quality improvement processes that identify and fix the most common errors. Using real examples from client work with Nurture Boss (an AI assistant for property managers), Hamel walks through practical techniques that product managers can implement immediately to dramatically improve their AI products. *What you’ll learn:* 1. A step-by-step error analysis framework that helps identify and categorize the most common AI failures in your product 2. How to create custom annotation systems that make reviewing AI conversations faster and more insightful 3. Why binary evaluations (pass/fail) are more useful than arbitrary quality scores for measuring AI performance 4. Techniques for validating your LLM judges to ensure they align with human quality expectations 5. A practical approach to prioritizing fixes based on frequency counting rather than intuition 6. Why looking at real user conversations (not just ideal test cases) is critical for understanding AI product failures 7. How to build a comprehensive quality system that spans from manual review to automated evaluation *Brought to you by:* GoFundMe Giving Funds—One account. Zero hassle: https://gofundme.com/howiai Persona—Trusted identity verification for any use case: https://withpersona.com/lp/howiai *Where to find Hamel Husain:* Website: https://hamel.dev/ Twitter: https://twitter.com/HamelHusain Course: https://maven.com/parlance-labs/evals GitHub: https://github.com/hamelsmu *Where to find Claire Vo:* ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ X: https://x.com/clairevo *In this episode, we cover:* (00:00) Introduction to Hamel Husain (03:05) The fundamentals: why data analysis is critical for AI products (06:58) Understanding traces and examining real user interactions (13:35) Error analysis: a systematic approach to finding AI failures (17:40) Creating custom annotation systems for faster review (22:23) The impact of this process (25:15) Different types of evaluations (29:30) LLM-as-a-Judge (33:58) Improving prompts and system instructions (38:15) Analyzing agent workflows (40:38) Hamel’s personal AI tools and workflows (48:02) Lighting round and final thoughts *Tools referenced:* • Claude: https://claude.ai/ • Braintrust: https://www.braintrust.dev/docs/start • Phoenix: https://phoenix.arize.com/ • AI Studio: https://aistudio.google.com/ • ChatGPT: https://chat.openai.com/ • Gemini: https://gemini.google.com/ *Other references:* • Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://dl.acm.org/doi/10.1145/3654777.3676450 • Nurture Boss: https://nurtureboss.io • Rechat: https://rechat.com/ • Your AI Product Needs Evals: https://hamel.dev/blog/posts/evals/ • A Field Guide to Rapidly Improving AI Products: https://hamel.dev/blog/posts/field-guide/ • Creating a LLM-as-a-Judge That Drives Business Results: https://hamel.dev/blog/posts/llm-judge/ • Lenny’s List on Maven: https://maven.com/lenny _Production and marketing by https://penname.co/._ _For inquiries about sponsoring the podcast, email jordan@penname.co._

Claire VohostHamel Husainguest

Oct 13, 202554mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Systematically improve AI products with traces, error analysis, and evals

Hamel argues the biggest unlock for higher-quality AI products is the same as classic product work: look at real data—especially real user interactions—rather than relying on “vibe checks.”
He demonstrates how “traces” (logged multi-step AI interactions including prompts, tool calls, and outputs) make failures debuggable and reveal surprising user behavior (vague, typo-heavy, ambiguous inputs).
The core method is error analysis: sample ~100 traces, write brief human notes on the first upstream failure, then categorize and count issues to create a prioritized defect backlog.
Only after this grounding should teams write evals: use code-based checks for objective issues, and LLM-as-a-judge for subjective ones—designed as specific binary pass/fail tests and validated against human labels to avoid misleading dashboards.

IDEAS WORTH REMEMBERING

5 ideas

Real user traces are the foundation of AI quality work.

Before you optimize prompts or add evals, you need logs of what users actually do (including messy, ambiguous language) and what the system actually executed (system prompt, tool calls, retrieved context).

Error analysis makes an intractable problem tractable.

Randomly sample a manageable set (e.g., ~100 traces), write one-sentence notes, and stop at the most upstream failure to avoid getting lost in downstream artifacts of earlier mistakes.

Categorize and count failures to build a prioritized roadmap.

After open-coded notes, bucket issues (manually or with an LLM) and count frequency; this turns “the model is weird” into a ranked list like “handoff failures” and “tour scheduling errors.”

Custom annotation UIs can be worth it to remove friction.

Off-the-shelf observability tools work, but lightweight internal tools tailored to your channels and filters can speed human review, standardize labels, and improve throughput.

Write evals only after you know what to measure.

There are infinite possible evals; error analysis tells you which ones matter. In the Nurture Boss example, they wrote evals specifically for tour scheduling and transfer/handoff behaviors.

WORDS WORTH SAVING

5 quotes

The most important thing is looking at data.

— Hamel Husain

Just spend three hours of your afternoon, go through, read some of these chats, look at some of them with your human eyes... and get to work.

— Claire Vo

Error analysis has two steps. The first step is writing notes... basically like journaling what is wrong.

— Hamel Husain

The last thing you wanna do is throw up a judge on the dashboard... and people don't know if they can trust it.

— Hamel Husain

If you do all this eval stuff, fine-tuning is basically free... those difficult examples... are exactly the stuff you want to fine-tune on.

— Hamel Husain

Traces and observability for LLM appsReal-user vs synthetic data distributionsError analysis (open coding) workflowCustom annotation tooling to reduce review frictionIssue bucketing, counting, and prioritizationEvaluation types: code-based, reference-based, LLM judgeValidating LLM-as-a-judge with human-labeled agreementPrompt/system instruction iteration and common pitfallsAgent workflow analytics (handoffs, transition matrices)Personal AI workflows: Claude Projects, GitHub prompt repo, Gemini for video-to-notes

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.