Skip to content
How I AIHow I AI

Evals, error analysis, and better prompts: A systematic approach to improving your AI products

Hamel Husain, an AI consultant and educator, shares his systematic approach to improving AI product quality through error analysis, evaluation frameworks, and prompt engineering. In this episode, he demonstrates how product teams can move beyond “vibe checking” their AI systems to implement data-driven quality improvement processes that identify and fix the most common errors. Using real examples from client work with Nurture Boss (an AI assistant for property managers), Hamel walks through practical techniques that product managers can implement immediately to dramatically improve their AI products. *What you’ll learn:* 1. A step-by-step error analysis framework that helps identify and categorize the most common AI failures in your product 2. How to create custom annotation systems that make reviewing AI conversations faster and more insightful 3. Why binary evaluations (pass/fail) are more useful than arbitrary quality scores for measuring AI performance 4. Techniques for validating your LLM judges to ensure they align with human quality expectations 5. A practical approach to prioritizing fixes based on frequency counting rather than intuition 6. Why looking at real user conversations (not just ideal test cases) is critical for understanding AI product failures 7. How to build a comprehensive quality system that spans from manual review to automated evaluation *Brought to you by:* GoFundMe Giving Funds—One account. Zero hassle: https://gofundme.com/howiai Persona—Trusted identity verification for any use case: https://withpersona.com/lp/howiai *Where to find Hamel Husain:* Website: https://hamel.dev/ Twitter: https://twitter.com/HamelHusain Course: https://maven.com/parlance-labs/evals GitHub: https://github.com/hamelsmu *Where to find Claire Vo:* ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ X: https://x.com/clairevo *In this episode, we cover:* (00:00) Introduction to Hamel Husain (03:05) The fundamentals: why data analysis is critical for AI products (06:58) Understanding traces and examining real user interactions (13:35) Error analysis: a systematic approach to finding AI failures (17:40) Creating custom annotation systems for faster review (22:23) The impact of this process (25:15) Different types of evaluations (29:30) LLM-as-a-Judge (33:58) Improving prompts and system instructions (38:15) Analyzing agent workflows (40:38) Hamel’s personal AI tools and workflows (48:02) Lighting round and final thoughts *Tools referenced:* • Claude: https://claude.ai/ • Braintrust: https://www.braintrust.dev/docs/start • Phoenix: https://phoenix.arize.com/ • AI Studio: https://aistudio.google.com/ • ChatGPT: https://chat.openai.com/ • Gemini: https://gemini.google.com/ *Other references:* • Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://dl.acm.org/doi/10.1145/3654777.3676450 • Nurture Boss: https://nurtureboss.io • Rechat: https://rechat.com/ • Your AI Product Needs Evals: https://hamel.dev/blog/posts/evals/ • A Field Guide to Rapidly Improving AI Products: https://hamel.dev/blog/posts/field-guide/ • Creating a LLM-as-a-Judge That Drives Business Results: https://hamel.dev/blog/posts/llm-judge/ • Lenny’s List on Maven: https://maven.com/lenny _Production and marketing by https://penname.co/._ _For inquiries about sponsoring the podcast, email jordan@penname.co._

Claire VohostHamel Husainguest
Oct 13, 202554mWatch on YouTube ↗

CHAPTERS

  1. Hamel Husain’s core premise: quality comes from looking at data

    Hamel frames AI product quality as a data analysis problem, not a “prompt magic” problem. The twist with LLMs is that the data is messy, stochastic, and often multi-step, but the fundamentals of product analytics still apply.

    • High-quality AI products start with examining real data, not vibes
    • Traditional PM skills (SQL, metrics, spreadsheets) still matter
    • AI adds complexity: non-determinism and more unstructured inputs/outputs
    • Systematic improvement requires instrumentation and review loops
  2. Case study setup: Nurture Boss virtual leasing assistant and why it’s hard to scale

    Hamel introduces a real client, Nurture Boss, an AI assistant handling inbound leasing communications across channels (SMS, email, web chat). The team’s pain: prompt tweaks felt risky because they couldn’t tell if changes improved overall behavior or broke something else.

    • Product context: AI assistant for property managers/leasing workflows
    • Multiple channels and tasks (questions, appointments, tours, handoffs)
    • Early prototype worked in demos but failed in real-world edge cases
    • Main challenge: no reliable way to measure improvement beyond gut feel
  3. Traces and observability: capturing what actually happened in an AI interaction

    Hamel explains “traces” as the log of end-to-end AI behavior: system prompt, user message(s), tool calls, tool responses, and final assistant output. He shows how observability tools (e.g., Braintrust, Arize Phoenix) make these sequences inspectable.

    • A trace includes prompts, multi-turn context, retrieval/tool calls, and outputs
    • Essential for debugging chatbots/agents where behavior is a chain of events
    • Tools differ, but the goal is the same: reviewable, queryable interaction logs
    • Traces turn fuzzy failures into concrete artifacts you can analyze
  4. Reality check from user logs: vague, typo-filled, ambiguous prompts

    Reviewing real conversations reveals how differently users interact than builders expect. Hamel and Claire highlight an example where the user’s message is unclear, and the assistant responds with something plausibly helpful but likely wrong for the user’s intent.

    • Real users write ambiguous text (typos, slang, incomplete questions)
    • Builder testing is biased: we ask “good” questions with clear intent
    • Seeing real distributions changes what you should test and simulate
    • Synthetic data may miss messy real-world patterns unless guided by logs
  5. Error analysis, step 1 (open coding): write quick notes on the first upstream failure

    Hamel introduces error analysis as a simple, high-leverage process borrowed from classical ML practice. The first step is “open coding”: sample ~100 traces and write short notes describing what went wrong, stopping at the earliest (most causal) error in the chain.

    • Process is accessible: manually review a sample and jot one-sentence notes
    • Stop at the most upstream issue to avoid downstream noise
    • Upstream errors propagate (bad intent → bad tool call → bad response)
    • This makes the problem tractable and yields fast insight
  6. Error analysis, step 2: categorize notes and count—turn observations into priorities

    After collecting notes, you bucket them into issue categories (optionally with LLM help) and then quantify frequency. Counting transforms qualitative review into a prioritized roadmap of fixes, replacing paralysis with clarity.

    • Use LLMs to help cluster notes, but iterate and correct categories
    • Quantify by counting category frequency—classic PM leverage
    • Outcome: a ranked list of the biggest failure modes to tackle first
    • Creates confidence in what evals and fixes are worth doing next
  7. Building custom annotation UIs: reducing friction for faster review and labeling

    Hamel explains that off-the-shelf observability UIs are helpful but sometimes too generic or slow for high-throughput review. For Nurture Boss, they quickly “vibe coded” a tailored annotation tool with filters, channel views, and lightweight labeling to speed analysis.

    • Goal: remove friction so humans can annotate quickly and consistently
    • Custom UI can be simple: filters by channel, annotated/unannotated, stats
    • Annotation needs to be human-readable and workflow-aligned
    • Fast labeling becomes the foundation for better evals and iteration
  8. Impact of the process: clients get immediate quality gains and clearer next steps

    Hamel notes many clients find error analysis alone transformative—sometimes enough to ship meaningful improvements. It also prevents premature obsession with eval tooling by clarifying which evals matter and what “good” looks like for real failures.

    • Even a few hours of review can surface major systemic issues
    • Teams stop guessing and start targeting the biggest sources of failure
    • Error analysis informs what evals to write (instead of infinite possibilities)
    • Reduces reliance on user thumbs-up/down feedback as the main signal
  9. Choosing evaluation types: code-based checks vs reference-based vs subjective judging

    With prioritized failures in hand, the next step is writing evaluations that match the problem type. Hamel distinguishes deterministic/code-based evals (unit-test-like), reference-based evals (known right answers), and LLM-judge evals for subjective or nuanced criteria.

    • Code-based evals catch concrete issues (formatting, leakage, policy rules)
    • Reference-based evals work when “correct answer” is known/derivable
    • LLM-as-judge is for subjective tasks like handoff quality or tone constraints
    • You still need representative test cases from traces or synthetic generation
  10. LLM-as-a-Judge done right: binary, task-specific, and validated against human labels

    Hamel critiques vague dashboards (helpfulness/truthfulness scores) as hard to interpret and easy to misuse. Instead, he recommends judge prompts that output binary pass/fail for specific failure modes, and validating judge agreement with hand-labeled examples to build trust.

    • Avoid abstract scalar scores that don’t map to actionable fixes
    • Create problem-specific binary outcomes (e.g., ‘handoff correct: yes/no’)
    • Validate the judge by comparing to human labels to measure agreement
    • Without validation, dashboards can destroy stakeholder trust in evals
  11. Improving prompts and system instructions: fix obvious gaps, then iterate systematically

    Once evals reveal failure clusters, teams decide what to change—prompting, retrieval, examples, tool specs, or eventually fine-tuning. Hamel emphasizes there’s no universal prompt trick; progress comes from targeted experimentation guided by measured errors (e.g., missing today’s date causing scheduling mistakes).

    • Use eval results to decide whether it’s prompting, retrieval, tools, or data
    • Common quick wins: missing context (like current date) and unclear rules
    • Prompt edits are “bug surface area”—small wording can cause big behavior shifts
    • Fine-tuning becomes easier/cheaper once eval + data curation pipelines exist
  12. Evaluating and analyzing agents: tool-to-tool handoffs, transition matrices, and workflow insight

    For agentic systems, Hamel highlights advanced analysis techniques like mapping transitions between steps and identifying where failures cluster. Claire adds that the same telemetry helps product discovery—seeing where users seek value and where workflows bottleneck.

    • Agent evaluation must consider sequences, not just single responses
    • Transition matrices help locate where chains break (handoff points)
    • Debugging overlaps with product analytics: what are users trying to do?
    • Notebook-style, AI-assisted analysis can speed iteration for PMs and teams
  13. Hamel’s personal AI workflows: Claude Projects, Gemini for video, and a monorepo for prompts

    Hamel shares how he runs his business with AI: Claude Projects for repeated tasks (proposals, copywriting, course FAQs, legal), Gemini for turning videos into consumable notes, and a GitHub monorepo that centralizes prompts, rules, content, and tooling to avoid vendor lock-in.

    • Claude Projects as reusable ‘work cells’ with instructions + examples
    • Proposal automation: feed call transcripts, then lightly edit outputs
    • Gemini excels at video/transcript-to-text artifacts (slide summaries, notes)
    • GitHub monorepo stores prompts/tools/content + agent rules for portability
  14. Who should do annotation & a practical writing prompting tip: outline → draft → edit inline

    In the closing lightning round, Hamel argues subject matter expertise is central—often PMs, sometimes ops/function experts, and data scientists as analysis scales. For writing quality, he recommends stepwise workflows and tools that support inline editing to create better examples and preserve human voice.

    • Best annotators are SMEs with ‘taste’ for what good looks like
    • PMs often act as SMEs; ops teams can contribute domain expertise
    • Data scientists become useful as evaluation/analysis sophistication grows
    • Writing tip: outline first, draft small sections, edit inline to teach the model

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.