How I AIEvals, error analysis, and better prompts: A systematic approach to improving your AI products
Claire Vo and Hamel Husain on systematically improve AI products with traces, error analysis, and evals.
In this episode of How I AI, featuring Claire Vo and Hamel Husain, Evals, error analysis, and better prompts: A systematic approach to improving your AI products explores systematically improve AI products with traces, error analysis, and evals Hamel argues the biggest unlock for higher-quality AI products is the same as classic product work: look at real data—especially real user interactions—rather than relying on “vibe checks.”
At a glance
WHAT IT’S REALLY ABOUT
Systematically improve AI products with traces, error analysis, and evals
- Hamel argues the biggest unlock for higher-quality AI products is the same as classic product work: look at real data—especially real user interactions—rather than relying on “vibe checks.”
- He demonstrates how “traces” (logged multi-step AI interactions including prompts, tool calls, and outputs) make failures debuggable and reveal surprising user behavior (vague, typo-heavy, ambiguous inputs).
- The core method is error analysis: sample ~100 traces, write brief human notes on the first upstream failure, then categorize and count issues to create a prioritized defect backlog.
- Only after this grounding should teams write evals: use code-based checks for objective issues, and LLM-as-a-judge for subjective ones—designed as specific binary pass/fail tests and validated against human labels to avoid misleading dashboards.
IDEAS WORTH REMEMBERING
5 ideasReal user traces are the foundation of AI quality work.
Before you optimize prompts or add evals, you need logs of what users actually do (including messy, ambiguous language) and what the system actually executed (system prompt, tool calls, retrieved context).
Error analysis makes an intractable problem tractable.
Randomly sample a manageable set (e.g., ~100 traces), write one-sentence notes, and stop at the most upstream failure to avoid getting lost in downstream artifacts of earlier mistakes.
Categorize and count failures to build a prioritized roadmap.
After open-coded notes, bucket issues (manually or with an LLM) and count frequency; this turns “the model is weird” into a ranked list like “handoff failures” and “tour scheduling errors.”
Custom annotation UIs can be worth it to remove friction.
Off-the-shelf observability tools work, but lightweight internal tools tailored to your channels and filters can speed human review, standardize labels, and improve throughput.
Write evals only after you know what to measure.
There are infinite possible evals; error analysis tells you which ones matter. In the Nurture Boss example, they wrote evals specifically for tour scheduling and transfer/handoff behaviors.
WORDS WORTH SAVING
5 quotesThe most important thing is looking at data.
— Hamel Husain
Just spend three hours of your afternoon, go through, read some of these chats, look at some of them with your human eyes... and get to work.
— Claire Vo
Error analysis has two steps. The first step is writing notes... basically like journaling what is wrong.
— Hamel Husain
The last thing you wanna do is throw up a judge on the dashboard... and people don't know if they can trust it.
— Hamel Husain
If you do all this eval stuff, fine-tuning is basically free... those difficult examples... are exactly the stuff you want to fine-tune on.
— Hamel Husain
QUESTIONS ANSWERED IN THIS EPISODE
5 questionsIn the Nurture Boss trace (“What’s up to four-month rent”), what would an ideal clarification policy look like (when to ask a follow-up vs. attempt an answer)?
Hamel argues the biggest unlock for higher-quality AI products is the same as classic product work: look at real data—especially real user interactions—rather than relying on “vibe checks.”
How do you decide the right sample size for manual error analysis (100 traces vs. 30 vs. 500), and how often should teams repeat the exercise as the product evolves?
He demonstrates how “traces” (logged multi-step AI interactions including prompts, tool calls, and outputs) make failures debuggable and reveal surprising user behavior (vague, typo-heavy, ambiguous inputs).
What’s your recommended schema for annotations (labels, severity, upstream/downstream flags) so the resulting buckets translate directly into evals and engineering work?
The core method is error analysis: sample ~100 traces, write brief human notes on the first upstream failure, then categorize and count issues to create a prioritized defect backlog.
For transfer/handoff issues, what are examples of high-quality binary LLM-judge criteria that capture “good handoff” without being overly brittle?
Only after this grounding should teams write evals: use code-based checks for objective issues, and LLM-as-a-judge for subjective ones—designed as specific binary pass/fail tests and validated against human labels to avoid misleading dashboards.
When you validate an LLM-as-a-judge against human labels, what agreement threshold is “good enough,” and how do you handle systematic judge bias?
Chapter Breakdown
Hamel Husain’s core premise: quality comes from looking at data
Hamel frames AI product quality as a data analysis problem, not a “prompt magic” problem. The twist with LLMs is that the data is messy, stochastic, and often multi-step, but the fundamentals of product analytics still apply.
Case study setup: Nurture Boss virtual leasing assistant and why it’s hard to scale
Hamel introduces a real client, Nurture Boss, an AI assistant handling inbound leasing communications across channels (SMS, email, web chat). The team’s pain: prompt tweaks felt risky because they couldn’t tell if changes improved overall behavior or broke something else.
Traces and observability: capturing what actually happened in an AI interaction
Hamel explains “traces” as the log of end-to-end AI behavior: system prompt, user message(s), tool calls, tool responses, and final assistant output. He shows how observability tools (e.g., Braintrust, Arize Phoenix) make these sequences inspectable.
Reality check from user logs: vague, typo-filled, ambiguous prompts
Reviewing real conversations reveals how differently users interact than builders expect. Hamel and Claire highlight an example where the user’s message is unclear, and the assistant responds with something plausibly helpful but likely wrong for the user’s intent.
Error analysis, step 1 (open coding): write quick notes on the first upstream failure
Hamel introduces error analysis as a simple, high-leverage process borrowed from classical ML practice. The first step is “open coding”: sample ~100 traces and write short notes describing what went wrong, stopping at the earliest (most causal) error in the chain.
Error analysis, step 2: categorize notes and count—turn observations into priorities
After collecting notes, you bucket them into issue categories (optionally with LLM help) and then quantify frequency. Counting transforms qualitative review into a prioritized roadmap of fixes, replacing paralysis with clarity.
Building custom annotation UIs: reducing friction for faster review and labeling
Hamel explains that off-the-shelf observability UIs are helpful but sometimes too generic or slow for high-throughput review. For Nurture Boss, they quickly “vibe coded” a tailored annotation tool with filters, channel views, and lightweight labeling to speed analysis.
Impact of the process: clients get immediate quality gains and clearer next steps
Hamel notes many clients find error analysis alone transformative—sometimes enough to ship meaningful improvements. It also prevents premature obsession with eval tooling by clarifying which evals matter and what “good” looks like for real failures.
Choosing evaluation types: code-based checks vs reference-based vs subjective judging
With prioritized failures in hand, the next step is writing evaluations that match the problem type. Hamel distinguishes deterministic/code-based evals (unit-test-like), reference-based evals (known right answers), and LLM-judge evals for subjective or nuanced criteria.
LLM-as-a-Judge done right: binary, task-specific, and validated against human labels
Hamel critiques vague dashboards (helpfulness/truthfulness scores) as hard to interpret and easy to misuse. Instead, he recommends judge prompts that output binary pass/fail for specific failure modes, and validating judge agreement with hand-labeled examples to build trust.
Improving prompts and system instructions: fix obvious gaps, then iterate systematically
Once evals reveal failure clusters, teams decide what to change—prompting, retrieval, examples, tool specs, or eventually fine-tuning. Hamel emphasizes there’s no universal prompt trick; progress comes from targeted experimentation guided by measured errors (e.g., missing today’s date causing scheduling mistakes).
Evaluating and analyzing agents: tool-to-tool handoffs, transition matrices, and workflow insight
For agentic systems, Hamel highlights advanced analysis techniques like mapping transitions between steps and identifying where failures cluster. Claire adds that the same telemetry helps product discovery—seeing where users seek value and where workflows bottleneck.
Hamel’s personal AI workflows: Claude Projects, Gemini for video, and a monorepo for prompts
Hamel shares how he runs his business with AI: Claude Projects for repeated tasks (proposals, copywriting, course FAQs, legal), Gemini for turning videos into consumable notes, and a GitHub monorepo that centralizes prompts, rules, content, and tooling to avoid vendor lock-in.
Who should do annotation & a practical writing prompting tip: outline → draft → edit inline
In the closing lightning round, Hamel argues subject matter expertise is central—often PMs, sometimes ops/function experts, and data scientists as analysis scales. For writing quality, he recommends stepwise workflows and tools that support inline editing to create better examples and preserve human voice.
EVERY SPOKEN WORD
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome