Skip to content
Aakash GuptaAakash Gupta

How to Build AI Evals in 2026 (Step-by-Step, No Hype)

Hamel Husain and Shreya Shankar are back with the definitive guide to AI evals. Step-by-step walkthrough using real production data from Nurture Boss. Error analysis, LLM judges, and the mistakes 90% of teams make. Full Writeup: https://www.news.aakashg.com/p/hamel-shreya-podcast-2 Transcript: https://www.aakashg.com/how-to-master-ai-evals-a-step-by-step-guide-with-hamel-husain-shreya-shankar/ ---- Timestamps: 0:00 - Intro 2:09 - Why Every AI Product Needs Evals 3:11 - Real Example: Nurture Boss Case Study 5:26 - Starting with Observability 11:24 - Ad Start 13:05 - Ad End: Analyzing Traces 24:55 - Error Analysis Introduction 27:00 - Axial Coding Explained 30:53 - Ad Start 32:40 - Ad End: Counting Issues 42:26 - Building Your LLM Judge 48:02 - Measuring the Judge 56:38 - PM vs AI Engineer Roles 1:01:29 - Common Mistakes to Avoid 1:06:31 - Outro ---- 🏆 Thanks to our sponsors: 1. The AI Evals Course for PMs & Engineers: You get $800 with this link: https://maven.com/parlance-labs/evals?promoCode=ag-product-growth 2. Vanta: Automate compliance, Get $1,000 with my link : https://www.vanta.com/lp/demo-1k?utm_campaign=1k_offer&utm_source=product-growth&utm_medium=podcast 3. Jira Product Discovery: Plan with purpose, ship with confidence - https://www.atlassian.com/software/jira/product-discovery 4. Land PM job: 12-week experience to master [getting a PM job](https://www.landpmjob.com/) - https://www.landpmjob.com/ 5. Pendo: the #1 Software Experience Management Platform - http://www.pendo.com/aakash ---- Key Takeaways: 1. AI evals are the #1 most important new skill for PMs in 2025 - Even Claude Code teams do evals upstream. For custom applications, systematic evaluation is non-negotiable. Dog fooding alone isn't enough at scale. 2. Error analysis is the secret weapon most teams skip - Looking at 100 traces teaches you more than any generic metric. Hamel: "If you try to use helpfulness scores, the LLM won't catch the real product issues." 3. Use observability tools but don't depend on them completely - Brain Trust, LangSmith, Arise all work. But Shreya and Hamel teach students to vibe code their own trace viewers. Sometimes CSV files are enough to start. 4. Never use agreement as your eval metric - It's a trap. A judge that always says "pass" can have 90% accuracy if failures are rare. Use TPR (true positive rate) and TNR (true negative rate) instead. 5. Open coding then axial coding reveals patterns - Write notes on 100 traces without root cause analysis. Then categorize into 5-6 actionable themes. Use LLMs to help but refine manually. 6. Product managers must do the error analysis themselves - Don't outsource to developers. Engineers lack domain context. Hamel: "It's almost a tragedy to separate the prompt from the product manager because it's English." 7. Real traces reveal what demos hide - Chat GPT said the assistant was correct but missed: wrong bathroom configuration, markdown in SMS, double-booked tours, ignored handoff requests. 8. Binary scores beat 1-5 scales for LLM judges - Easier to validate alignment. Business decisions are binary anyway. LLMs struggle with nuanced numerical scoring. 9. Code-based evals for formatting, LLM judges for subjective calls - Markdown in text messages? Write a simple assertion. Human handoff quality? Need an LLM judge with proper rubric. 10. Start with traces even before launch - Dog food your own app. Recruit friends as beta testers. Generate synthetic inputs only as last resort. Error analysis works best with real user behavior. ---- 👨‍💻 Where to find Hamel Husain: Website: https://hamel.dev Twitter/X: https://x.com/HamelHusain Course: https://evals.info 👨‍💻 Where to find Shreya Shankar: Website: https://www.shreya-shankar.com Twitter/X: https://x.com/sh_reya Course: https://evals.info 👨‍💻 Where to find Aakash: Twitter: https://www.x.com/aakashg0 LinkedIn: https://www.linkedin.com/in/aagupta/ Newsletter: https://www.news.aakashg.com #aievals #aipm #productmanagement ---- 🧠 About Product Growth: The world's largest podcast focused solely on product + growth, with over 200K+ listeners. 🔔 Subscribe and turn on notifications to get more videos like this.

Aakash GuptahostHamel HusainguestShreya Shankarguest
Jan 15, 20261h 7mWatch on YouTube ↗

CHAPTERS

  1. What “AI evals” actually means (and why this episode is different)

    Aakash frames evals as a production necessity rather than a demo-time nice-to-have, and sets up a step-by-step walkthrough on real data. Hamel and Shreya preview the core approach: start from real traces, do error analysis, then build targeted evals—no hype, no vanity metrics.

  2. Why every AI product needs evals (even if you’re dogfooding)

    Shreya explains the misconception behind “Claude Code doesn’t use evals,” arguing many apps benefit from upstream eval work—but most real applications still require application-specific evaluation. The group positions evals as the mechanism to improve real user outcomes, not to chase abstract model quality.

  3. Case study setup: Nurture Boss and why it’s a ‘messy’ ideal example

    Hamel introduces Nurture Boss, a property-management AI assistant handling multi-turn tenant conversations across channels (text, voice, chatbot). The product’s real-world complexity—tool calls, RAG, scheduling flows, and noisy inputs—makes it a strong example for how evals should be built from reality.

  4. Start with observability: traces over dashboards

    They argue the first step is capturing traces of what the model saw and did—not just aggregate APM metrics. Hamel notes you don’t need a fancy tool to start (CSV/JSON logs work), but you must be able to inspect and annotate interactions to understand failures.

  5. Reading a trace like a PM: concrete failures hidden in plain sight

    Using a real text-message trace, they surface multiple product-impacting problems: misunderstanding constraints, failing to follow up, and output formatting mismatches (markdown sent as SMS). The segment emphasizes that humans must interpret nuance; generic “helpfulness” style metrics miss what matters.

  6. Why ‘just ask ChatGPT’ isn’t enough for evaluation

    They demonstrate how LLMs can catch some issues but miss critical product nuances (like whether the tool even supports a requested filter or whether brevity is desirable in SMS). The takeaway: LLMs can assist, but you still need structured human review and domain context.

  7. Open coding: fast, lightweight annotation of 100 traces

    Hamel introduces the core workflow: scan traces quickly and write short notes about what went wrong, without overthinking root cause. Shreya warns against getting stuck debating each trace; the goal is momentum and coverage, capturing the most important failures.

  8. Error analysis begins: turning messy notes into actionable categories (axial coding)

    They move from raw notes to categorization using axial coding, optionally bootstrapped by an LLM but refined by humans. Shreya emphasizes categories must be specific and labelable—vague buckets like “temporal issues” aren’t useful unless made concrete.

  9. Counting issues with pivot tables: prioritization with evidence

    Once categories exist, they quantify frequency via pivot tables to identify dominant failure modes and unblock roadmap decisions. They also note you can introduce hierarchy (subcategories) and prioritize not only by frequency but by severity/impact.

  10. From issues to eval types: code-based checks vs LLM-as-judge

    They explain not every issue needs an LLM judge—some are cheaply caught with deterministic rules (e.g., markdown in SMS). LLM judges are reserved for subjective judgments (e.g., when a human handoff is needed) and should be created for problems you expect to iterate on repeatedly.

  11. Building an LLM judge: rubrics, binary outputs, and iteration

    Hamel shares a simple rubric-driven judge prompt for “handoff failure,” designed to return only true/false. Shreya argues binary is easier to align than numeric scales and matches how product decisions actually get made (act vs don’t act).

  12. Measuring the judge: why agreement is a trap (TPR/TNR mindset)

    They warn that stakeholders will lose trust if judges aren’t validated against human labels. Simple accuracy/agreement can be misleading in imbalanced cases, so you should evaluate positive and negative performance separately (e.g., ability to catch true failures vs avoid false alarms).

  13. Operating the eval suite: CI vs monitoring, sampling, and evolving data

    They discuss what an end-state eval setup looks like in practice: a mix of lightweight code checks in CI and occasional LLM-powered monitoring on sampled production traces. Shreya notes evals must evolve with distribution shifts—new user cohorts and new document types create new failure modes.

  14. Roles, workflows, and common mistakes: keep PMs in the loop

    They outline collaboration patterns between PMs and AI engineers, emphasizing PM/domain experts should lead error analysis because it encodes product taste and becomes the product moat. Key pitfalls include skipping error analysis, relying on vendor metrics, and outsourcing the core judgment work away from domain experts.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome