Skip to content
Aakash GuptaAakash Gupta

How to Build AI Evals in 2026 (Step-by-Step, No Hype)

Hamel Husain and Shreya Shankar are back with the definitive guide to AI evals. Step-by-step walkthrough using real production data from Nurture Boss. Error analysis, LLM judges, and the mistakes 90% of teams make. Full Writeup: https://www.news.aakashg.com/p/hamel-shreya-podcast-2 Transcript: https://www.aakashg.com/how-to-master-ai-evals-a-step-by-step-guide-with-hamel-husain-shreya-shankar/ ---- Timestamps: 0:00 - Intro 2:09 - Why Every AI Product Needs Evals 3:11 - Real Example: Nurture Boss Case Study 5:26 - Starting with Observability 11:24 - Ad Start 13:05 - Ad End: Analyzing Traces 24:55 - Error Analysis Introduction 27:00 - Axial Coding Explained 30:53 - Ad Start 32:40 - Ad End: Counting Issues 42:26 - Building Your LLM Judge 48:02 - Measuring the Judge 56:38 - PM vs AI Engineer Roles 1:01:29 - Common Mistakes to Avoid 1:06:31 - Outro ---- 🏆 Thanks to our sponsors: 1. The AI Evals Course for PMs & Engineers: You get $800 with this link: https://maven.com/parlance-labs/evals?promoCode=ag-product-growth 2. Vanta: Automate compliance, Get $1,000 with my link : https://www.vanta.com/lp/demo-1k?utm_campaign=1k_offer&utm_source=product-growth&utm_medium=podcast 3. Jira Product Discovery: Plan with purpose, ship with confidence - https://www.atlassian.com/software/jira/product-discovery 4. Land PM job: 12-week experience to master [getting a PM job](https://www.landpmjob.com/) - https://www.landpmjob.com/ 5. Pendo: the #1 Software Experience Management Platform - http://www.pendo.com/aakash ---- Key Takeaways: 1. AI evals are the #1 most important new skill for PMs in 2025 - Even Claude Code teams do evals upstream. For custom applications, systematic evaluation is non-negotiable. Dog fooding alone isn't enough at scale. 2. Error analysis is the secret weapon most teams skip - Looking at 100 traces teaches you more than any generic metric. Hamel: "If you try to use helpfulness scores, the LLM won't catch the real product issues." 3. Use observability tools but don't depend on them completely - Brain Trust, LangSmith, Arise all work. But Shreya and Hamel teach students to vibe code their own trace viewers. Sometimes CSV files are enough to start. 4. Never use agreement as your eval metric - It's a trap. A judge that always says "pass" can have 90% accuracy if failures are rare. Use TPR (true positive rate) and TNR (true negative rate) instead. 5. Open coding then axial coding reveals patterns - Write notes on 100 traces without root cause analysis. Then categorize into 5-6 actionable themes. Use LLMs to help but refine manually. 6. Product managers must do the error analysis themselves - Don't outsource to developers. Engineers lack domain context. Hamel: "It's almost a tragedy to separate the prompt from the product manager because it's English." 7. Real traces reveal what demos hide - Chat GPT said the assistant was correct but missed: wrong bathroom configuration, markdown in SMS, double-booked tours, ignored handoff requests. 8. Binary scores beat 1-5 scales for LLM judges - Easier to validate alignment. Business decisions are binary anyway. LLMs struggle with nuanced numerical scoring. 9. Code-based evals for formatting, LLM judges for subjective calls - Markdown in text messages? Write a simple assertion. Human handoff quality? Need an LLM judge with proper rubric. 10. Start with traces even before launch - Dog food your own app. Recruit friends as beta testers. Generate synthetic inputs only as last resort. Error analysis works best with real user behavior. ---- 👨‍💻 Where to find Hamel Husain: Website: https://hamel.dev Twitter/X: https://x.com/HamelHusain Course: https://evals.info 👨‍💻 Where to find Shreya Shankar: Website: https://www.shreya-shankar.com Twitter/X: https://x.com/sh_reya Course: https://evals.info 👨‍💻 Where to find Aakash: Twitter: https://www.x.com/aakashg0 LinkedIn: https://www.linkedin.com/in/aagupta/ Newsletter: https://www.news.aakashg.com #aievals #aipm #productmanagement ---- 🧠 About Product Growth: The world's largest podcast focused solely on product + growth, with over 200K+ listeners. 🔔 Subscribe and turn on notifications to get more videos like this.

Aakash GuptahostHamel HusainguestShreya Shankarguest
Jan 15, 20261h 7mWatch on YouTube ↗

CHAPTERS

  1. What “AI evals” actually means (and why this episode is different)

    Aakash frames evals as a production necessity rather than a demo-time nice-to-have, and sets up a step-by-step walkthrough on real data. Hamel and Shreya preview the core approach: start from real traces, do error analysis, then build targeted evals—no hype, no vanity metrics.

    • Evals as a practical skill for shipping AI features, especially for PMs
    • The plan: real company example, real production traces, concrete workflow
    • Pushback/controversy: “some products don’t need evals” vs reality
    • Theme: move beyond vibes and generic scores into systematic improvement
  2. Why every AI product needs evals (even if you’re dogfooding)

    Shreya explains the misconception behind “Claude Code doesn’t use evals,” arguing many apps benefit from upstream eval work—but most real applications still require application-specific evaluation. The group positions evals as the mechanism to improve real user outcomes, not to chase abstract model quality.

    • Coding agents may rely on upstream model testing + heavy dogfooding, but most apps can’t
    • Application-specific behavior requires application-specific evals
    • Evals are about iterative product improvement, not proving intelligence
    • Dogfooding helps, but doesn’t replace a disciplined measurement loop
  3. Case study setup: Nurture Boss and why it’s a ‘messy’ ideal example

    Hamel introduces Nurture Boss, a property-management AI assistant handling multi-turn tenant conversations across channels (text, voice, chatbot). The product’s real-world complexity—tool calls, RAG, scheduling flows, and noisy inputs—makes it a strong example for how evals should be built from reality.

    • What Nurture Boss does: leasing/tenant interactions, listings, tours, applications
    • Real-world complexity: tool calls, RAG, multi-turn, multiple channels
    • Goal: identify what’s going wrong and improve systematically beyond vibe checks
    • Using anonymized production data to teach the process
  4. Start with observability: traces over dashboards

    They argue the first step is capturing traces of what the model saw and did—not just aggregate APM metrics. Hamel notes you don’t need a fancy tool to start (CSV/JSON logs work), but you must be able to inspect and annotate interactions to understand failures.

    • Traces show prompts, tool calls, retrieved context, and outputs across turns
    • AI observability tools are optional; simplest logging that supports review is fine
    • Difference vs traditional APM: need model-context visibility, not just latency/errors
    • Key requirement: ability to take notes directly on traces
  5. Reading a trace like a PM: concrete failures hidden in plain sight

    Using a real text-message trace, they surface multiple product-impacting problems: misunderstanding constraints, failing to follow up, and output formatting mismatches (markdown sent as SMS). The segment emphasizes that humans must interpret nuance; generic “helpfulness” style metrics miss what matters.

    • Identify mismatched requirement: bathroom configuration misunderstood
    • Model says it will do something (check) but never follows through
    • Channel mismatch: markdown formatting in a text message context
    • Lesson: PM taste/UX context is required to judge quality accurately
  6. Why ‘just ask ChatGPT’ isn’t enough for evaluation

    They demonstrate how LLMs can catch some issues but miss critical product nuances (like whether the tool even supports a requested filter or whether brevity is desirable in SMS). The takeaway: LLMs can assist, but you still need structured human review and domain context.

    • LLMs may flag obvious errors but miss product-specific constraints
    • They may invent assumptions (e.g., tool supports bathroom filter)
    • They may misjudge UX tradeoffs (e.g., listing only 3 apartments is fine)
    • Human-in-the-loop review remains essential for grounding evals
  7. Open coding: fast, lightweight annotation of 100 traces

    Hamel introduces the core workflow: scan traces quickly and write short notes about what went wrong, without overthinking root cause. Shreya warns against getting stuck debating each trace; the goal is momentum and coverage, capturing the most important failures.

    • Write simple notes (open codes) per trace—what’s wrong, in plain language
    • Speed matters: ~30 seconds per trace is the target, perfection not required
    • Avoid root-cause analysis at this stage; just observe and record
    • Skip clean traces; focus attention on failures and friction
  8. Error analysis begins: turning messy notes into actionable categories (axial coding)

    They move from raw notes to categorization using axial coding, optionally bootstrapped by an LLM but refined by humans. Shreya emphasizes categories must be specific and labelable—vague buckets like “temporal issues” aren’t useful unless made concrete.

    • Axial coding = grouping open codes into specific, actionable error categories
    • LLMs can propose initial categories, but humans must refine names and scope
    • Avoid vague categories; optimize for clarity if someone else had to label them
    • Iterate on category taxonomy as you see more examples
  9. Counting issues with pivot tables: prioritization with evidence

    Once categories exist, they quantify frequency via pivot tables to identify dominant failure modes and unblock roadmap decisions. They also note you can introduce hierarchy (subcategories) and prioritize not only by frequency but by severity/impact.

    • Counting converts qualitative chaos into a prioritized list of failure modes
    • Pivot tables quickly show top categories and enable drill-down to examples
    • Consider hierarchical breakdowns (category → subcategory) for clarity
    • Prioritize by impact as well as frequency (rare but catastrophic failures)
  10. From issues to eval types: code-based checks vs LLM-as-judge

    They explain not every issue needs an LLM judge—some are cheaply caught with deterministic rules (e.g., markdown in SMS). LLM judges are reserved for subjective judgments (e.g., when a human handoff is needed) and should be created for problems you expect to iterate on repeatedly.

    • Two eval classes: deterministic/code-based vs LLM-based evaluators
    • Use code-based evals when possible (formatting, policy rules, invariants)
    • Reserve LLM judges for subjective or contextual product decisions
    • Write evals for recurring/iteration-worthy problems, not for everything
  11. Building an LLM judge: rubrics, binary outputs, and iteration

    Hamel shares a simple rubric-driven judge prompt for “handoff failure,” designed to return only true/false. Shreya argues binary is easier to align than numeric scales and matches how product decisions actually get made (act vs don’t act).

    • Judge prompt should define what counts as failure vs non-failure (rubric)
    • Prefer binary outputs (true/false) to reduce alignment complexity
    • Examples help but aren’t strictly required to start; iterate over time
    • Don’t copy prompts blindly—tailor to your product’s policies and tools
  12. Measuring the judge: why agreement is a trap (TPR/TNR mindset)

    They warn that stakeholders will lose trust if judges aren’t validated against human labels. Simple accuracy/agreement can be misleading in imbalanced cases, so you should evaluate positive and negative performance separately (e.g., ability to catch true failures vs avoid false alarms).

    • Validate LLM judge against human-labeled traces to earn trust
    • Accuracy/agreement can be high even for a useless always-pass judge
    • Track performance on positives and negatives separately (catch vs avoid)
    • Acknowledges deeper topics: dataset splits, overfitting, agent-specific nuance
  13. Operating the eval suite: CI vs monitoring, sampling, and evolving data

    They discuss what an end-state eval setup looks like in practice: a mix of lightweight code checks in CI and occasional LLM-powered monitoring on sampled production traces. Shreya notes evals must evolve with distribution shifts—new user cohorts and new document types create new failure modes.

    • Typical suite: many code-based checks, few LLM judges in CI due to cost/latency
    • Run LLM-powered monitoring periodically (weekly) on sampled production traces
    • Watch for distribution shift: new cohorts, new doc/contract types, new behavior
    • Use evals to iterate quickly and prevent regressions across multiple goals
  14. Roles, workflows, and common mistakes: keep PMs in the loop

    They outline collaboration patterns between PMs and AI engineers, emphasizing PM/domain experts should lead error analysis because it encodes product taste and becomes the product moat. Key pitfalls include skipping error analysis, relying on vendor metrics, and outsourcing the core judgment work away from domain experts.

    • PM/domain expert should drive error analysis; engineers may lack UX/domain context
    • Make prompts editable by domain experts (admin views), not locked in code
    • Build/“vibe code” lightweight trace viewers to remove analysis friction
    • Common mistakes: skipping error analysis, using generic vendor scores, outsourcing the moat

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.