Aakash GuptaHow to Build AI Evals in 2026 (Step-by-Step, No Hype)
CHAPTERS
What “AI evals” actually means (and why this episode is different)
Aakash frames evals as a production necessity rather than a demo-time nice-to-have, and sets up a step-by-step walkthrough on real data. Hamel and Shreya preview the core approach: start from real traces, do error analysis, then build targeted evals—no hype, no vanity metrics.
- •Evals as a practical skill for shipping AI features, especially for PMs
- •The plan: real company example, real production traces, concrete workflow
- •Pushback/controversy: “some products don’t need evals” vs reality
- •Theme: move beyond vibes and generic scores into systematic improvement
Why every AI product needs evals (even if you’re dogfooding)
Shreya explains the misconception behind “Claude Code doesn’t use evals,” arguing many apps benefit from upstream eval work—but most real applications still require application-specific evaluation. The group positions evals as the mechanism to improve real user outcomes, not to chase abstract model quality.
- •Coding agents may rely on upstream model testing + heavy dogfooding, but most apps can’t
- •Application-specific behavior requires application-specific evals
- •Evals are about iterative product improvement, not proving intelligence
- •Dogfooding helps, but doesn’t replace a disciplined measurement loop
Case study setup: Nurture Boss and why it’s a ‘messy’ ideal example
Hamel introduces Nurture Boss, a property-management AI assistant handling multi-turn tenant conversations across channels (text, voice, chatbot). The product’s real-world complexity—tool calls, RAG, scheduling flows, and noisy inputs—makes it a strong example for how evals should be built from reality.
- •What Nurture Boss does: leasing/tenant interactions, listings, tours, applications
- •Real-world complexity: tool calls, RAG, multi-turn, multiple channels
- •Goal: identify what’s going wrong and improve systematically beyond vibe checks
- •Using anonymized production data to teach the process
Start with observability: traces over dashboards
They argue the first step is capturing traces of what the model saw and did—not just aggregate APM metrics. Hamel notes you don’t need a fancy tool to start (CSV/JSON logs work), but you must be able to inspect and annotate interactions to understand failures.
- •Traces show prompts, tool calls, retrieved context, and outputs across turns
- •AI observability tools are optional; simplest logging that supports review is fine
- •Difference vs traditional APM: need model-context visibility, not just latency/errors
- •Key requirement: ability to take notes directly on traces
Reading a trace like a PM: concrete failures hidden in plain sight
Using a real text-message trace, they surface multiple product-impacting problems: misunderstanding constraints, failing to follow up, and output formatting mismatches (markdown sent as SMS). The segment emphasizes that humans must interpret nuance; generic “helpfulness” style metrics miss what matters.
- •Identify mismatched requirement: bathroom configuration misunderstood
- •Model says it will do something (check) but never follows through
- •Channel mismatch: markdown formatting in a text message context
- •Lesson: PM taste/UX context is required to judge quality accurately
Why ‘just ask ChatGPT’ isn’t enough for evaluation
They demonstrate how LLMs can catch some issues but miss critical product nuances (like whether the tool even supports a requested filter or whether brevity is desirable in SMS). The takeaway: LLMs can assist, but you still need structured human review and domain context.
- •LLMs may flag obvious errors but miss product-specific constraints
- •They may invent assumptions (e.g., tool supports bathroom filter)
- •They may misjudge UX tradeoffs (e.g., listing only 3 apartments is fine)
- •Human-in-the-loop review remains essential for grounding evals
Open coding: fast, lightweight annotation of 100 traces
Hamel introduces the core workflow: scan traces quickly and write short notes about what went wrong, without overthinking root cause. Shreya warns against getting stuck debating each trace; the goal is momentum and coverage, capturing the most important failures.
- •Write simple notes (open codes) per trace—what’s wrong, in plain language
- •Speed matters: ~30 seconds per trace is the target, perfection not required
- •Avoid root-cause analysis at this stage; just observe and record
- •Skip clean traces; focus attention on failures and friction
Error analysis begins: turning messy notes into actionable categories (axial coding)
They move from raw notes to categorization using axial coding, optionally bootstrapped by an LLM but refined by humans. Shreya emphasizes categories must be specific and labelable—vague buckets like “temporal issues” aren’t useful unless made concrete.
- •Axial coding = grouping open codes into specific, actionable error categories
- •LLMs can propose initial categories, but humans must refine names and scope
- •Avoid vague categories; optimize for clarity if someone else had to label them
- •Iterate on category taxonomy as you see more examples
Counting issues with pivot tables: prioritization with evidence
Once categories exist, they quantify frequency via pivot tables to identify dominant failure modes and unblock roadmap decisions. They also note you can introduce hierarchy (subcategories) and prioritize not only by frequency but by severity/impact.
- •Counting converts qualitative chaos into a prioritized list of failure modes
- •Pivot tables quickly show top categories and enable drill-down to examples
- •Consider hierarchical breakdowns (category → subcategory) for clarity
- •Prioritize by impact as well as frequency (rare but catastrophic failures)
From issues to eval types: code-based checks vs LLM-as-judge
They explain not every issue needs an LLM judge—some are cheaply caught with deterministic rules (e.g., markdown in SMS). LLM judges are reserved for subjective judgments (e.g., when a human handoff is needed) and should be created for problems you expect to iterate on repeatedly.
- •Two eval classes: deterministic/code-based vs LLM-based evaluators
- •Use code-based evals when possible (formatting, policy rules, invariants)
- •Reserve LLM judges for subjective or contextual product decisions
- •Write evals for recurring/iteration-worthy problems, not for everything
Building an LLM judge: rubrics, binary outputs, and iteration
Hamel shares a simple rubric-driven judge prompt for “handoff failure,” designed to return only true/false. Shreya argues binary is easier to align than numeric scales and matches how product decisions actually get made (act vs don’t act).
- •Judge prompt should define what counts as failure vs non-failure (rubric)
- •Prefer binary outputs (true/false) to reduce alignment complexity
- •Examples help but aren’t strictly required to start; iterate over time
- •Don’t copy prompts blindly—tailor to your product’s policies and tools
Measuring the judge: why agreement is a trap (TPR/TNR mindset)
They warn that stakeholders will lose trust if judges aren’t validated against human labels. Simple accuracy/agreement can be misleading in imbalanced cases, so you should evaluate positive and negative performance separately (e.g., ability to catch true failures vs avoid false alarms).
- •Validate LLM judge against human-labeled traces to earn trust
- •Accuracy/agreement can be high even for a useless always-pass judge
- •Track performance on positives and negatives separately (catch vs avoid)
- •Acknowledges deeper topics: dataset splits, overfitting, agent-specific nuance
Operating the eval suite: CI vs monitoring, sampling, and evolving data
They discuss what an end-state eval setup looks like in practice: a mix of lightweight code checks in CI and occasional LLM-powered monitoring on sampled production traces. Shreya notes evals must evolve with distribution shifts—new user cohorts and new document types create new failure modes.
- •Typical suite: many code-based checks, few LLM judges in CI due to cost/latency
- •Run LLM-powered monitoring periodically (weekly) on sampled production traces
- •Watch for distribution shift: new cohorts, new doc/contract types, new behavior
- •Use evals to iterate quickly and prevent regressions across multiple goals
Roles, workflows, and common mistakes: keep PMs in the loop
They outline collaboration patterns between PMs and AI engineers, emphasizing PM/domain experts should lead error analysis because it encodes product taste and becomes the product moat. Key pitfalls include skipping error analysis, relying on vendor metrics, and outsourcing the core judgment work away from domain experts.
- •PM/domain expert should drive error analysis; engineers may lack UX/domain context
- •Make prompts editable by domain experts (admin views), not locked in code
- •Build/“vibe code” lightweight trace viewers to remove analysis friction
- •Common mistakes: skipping error analysis, using generic vendor scores, outsourcing the moat