Aakash GuptaHow to Build AI Evals in 2026 (Step-by-Step, No Hype)
CHAPTERS
What “AI evals” actually means (and why this episode is different)
Aakash frames evals as a production necessity rather than a demo-time nice-to-have, and sets up a step-by-step walkthrough on real data. Hamel and Shreya preview the core approach: start from real traces, do error analysis, then build targeted evals—no hype, no vanity metrics.
Why every AI product needs evals (even if you’re dogfooding)
Shreya explains the misconception behind “Claude Code doesn’t use evals,” arguing many apps benefit from upstream eval work—but most real applications still require application-specific evaluation. The group positions evals as the mechanism to improve real user outcomes, not to chase abstract model quality.
Case study setup: Nurture Boss and why it’s a ‘messy’ ideal example
Hamel introduces Nurture Boss, a property-management AI assistant handling multi-turn tenant conversations across channels (text, voice, chatbot). The product’s real-world complexity—tool calls, RAG, scheduling flows, and noisy inputs—makes it a strong example for how evals should be built from reality.
Start with observability: traces over dashboards
They argue the first step is capturing traces of what the model saw and did—not just aggregate APM metrics. Hamel notes you don’t need a fancy tool to start (CSV/JSON logs work), but you must be able to inspect and annotate interactions to understand failures.
Reading a trace like a PM: concrete failures hidden in plain sight
Using a real text-message trace, they surface multiple product-impacting problems: misunderstanding constraints, failing to follow up, and output formatting mismatches (markdown sent as SMS). The segment emphasizes that humans must interpret nuance; generic “helpfulness” style metrics miss what matters.
Why ‘just ask ChatGPT’ isn’t enough for evaluation
They demonstrate how LLMs can catch some issues but miss critical product nuances (like whether the tool even supports a requested filter or whether brevity is desirable in SMS). The takeaway: LLMs can assist, but you still need structured human review and domain context.
Open coding: fast, lightweight annotation of 100 traces
Hamel introduces the core workflow: scan traces quickly and write short notes about what went wrong, without overthinking root cause. Shreya warns against getting stuck debating each trace; the goal is momentum and coverage, capturing the most important failures.
Error analysis begins: turning messy notes into actionable categories (axial coding)
They move from raw notes to categorization using axial coding, optionally bootstrapped by an LLM but refined by humans. Shreya emphasizes categories must be specific and labelable—vague buckets like “temporal issues” aren’t useful unless made concrete.
Counting issues with pivot tables: prioritization with evidence
Once categories exist, they quantify frequency via pivot tables to identify dominant failure modes and unblock roadmap decisions. They also note you can introduce hierarchy (subcategories) and prioritize not only by frequency but by severity/impact.
From issues to eval types: code-based checks vs LLM-as-judge
They explain not every issue needs an LLM judge—some are cheaply caught with deterministic rules (e.g., markdown in SMS). LLM judges are reserved for subjective judgments (e.g., when a human handoff is needed) and should be created for problems you expect to iterate on repeatedly.
Building an LLM judge: rubrics, binary outputs, and iteration
Hamel shares a simple rubric-driven judge prompt for “handoff failure,” designed to return only true/false. Shreya argues binary is easier to align than numeric scales and matches how product decisions actually get made (act vs don’t act).
Measuring the judge: why agreement is a trap (TPR/TNR mindset)
They warn that stakeholders will lose trust if judges aren’t validated against human labels. Simple accuracy/agreement can be misleading in imbalanced cases, so you should evaluate positive and negative performance separately (e.g., ability to catch true failures vs avoid false alarms).
Operating the eval suite: CI vs monitoring, sampling, and evolving data
They discuss what an end-state eval setup looks like in practice: a mix of lightweight code checks in CI and occasional LLM-powered monitoring on sampled production traces. Shreya notes evals must evolve with distribution shifts—new user cohorts and new document types create new failure modes.
Roles, workflows, and common mistakes: keep PMs in the loop
They outline collaboration patterns between PMs and AI engineers, emphasizing PM/domain experts should lead error analysis because it encodes product taste and becomes the product moat. Key pitfalls include skipping error analysis, relying on vendor metrics, and outsourcing the core judgment work away from domain experts.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome