How to Build AI Evals in 2026 (Step-by-Step, No Hype)

Name: How to Build AI Evals in 2026 (Step-by-Step, No Hype)
Uploaded: 2026-01-15T00:00:00Z
Duration: 1 h 7 min
Description: The speakers argue that most real AI products need evals, and “no evals” claims often rely on upstream testing or informal dogfooding rather than rigorous measurement.

Aakash Gupta and Hamel Husain on step-by-step evals workflow: traces, error analysis, and LLM judges.

Aakash GuptahostHamel HusainguestShreya ShankarguestAakash GuptahostAakash Guptahost

Jan 15, 20261h 7mWatch on YouTube ↗

Why evals are necessary beyond demos and vibe checksTracing and observability (tools vs DIY logging)Open coding: fast note-taking on trace failuresAxial coding: categorizing issues into actionable bucketsCounting and prioritization with pivot tables and subcategoriesCode-based evals vs LLM-as-judge evalsValidating judges: TPR/TNR vs accuracy/overall agreementPM vs AI engineer responsibilities and prompt ownershipCommon mistakes: skipping error analysis, outsourcing judgment, generic metrics

In this episode of Aakash Gupta, featuring Aakash Gupta and Hamel Husain, How to Build AI Evals in 2026 (Step-by-Step, No Hype) explores step-by-step evals workflow: traces, error analysis, and LLM judges The speakers argue that most real AI products need evals, and “no evals” claims often rely on upstream testing or informal dogfooding rather than rigorous measurement.

WHAT IT’S REALLY ABOUT

Step-by-step evals workflow: traces, error analysis, and LLM judges

The speakers argue that most real AI products need evals, and “no evals” claims often rely on upstream testing or informal dogfooding rather than rigorous measurement.
They demonstrate starting with observability by collecting and reviewing real production traces (even via simple logging) to see what users actually experience beyond polished demos.
They emphasize error analysis as the core leverage point: manually “open code” trace issues, then “axial code” them into actionable categories and count frequency to prioritize work.
They show how to turn high-impact error categories into automated evaluators, including code-based checks for objective issues and LLM-as-judge prompts for subjective product failures.
They stress that LLM judges must be validated against human labels using metrics beyond simple agreement (e.g., true positive/true negative rates) to avoid misleading confidence and stakeholder mistrust.

IDEAS WORTH REMEMBERING

7 ideas

Start with traces, not abstract metrics.

Review real production conversations (including tool calls/RAG/multi-turn) to see the messy failures that “helpfulness” scores and generic dashboards routinely miss.

You don’t need fancy observability to begin.

An observability platform can help, but logging to CSV/JSON/DataDog is sufficient if you can reliably inspect traces and attach notes to them.

Do “open coding” quickly to build intuition and a dataset.

Scan ~100 traces and write brief notes on what went wrong (or skip if fine) without debating root cause; speed and coverage matter more than perfection.

Convert notes into categories (axial codes) that are specific and labelable.

Vague buckets like “quality” or “temporal issues” don’t help teams label consistently; use concrete, actionable categories (and a “none of the above” option) to discover missing buckets.

Count issues to escape prioritization paralysis.

Once errors are categorized, pivot tables (and optional hierarchical subcategories) reveal the most frequent failure modes and enable PM-driven prioritization by frequency and severity.

Choose the cheapest evaluator that can work.

Use code-based evals for objective checks (e.g., markdown artifacts in SMS), and reserve LLM-as-judge evaluators for subjective judgments (e.g., whether a human handoff was warranted).

LLM judges should be binary and must be measured against human labels.

Binary outputs (true/false) are easier to align than 1–5 scales, and validation should look at positives and negatives separately (TPR/TNR), not just overall agreement which can be gamed by always predicting the majority class.

WORDS WORTH SAVING

5 quotes

“This is what your AI agents are actually doing out there in production.”

— Aakash Gupta

“If you try to put helpfulness score… it’s not gonna catch stuff like this very well at all.”

— Hamel Husain

“ChatGPT will say, ‘Yeah, absolutely,’ but it will miss all of this nuance.”

— Shreya Shankar

“The main thing that’s inhibiting people is not doing the error analysis.”

— Hamel Husain

“It’s almost a tragedy to separate the prompt from the product manager ’cause it’s English.”

— Hamel Husain

QUESTIONS ANSWERED IN THIS EPISODE

5 questions

In the Nurture Boss examples, which failures were highest severity even if low frequency (e.g., double-booking tours), and how would you weight severity vs counts in prioritization?

The speakers argue that most real AI products need evals, and “no evals” claims often rely on upstream testing or informal dogfooding rather than rigorous measurement.

What’s your recommended minimum trace sample size for initial open coding, and how does it change for highly seasonal or multi-channel products (SMS/voice/chat)?

They demonstrate starting with observability by collecting and reviewing real production traces (even via simple logging) to see what users actually experience beyond polished demos.

How do you design axial code categories so multiple reviewers label consistently—what tests or “definition of done” do you apply before scaling labeling?

They emphasize error analysis as the core leverage point: manually “open code” trace issues, then “axial code” them into actionable categories and count frequency to prioritize work.

For the ‘virtual tour’ failure, would you fix it via prompt constraints, tool schema/guardrails, or product UX changes—and how would evals detect regression across all three?

They show how to turn high-impact error categories into automated evaluators, including code-based checks for objective issues and LLM-as-judge prompts for subjective product failures.

When building an LLM judge, what’s your concrete workflow for iterating the rubric (including adding examples) without overfitting to the labeled traces?

They stress that LLM judges must be validated against human labels using metrics beyond simple agreement (e.g., true positive/true negative rates) to avoid misleading confidence and stakeholder mistrust.

EVERY SPOKEN WORD

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

At a glance

Step-by-step evals workflow: traces, error analysis, and LLM judges

Start with traces, not abstract metrics.

You don’t need fancy observability to begin.

Do “open coding” quickly to build intuition and a dataset.

Convert notes into categories (axial codes) that are specific and labelable.

Count issues to escape prioritization paralysis.

Choose the cheapest evaluator that can work.

LLM judges should be binary and must be measured against human labels.

In the Nurture Boss examples, which failures were highest severity even if low frequency (e.g., double-booking tours), and how would you weight severity vs counts in prioritization?

What’s your recommended minimum trace sample size for initial open coding, and how does it change for highly seasonal or multi-channel products (SMS/voice/chat)?

How do you design axial code categories so multiple reviewers label consistently—what tests or “definition of done” do you apply before scaling labeling?

For the ‘virtual tour’ failure, would you fix it via prompt constraints, tool schema/guardrails, or product UX changes—and how would evals detect regression across all three?

When building an LLM judge, what’s your concrete workflow for iterating the rubric (including adding examples) without overfitting to the labeled traces?

Get more out of YouTube videos.