If You Don’t Understand AI Evals, Don’t Build AI

Name: If You Don’t Understand AI Evals, Don’t Build AI
Uploaded: 2026-03-20T00:00:00Z
Duration: 52 min 6 s
Description: “Vibe checks” are an early form of eval, but they stop scaling as usage, complexity, and stakeholders grow.

Aakash Gupta and Ankur Goyal on evals are the durable moat behind reliable AI products.

Aakash GuptahostAnkur Goyalguest

Mar 20, 202652mWatch on YouTube ↗

Why evals matter beyond “vibes”LLMs: imperfect, non-deterministic, rapidly changingEvals as the new PRD and the PM’s roleClaude Code “no evals” controversy and what counts as evalsCore eval components: data, task, scoresLive build: Linear MCP + prompt iteration + scorer iterationOffline vs online evals and maintaining eval culture

In this episode of Aakash Gupta, featuring Aakash Gupta and Ankur Goyal, If You Don’t Understand AI Evals, Don’t Build AI explores evals are the durable moat behind reliable AI products “Vibe checks” are an early form of eval, but they stop scaling as usage, complexity, and stakeholders grow.

WHAT IT’S REALLY ABOUT

Evals are the durable moat behind reliable AI products

“Vibe checks” are an early form of eval, but they stop scaling as usage, complexity, and stakeholders grow.
Because LLMs are capable yet imperfect and fast-changing, evals become a durable investment that outlives specific models, prompts, and agent frameworks.
Product managers play a central role in defining evals, which Ankur frames as the modern, quantifiable evolution of the PRD.
A live demo shows the core eval loop—data, task, scores—then iterating prompt, tool selection (MCP), and scoring criteria to improve measured performance.
Offline evals validate changes quickly in a controlled dataset, while online evals score real production logs to detect gaps and feed new failing cases back into offline suites.

IDEAS WORTH REMEMBERING

7 ideas

Treat vibe checks as a starting eval, not the end state.

Manual qualitative testing is effectively your “brain as scoring function,” but it breaks once multiple people, higher stakes, and frequent changes demand repeatable, comparable measurement.

Your moat is the harness: evals, data, and feedback loops—not today’s prompt or model choice.

Models and agent stacks change quickly, but a well-constructed eval suite encodes user reality into durable artifacts that keep guiding iteration as components evolve.

PMs should own eval definitions the way they once owned PRDs.

Ankur argues evals operationalize product intent into quantifiable criteria; when something “passes” but still feels bad, it’s often the eval that must be updated—creating new PM leverage.

Build evals with a simple framework: data → task → scores.

Start with representative inputs (optionally with ground truth), define the generation process (prompt/model/agent/tools), and score outputs with clear criteria normalized to a consistent range (often categorical mapped to 0–1).

Iterate on the dataset and the scorer, not just the prompt.

The demo highlights that failures can come from weak test questions, overly harsh scoring rules (e.g., citations), or tool overload—improvement often requires adjusting all three eval components.

Have evals that fail on purpose to track what’s impossible today.

If everything passes, you’re blind to user pain and to model progress; failing evals become the first suite you rerun when new models ship to spot meaningful capability jumps.

Use online evals to detect real-world drift and refill your offline ‘golden set.’

Running the same scorer on production logs reveals when offline success doesn’t translate, and it surfaces concrete failing examples you can add back into the offline dataset to harden coverage.

WORDS WORTH SAVING

5 quotes

I think vibe checks are evals.

— Ankur Goyal

I think the modern PRD is an eval.

— Ankur Goyal

LLMs are imperfect, yet very capable.

— Ankur Goyal

One of the most important things is to have evals that fail.

— Ankur Goyal

If you believe that the way that you've wired together your agent today is your differentiator, you're… highly likely to fail.

— Ankur Goyal

QUESTIONS ANSWERED IN THIS EPISODE

5 questions

In the Linear MCP demo, what were the specific prompt changes (wording + structure) that most improved tool-use behavior and reduced clarifying questions?

“Vibe checks” are an early form of eval, but they stop scaling as usage, complexity, and stakeholders grow.

How should teams decide when a scoring dimension should be binary versus categorical versus truly continuous (and what failure modes appear with each)?

Because LLMs are capable yet imperfect and fast-changing, evals become a durable investment that outlives specific models, prompts, and agent frameworks.

What’s your recommended minimum viable offline eval dataset size before teams should trust it for iteration, and how does that change by domain risk (e.g., fintech vs consumer)?

Product managers play a central role in defining evals, which Ankur frames as the modern, quantifiable evolution of the PRD.

The Claude Code controversy: where do you draw the line between “informal evals” (internal dogfooding) and “eval discipline” that actually scales across orgs?

A live demo shows the core eval loop—data, task, scores—then iterating prompt, tool selection (MCP), and scoring criteria to improve measured performance.

How do you prevent “eval gaming,” where teams optimize for the scorer rather than user value—especially when the scorer is LLM-based?

Offline evals validate changes quickly in a controlled dataset, while online evals score real production logs to detect gaps and feed new failing cases back into offline suites.

EVERY SPOKEN WORD

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

At a glance

Evals are the durable moat behind reliable AI products

Treat vibe checks as a starting eval, not the end state.

Your moat is the harness: evals, data, and feedback loops—not today’s prompt or model choice.

PMs should own eval definitions the way they once owned PRDs.

Build evals with a simple framework: data → task → scores.

Iterate on the dataset and the scorer, not just the prompt.

Have evals that fail on purpose to track what’s impossible today.

Use online evals to detect real-world drift and refill your offline ‘golden set.’

In the Linear MCP demo, what were the specific prompt changes (wording + structure) that most improved tool-use behavior and reduced clarifying questions?

How should teams decide when a scoring dimension should be binary versus categorical versus truly continuous (and what failure modes appear with each)?

What’s your recommended minimum viable offline eval dataset size before teams should trust it for iteration, and how does that change by domain risk (e.g., fintech vs consumer)?

The Claude Code controversy: where do you draw the line between “informal evals” (internal dogfooding) and “eval discipline” that actually scales across orgs?

How do you prevent “eval gaming,” where teams optimize for the scorer rather than user value—especially when the scorer is LLM-based?

Get more out of YouTube videos.