At a glance
WHAT IT’S REALLY ABOUT
Evals are the durable moat behind reliable AI products
- “Vibe checks” are an early form of eval, but they stop scaling as usage, complexity, and stakeholders grow.
- Because LLMs are capable yet imperfect and fast-changing, evals become a durable investment that outlives specific models, prompts, and agent frameworks.
- Product managers play a central role in defining evals, which Ankur frames as the modern, quantifiable evolution of the PRD.
- A live demo shows the core eval loop—data, task, scores—then iterating prompt, tool selection (MCP), and scoring criteria to improve measured performance.
- Offline evals validate changes quickly in a controlled dataset, while online evals score real production logs to detect gaps and feed new failing cases back into offline suites.
IDEAS WORTH REMEMBERING
5 ideasTreat vibe checks as a starting eval, not the end state.
Manual qualitative testing is effectively your “brain as scoring function,” but it breaks once multiple people, higher stakes, and frequent changes demand repeatable, comparable measurement.
Your moat is the harness: evals, data, and feedback loops—not today’s prompt or model choice.
Models and agent stacks change quickly, but a well-constructed eval suite encodes user reality into durable artifacts that keep guiding iteration as components evolve.
PMs should own eval definitions the way they once owned PRDs.
Ankur argues evals operationalize product intent into quantifiable criteria; when something “passes” but still feels bad, it’s often the eval that must be updated—creating new PM leverage.
Build evals with a simple framework: data → task → scores.
Start with representative inputs (optionally with ground truth), define the generation process (prompt/model/agent/tools), and score outputs with clear criteria normalized to a consistent range (often categorical mapped to 0–1).
Iterate on the dataset and the scorer, not just the prompt.
The demo highlights that failures can come from weak test questions, overly harsh scoring rules (e.g., citations), or tool overload—improvement often requires adjusting all three eval components.
WORDS WORTH SAVING
5 quotesI think vibe checks are evals.
— Ankur Goyal
I think the modern PRD is an eval.
— Ankur Goyal
LLMs are imperfect, yet very capable.
— Ankur Goyal
One of the most important things is to have evals that fail.
— Ankur Goyal
If you believe that the way that you've wired together your agent today is your differentiator, you're… highly likely to fail.
— Ankur Goyal
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome