CHAPTERS
Why evals are a core AI product skill (and why “vibes” still count)
Aakash frames evals as one of the most important skills for building AI products, then Ankur argues that “vibe checks” are simply the earliest, non-scalable form of evaluation. The core issue is that LLM behavior is probabilistic, so teams need a repeatable feedback loop to improve quality over time.
LLMs are imperfect-but-capable: turning uncertainty into an engineering challenge
Ankur explains that teams often can’t tell whether failures are model limits or product/prompt shortcomings. Top builders assume imperfection and design around it, using evals to systematically convert ‘mystery behavior’ into measurable product work.
PMs as eval authors: the PRD evolves into a measurable test suite
The conversation shifts to product managers’ role in defining evals. Ankur argues the modern PRD is effectively an eval: a quantifiable artifact engineers can use to validate whether a system meets user needs—and a mechanism for PM leverage when a system ‘meets spec’ but still feels bad.
Claude Code controversy: “no evals” vs implicit evals in verticalized teams
Aakash raises the viral claim that Claude Code was built without evals and how it impacts PM credibility. Ankur counters that internal feedback and iteration is still evaluation; the difference is whether you need a structured, shareable, multidisciplinary process.
When distance from end users grows, evals become your coordination mechanism
They generalize a key heuristic: the more organizational or domain distance between builders and end users, the more you need formalized evals. Braintrust customers also use evals as a “ledger” to communicate needs back to frontier labs.
Braintrust today: scale, usage growth, and why top AI companies invest in evals
Ankur shares Braintrust’s size and growth dynamics: more customers, more logged data, and rapidly increasing eval volume. He explains why companies like Zapier, Ramp, Airtable, Replit, and Vercel focus on evals: they must ship high-quality AI at production scale.
Why offline experimentation is exploding: from quarterly A/B tests to daily eval runs
Aakash contrasts old-school experimentation cadence with today’s pace. Ankur explains that evals let teams run many experiments offline on a laptop, iterating quickly without the cost and latency of production A/B testing.
Eval anatomy: data, task, and scores (and why normalization matters)
Ankur lays out a simple framework: evals consist of a dataset (inputs/optionally ground truth), a task (LLM call or agent workflow), and scorers (0–1). Normalizing scores creates comparability over time as systems evolve.
Live build: generating a dataset, running a baseline, and seeing failure clearly
They begin a fully live eval build for a Linear QA assistant. The initial dataset generation produces the wrong kinds of questions, they refine it toward real workload queries, then run a baseline and confirm outputs are unhelpful—turning a vibe check into quantified failure.
Scoring design: categorical rubrics, avoiding fake precision, aligning with intuition
They create an LLM-based scoring function with clear criteria and limited categories rather than arbitrary decimals. Ankur discusses why binary isn’t always necessary, but criteria should be crisp; they validate the scorer by confirming it outputs zeros on obviously bad responses.
Adding MCP tooling (Linear): tool overload, prompt fixes, and iteration loops
They connect Braintrust to Linear via MCP and observe the model still fails—often listing capabilities instead of using tools. They iterate on system prompt instructions (don’t ask clarifying questions; use tools), adjust tool availability, and refine the scorer to better match real citations.
Why you need evals that fail: tracking model progress and spotting benchmark illusions
Ankur emphasizes that failing evals are strategic—they reveal what’s impossible today and what users struggle with. When new models drop, rerun failing evals to discover new capabilities or regressions, and be skeptical of benchmark ‘ups’ until validated with real data.
Offline vs online evals: deploying scorers to production logs to close the loop
They distinguish offline evals (golden-ish datasets) from online evals (running scorers on real user interactions). Online scoring reveals whether offline gains translate to production and provides a pipeline for harvesting low-scoring real examples back into the offline dataset.
Maintaining an eval culture: rituals, shared ownership, and not treating evals as a gate
Ankur explains how teams keep evals trusted and used: integrate them into daily work rather than a late-stage shipping gate. The best teams review production examples regularly, update datasets to reflect new patterns, and iterate with evals as the steering wheel for priorities.
Wrap-up: where to learn more and why PMs should master evals
They close with resources to explore Braintrust and a practitioner conference, and Aakash reinforces the career angle: eval literacy is becoming a baseline PM skill. The episode positions evals as the durable moat and the practical method to ship reliable AI features.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome