Skip to content
Aakash GuptaAakash Gupta

If You Don’t Understand AI Evals, Don’t Build AI

Ankur Goyal is the Founder and CEO of Braintrust, the AI eval platform used by Replit, Vercel, Airtable, Ramp, Zapier, and Notion, valued at $800 million. In this episode, we break down why evals are the new PRD, build an eval from scratch using Linear's MCP server, and walk through the data-task-scores framework every PM needs to master. Full Writeup: https://www.news.aakashg.com/p/ankur-goyal-podcast Transcript: https://www.aakashg.com/ankur-goyal-podcast/ --- Timestamps: 0:00 - Intro 1:43 - Why should anyone care about evals 3:21 - LLMs are imperfect yet capable 6:35 - The role of the PM in defining evals 8:45 - The Claude Code evals controversy 11:34 - Ads 13:05 - Distance from the end user determines eval need 14:27 - How big is Braintrust today 18:48 - Building an eval from scratch (live demo) 20:20 - Ads 22:15 - Creating the data set and scoring function 30:20 - Ads 33:01 - Iterating on prompt and MCP tools 39:12 - Why you need evals that fail 43:36 - Offline vs online evals 47:40 - How to maintain eval culture 50:00 - Outro --- 🏆 Thanks to our sponsors: 1. Kameleoon: Leading AI experimentation platform - http://www.kameleoon.com/ 2. Testkube: Leading test orchestration platform - http://testkube.io/ 3. Pendo: The #1 software experience management platform - http://www.pendo.io/aakash 4. Bolt: Ship AI-powered products 10x faster - https://bolt.new/solutions/product-manager?utm_source=Promoted&utm_medium=email&utm_campaign=aakash-product-growth 5. Product Faculty: Get $550 off their #1 AI PM Certification with my link - https://maven.com/product-faculty/ai-product-management-certification?promoCode=AAKASH550C7 --- Key Takeaways: 1. Vibe checks are evals - When you look at an AI output and intuit whether it is good or bad, you are using your brain as a scoring function. It is evaluation. It just does not scale past one person and a handful of examples. 2. Every eval has three parts - Data (a set of inputs), Task (generates an output), and Scores (rates the output between 0 and 1). That normalization forces comparability across time. 3. Evals are the new PRD - In 2015, a PRD was an unstructured document nobody followed. In 2026, the modern PRD is an eval the whole team can run to quantify product quality. 4. Start with imperfect data - Auto-generate test questions with a model. Do not spend a month building a golden data set. Jump in and iterate from your first experiment. 5. The distance principle - The farther you are from the end user, the more critical evals become. Anthropic can vibe check Claude Code because engineers are the users. Healthcare AI teams cannot. 6. Use categorical scoring, not freeform numbers - Give the scorer three clear options (full answer, partial, no answer) instead of asking an LLM to produce an arbitrary number. 7. Evals compound, prompts do not - Models and frameworks change every few months. If you encode what your users need as evals, that investment survives every model swap. 8. Have evals that fail - If everything passes, you have blind spots. Keep failing evals as a roadmap and rerun them every time a new model drops. 9. Build the offline-to-online flywheel - Offline evals test your hypothesis. Online evals run the same scorers on production logs. The gap between them is your improvement roadmap. 10. The best teams review production logs every morning - They find novel patterns, add them to the data set, and iterate all day. That morning ritual is what separates teams that ship blind from teams that ship with confidence. --- 👨‍💻 Where to find Ankur Goyal: LinkedIn: https://www.linkedin.com/in/ankrgyl/ Braintrust: https://www.braintrust.dev X: https://x.com/ankrgyl 👨‍💻 Where to find Aakash: Twitter: https://x.com/aakashgupta LinkedIn: https://www.linkedin.com/in/aakashgupta/ Newsletter: https://www.news.aakashg.com #aievals #aipm --- 🧠 About Product Growth: The world's largest podcast focused solely on product + growth, with over 200K+ listeners. 🔔 Subscribe and turn on notifications to get more videos like this.

Aakash GuptahostAnkur Goyalguest
Mar 20, 202652mWatch on YouTube ↗

CHAPTERS

  1. Why evals are a core AI product skill (and why “vibes” still count)

    Aakash frames evals as one of the most important skills for building AI products, then Ankur argues that “vibe checks” are simply the earliest, non-scalable form of evaluation. The core issue is that LLM behavior is probabilistic, so teams need a repeatable feedback loop to improve quality over time.

  2. LLMs are imperfect-but-capable: turning uncertainty into an engineering challenge

    Ankur explains that teams often can’t tell whether failures are model limits or product/prompt shortcomings. Top builders assume imperfection and design around it, using evals to systematically convert ‘mystery behavior’ into measurable product work.

  3. PMs as eval authors: the PRD evolves into a measurable test suite

    The conversation shifts to product managers’ role in defining evals. Ankur argues the modern PRD is effectively an eval: a quantifiable artifact engineers can use to validate whether a system meets user needs—and a mechanism for PM leverage when a system ‘meets spec’ but still feels bad.

  4. Claude Code controversy: “no evals” vs implicit evals in verticalized teams

    Aakash raises the viral claim that Claude Code was built without evals and how it impacts PM credibility. Ankur counters that internal feedback and iteration is still evaluation; the difference is whether you need a structured, shareable, multidisciplinary process.

  5. When distance from end users grows, evals become your coordination mechanism

    They generalize a key heuristic: the more organizational or domain distance between builders and end users, the more you need formalized evals. Braintrust customers also use evals as a “ledger” to communicate needs back to frontier labs.

  6. Braintrust today: scale, usage growth, and why top AI companies invest in evals

    Ankur shares Braintrust’s size and growth dynamics: more customers, more logged data, and rapidly increasing eval volume. He explains why companies like Zapier, Ramp, Airtable, Replit, and Vercel focus on evals: they must ship high-quality AI at production scale.

  7. Why offline experimentation is exploding: from quarterly A/B tests to daily eval runs

    Aakash contrasts old-school experimentation cadence with today’s pace. Ankur explains that evals let teams run many experiments offline on a laptop, iterating quickly without the cost and latency of production A/B testing.

  8. Eval anatomy: data, task, and scores (and why normalization matters)

    Ankur lays out a simple framework: evals consist of a dataset (inputs/optionally ground truth), a task (LLM call or agent workflow), and scorers (0–1). Normalizing scores creates comparability over time as systems evolve.

  9. Live build: generating a dataset, running a baseline, and seeing failure clearly

    They begin a fully live eval build for a Linear QA assistant. The initial dataset generation produces the wrong kinds of questions, they refine it toward real workload queries, then run a baseline and confirm outputs are unhelpful—turning a vibe check into quantified failure.

  10. Scoring design: categorical rubrics, avoiding fake precision, aligning with intuition

    They create an LLM-based scoring function with clear criteria and limited categories rather than arbitrary decimals. Ankur discusses why binary isn’t always necessary, but criteria should be crisp; they validate the scorer by confirming it outputs zeros on obviously bad responses.

  11. Adding MCP tooling (Linear): tool overload, prompt fixes, and iteration loops

    They connect Braintrust to Linear via MCP and observe the model still fails—often listing capabilities instead of using tools. They iterate on system prompt instructions (don’t ask clarifying questions; use tools), adjust tool availability, and refine the scorer to better match real citations.

  12. Why you need evals that fail: tracking model progress and spotting benchmark illusions

    Ankur emphasizes that failing evals are strategic—they reveal what’s impossible today and what users struggle with. When new models drop, rerun failing evals to discover new capabilities or regressions, and be skeptical of benchmark ‘ups’ until validated with real data.

  13. Offline vs online evals: deploying scorers to production logs to close the loop

    They distinguish offline evals (golden-ish datasets) from online evals (running scorers on real user interactions). Online scoring reveals whether offline gains translate to production and provides a pipeline for harvesting low-scoring real examples back into the offline dataset.

  14. Maintaining an eval culture: rituals, shared ownership, and not treating evals as a gate

    Ankur explains how teams keep evals trusted and used: integrate them into daily work rather than a late-stage shipping gate. The best teams review production examples regularly, update datasets to reflect new patterns, and iterate with evals as the steering wheel for priorities.

  15. Wrap-up: where to learn more and why PMs should master evals

    They close with resources to explore Braintrust and a practitioner conference, and Aakash reinforces the career angle: eval literacy is becoming a baseline PM skill. The episode positions evals as the durable moat and the practical method to ship reliable AI features.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome