Skip to content
Aakash GuptaAakash Gupta

If You Don’t Understand AI Evals, Don’t Build AI

Ankur Goyal is the Founder and CEO of Braintrust, the AI eval platform used by Replit, Vercel, Airtable, Ramp, Zapier, and Notion, valued at $800 million. In this episode, we break down why evals are the new PRD, build an eval from scratch using Linear's MCP server, and walk through the data-task-scores framework every PM needs to master. Full Writeup: https://www.news.aakashg.com/p/ankur-goyal-podcast Transcript: https://www.aakashg.com/ankur-goyal-podcast/ --- Timestamps: 0:00 - Intro 1:43 - Why should anyone care about evals 3:21 - LLMs are imperfect yet capable 6:35 - The role of the PM in defining evals 8:45 - The Claude Code evals controversy 11:34 - Ads 13:05 - Distance from the end user determines eval need 14:27 - How big is Braintrust today 18:48 - Building an eval from scratch (live demo) 20:20 - Ads 22:15 - Creating the data set and scoring function 30:20 - Ads 33:01 - Iterating on prompt and MCP tools 39:12 - Why you need evals that fail 43:36 - Offline vs online evals 47:40 - How to maintain eval culture 50:00 - Outro --- 🏆 Thanks to our sponsors: 1. Kameleoon: Leading AI experimentation platform - http://www.kameleoon.com/ 2. Testkube: Leading test orchestration platform - http://testkube.io/ 3. Pendo: The #1 software experience management platform - http://www.pendo.io/aakash 4. Bolt: Ship AI-powered products 10x faster - https://bolt.new/solutions/product-manager?utm_source=Promoted&utm_medium=email&utm_campaign=aakash-product-growth 5. Product Faculty: Get $550 off their #1 AI PM Certification with my link - https://maven.com/product-faculty/ai-product-management-certification?promoCode=AAKASH550C7 --- Key Takeaways: 1. Vibe checks are evals - When you look at an AI output and intuit whether it is good or bad, you are using your brain as a scoring function. It is evaluation. It just does not scale past one person and a handful of examples. 2. Every eval has three parts - Data (a set of inputs), Task (generates an output), and Scores (rates the output between 0 and 1). That normalization forces comparability across time. 3. Evals are the new PRD - In 2015, a PRD was an unstructured document nobody followed. In 2026, the modern PRD is an eval the whole team can run to quantify product quality. 4. Start with imperfect data - Auto-generate test questions with a model. Do not spend a month building a golden data set. Jump in and iterate from your first experiment. 5. The distance principle - The farther you are from the end user, the more critical evals become. Anthropic can vibe check Claude Code because engineers are the users. Healthcare AI teams cannot. 6. Use categorical scoring, not freeform numbers - Give the scorer three clear options (full answer, partial, no answer) instead of asking an LLM to produce an arbitrary number. 7. Evals compound, prompts do not - Models and frameworks change every few months. If you encode what your users need as evals, that investment survives every model swap. 8. Have evals that fail - If everything passes, you have blind spots. Keep failing evals as a roadmap and rerun them every time a new model drops. 9. Build the offline-to-online flywheel - Offline evals test your hypothesis. Online evals run the same scorers on production logs. The gap between them is your improvement roadmap. 10. The best teams review production logs every morning - They find novel patterns, add them to the data set, and iterate all day. That morning ritual is what separates teams that ship blind from teams that ship with confidence. --- 👨‍💻 Where to find Ankur Goyal: LinkedIn: https://www.linkedin.com/in/ankrgyl/ Braintrust: https://www.braintrust.dev X: https://x.com/ankrgyl 👨‍💻 Where to find Aakash: Twitter: https://x.com/aakashgupta LinkedIn: https://www.linkedin.com/in/aakashgupta/ Newsletter: https://www.news.aakashg.com #aievals #aipm --- 🧠 About Product Growth: The world's largest podcast focused solely on product + growth, with over 200K+ listeners. 🔔 Subscribe and turn on notifications to get more videos like this.

Aakash GuptahostAnkur Goyalguest
Mar 20, 202652mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:001:43

    Intro

    1. AG

      Evals are one of the most important skills for building effective AI products

    2. AG

      The failure and success of AI products is driven by how good the evals they write are, how well they use them, and of course, how much they improve them

    3. AG

      One of the top eval companies used by Replit, Vercel, and Airtable is Braintrust

    4. AG

      I think all the top AI companies understand that building a really good feedback loop from what their users are doing in production all the way through their evals that they can run offline is really, really important

    5. AG

      Ankur Goyal is the founder and CEO of Braintrust. Just announced its Series B round at an $800 million valuation. This tweet blew up. This is literally affecting people's jobs who are product managers and heads of product. How should they be dealing with this controversy?

    6. AG

      I think vibe checks are evals. I actually think this tweet is... I think one of the most important things is to have evals that fail. If you only have evals that succeed, then you don't know what problems there are

    7. AG

      Yeah. My brain immediately went to, "Well, let's improve the system prompt." So why should anyone care about evals to begin with? Before we go any further, do me a favor and check that you are subscribed on YouTube and following on Apple and Spotify podcasts. And if you wanna get access to amazing AI tools, check out my bundle, where if you become an annual subscriber to my newsletter, you get a full year free of the paid plans of Mobbin, Arise, Relayapp, Dovetail, Linear, Magic Patterns, DeepSky, Reforge Build, Descript, and Speechify. So be sure to check that out at bundle.aakashg.com. And now into today's episode.

  2. 1:433:21

    Why should anyone care about evals

    1. AG

      Ankur, welcome to the podcast.

    2. AG

      Thank you so much for having me. I'm really excited to be here.

    3. AG

      When I think of experts in the eval space, you have to be right at the very top of the list. But some people, they just rely on vibe checks. I've had some product leaders on this podcast who have created amazing AI features that have helped their company bag the next 500, $1 billion valuation just on vibes. So why should anyone care about evals to begin with?

    4. AG

      You know, I actually think vibe checks are a form of eval. And, uh, there's this really popular Paul Graham essay that, um, I think is very true. It's, "Do things that don't scale." And vibe checks are like the, you know, do things that don't scale analog for evals. When you do a vibe check, you are using your AI product and then using a scoring function, which is your brain, to try to intuit whether the result is good or bad. And if it's not very good, then you might tweak the prompt, or you might try a different model, or adjust how your agent is architected, whatever it may be, and then try again. What happens is that once your product gets into production, more people start using it, you have more subject matter experts and engineers at your company that are actually contributing to its quality, then the vibe check version of an eval stops scaling. And you need a little bit more software and process and tooling to help you, um, execute at a higher scale and with more predictable performance. And that's where, you know, what we normally think of as evals start to come in. But I, I actually think it's, it's a whole journey, and vibe checking is great. It's, it's just one type of eval.

  3. 3:216:35

    LLMs are imperfect yet capable

    1. AG

      You know, I think one of the really new things about AI development is there's this kind of magical thing, ether, that we have to deal with, which is, uh, an LLM. And an LLM, you know, not, not unlike a, a person that you might hire or work with, is somewhat unpredictable. You don't know if your app isn't working, whether it's because the LLM inherently doesn't understand, uh, your task, or maybe you haven't prompted or, or built around it well. I remember a couple years ago, right around when Braintrust started, um, a lot of my really smart friends who were LLM skeptics would say, like, "Hey, this LLM doesn't understand C++," or, "It doesn't understand my, you know, specific task, even though it works for demos." I think nowadays people have mostly moved past that, but it illustrates the idea that it's hard to know where the responsibility lies. Um, and, and I actually think that what the most clever, successful AI builders have, have proven, you know, Manus is a really good recent example of this, is that the alpha in building a good AI product is kind of understanding that LLMs are imperfect, yet very capable, and figuring out how to work your way around that and make the most of what LLMs are able to do today with an eye towards what they can do in the future. And that is really where evals come into play. They are a good way for you to treat, you know, the imperfection of LLMs from being kind of a mystery or a burden into a really fun and engaging product and engineering challenge that you can actually overcome. Um, I, I think a lot of people are starting to recognize this, you know, especially as, uh, software and models and agents are changing constantly. One of the things I come back to is that an eval is, is a relatively durable thing that you can invest in. So let's say that you're working on a new product area and, uh, you, you know, use the latest agent framework and use, uh, Opus 4.5, which is the cool model right now. All of that might change, and, you know, you and I were just joking about this a few minutes ago. Like, all of that might change in, in a couple weeks or a couple months. But if you, um, invest in evals, and, and by that I mean you do a good job of understanding what your users are actually trying to do with the product, and then you encode that as data and, uh, scores and, um, an eval flow, then even as the models and agents and, and everything change, you've, you've actually set yourself up to continue iterating and build on an investment that you make. Um, and, and so I think the, the companies that have started doing that effectively, they're actually building true differentiation. If you believe that the way that you've wired together your agent today is your differentiator, you're, you're actually highly likely to fail because that's probably gonna change in, in a couple months. On the other hand, if you build really good evals, then you've built something that has a little bit more durability to it.

    2. AG

      I've been s- preaching this message to everybody, which is, like, the harness around your LLM, everything from the memory to the evals, that is actually your more durable moat because the model underneath, that continues to evolve. One of the interesting things you have here on this slide is you have the quotes from Mike Krieger and Kevin Weil. And I think it's super notable because they're the-... product leaders at these

  4. 6:358:45

    The role of the PM in defining evals

    1. AG

      companies, what do you see as the role of the product manager in defining the evals? You're running one of the most used eval platforms out there. Are product managers the main user of it, or how do they interact with whoever the main user actually is?

    2. AG

      Oh, yeah. I think this is honestly not something I anticipated when we started. And, uh, just a little bit of backstory, I know there's some controversy about evals and coding products and stuff, and I actually think it relates to this point, so I'd love to, to talk about that, too. But what we've seen is that, uh, if you're building an AI product, you are now able to solve problems that software couldn't really solve before. And the people who really, really drive that level of, uh, creative thinking and software application outside of the sort of four lines of, of what software did before are product managers. And evals are core to product managers' ability to do that. I actually have sort of shifted my thinking. I think of evals as kind of like the natural evolution of a PRD. So if you look at a PRD in 2015, it's an unstructured document that is a spec that is meant to communicate how you should build something and what maybe the success criteria are for the product working. Fast-forward to now, uh, I think the modern PRD is an eval, and, um, it's actually something that an engineering team who maybe doesn't know everything about the product or the problem that they're solving can use to quantify how well the software that they're building is able to solve the problem. And I think that actually means product managers are able to be a lot more effective because they go from providing kind of a qualitative spec that no one really follows, and it's always kind of annoying to reconcile the PRD with the actual product, into something that's very quantifiable. You can look at an eval and say, "Does this, uh, piece of software fit the eval or not?" And, and oftentimes it will fit the eval, and the product will still suck, and that means that it's actually on the product manager to then go and improve the evals. And I think that's an area of leverage that product managers actually didn't have before.

    3. AG

      Yeah. I would love to talk about this coding controversy [laughs] that you referred to.

  5. 8:4511:34

    The Claude Code evals controversy

    1. AG

      This tweet blew up. I actually had somebody ping me about this, like, the day it happened because they were like, "My boss was telling me about this because I've been championing evals in my own company, but Claude Code is not using evals. Have we been doing it all wrong?" So this is literally affecting people's jobs who are product managers and heads of product. How should they be dealing with this controversy?

    2. AG

      Yeah, for sure. I mean, I think a lot of coding tools also don't have product managers, and the reason is that the software engineers who work on the coding tools have relatively good intuition about what other software engineers want to do. And I actually think the same principle applies here. A- as I mentioned earlier, I think vibe checks are eval. So I just think this ... I love Swyx. He's a good friend, but I, I actually think this tweet is factually incorrect. Like, the fact that other people, you know, Boris and other folks at Anthropic are using Claude Code and likely providing feedback about whether the model or Claude Code itself is solving their problem, that is a form of eval. Uh, sure, they don't have it necessarily as a quantifiable process. Maybe they do by now. We, we, we don't know, or I certainly don't know. Or maybe they don't use a tool or whatever and follow, you know, a, what someone might think of as an eval. But I think they are doing evals. If someone is trying out the product and they're providing feedback, and then they're incorporating that feedback into iterating on the product, which I, I think they are, to me, that, that's doing evals. Now, why are they able to get away without a structured process that is somewhat multidisciplinary with product managers and engineers? I think it's likely because the engineers are solving problems for other engineers, and they're doing it at a company that's training the models that are also able to solve that problem. So it's totally verticalized, and you don't really need any third-party intuition to solve the problem. If you go into another domain, like an AI company that is applying an LLM to solve healthcare problems, I think you're in a totally different world, 'cause they're probably not making the LLM themselves. They probably have great software engineers who are passionate about healthcare but are not necessarily healthcare subject matter experts. And then, of course, there are product managers who are able to bridge from what engineers are working on to what, you know, patients or doctors, whoever the end user is, um, is actually experiencing. I ... My parents are both doctors, so I have a little bit of a, a soft spot for this use case. When I hang out with them and talk to them, I have almost no idea what they're talking about, right? They're using very specific jargon. They're talking about, you know, uh, medical issues that are, are obviously very important and maybe can be assisted by software. But I, I just don't have the intuition for that. And so evals become a mechanism for, uh, product managers in, in this scenario to help glue together the unknowns of how a, an end user might actually interact with a piece of software into something that is tangible that an engineering and product team can use to iterate on and improve the quality of their product.

  6. 11:3413:05

    Ads

    1. AG

      Today's episode is brought to you by the experimentation platform Kameleoon. Nine out of 10 companies that see themselves as industry leaders and expect to grow this year say experimentation is critical to their business, but most companies still fail at it. Why? Because most experiments require too much developer involvement. Kameleoon handles experimentation differently. It enables product and growth teams to create and test prototypes in minutes with prompt-based experimentation. You describe what you want, Kameleoon builds a variation of your webpage, lets you target a cohort of users, choose KPIs, and runs the experiment for you. Prompt-based experimentation makes what used to take days of developer time turn into minutes. Try prompt-based experimentation on your own web apps. Visit kameleoon.com/prompt to join the wait list. That's K-A-M-E-L-E-O-O-N.com/prompt.AI is writing code faster than ever, but can your testing keep up? Testkube is the Kubernetes native platform that scales testing at the pace of AI-accelerated development. One dashboard, all your tools, full oversight. Run functional and load tests in minutes, not hours, across any framework, any environment. No vendor lock-in, no bottlenecks, just confidence that your AI-driven releases are tested, reliable, and ready to ship. Testkube, scale testing for the AI era. See more at testkube.io/aakash. That's T-E-S-T-K-U-B-E.I-O/A-A-K-A-S-H.

  7. 13:0514:27

    Distance from the end user determines eval need

    1. AG

      Nailed it. In my opinion, which is that if you're not the end user, it becomes more and more important. And the more distance you have from that end user, like in a healthcare setting, the more and more important it is to create the evals. I think also one thing that probably Claude Code benefits from is that Anthropic, in their post-training, is using a bunch of [chuckles] evals around coding. So even if Claude Code doesn't have formalized evals, we know Anthropic does.

    2. AG

      Right. I think distance is the perfect way to think about it. If you imagine, uh, Anthropic, which is just an amazing organization bubbling with talent, you have the people training the models do, you know, pre-training, post-training, building the harness, building the product, the UI, which is Claude Code for the harness, and the end users all inside of, you know, one set of four walls. And so the efficiency with which they're able to circulate feedback is very high and, and therefore it may not need a-additional process or, or whatever to help facilitate, uh, the feedback. As any of those points of distance start to increase, you actually need a little bit more structure. Like, one of the big use cases for Braintrust has actually been helping our customers collect evals that they can share with labs so that labs can do a better job of implementing support for their use case. Uh, so they, you know, they need something. They need some ledger to be able to capture that information, otherwise how are they gonna communicate it?

    3. AG

      Makes sense. And you mentioned

  8. 14:2718:48

    How big is Braintrust today

    1. AG

      Braintrust. I wanted to ask you, how big is Braintrust today? What can you share, whether that's users, revenue, valuation?

    2. AG

      Yeah. Braintrust is about 100 people. We have many hundreds of customers and many tens of thousands of organizations using the product. We actually have a pretty generous free plan, um, which we intend to make even more generous over time. If you're a product manager or an engineer and you're working on a hobby project, we want you to be able to use Braintrust without having to really think about it. And growth has just been absurd. Nowadays, people are running about 10 times as many evals as they were this time last year. Today, people log about twice as much data per day as they did the entire first year of Braintrust being in existence, and it's just been incredible. I think what we've seen is that everything is growing in AI. Every individual LLM call is getting bigger. People are creating larger prompts. They're putting more context into their prompts. There are more LLM calls in every, uh, request that comes through because people are building agents, and agents are doing research and interacting more frequently with users and doing much richer work. And then AI products are actually achieving real product market fit, and so the number of requests is also growing very rapidly. And if you multiply those three things together, you get this incredible explosion of interesting data that we see flowing through Braintrust.

    3. AG

      Wow. So as of, I believe, October 2024, so a year and change ago, it was reported that you were valued at $150 million in that fundraise. Can you give us a sense of what the scale of growth has been since then?

    4. AG

      Yeah. I think we have been very fortunate. We were, uh, cash flow positive for a very long time, and so, uh, we've been able to utilize capital, uh, actually very, very effectively. I think, uh, I can't share the very, very specific numbers now, but if you look at our revenue metrics and growth metrics, we are more than an order of magnitude in growth, uh, on, you know, literally every axis. And if you look at just consumption growth, it's multiple orders of magnitude of growth. So, um, it's been, uh, a pretty wild, w- I don't know what it is, 15 months, [chuckles] uh, since then.

    5. AG

      That's crazy, and I think it's a testament to companies that you have as customers, like Vercel and Replit and Airtable, being so keen on evals. Why are all the hottest companies so focused on evals?

    6. AG

      You know, I think when we started Braintrust, we wanted to partner with entrepreneurs and, and, uh, builders who had companies that had preexisting product market fit and were earnestly investing in AI. Um, I'll just highlight Brian from Zapier, for example. Zapier was our first customer. Brian is the CTO. He's been working on Zapier for a long time. And, uh, when I met him, he basically introduced himself as a full-time AI engineer. Now, this guy's, like, super successful. He probably doesn't have to work, but I haven't seen anyone nerd out about AI as much as Brian does. And the reason that we wanted to partner with these companies is that we knew that they would only build and ship products that met a certain level of quality, and they would hold themselves to a rigorous product market fit bar. But they were very, uh, earnestly adopting AI, and that has very much turned out to be true with, with all of the companies on, on this list and, you know, most of the companies that, that we work with. And I think if you consider that, like, that these companies have preexisting product market fit, so they have to do things at some level of scale. They can't simply rely on, on vibe checks. Although they do a lot of vibe checking, as, as everyone should, they can't simply rely on it. They have enough product market fit to actually drive real scale, and then they have products... Like, if Ramp doesn't work, it's very bad. You know? They can't-- They don't really have the leeway to screw things up, and so the standard for the quality of the products that they're shipping is very high. You kind of mix those ingredients together, and it's very, very obvious from first principles that you need to run evals and take observability very seriously to, to implement a good product. And so, uh, honestly, it's been no surprise to me. I, I built Braintrust as an internal tool when I led the AI team at Figma and, y-you know, intuitively, I, I've known for a long time just how critical evals are to being able to execute product well, and we've very much seen that play out with, with these companies who I think are very much on the leading edge of building great AI products.

    7. AG

      So I wanna get a little bit more

  9. 18:4820:20

    Building an eval from scratch (live demo)

    1. AG

      tactical for everybody. You have on there the stat, which I think is pretty crazy, 12.8 experiments per day. What exactly are those tangibly? Like, what are people doing that they are running this many experiments per day? I remember 10 years ago, we were talking about 2015 PRDs, for instance. We might run 4.8 experiments total in a quarter, let alone just on our evals in a day.

    2. AG

      Yeah. A- um, of AI is that experimentation, which used to be something that you would only run in production, um, is now something that you can do offline as well. And that is, uh, actually one of the things that I think contributes to so much rapid evolution of AI products. Uh, y-you're absolutely right. Like, if you had this nondeterministic problem that you had to solve, then you might have to A/B test it, and doing an A/B test is a very, very high fidelity but very expensive way to get feedback about whether a nondeterministic thing works or not. In AI, because you're able to do evals and actually, um, iterate offline, you can do those experiments just on your laptop. In fact, in a few minutes, we're gonna run some experiments with, um, a prompt and an MCP server and try and, and improve some stuff. And w- I don't know if we'll run 12.8, but we're certainly gonna run more than one experiment and, and iterate, you know, just, just live.

    3. AG

      What are the steps that we need to go through in order to define an experiment like this?

    4. AG

      So-Uh, this is straight from our docs. Uh, an eval consists of three things, and I think this is a very helpful framework 'cause it allows you to simplify what might otherwise be kind of an overly complex or infinitely complex

  10. 20:2022:15

    Ads

    1. AG

      topic. But an eval is, is literally three things. Data, which is a set of inputs, so we're gonna play with Linear's MCP in a moment. Uh, an example of a piece of data could be how many tasks do I have assigned to me? That could be the input question, and then optionally you might have a ground truth answer, like 12. You might not, which is totally fine. Actually, we're not going to have ground truth in the eval that we run, but if you do, you might be able to use it. The next part is a task. A task is something that takes an input and then generates an output. And a task could be as simple as a single LLM call, like you could just take the question and paste it as a message into GPT-5-nano and then get a response, or it could be as complex as an agent. It might do some research or call an MCP server. It might call other LLMs. It might call APIs or vector databases, whatever it may be. At the end of the day, though, it's gonna produce some kind of output, and that's the thing that you evaluate. And then the last thing is scores. Scores take the data, so they, they know about the input, they know about maybe the expected output. They take the output of the task, and then their job is to produce a number between zero and one. I think it's actually really, really important you normalize things between a fixed range, uh, zero and one, and the reason that's important is that it forces you to make everything comparable. So no matter what, like a week from now or a month from now, when you run a new eval, you'll be able to produce a score that's within the same range. And, uh, when you do that, that means that you'll be able to compare how the thing that you did today performs against the thing that you do tomorrow. Uh, so it's kind of this forcing function for you to really simplify how you want to assess whether the thing is working or not. And then once you are creative and you figure out how to do that, you have a really nice artifact that allows you to continue testing and evaluating things.

    2. AG

      Okay. So data task scores. I think I got it. Let's see it in action.

  11. 22:1530:20

    Creating the data set and scoring function

    1. AG

      Awesome. Uh, so we are going to create an eval entirely from scratch. There's no pre-written prompts. There's no pre-written data set. There's no pre-written scoring functions. This is gonna be 100% live. Expect some fun, uh, nuances along the way, and, uh, let's have some fun. And, and by the way, I actually haven't done this demo before, so Aakash, if you have ideas or feedback about how we can evaluate this together, I'm, I'm all ears.

    2. AG

      All right. For those who don't know Linear, if you're just on a Jira ecosystem, Linear is a competitor to Jira, so it's your task management tool. It's where you're putting down, "Hey, these are all the things our engineers are gonna build."

    3. AG

      Um, we use Linear. Uh, it's been a fantastic piece of software for us, and Linear also uses Braintrust, so, uh, they're a good friend o- of ours, and I think they have a really nice MCP server, which is, uh, super cool, um, and we're gonna use it actually as part of this. So let's say that we're building a tool that allows us to ask questions, uh, about our task workload and understand, you know, what, what work do we have to do. So let's just write a really simple system prompt. "You are a helpful assistant who answers questions from Linear." Okay. And let's create a data set, and instead of creating it from scratch, let's just use Opus to help us create the data set. So it's gonna look at-- It's gonna know that we're working on something related to Linear, and it's gonna generate some test data.

    4. AG

      Okay. And for those of you who are wondering, you just said MCP. What, what is that usage? So Model Context Protocol, it's just the standard definition, basically. It's like the API that LLMs can use, so it's allowing the tool we're looking at, Braintrust, to get access to the data inside of Linear. And Ankur mentioned Brian. Brian is the one who did the MCP primer podcast episode nearly a year ago on this podcast itself, so if you want more details-

    5. AG

      Oh, great

    6. AG

      ... you can check that out. [chuckles]

    7. AG

      That's awesome.

    8. AG

      But it looks like we've got the initial test data from Opus. This is not the one from the MCP yet. This is just test data that-

    9. AG

      Yeah. There's no MCP connection yet. Um-

    10. AG

      Yeah

    11. AG

      ... and, and actually, I don't love this data. So this is asking questions about what Linear is. Let, let, let's try to improve it. So I, I don't really... I, I, we know what Linear is. We're, we're trying to build a bot that helps us ask questions about the workload, so let's, let's actually tweak it. Actually, I want questions about my Linear project. For example, what tasks are assigned to me?

    12. AG

      So creating stronger test data, in this case, making it more about tasks and kinds of tasks instead of just the high level it was before.

    13. AG

      Okay, great. Last but not least, remove the expected answers since we don't know them. Models still love to hallucinate.

    14. AG

      [chuckles]

    15. AG

      Even Opus 4.5. Okay, great. So now it's going to create this data. Of course, I can always edit it. So let's see, like we don't do sprints at Braintrust, so let's say like how many tasks need to be triaged.

    16. AG

      Mm.

    17. AG

      And now what we can do is just hit Run. So it's gonna use GPT-5-nano, which is one of my favorite models. It's super cheap and relatively fast, and let's see what it comes up with. Okay, so this doesn't seem like a great answer. "What tasks are assigned to me? Happy to help with Linear. What would you like me to do?" Let's see. "Are there any overdue tasks? I can help with questions about Linear's usage." Um, okay. Well, what we just did is a vibe check, and that means that we looked at some of these questions.We looked at the answers. Aakash, feel free to disagree. I think these answers are pretty bad.

    18. AG

      Yeah.

    19. AG

      Um, so now, before we actually try to improve them, what I'd like to do is be able to quantify that, and that is where scoring comes in. The benefit of quantifying it is that we're of course gonna vibe check the improved results as well, but the artifact that we'll produce by actually running these evals is something that our team could continue to use so that as we add more data, as we evolve the prompts, we'll have a quantitative signal about whether we're improving the thing that we're trying to improve. So now let's go back to Loop. Uh, by the way, Loop is the agent that's built into Braintrust, and it works kind of like Claude Code or Cursor. It has tools that are plugged into all of the nooks and crannies of our product, and so it can interact with data and prompts and run evals and stuff for you. So anyway, we have these tasks, and we, we know they kinda suck. Let's just see if we can write a scoring function using Loop so we don't have to create it from scratch, so-

    20. AG

      Yeah

    21. AG

      ... these answers aren't great. They are vague and introductory. Can you create a scoring function that makes sure that the answers actually answer the question, and B, if they cite any information or include any facts about tasks, they cite a source.

    22. AG

      Mm.

    23. AG

      Okay.

    24. AG

      And while this is coming up here, in the lore of the podcast, the prior few eval episodes that we had from Homu Hussein, Shreyas Shankar, Aman Khan, they all warned against numerical scores. They said that we need to go for more of, like, a binary yes/no. Here, we're going for a score there. Can you talk to us about that?

    25. AG

      Yeah, I think the simple way to think about it is that jumping into scores like 0.2, uh, or 0.4 before you have really justified the need to do that is not a good idea. And in fact, even though we are going to create numbers here, we're actually only gonna create scores that fit a specific set of values. Uh, let's see what the model came up with here. So it, it only has three options.

    26. AG

      Mm.

    27. AG

      Um, and if we look at what the definition of B is, it's partial. So it's saying it's missing citations for tasks, but it has some sort of answer. We can s- change that. Like, we can say that, "Hey, actually, I don't like the fact that it's doing that. I, I don't want, I don't wanna give you any partial credit in that case."

    28. AG

      Mm-hmm.

    29. AG

      Um, so I think it's important not to overcomplicate your scores, and I think if you're creating LLM-based scores, you shouldn't ask the LLM to generate a number, 'cause that's not very clear. It's useful to have clear criteria. But I actually disagree that every score needs to be binary. I don't think there's any real justification behind that. In fact, I worked with the OpenAI team about a year ago and published a research cookbook that I can, uh, send a link to you to that walks through, somewhat scientifically, what is a good thing to do and what is a bad thing to do and why. And so that might be helpful reading if anyone wants to go, you know, one level deeper.

    30. AG

      So here we've gone with categorical, and that should be all right.

  12. 30:2033:01

    Ads

    1. AG

      tool. Today's podcast is brought to you by Pendo, the leading software experience management platform. McKinsey found that 78% of companies are using gen AI, but just as many have reported no bottom-line improvements. So how do you know if your AI agents are actually working? Are they giving users the wrong answers, creating more work instead of less, improving retention or hurting it? When your software data and AI data are disconnected, you can't answer these questions. But when you bring all your usage data together in one place, you can see what users do before, during, and after they use AI, showing you when agents work, how they help you grow, and when to prioritize on your roadmap. Pendo Agent Analytics is the only solution built to do this for product teams. Start measuring your AI's performance with Agent Analytics at pendo.io/aakash. That's P-E-N-D-O.I-O/A-A-K-A-S-H.Here's the dirty secret about prototyping. You spend two weeks building a prototype. You validate your assumptions. Engineering loves the direction. Then what happens? You throw the whole thing away. Bolt changes this completely. When you prototype in Bolt, you're not building throwaway mockup. You're building real front-end code that integrates with your existing design system. So when you hand it to engineering, they don't throw it away, they ship on top of what you've built. I use Bolt every single day. I host my LAN PM job cohort on it, and honestly, I'm up till 2:00 AM some days just vibing in the tool, having fun, and building. That's when you know a product is good, when you're using it past midnight, not because you need to, but because you want to. Check out Bolt at bolt.new/aakash. That's B-O-L-T.N-E-W/A-A-K-A-S-H. Link in the show notes. I hope you're enjoying today's episode. Are you interested in becoming an AI product manager, making hundreds of thousands of dollars more, joining OpenAI and Anthropic? Then you might want to do a course that I've taken myself, the AI PM certificate ran by OpenAI Product Leader Miqdad Jaffer. If you use my code and my link, you get a special discount on this course. It is a course that I highly recommend. We have done a lot of collaborations together on things like AI product strategy, so check out our newsletter articles if you want to see the quality of the type of thinking you'll get. One of my frequent collaborators, Pavel Hern, is the Build Labs leader, so you're gonna live build an AI product with Pavel's feedback if you take this AI PM certificate. So be sure to check that out. Be sure to use my code and my link in order to get a special discount. And now back into today's episode.

    2. AG

      And by the way, you can always do that later and actually evaluate that. So one of the things that you could do here is actually keep all of them enabled, run it, and then, um, maybe you don't get great performance. You could duplicate the prompt again and then try disabling some of the tools and see if you get better performance.

    3. AG

      Mm-hmm. Makes sense.

  13. 33:0139:12

    Iterating on prompt and MCP tools

    1. AG

      Okay, great. So we'll save that, and let's try running it again.

    2. AG

      It's fun how fast you can iterate here, and so this might be, like, an example of those 13 experiments. Like, they're constantly improving what they're working on, and you just get the results so quickly.

    3. AG

      Exactly. Every time I click Run, actually, it is essentially running an experiment. Okay, great. So it didn't actually do that well. Welcome to AI. Um, so let's, let's see what happened. Here it said, "Are there any overdue tasks?" And this model said, "I'm ready to help with linear tasks," but it doesn't actually do anything. It just says what it can do, and it doesn't really solve the problem.

    4. AG

      Wow.

    5. AG

      Now, there's a few things we can do. One thing we could do is we could try a different model. So we could say maybe let's try GPT-5 or GPT-5 mini and see if we get better performance.

    6. AG

      Mm-hmm.

    7. AG

      Uh, another thing we could do is we could try to improve the system prompt. So we could say, "Don't ask clarifying questions, please. Just use the tools and figure it out."

    8. AG

      Oh, okay.

    9. AG

      Let's try it, actually. And, uh, you know, a third thing we could do is we could go and actually edit the questions. Maybe the questions are not great, and, uh, maybe if we made them a little bit more specific, we'd get better results. Um, and then of course, the fourth thing we can do is edit the scoring function, but I agree. I think my vibe check on the score, which was zero, is consistent with what the score actually was. So I, I wouldn't advise that we do that.

    10. AG

      Yeah. My brain immediately went to, well, let's improve the system prompt. Maybe let's add a few few-shot examples of how to run it. Maybe let's specify the tools. But we didn't go to that level quite yet.

    11. AG

      Exactly. And, and as you can see, it, it actually didn't solve all of the problems yet, and OpenAI returned an error for one of them. But it seems like this one is actually pretty good.

    12. AG

      Mm-hmm.

    13. AG

      So let's take a look. Here's a quick digest of the 20 issues assigned to you. Uh, so it's actually talking to Linear. And then if we go and look at the score here, it says it doesn't include a citation, so it just mentioned that it got the digest, but it did answer the question. So it gave us partial credit.

    14. AG

      Yeah, and it did a pretty good job citing its sources, so...

    15. AG

      So yeah, maybe that means we should improve the scoring function.

    16. AG

      Yeah.

    17. AG

      Yeah.

    18. AG

      So it looks like we probably want to iterate both on our system prompt and our scoring function.

    19. AG

      Exactly. And by the way, I'm hand doing this just for the purpose of showing you that, but one of the things that I really love about Loop is that I can say things like, "I think the scoring function is too harsh. If the response contains any references to BRA tasks, then it has cited its sources."

    20. AG

      So this will go update the scoring function. And could we do the same for the system prompt? Could we say, "Right now, we're still failing on four out of the five, so can we add a few few-shot examples and specify which MCP tools in Linear to use?"

    21. AG

      Absolutely. So here you can see it's edited the criteria for the, the scorer. We can hit Accept, and it will update it. Here's the updated one with the new criteria.

    22. AG

      Nice.

    23. AG

      And we can also say, "Scores still..." Oh, yes, please. It's offering to actually run the task and see how it scores.

    24. AG

      Oh, yeah, so we can do one change at a time and check it first.

    25. AG

      Yeah. Let's just have it do that.

    26. AG

      Yep. Now that I'm back to becoming a coder, thanks to vibe coding tools, after 16 years of being away, I'm like, "One step at a time." [chuckles] All right. Hmm.

    27. AG

      Yes, please. Improve...

    28. AG

      Yep, the prompt. Yep, but not just with citations, right?

    29. AG

      Yeah.

    30. AG

      Okay. And if you wanted, you didn't need to use AI to do this. I think, like a lot of people, this might be a process also that maybe the PM isn't necessarily controlling at this point, improving the system prompt. It might be something that an AI engineer or an engineer involved with it is. But usually, the PMs are pretty involved in the scoring function, going back to your point around evals are the new PRD.

  14. 39:1243:36

    Why you need evals that fail

    1. AG

      release it.

    2. AG

      Absolutely. I think one of the most important things is to have evals that fail. If you only have evals that succeed, then you don't know what problems, uh, there are, and that means that you either don't have a clear understanding of what problems your users are hitting, or you don't have a clear understanding of what is impossible today. And I think it's very, very important to have both. If you have evals that are failing, then when a new model comes out, the first thing you should do is just rerun those evals. And you'll be surprised that every time a new model comes out, something interesting is gonna happen.

    3. AG

      I heard, like, uh, some people who are running a coding tool, 3-Flash was somehow performing better on, like, a lot of coding benchmarks than 3 Pro, Gemini 3 Pro, but it was hallucinating more.

    4. AG

      Oh, yeah.

    5. AG

      So you'll-- These are, like, these nuances where you need to have a full eval testing suite to really understand which metric's improving versus which it's hurting.

    6. AG

      For sure, and I think as with any benchmark, an up does not necessarily mean good. An up just means that something interesting happened. And I think more often than, uh, not, when you see something interesting happen in a benchmark, including an improvement, it means that the benchmark itself is broken. But you should not necessarily hypothesize whether a benchmark is broken until you're able to reproduce it with some real data. So I, I'm a big believer in doing really dumb, seemingly obvious things like just auto-generating silly questions about linear tasks or whatever it is, and then running stuff and confronting, you know, the actual generated outputs with your intuition and using that moment as the opportunity to improve things, as opposed to spending a month creating a perfect golden data set that you think represents the problem that you're trying to solve and doing all this other prep work. I think you should just jump in and then start iterating.

    7. AG

      So that's a real case for don't silo your Braintrust licenses and user accounts to the AI engineers. Make sure that the PMs, maybe even the right go-to-market domain experts who really understand it, have access to the tool.

    8. AG

      Yeah, I mean, we about, uh, I don't-- I, I actually don't remember, maybe three or six months ago, we sort of realized that Braintrust should not be constrained to the AI engineering team, and we removed user-based pricing. So there's no, no user-based pricing. It's just based on how many evals you run and how much data you log. You should just not worry about that.

    9. AG

      Mm-hmm.

    10. AG

      Looks like this thing is cooking, and it's made some serious progress. I've been watching along as we've been talking, and it's solved some problems, like telling the model to use the tool. It's also solved the problem of the model asking for clarification. So I think in chat-based use cases, models are post-trained to ask for clarifying questions. In the context of this demo, we're not giving it the opportunity to do that. We're just hoping that it generates a response from one question. And so it's really important that we tell it not to do that.

    11. AG

      Mm-hmm. Yeah, it's very cool. I think it, like, started with a partial score, then it moved. It said, "Okay, let me iterate on the system prompt again to get a full score." So it's really working through the problems.

    12. AG

      Um, and yeah, I mean, this is evaluation. I think a few things I'd highlight are that we touched all three parts of the workflow. We worked on the dataset. We iterated it a little bit. From here, uh, you might add more examples to it. You might tweak the ones that you have. You could use Loop to help you think about more examples to add. A second thing that we did is we actually worked on the task function itself. So we wrote prompt, we picked a model, we changed the MCP tools that were available to it. We could do more work there. Like, I think maybe switching to a better model might help us consistently get a better score. Oh, wow. It looks like we're now at, uh, zero point seven five across the board, uh, which is a huge improvement from where we were before.

    13. AG

      Yeah.

    14. AG

      And then the third thing that we did is we actually iterated, we created and then iterated on a scoring function. So we made an initial one. You pointed out that the scoring function was being a little bit too nitpicky, so even though the response was citing the specific issues, it wasn't really giving it credit for that because it didn't have a link or something like that. So we also improved and iterated on the scoring function to better represent what our vibe check, or in this case, your vibe check, was indicating was a little bit off about how it was working. And I think that process that we just ran is, is very, very representative of how people do evals.

  15. 43:3647:40

    Offline vs online evals

    1. AG

      So what is the distinction between offline and online evals, and when should people be doing which?

    2. AG

      Yeah. So one of the cool things about the work that we did is we created a scorer, and even though we're using it in this playground, this isn't the only place that you could use the scorer. So if we go into the scorer list in Braintrust, you'll see that we have the scorer right here, and we can actually run it on real live logs and deploy it into production so that, let's say we take this app that we built and we start using it, every time we ask a question, it will actually run the scorer online. In fact, we can do that right now. If we go back to the playground, we can save this prompt. Oh, it's right here.

    3. AG

      And I'm loving this prompt.

    4. AG

      Yeah.

    5. AG

      You can see the tool usage patterns.

    6. AG

      Hardcore.

    7. AG

      Great.

    8. AG

      Yeah, exactly.

    9. AG

      And it's really quite nice.

    10. AG

      And, and again, uh, you're a product manager, so I think you probably, correct me if I'm wrong, but you see this and you get some PRD vibes, right?

    11. AG

      Yeah. [chuckles]

    12. AG

      And, and that's what I mean, like this is a much more quantifiable version of thinking about what a product should be, and it, it's really fun, I think, to actually be able to take, uh, product intuition and quantify it and turn it into something really tangible. So if we go, we can actually take that scorer that we created and run it online. So we have linear answer quality. We can run it on, uh, every LLM span, and we'll run it on 100%.

    13. AG

      We'll give it a name.

    14. AG

      Um, and then it's super easy to actually test these things out in Braintrust. So we had the prompt that we created here. There's a little built-in chat interface, so I can say like, um, "What tasks are assigned to me?" And you can see it's calling this tool, and it's gonna generate an answer.

    15. AG

      Mm-hmm. And what makes this online is it's accessing the real data? Or what... When people talk about that distinction, what should they understand?

    16. AG

      Yeah. So what's happening here is, we'll go to it in a minute, but every time I use a prompt in Braintrust or whatever my app is, I'm gonna be generating real live logs, uh, of my production application. And online evals are, um, taking these scorers that we build and running them on your real live user logs. And I think that's helpful for two reasons. The first is that it gives you insight into how well the same eval functions that you're using to test things offline are actually translating into real world performance. So let's say that offline we are able to achieve a score of 0.75, which is not bad, and then we run the same scorer online and we consistently see the result is 0.3, that means that maybe it's not actually working as well in the real world as we think it's working in our little simulation environment. And then the second thing is that it becomes a really good flywheel for you to find, uh, examples that are worth including in your offline eval. So when you see that the score is 0.3, then you can actually filter down to the examples that are not performing very well, and then grab them and add them into that same data set that we were using to assess things. Wow, I have a lot of tasks assigned to me. I am, I'm gonna have to do some coding work after this, uh-

    17. AG

      [chuckles]

    18. AG

      ... chat we had. Um, so if we go to our logs page, you'll see that right here, this is exactly the chat thing that we had and then the eval running, and it looks like we didn't do a... Oh, it looks like we actually in the end did a good job. So I set it up to evaluate every step. Maybe you only want to evaluate the last step.

    19. AG

      Mm-hmm.

    20. AG

      But it looks like at the end it actually scored pretty well.

    21. AG

      Nice. So to summarize for people, your offline eval is based on that golden data set, and you can continue to improve that golden data set when you see a discrepancy between your performance on your offline and your online evals. You say, "Okay, everything we failed online, let's potentially add that back in," or those are candidates to add back in to your golden data set for what's running offline. Did I get that right?

    22. AG

      Exactly. You can actually do that directly in our UI. So you just find examples that you think are interesting, and then you can add them to the data set.

    23. AG

      Okay.

  16. 47:4050:00

    How to maintain eval culture

    1. AG

      Very cool. So how do you maintain trust in your eval system so people don't bypass it when they're shipping new features?

    2. AG

      Yeah, I mean, I think that the best teams don't think of evals as a gate. They think of it as a core part of their iterative loop of actually improving things. And I think that the best workflow looks like looking at real production examples. Um, in fact, some of our customers have kind of like a ritual where every morning in standup they'll look at some examples from the previous day's usage of their product. A- and then what they'll do is they'll reconcile what they see with those examples with what their evals, uh, have. So let's say that the scores are very low for, uh, let's just use this linear example, questions related to our UI. It's like, huh, maybe we don't have that many questions related to UI tasks in our eval data set. So what they'll do is find these novel patterns that have emerged from their logs and then add them to the data set, and maybe you do that in the morning. And then what they'll do is they'll grind that day and actually try to improve the eval performance on the things that they noticed. And that becomes a really helpful way to prioritize what you should actually work on and what it means to actually succeed on a particular endeavor. Like, hey, it clearly looks like we're not doing well on questions related to UIs. Let's bring in a bunch of those tasks, add them to our data sets, reproduce that problem in our evals, and then go and iterate on it until we're able to produce a better result. And I think that's the best way to think about evals. If you think about evals as instead, which a lot of people do, and, and I try to discourage, uh, folks from thinking about it this way. Instead, what you might do is, "I think there's a problem. Let me edit my prompt to try to fix the problem and play with it on three examples." And then, "Okay, it seems like it's better. Now let me go run a full eval run and see if I can ship this thing." I think you're, you're not going to be as efficient because you're notThinking about the broader problem which is represented in the data set, uh, while you're actually making those iterations in the first place.

    3. AG

      Amazing. Couldn't agree more. This, you guys, was a less than one-hour masterclass into eval, so there is much, much more out there that you can go deeper. If people wanna go deeper, Ankur, where should they be going?

  17. 50:0051:56

    Outro

    1. AG

      Well, you can reach out to us, www.braintrust.dev, um, is our website. Um, you can email me, A-N-K-U-R @braintrust.dev, or reach out to us on, um, X or Discord. We also have a user conference coming up in February called Trace, so if you go to braintrust.dev/trace, you can see information about signing up. It's a zero bullshit practitioner-led conference, so a bunch of talks from people, like companies, uh, from companies like Dropbox, Ramp, uh, Notion, um, other folks that we talked about earlier, who are just gonna talk about how they're solving these problems, and would love to meet you.

    2. AG

      All right, guys. For my money, in 2026, whether or not you're building an AI feature or not right now, every PM should be learning this skill. I hope we got you excited enough to go out there and try this out, maybe with a free Braintrust account or something else, whatever platform you are using. Get out there, start iterating. You saw how fun it was. You saw how I was jumping in on how I wanted to do more of the system prompt. I think you'll feel that same excitement once you get your hands into a tool like this. So I hope we've removed that barrier to entry for you guys, and we'll see you in the next episode.

    3. AG

      Thanks for having me.

    4. AG

      I hope you enjoyed that episode. If you could take a moment to double-check that you have followed on Apple and Spotify Podcasts, subscribed on YouTube, left a rating or review on Apple or Spotify, and commented on YouTube, all these things will help the algorithm distribute the show to more and more people. As we distribute the show to more people, we can grow the show, improve the quality of the content and the production to get you better insights to stay ahead in your career. Finally, do check out my bundle at bundle.aakashg.com to get access to nine AI products for an entire year for free. This includes Dovetail, Maven, Linear, Reforge Build, Descript, and many other amazing tools that will help you as an AI product manager or builder succeed. I'll see you in the next episode

Episode duration: 52:06

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode 71qvIkO9d_A

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome