Braintrust CEO: Evals are the new PRD for AI products

In this episode, I sit down with Ankur Goyal, founder and CEO of Braintrust, the AI evals and observability platform used by teams like Notion, Stripe, Vercel, and Zapier. This one is for the senior engineers, staff engineers, VPs of engineering, and CTOs in my audience. We get into how coding agents can take on deeply technical architecture and infrastructure work that no single human engineer could tackle before, and then we demystify evals so you can use them to make your AI products better without touching the implementation. *What you’ll learn:* 1. How Ankur uses Codex to run week-long benchmark experiments across database indexes, column store formats, and execution engines to speed up slow queries 2. Why he argues there’s no excuse to skip rigorous benchmarking now that agents can run them tirelessly 3. The “agent line” framework: how to decide which decisions, directions, and interactions you can hand off to an agent 4. How I think about the practical vs. theoretical quality of AI on hard technical problems, and why human attention decays on tedious work 5. Why evals are the modern version of a PRD, and how to encode “what good looks like” so a model can figure out the “how” 6. How to build a scoring function live and let an agent improve your prompt inside a safe playground 7. How Ankur turned his designer David’s taste into a repeatable eval so quality scales beyond one person 8. Why fixing your CI is the highest-leverage way to speed up engineering velocity *Brought to you by:* Guru—The AI layer of truth: http://getguru.com/ Persona—Trusted identity verification for any use case: https://withpersona.com/lp/howiai *In this episode, we cover:* (00:00) Introduction to Ankur Goyal (03:00) Using AI agents for database optimization (06:10) Running exhaustive benchmarks with coding agents (09:03) Why staff engineers are wrong about AI limitations (11:30) The “agent line” framework for delegation (14:00) Ankur’s workflow: running 4 to 6 concurrent agents (17:16) Technical setup: foreground agents, background agents, and cloud environments (20:32) Spending time with AI tools (23:06) Demystifying evals (26:02) Live demo: Building an eval for documentation answers (30:20) The alternative to evals: vibe checks and whack-a-mole (32:09) Capturing designer taste in scoring functions (33:13) Quick recap (33:44) Managing velocity and throughput (35:40) Why CI/CD investment is critical for AI-accelerated teams (37:30) Ankur’s prompting strategy when agents fail (39:10) Closing thoughts and how to connect *Blog & detailed workflow walkthroughs from this episode:* Blog: ↳ Ankur Goyal's Playbook for Agent-Driven Benchmarking and AI Evals https://www.chatprd.ai/how-i-ai/ankur-goyals-playbook-for-agent-driven-benchmarking-and-ai-evals Workflows: ↳ How to Scale Expert Judgment in AI Systems with a Human Feedback Loop https://www.chatprd.ai/how-i-ai/workflows/how-to-scale-expert-judgment-in-ai-systems-with-a-human-feedback-loop ↳ How to Use AI Coding Agents for Exhaustive Infrastructure Benchmarking https://www.chatprd.ai/how-i-ai/workflows/how-to-use-ai-coding-agents-for-exhaustive-infrastructure-benchmarking *Tools referenced:* • Braintrust: https://www.braintrust.dev/ • Codex: https://openai.com/codex/ • GPT 5.4: https://developers.openai.com/api/docs/models/gpt-5.4 • Claude: https://claude.ai/ *Other references:* • GPT 5.5 just did what no other model could: https://www.lennysnewsletter.com/p/gpt-55-just-did-what-no-other-model • Paul Graham’s Maker vs. Manager Schedule: http://www.paulgraham.com/makersschedule.html • tmux: https://github.com/tmux/tmux • Chris Tate at Vercel: https://www.linkedin.com/in/ctatedev/ *Where to find Ankur Goyal:* LinkedIn: https://www.linkedin.com/in/ankrgyl/ *Where to find Claire Vo:* ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ X: https://x.com/clairevo _Production and marketing by https://penname.co/._ _For inquiries about sponsoring the podcast, email jordan@penname.co._

Claire VohostAnkur Goyalguest

Jun 15, 202640mWatch on YouTube ↗

EVERY SPOKEN WORD

40 min read · 7,867 words

0:00 – 3:00
Introduction to Ankur Goyal
1. CVClaire Vo
  And still in, as I say, the year of our Claude 2026, I still talk to engineers that say, "AI on our most complicated things cannot do a good job."
2. AGAnkur Goyal
  I so viscerally disagree with. There's no staff engineer who is running as many rigorous benchmarks and trying out different algorithms and analyzing ideas manually than someone who's using an agent. Everyone should take a hard look in the mirror and reevaluate how they spend their time. There's a lot of interactions that you have or direction that you're giving or decisions that you're making, and I think, like, many of these things, to me, fit below the agent line. I think the agent line keeps going up.
3. CVClaire Vo
  Why do you think this concept is so important to understand? How can you just demystify it for folks who are a little intimidated by it?
4. AGAnkur Goyal
  Now that models are so good at actually writing code, one of the best things that we can do is create really hard evals. And if you create the right tests and success criteria for a model, then it can be really creative, and it can work on this stuff in the background and actually try to improve a bunch of things.
5. CVClaire Vo
  I have a lot of people saying, "Wow, if I go as so far as to turn my own taste or my own skills or my own expertise into a system, I'm functionally just building my own replacement."
6. AGAnkur Goyal
  We're able to have David's palette applied to more things. I think the quality bar that we're able to hit is higher because we're able to get more things to that bar. [upbeat music]
7. CVClaire Vo
  Welcome back to How I AI. I'm Claire Vo, product leader and AI obsessive here on a mission to help you build better with these new tools. Today I have Ankur Goyal, the CEO of Braintrust, and this is a technical one, so if you're a senior or staff engineer or a VP of engineering or a CTO, this is one you're really gonna wanna pay attention to. And we're gonna talk about how coding agents can help you bite off really technical architecture and infrastructure work in a way that no other human engineer could before. We're also gonna demystify evals for folks and just show you exactly how you can use them to make your AI products better without having to touch a thing. Let's get to it. This episode is brought to you by Guru, the AI layer of truth for your company's knowledge. Here's the problem: your AI is only as good as the information you feed it. Most companies are getting confident but wrong answers from AI because their underlying knowledge is outdated, incomplete, or just plain incorrect. Bad information doesn't just slow you down. It costs you money and puts you at risk. Guru solves this by adding a verification layer between your company's knowledge and AI tools. Instead of just hoping your AI gets it right, Guru automatically scores content for accuracy, flags outdated information, and ensures your team gets trustworthy answers every time. It works with the tools you already use, so you don't have to change how you work. Thousands of companies trust Guru to keep their AI accurate and compliant. Ready to stop playing Russian roulette with your company's knowledge? Visit getguru.com
3:00 – 6:10
Using AI agents for database optimization
1. CVClaire Vo
  to learn more. Welcome to How I AI. I'm excited to have you here.
2. AGAnkur Goyal
  I'm super excited to be here. Thanks for having me.
3. CVClaire Vo
  So I'm gonna make you laugh, but I recently did an episode about the recent GPT 5.5 model release, and I know you and I use Codex, and one of the funniest comments in that post was, "Claire, can you do an entire episode about tech debt?" And we were talking before we got on the recording, you were like, "How technical and how nerdy is this audience?" And I'm like, "Bring it on." So we are gonna talk a little bit about how you approach engineering and then how you use AI to do things like optimize slow queries. So let's, let's hop in. Tell me, tell me about your approach to, to software engineering in the age of AI.
4. AGAnkur Goyal
  You know, I spend a lot of time working on software for doing evals and observability, and that's kind of shaped my own perspective about software engineering. Like, now, now that models are so good at actually writing code, one of the best things that we can do is create really hard evals. And not, I'm not talking about, like, AI evals. I mean things like, "Why is this query so slow?" And if you create the right tests and success criteria for a model, then it can be really creative, and it can work on this stuff in the background and actually try to improve a bunch of things. So, um, one of the things that I spend a lot of time on right now is making the queries that people run in our product faster. And people can just write arbitrary queries. Like, you know, they can... There's, uh, an example of someone who's trying to find, like, needle in a haystack of some, um, specific kind of interaction someone had in their product, and they're looking at, like, billions and billions of traces, and they wanna find, like, the 5,000 or something that match. And this is over, like, a 90-day period or something, like a lot of, a lot of data. And that's one example of, of a query. And, like, okay, there are all these things that you can do in database, uh, literature, like different indexes you can build and different ways you can prefetch data and blah, blah, blah, all this stuff. But how do you try all those things, and, and, and how do you, um, run all the experiments required to, to actually do something like this? So what we do, and what I've personally spent a lot of time working on, is trying to figure out, you know, manually is fine, but automatically is even better. Like, what are the patterns of queries that people are running that are slow? And then we will reproduce those things and use, um, a coding agent, uh, to try out a bunch of ideas from database literature. So, like, download a bunch of data locally, and then maybe try different... I- in this case right now, I'm trying out different column store formats. So we use an index underneath the scenes called Tantivy, which has a built-in column store, um, but it's not that great. Like, the thing overall is great, but their column store is not, like, that great. And so what we're doing right now is, like, exhaustively trying every open source column store format out there and then exhaustively trying every column store execution engine out there and sort of computing the matrix of this. And, uh, it's, you know, it's like, it's amazing.
5. CVClaire Vo
  I completely agree.
6:10 – 9:03
Running exhaustive benchmarks with coding agents
1. CVClaire Vo
  As somebody who has led engineering organizations for a really long time, when you're trying to make infrastructure platform core component changes in your application, because of both the cost of implementing those being very high and then the unknown unknowns being quite risky, teams are actually pretty risk-averse in terms of making big platform shifted, shifts or changes to their core implementa- It's like the thing that you shipped is the thing that you get stuck with certainly on, on the engineering side. And what I love about AI right now, and these coding agents in particular, and then Codex in particular, particular, is it has been the only setup, Codex plus these GPT models, has been the only setup where I have been able to set up a very similar process, which is the outcome I want is X, Y, Z. We need to programmatically test against pretty long tail data structures to figure out which of these potential solutions are gonna get us closer to the outcome we want. In your instance, it's database query speed and latency. In my instance, I was doing a very co- you can appreciate this, very complex data, data migration of-
2. AGAnkur Goyal
  Oh, right
3. CVClaire Vo
  ... door, like stored, structured, and unstructured data generated by AI, so it was all, like, messed up to begin with, and then I had to migrate it to a schema. And so it was like schema to schema migration, millions and millions and millions and millions of rows, and lots of edge cases. And doing that as a human takes forever. N- you know, you could script it and you can, like, bang, um, some systems against it, but then your human ability to manage those cycles and say, "Yes, that's right," or, "No, that's wrong," or, "This gives us indication that we should go left," or, "That gives us indication we should go right." And so I do feel like this combination of, like, a very precise outcome and an agent that's smart enough to bang its head against a really, really long tail of problems with a guided sense of the technical space, it does really well. And I have not heard this on the kinda like data store side. It's really interesting. But I just think, hey, engineering leaders out there, I've had s- I've been in so many debates about what we're using for our data store, how we optimize performance, what technologies we should bring into the stack versus not, and you can run those, like, very, very iterative loops on... I'm, I'm presuming you're using production-like data, um, or, or real representative queries to, to test that. Is that right?
4. AGAnkur Goyal
  You can actually use production data too, but for some subset of things, and with the right engineering in place, you can just run on production data.
5. CVClaire Vo
  Yeah. Yeah.
6. AGAnkur Goyal
  Um, and in many, in, in many ways, it's a lot safer than having humans test on the production data 'cause no one, no one's looking at it.
9:03 – 11:30
Why staff engineers are wrong about AI limitations
1. CVClaire Vo
  Yeah. And this is where I have so many staff engineers be really, really, um, cynical about does AI have a place in their, their, their coding tools. I'm still in, as I say, the year of our Claude 2026, we still, I still talk to engineers that say AI on our most complicated things cannot do a good job.
2. AGAnkur Goyal
  Oh, I, I, I so viscerally disagree with that.
3. CVClaire Vo
  Same.
4. AGAnkur Goyal
  Yeah.
5. CVClaire Vo
  Tell me why you disagree.
6. AGAnkur Goyal
  Well, I, I mean, I think... So I've been working on databases for almost two decades. There's n- not many things that staff, whatever, risk-averse, blah, blah, blah, all that stuff you could app- apply to than, like, literally building a database. If you work on a database, uh, uh... We recently added this, like, fancy index thing into Braintrust that uses bloom filters. And by the way, we discovered that that would be a practical solution to the problem after running, like, a week of continuous experiments with different types of indexes. Bloom filters kind of have a bad reputation, but they, they worked out to be very effective in this case. So if you, if you, if you build something like that, usually what happens is the very best engineers will run a few benchmarks, and then you'll send it to your peers, and then your peers will shit all over it and rip it apart and say, "You didn't benchmark this. You didn't benchmark that." And what you do is you prioritize the f- top few benchmarks, and then you probably bullshit the rest. Like, "Oh, I, you know, I didn't benchmark this. However, if you read the code, you'll see it's not N squared, it's log N, and so this is not gonna happen." And, and, like, half the time you're wrong. Now, there is no excuse to not do those benchmarks. So now, I, I love it. Like, we don't... I'm not spending my time, like, sitting and, like, typing the benchmark code, but I'm talking to people. We're looking at the code. We're looking at the thing. We're like, "Okay, well, we've inda- we, we've benchmarked how f- much faster it makes the queries. Did we actually do a good job of benchmarking how much slower it makes indexing? Oh, shit, no, we didn't do that." And so we actually spent some time doing that, and we discovered that we were doing a terrible job at indexing it efficiently, and so we spent a lot of time on that. And I, I could sort of s- I, I don't agree with this. I could... Some- I, I can empathize with the argument that models aren't good at writing highly concurrent code, or they're not good at writing very performance sensitive code. But the... I- There's no staff engineer who is running as many rigorous benchmarks and trying out different algorithms and analyzing ideas manually than someone who's using,
11:30 – 14:00
The “agent line” framework for delegation
1. AGAnkur Goyal
  uh, an agent, and even that baseline is just incredible.
2. CVClaire Vo
  I agree. And I think there's this theoretical quality, and then there's this practical quality, right? W- in, in a theoretical ideal world in which we don't sleep, and every time we sit down at our laptop, we, we end up writing perfect code. And in a theoretical world in which those benchmarks are, all of them are run, not just the BS ones. Like, in that theoretical world, you could theoretically say perhaps in some untested case you get better quality when, when humans are hands-on. But the practical application is you lose context over... Like, the humans lose context on the problem-
3. AGAnkur Goyal
  Mm-hmm, mm
4. CVClaire Vo
  ... over days. You have a decaying attention span towards hard but tedious problems. And so I do think the practical quality goes down. And, and so I tell people, like, the practical quality of integrating AI into your engineering process on very hard technical problems goes up simply because of how hard you can run at the problem and how long and consistently you can run a- a- against the problem. And then, you know, what I was gonna go back to saying, which is you can bite off much more interesting technical challenges with AI as your, you know, sidecar than you could before. Again, practically because your company can support the cost of doing so.
5. AGAnkur Goyal
  Oh, yeah.
6. CVClaire Vo
  Right?
7. AGAnkur Goyal
  Yeah.
8. CVClaire Vo
  If you're like, "I wanna sequester all my staff engineers to solve our database indexing problems for the next year, and we're just, like, really gonna go, go deep in the weeds, and we're gonna test these six different open source, you know, solutions to this, and we'll come back in a year, and we'll tell you if we figured it out, not-
9. AGAnkur Goyal
  Yeah
10. CVClaire Vo
  ... that we figured it out," business is like, "No." You know? You're, you're... They're a CEO. Like, "No. No thank you." But if you say, "Hey, we're gonna have this thing in the background, and we're gonna check on it, and we're gonna make expedient progress." Sure. "And we can ship other stuff while we're at it," I think that's a really easy yes.
11. AGAnkur Goyal
  Absolutely. Yeah, I mean, I think the, the motto that we have now is there's just no excuse to not have rigor. Like, if... A- and there's no excuse to not have performance. If someone complains about something, if someone complains about a, a paper cut in the UI, you know, whatever it is, there's just no, there's no... We don't really have a backlog. Like, there's-
12. CVClaire Vo
  Yeah
13. AGAnkur Goyal
  ... there's no excuse to, to just not improve these things.
14. CVClaire Vo
  Yeah, and, and for folks looking for we don't have a backlog inspiration, we just interviewed Brian from Intercom, who said their goal is, like, backlog zero. Nothing in the backlog so that everything can get shipped.
14:00 – 17:16
Ankur’s workflow: running 4 to 6 concurrent agents
1. CVClaire Vo
  Okay, so we're solving really technical problems. I think this is a great approach. How are you engineering with AI? Because I love that you're still... You know, you're writing code. You're spending time on this. Any tips or tricks for how you're managing your, your fleet of agents that you think are unique?
2. AGAnkur Goyal
  I think that everyone should take a, a hard look in the mirror and reevaluate how they spend their time. There's a lot of interactions that you have or direction that you're giving or decisions that you're making, and I think, like, many of these things to me fit below the agent line. And to me, the agent line is, like, if I or whoever it would be at the meeting or whatever, like, if we equivalently took the information that we're discussing and we just gave it to an agent, would it solve the same problem? And, and I think the agent line keeps going up. And also, I think the best people are pushing the agent line inside of their company by being smart about what skills they're writing and what integrations they're building and so on. So o- once you do that, you likely have a lot more time than you thought you did. I don't take any meetings after 12:00. This is the last meeting of the day for me. And, uh, that means that every day I am able to, in the Paul Graham framework of Maker vs. Manager Schedule, every day I'm able to enter the level of focus that's required to be in the maker schedule. And so I, I personally write a lot of code, and, and I spend a lot of time writing code, and I, I've, I haven't spent as much time writing code in, in a while, and I, I really love it. So that's number one, is, like, make the time. Um, my, my workflow is, is very simple right now. We don't have a great background agent set up yet. I think that we are, uh, exploring various things and trying to get there, but I, I have usually f- five or six foreground agents running on my computer. Each one is a tmux session. Right now I have four things I'm working on, so each one is a tmux session. They're named Braintrust 1 through Braintrust 4. [chuckles] And, and, you know, each of these has, like, um, some UI running, and it has some services running. They're problems like port collisions, like, I can't isolate everything as much as I'd like to. And, and I think that there are a lot of solutions for trivial software that do this. There's not a lot of solutions for complicated software yet, and I'm excited. I mean, everyone I talk to is building their own thing. I just met a startup that's, like, two months old, and they built their own internal tool for doing background agent PRs, which, uh, is... I'm, I don't judge them for it. Like, I don't know what else they would do, but it's kind of crazy. And then I also have r- remote one. So here's one where I'm, I'm working on trying to improve our column store performance, and this is running on not real data, but close to real data, and it's running remotely, and it's, you know, it's, it's, it's running, like, much more scale and many... And I mean, if I ran this on my computer, it would, it, it would, uh, probably die from [chuckles] just how much, uh, compute it's using. But, but I'm, I'm able to, to, uh, in this case test, like, what's the real latency between EC2 and S3 if I'm trying to do, like, 4,000 concurrent reads. Is it enough? Is it not enough for this workfl- workload? Can I interleave things whatever properly? And I've been running this experiment for, for several days just trying to figure out, like, what's the best... You know, right now I'm, I'm, I'm talking to it about what the indexing life cycle should be because it, I think we figured out how to make the
17:16 – 20:32
Technical setup: foreground agents, background agents, and cloud environments
1. AGAnkur Goyal
  queries fast enough.
2. CVClaire Vo
  Some people are gonna be listening to this and being like, "Oh my gosh, this is so technical. I don't have these problems." Let me take a step back for folks and tell you what I, I think I'm seeing here, which is, one, you're using Codex, right?
3. AGAnkur Goyal
  Yeah.
4. CVClaire Vo
  Codex for hard problems, people. I'm telling you. Just, uh, that, that-
5. AGAnkur Goyal
  I, I think it's currently the only model that will disagree with you regularly.
6. CVClaire Vo
  Yes.
7. AGAnkur Goyal
  And I think if you're working on hard problems, it's very important.
8. CVClaire Vo
  And then for you, what I am also hearing is you're using foreground agents. You can bas- you basically have a personal concurrency limit of, like, let's call it four.
9. AGAnkur Goyal
  Sure, yeah.
10. CVClaire Vo
  Um, which is about what, about what I can do as well. So I think people ask me all the time, "How do you handle all this context?" I'm like, "I don't do more than I think I can do at any one time." And I also, I, I have more trivial problems than you, so I think you're right in that the current sort of commercial background agents, I would call them, that you can buy off the shelf Uh, work very well for web a- like standard web apps. I'm very happy with them. If you are not using one of them as an engineering organization, maybe it's, like, doing classic SaaS, highly, highly, highly recommend. But I am hearing more and more from teams two things that you called out. I am hearing more and more people are just building their own background agents. So it's happening. It's happening in teams very, very big and very, very small. I think the primitives are there to start experimenting with it, and so I don't think it's gonna be as surprising to us to hear about people building their own internal coding background agents, even if, like, core infrastructure is something from the big, the big models, model providers. I think the second thing that I'm hearing a lot, and we heard this from the Stripe team, is investment in cloud, um, development, uh, environments and remote, remote computing. Again, because if you were to run some of the stuff, especially the, the data-heavy stuff on your computer, it starts to sound like an airplane taking off. It's no good. And then the last thing I heard you say, which is, like, ports. I, I joke with everybody, I say, "Wort trees everywhere, ports 3000 through 3009 accounted for." Like, I am just, like, every- everything... And I have to call out, uh, Chris Tate at Vercel released a thing called Portless, which just makes managing multiple ports, [laughs] localhost ports on your, uh, local machine a little nicer. So for simple things, I would go look that up. We'll link it in the, in the GitHub show notes. But, uh, you know, common problems that I think people have running concurrent engineering processes on their own machine. And then the, like, meta thing, which is just, like, make time to code.
11. AGAnkur Goyal
  You need it.
12. CVClaire Vo
  Every-
13. AGAnkur Goyal
  Yeah, yeah
14. CVClaire Vo
  ... everyone. I also don't take meetings after 1:00. Sometimes I'll do podcasts in the early afternoon for folks, but all afternoon I'm just, like, in my real state, which is hoodie on, bad posture-
15. AGAnkur Goyal
  Yeah, and I, I think that-
16. CVClaire Vo
  ... fully coding
17. AGAnkur Goyal
  ... I'm, I'm sure you feel this too, but, like, there- when I was handwriting most of my code, um, I would enter this sort of, like, euphoric flow state where I-
18. CVClaire Vo
  Mm-hmm
19. AGAnkur Goyal
  ... you know, I just sort of completely focused on a problem. And then when I started doing a lot of agent coding, I lost that for a little bit. But now when I'm writing code, you know, Lane Eight just released a new album yesterday. You should listen to it. [laughs] Put on, put on your h- your hood and your, your headphones. I'm, like, way... I'm, like, totally back in that state now, um, just doing a different workflow.
20:32 – 23:06
Spending time with AI tools
1. CVClaire Vo
  Yeah. And I'll give folks the sort of, uh, you know, AI mom of the internet that I try to be, which is I do feel like a lot of people are pr- they, they kind of go into two camps. They are having more fun than they've ever had before, and they're back in the flow state of, like, what got them into software engineering or building or technology or whatever. Or they're approaching, like, clot anxiety, burnout breakdown, because they feel this, like, productivity anxiety.
2. AGAnkur Goyal
  Mm-hmm.
3. CVClaire Vo
  And they're not... I, I think, I think what I see is that people feel like if they're in a meeting and they're not kicking off agents, they're doing something wrong. Or f- or if they're-
4. AGAnkur Goyal
  Oh, yeah. Yeah. Yeah
5. CVClaire Vo
  ... talking to somebody and they're not kicking off agents, they're doing s- And I just say, like, I like the idea of chunking your t- [laughs] chunking your time with AI a little bit more.
6. AGAnkur Goyal
  Yeah, yeah, yeah.
7. CVClaire Vo
  Um, I think it just narrows you on the more, more productive, um, pieces of it, and is al- also just a more enjoyable way to get stuff done.
8. AGAnkur Goyal
  Yeah. I had a phase, which I think I'm over. Um, you know my wife, Alana.
9. CVClaire Vo
  Yep.
10. AGAnkur Goyal
  Um, uh, where I, we would have... We alwa- we have dinner together usually e- like, pretty much every night. Uh-
11. CVClaire Vo
  Yeah
12. AGAnkur Goyal
  ... um, and so I had a phase where my laptop was not at the table, but open and on the couch.
13. CVClaire Vo
  Oh. [laughs]
14. AGAnkur Goyal
  And, and I, I think I've progressed beyond that phase now. So now the laptop is closed, and it, I think it's, it's an important, it's an important thing.
15. CVClaire Vo
  I, I agree. I, uh, when I was first using OpenClau, I installed it on an old MacBook, and it would, like, stay open on our kitchen island, which is where all our plugs are, and it would, like, hover over us at dinner and hover over us at, at breakfast. And if it got moved, I was like, "Where is Polly? Is she alive? Is she open? Is she closed?"
16. AGAnkur Goyal
  Yeah.
17. CVClaire Vo
  So yes, close your laptop, people. Close your laptop. This episode is brought to you by Persona. You're learning to build with AI, but there's an important question you need to ask. Who is actually using your product? Is it a legitimate user, a bot, or a fraudster? Brex, Figma, Etsy, and Twilio trust Persona to answer that question. With Persona's identity verification platform, you can create branded experiences, automate fraud prevention, and know who is human online. That makes it easy to give good users an experience that makes them feel welcome, and to stop bad actors from causing damage. And for those of you building in the AI agent space, Persona helps you verify the identities of people, businesses, and developers behind agents. It's how companies like Lithic and Skyfire are pushing the frontier of agentic commerce. Learn more at withpersona.com.
23:06 – 26:02
Demystifying evals
1. CVClaire Vo
  All right. So, uh, you know, we covered the first half of this episode, which I think is very interesting for technical folks, how to have kind of, like, long-running or just really diligent agents run against technical problems to give you real benchmarks about performance on changing things. I love that. Second thing is just your core workflow on how you do coding, both how you dedicate time, and then technically just what your workflow looks like. Let's talk about evals, because I feel like this is something that's very intimidating to a lot of people. And obviously you build a product that supports this, but taking a step back, why do you think this concept is so important to understand, and how can you just demystify it for folks who are a little intimidated by it?
2. AGAnkur Goyal
  Machine learning specifically shifts the task of programming from being about the how to being about the what.
3. CVClaire Vo
  Mm-hmm.
4. AGAnkur Goyal
  And this is true, like, forget about LLMs. Like, you know, it's true with, let's say, like, you're, uh, back in, like, middle school, you're doing, like, remember statistical regression? You're not defining the... You're, you're computing what the slope and the y-intercept should be. You're not defining it, but you give it all the points, which are the, you know, the what, not the, the how, which is the slope and the y-intercept. And I think that, you know, the, the cool innovation around, like, transformers and, um, the next token prediction task which lets you, you know, ablate tokens and do all this cool stuff, it's all about saying like, "Okay, um, here's like the compute substrate, and here's the what," which is the outcome. It's predicting the next token. Can you go and use, uh, a lot of GPUs and figure out how to achieve that? And I think that, uh, if you take that as inspiration for anything you do with AI, then you're able to be more productive. And I think that applies to traditional programming, like what we just talked about. I'm not dictating exactly the implementation or even the s- set of algorithms that we're using to solve problems. I'm just trying to define very succinctly what the problem is and why it is a, a problem, and how to assess the solutions to the problem. It also applies to building AI software, and that's what evals are all about. Evals are a methodology for you to say, "This is what success looks like." In my opinion, evals are actually the modern version of a PRD. So a PRD, you would say, "Hey, in prose, this is what success looks like." Um, evals are also often written in prose, but you, um, supplement that with, with, uh, examples. Uh, so you know the best PRDs, they have good examples. [chuckles] Like they, maybe someone's made a demo or, uh, written out like a user story or something. It's, it's the same thing. Um, it, just the difference with evals is you encode those user stories in a way that can be quantified to some extent, and then you, and then, and then you let a model or whatever figure out the how, and, and you are really focused on
26:02 – 30:20
Live demo: Building an eval for documentation answers
1. AGAnkur Goyal
  the what.
2. CVClaire Vo
  Give an example of how you use this in, in product development, just to make it a little bit more tangible for folks.
3. AGAnkur Goyal
  Yeah. Let's, let's start with something that I think is quite straightforward, and then we can venture into the less straightforward stuff as we go.
4. CVClaire Vo
  Perfect.
5. AGAnkur Goyal
  So this is our UI, um, and, uh, like I'm working on a very simple task here, which is I'm trying to create a prompt that will be part of an agent that is good at answering questions about Braintrust documentation. So we looked at a few questions that people are asking in our docs, and we just put them into a dataset. You can, like, upload a CSV file. Like, it doesn't matter. It's just come up with a list of some questions, or you can auto-generate them, what, you know, whatever. Just start somewhere. And wrote like a very basic prompt. Uh, we're gonna use GPT 5.4 Mini, and, uh, I attached an MCP server. So I attached the Braintrust MCP server. We were also playing around with Context 7, which-
6. CVClaire Vo
  Mm-hmm
7. AGAnkur Goyal
  ... indexes docs for you. Um, you could also turn off the MCP and just see what the model already knows about your product.
8. CVClaire Vo
  Yeah.
9. AGAnkur Goyal
  They're getting pretty good at knowing about every product now as well. And, uh, here I just, I just ran it, and so you can see some of the answers. I, I'm gonna be honest though, I don't really want to, like, read all of these manually. And so what I would usually do is I just start by saying like, "Hey, can you come up with a good, uh, scoring function for these outputs? I care about having concise code snippets, only using one language, and, um, let's say avoiding em dashes."
10. CVClaire Vo
  [chuckles] Always.
11. AGAnkur Goyal
  Um, yeah, [chuckles] of course. Uh, and so now, um, in this case, GPT 5.4 is gonna go and actually, like, look at all this stuff for me, and it's gonna look at some of the outputs, and it's gonna rerun stuff, and, um, it'll kinda do its thing. And it's gonna come up with a new scoring function. One of the things I think, by the way, that's kinda cool about this workflow in general, and I expect to see this in more products over time, is that y- you'll notice, like, I have this in the equivalent of, like, unhinged mode of, uh, a coding agent, which is sometimes dangerous to run on your machine. But this agent is running inside of this playground, and it's using, like, data and, and some prompts and stuff, so the risk of letting it just go-
12. CVClaire Vo
  Yeah
13. AGAnkur Goyal
  ... and try stuff out is actually very low. And so I, I, I think, uh, I, I'm excited just generally about seeing agents in more environments outside of my local computer with Bash and something that's, like, very dangerous and, you know, could screw up my, my life if it goes wrong.
14. CVClaire Vo
  Yeah.
15. AGAnkur Goyal
  Um, I'm excited about just having more agents that sort of run in, in these types of environments and do whatever they want. Like, I, I don't even know what this is doing right now, but we'll find out in a few minutes.
16. CVClaire Vo
  I'm, I'm really excited about this. And just for people that are not watching or need just a, a, another set of context, basically what you did is you took these questions that people are asking in your doc site or search or whatever, chat bot, about, um, how the product worked. You built a little prompt to answer those questions, and then right now you're building, you're having AI build a scorer that tells you how well these questions are getting answered based on, like, a very loose definition of what you want it to do. And then does that score apply, that scoring mechanism applied across all of these so you can actually rank it?
17. AGAnkur Goyal
  Yes. Yeah, yeah. I think it's going a little bit awry, actually, so I'm gonna switch-
18. CVClaire Vo
  Okay
19. AGAnkur Goyal
  ... to this one, which is a little bit better. Um-
20. CVClaire Vo
  You do it live. We love a live demo.
21. AGAnkur Goyal
  [chuckles] I know. Uh, and, and let's use, uh, let's use Claude and give it a shot. So, uh, this one is a little bit cleaner, and it, it actually wrote a prompt. Um, well, let's use a smarter model. It didn't pick the smartest model. It wrote a prompt which, uh, takes the input and the output, and then it evaluates it on these criteria.
22. CVClaire Vo
  Mm-hmm.
23. AGAnkur Goyal
  It is a pain in the ass to write these criteria out by hand, so it's really nice to just let a model do it for you.
24. CVClaire Vo
  Yep.
25. AGAnkur Goyal
  And what we can do is, is run it, and it will, it will quantify how well, um, how well the model does, uh, on, um, on these criteria.
26. CVClaire Vo
  Mm-hmm.
27. AGAnkur Goyal
  And then we can look at it, like, one by one, or, or what I actually tend to do nowadays is, um, look at it in aggregate, and so the scores
30:20 – 32:09
The alternative to evals: vibe checks and whack-a-mole
1. AGAnkur Goyal
  will start coming in here.
2. CVClaire Vo
  What's the alternative that people are... What do you see people doing as an alternative to this that you think is less effective? One is just not doing it. I know a lot of people
3. AGAnkur Goyal
  Yeah, I mean, I think that a lot of people, and I fall into this trap myself despite working on this product, so it's not-
4. CVClaire Vo
  Mm-hmm
5. AGAnkur Goyal
  ... there's no judgment for, [chuckles] for doing this. But I think what a lot of people do is they just try stuff out on one or two examples, and they try to generalize from that.
6. CVClaire Vo
  Yeah.
7. AGAnkur Goyal
  And frankly, I don't think that's a bad idea. Like, I think that vibe checks are extremely important. But, uh, what's, what, what happens is that if you do this, you end up playing kind of like a whack-a-mole game.
8. CVClaire Vo
  Yeah.
9. AGAnkur Goyal
  So you might make it really good at one or two things, then you ship it, and then it's not good at something else. And what we do, we have this designer named David, uh, and David is really cool. Like, he dresses well.
10. CVClaire Vo
  [laughs]
11. AGAnkur Goyal
  He has, like, like he's into the latest music. He, he, he, like, he likes music before other people do. He told me that when he was a kid, he ha- he played soccer, and everyone had black shoes, and he wanted the orange ones, and then the next year everyone wanted the orange shoes.
12. CVClaire Vo
  Okay.
13. AGAnkur Goyal
  Um, so he's, like, that kind of person, right?
14. CVClaire Vo
  Yeah.
15. AGAnkur Goyal
  And we have a lot of AI stuff going on. So it's not practical for David, who has, like, the ultimate, who's the ultimate Braintrust tastemaker-
16. CVClaire Vo
  Yeah
17. AGAnkur Goyal
  ... to look at everything manually. And what I actually do is I run a shit ton of evals to try to quant- quantitatively improve things, and then when I feel like the evals are good and my own less sophisticated palate thinks that the results are good, I will go to David and ask him for a vibe check, and I probably do that, like, once every few days. And then David gives me the vibe check, and, like, half the time he just completely destroys everything that I've said. Like, "Hey, you know, you think it's good, but it's actually not very good." And then what, what, what happens is I will go back and try to capture what David said, and
32:09 – 33:13
Capturing designer taste in scoring functions
1. AGAnkur Goyal
  I'll say, like, you know, "Hey, David actually thinks, um, it's okay to show both languages as long as, you know, blah, blah, blah, blah, blah." And then I will... So I'll try to sort of capture David and then improve the scores, and then attempt to quantify David. And then the next time I go to him, I don't, like, repeat the same mistake, but I still get his vibe check.
2. CVClaire Vo
  Well, and I just have to call out the, the meta thing here in this David story, which I love, which is I have a lot of people saying, "Wow, if I go as so far as to turn my own taste or my own skills or my own expertise into a system," whether that system is like the David eval, the David, David-in-a-loop judge, or something else, I, I'm, I'm functionally just building my own replacement. And I am presuming, because I do, and it sounds like you do, too, you value David more in this system.
3. AGAnkur Goyal
  Oh, yeah. Yeah, yeah. We're able to have David's palate applied to more things. Like, the, the, I think the quality bar is, we're, that we're able to hit is higher because we're able to get more things to that, to that
33:13 – 33:44
Quick recap
1. AGAnkur Goyal
  bar.
2. CVClaire Vo
  Ah, I love it. Okay, so this has been a powerhouse episode, one of my favorites. We've talked about, a lot about, you know, solving really technical problems with AI. We've demystified evals a little bit for folks and shown how in, in a safe space, you can actually let AI... I think that's one of the meta themes of this, is in a safe space, you can let AI run with a lot of autonomy, and you'll, you know, throw a lot of data at it, and you can get higher quality outcomes, much more so than if you were to manually fix things or even manually
33:44 – 35:40
Managing velocity and throughput
1. CVClaire Vo
  evaluate things. I'm gonna do a quick lightning round, and then we'll get you back to, I mean, it's almost noon, so back-
2. AGAnkur Goyal
  It's time to code
3. CVClaire Vo
  ... back to coding.
4. AGAnkur Goyal
  Yeah. Yeah.
5. CVClaire Vo
  Time to code. One, I have a question. When you say there is no excuse, there's no excuse for bugs, there's no excuse for little design knits, there's no excuse for that, how do you feel like you practically... I maybe have two questions that you can answer. They'll be our two lightning round. How do you practically manage the velocity to customers, which is, do you ever get customers being like, "Wait, what's this? Wait, what's that?" Like, too much features just consumed as a customer? And then two, how do you technically manage the throughput into the system?
6. AGAnkur Goyal
  Product building and code writing is, now looks like carving rather than constructing. So it's very fast to create something that has too many features and too many buttons and too much code, and you need to spend a lot of time removing stuff. And so we actually, n- I would say 90% of the time someone complains about something, we remove the thing that was causing confusion and just make the system work better. 'Cause we understand now that the person complained their point of view, and we're able to build a product that doesn't even need the complexity that led them to the confusion in the first place. I'll give you an example. If you load a trace and you imagine hitting Command + F, you might in your brain think that that's just searching what's on the page, but what's on the page might be hundreds of megabytes of text, and it's virtualized, and then there's, it's across spans, and there's also a table. So we had a very powerful search implementation that would search across the spans and rank everything and, you know, blah, blah, blah, all this cool stuff. And then w- a lot of people complained, and they were just like, "Why is this? You know, I just hit Command + F. I just want it to show the, the thing." And we've just, we've, we've really simplified it over time. So I think, I think we, we try to carve.
35:40 – 37:30
Why CI/CD investment is critical for AI-accelerated teams
1. AGAnkur Goyal
  And then r- in terms of technically managing it, we spend a lot more time working on CI than we used to. Uh, and so I think that a lot of, um, platform effort has shifted so that if we are really good at CI, then we're able to move faster. And if we feel like we're constrained, then instead of shipping a bunch of crappy stuff, we're like, "Okay, let's pause and improve CI so that we earn the ability to move faster."
2. CVClaire Vo
  Okay. Again, for the VP of engineering in the back, invest in C... I've, I've told everyb- They're like, "How do I accelerate my engineering velocity with AI?" I was like, "Fix your CI."
3. AGAnkur Goyal
  Yeah. Yeah. I mean, I think-
4. CVClaire Vo
  Start there
5. AGAnkur Goyal
  ... every engineer is now building a platform, and upon the platform, agents are doing the work that the engineers were doing manually, right? And I think that applies to evals. Like, if you're an engineering team and you're building an AI product, the number one job for you is to build a feedback loop. Meaning you have a, a pipeline that allows you to summon from the ether of real world data and turn that into evals. And as an engineering team, that is your number one job. It is not prompt engineering. It's not picking an agent framework. It's not rewriting your database, whatever. It's creating that pipeline. And the same is true, C- CI is that same idea, but applied to software engineering.
6. CVClaire Vo
  Well, and I'll give one other tip, which is you think that those evals, people are always like, "Oh yeah, for my AI product I need that." I have seen, again, I think the Intercom team has run a bunch of evals on their internal use of Claude Code to figure out where engineers are hitting pain points, where people are giving up, where the agents are asking for permissions that have to be escalated. And I think that sort of analysis on your team is very, very important and ultimately
37:30 – 39:10
Ankur’s prompting strategy when agents fail
1. CVClaire Vo
  gets you to these, these better outcomes. Okay, last question. You seem like a very reasoned person, so I'm, I'm presuming I'm gonna get a very reasonable answer, but I ask everybody. When a, one of, one of your four tabs is not doing what you want, when the evals are failing the David test, what is a in your back pocket prompting strategy that you, that, that you rely on? Do you yell? Do you bribe?
2. AGAnkur Goyal
  Close the session, um, and then I improve the evals, and then I try from scratch a- again. Yeah, yeah. Um-
3. CVClaire Vo
  This is a man who is on message.
4. AGAnkur Goyal
  Yeah, I mean, I'll, I'll give you like a, a, a, an example. Um, we have this open source use case, uh, I'm, uh, sorry, a mo- a use case where we run open source models, and we're running, like, millions of tokens per second. It's very, very high scale, so every cent matters and e- every bit of optimization matters. We are trying to change right now from model A to model B, and I, again, I'm someone who builds software to write evals. I, uh, vibe coded an eval script, and it went... It just was getting stuck. And then I read the code, and it's, like, 3,000 lines of complete trash. And it had, like, all these scoring functions and all this crap, and it was getting confused. And so I, on Saturday, I hand wrote, like, no, uh, no copilot, no autocomplete. I just, uh, b- partly to improve my own understanding of the problem, I hand wrote the eval, and then by the end of Sunday, the problem was solved. Yeah.
5. CVClaire Vo
  So you shut the session, you do it yourself.
6. AGAnkur Goyal
  Yeah. Uh, just for the eval. Just for the eval.
7. CVClaire Vo
  Great. This has been so great.
39:10 – 40:10
Closing thoughts and how to connect
1. CVClaire Vo
  Where can we, where can we find you, and how can we be helpful?
2. AGAnkur Goyal
  Uh, if you are interested in, um, evals or you're trying to solve AI observability problems inside your company, please check out Braintrust. We're at braintrust.dev, @braintrust on X, or I'm @ankrgyl. I'm very happy to chat. We're also hiring. If you like working on these problems and you like maybe pushing the boundaries of rigor and stuff, uh, and you found th- this kind of stuff interesting, uh, we'd love to work with you.
3. CVClaire Vo
  Well, thank you so much for joining. This was great.
4. AGAnkur Goyal
  It was a lot of fun. [upbeat music]
5. CVClaire Vo
  Thanks so much for watching. If you enjoyed the show, please like and subscribe here on YouTube, or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app. Please consider leaving us a rating and review, which will help others find the show. You can see all our episodes and learn more about the show at howiai pod.com. See you next time.

Episode duration: 40:11

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode QE_1hRLsehM

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome