Skip to content
Lenny's PodcastLenny's Podcast

Hamel Husain & Shreya Shankar: How notes turn into AI evals

Manual error analysis on real traces, with one benevolent dictator labeling: open coding clusters notes into buckets, then narrow binary LLM judges check them.

Lenny RachitskyhostHamel HusainguestShreya Shankarguest
Sep 25, 20251h 46mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:004:57

    Introduction to Hamel and Shreya

    1. LR

      (instrumental music) To build great AI products, you need to be really good at building evals.

    2. HH

      It's the highest ROI activity you can engage in. This process is a lot of fun. Everyone that does this immediately gets addicted to it. When you're building an AI application, you just learn a lot.

    3. LR

      What's cool about this is you don't need to do this many, many times. For most products, you do this process once and then you build on it.

    4. NA

      The goal is not to do evals perfectly. It's to actionably improve your product.

    5. LR

      I did not realize how much controversy and drama there is around evals. There's a lot of people with very strong opinions. (laughs)

    6. NA

      People have been burned by evals in the past. People have done evals badly, so then they didn't trust it anymore, and then they're like, "Oh, I'm anti-evals."

    7. LR

      What are a couple of the most common misconceptions people have with evals?

    8. HH

      The top one is we live in the age of AI. Can't the AI just eval it? But it doesn't work.

    9. LR

      A term that you used in your posts that I love is this idea of a benevolent dictator.

    10. HH

      When you're doing this open coding, a lot of teams get bogged down in having a committee do this. For a lot of situations, that's wholly unnecessary. You don't want to make this process so expensive that you can't do it. You can appoint one person whose taste that you trust. It should be the person with domain expertise. Oftentimes, it is the product manager.

    11. LR

      (instrumental music) Today, my guests are Hamel Hussein and Shreya Shankar. One of the most trending topics on this podcast over the past year has been the rise of evals. Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders. And since then, this has been a recurring theme across many of the top AI builders I've had on. Two years ago, I had never heard the term evals. Now it's coming up constantly. When was the last time that a new skill emerged that product builders had to get good at to be successful? Hamel and Shreya have played a major role in shifting evals from being an obscure, mysterious subject to one of the most necessary skills for AI product builders. They teach the definitive online course on evals, which happens to be the number one course on Maven. They've now taught over 2,000 PMs and engineers across 500 companies, including large swaths of the OpenAI and Anthropic teams, along with every other major AI lab. In this conversation, we do a lot of show versus tell. We walk through the process of developing an effective eval, explain what the heck evals are and what they look like, address many of the major misconceptions with evals, give you the first few steps you can take to start building evals for your product, and also share just a ton of best practices that Hamel and Shreya have developed over the past few years. This episode is the deepest yet most understandable primer you will find on the world of evals, and honestly, it got me excited to write evals, even though I have nothing to write evals for. I think you'll feel the same way as you watch this. If this conversation gets you excited, definitely check out Hamel and Shreya's course on Maven. We'll link to it in the show notes. If you use the code LENNIESLIST when you purchase the course, you'll get 35% off the price of the course. With that, I bring you Hamel Hussein and Shreya Shankar. This episode is brought to you by Fin, the number one AI agent for customer service. If your customer support tickets are piling up, then you need Fin. Fin is the highest performing AI agent on the market with a 65% average resolution rate. Fin resolves even the most complex customer queries. No other AI agent performs better. In head-to-head bake-offs with competitors, Fin wins every time. Yes, switching to a new tool can be scary, but Fin works on any help desk with no migration needed, which means you don't have to overhaul your current system or deal with delays in service for your customers. And Fin is trusted by over 5,000 customer service leaders and top AI companies like Anthropic and Synthesia. Because Fin is powered by the Fin AI engine, which is a continuously improving system that allows you to analyze, train, test, and deploy with ease, Fin can continuously improve your results too. So if you're ready to transform your customer service and scale your support, give Fin a try for only 99 cents per resolution. Plus, Fin comes with a 90-day money back guarantee. Find out how Fin can work for your team at Fin.ai/Lenny. That's Fin.ai/Lenny. This episode is brought to you by Dscout. Design teams today are expected to move fast, but also to get it right. That's where Dscout comes in. Dscout is the all-in-one research platform built for modern product and design teams. Whether you're running usability tests, interviews, surveys, or in-the-wild fieldwork, Dscout makes it easy to connect with real users and get real insights fast. You can even test your Figma prototypes directly inside the platform. No juggling tools, no chasing ghost participants. And with the industry's most trusted panel plus AI-powered analysis, your team gets clarity and confidence to build better without slowing down. So if you're ready to streamline your research, speed up decisions, and design with impact, head to Dscout.com to learn more. That's D-S-C-O-U-T.com. The answers you need to move confidently.

  2. 4:579:56

    What are evals?

    1. LR

      (instrumental music) Hamel and Shreya, thank you so much for being here and welcome to the podcast.

    2. HH

      Thank you for having us.

    3. NA

      Yeah, super excited.

    4. LR

      I'm even more excited. Okay, so a couple years ago, I had never heard the term evals. Now it's one of the most trending topics on my podcast. Essentially that to build great AI products, you need to be really good at building evals. Uh, also turns out some of the fastest growing companies in the world are basically building and selling and creating evals for AI labs. I just had the CEO of Merkle on the podcast. So there's something really big happening here. Uh, I want to use this conversation to basically help people understand the space deeply, but let'll start with the basics. Just what, what the heck are evals, for folks that have no idea what we're talking about? Give us just a quick understanding of what an eval is, and let's start with, with Hamel.

    5. HH

      Sure. Evals is a way to systematically measure and improve an AI application, and it really doesn't have to be scary or unapproachable at all. It really is, at its core,... data analytics on your LLM application in a systematic way of looking at that data and, where necessary, creating metrics around things so you can measure what's happening and then so you can iterate and do experiments and improve.

    6. LR

      So that's a, that's a really good broad way of thinking about it. If you go one level deeper, just to give people a very even more concrete way of imagining and visualizing what we're talking about, even if you have a example to show, it would be even better. W- what's a, what's an even deeper way of understanding what an eval is?

    7. HH

      Let's say you have a real estate assistant, uh, you know, application, and it's, it's not working the way you want. It's not writing emails to customers the way you want or it's not, uh, you know, calling the right tools or any number of errors. And before evals, you would be left with guessing. You would maybe fix a prompt and hope that you're not breaking anything else with that prompt. And you might rely on vibe checks, which is totally fine. And vibe checks are good, and you should do vibe checks initially. But it can become very unmanageable very fast because as your application grows, it's really hard to rely on vibe checks. You just feel lost. And so evals help you create metrics that you can use to measure how your application is doing and kind of give you a way to improve your a- your application with confidence, that you, you have a feedback signal in which to iterate against.

    8. LR

      So just to make it the, very real, so imagining this, uh, real estate agent, maybe they're helping you book a listing or go s- o- see an open house. The idea here is you have this agent talking to people. It's answering questions, pointing them to things. As a builder of that agent, how do you know if it's giving them good advice, good answers? Is it telling them things that are completely wrong? So the idea of evals essentially is to build a set of tests that tell you is, how often are, is this agent doing something wrong that you don't want it to do? And there's a bunch of ways wrong, you could define wrong. It could be, uh, just making up stuff. It could be, uh, just answering in a really strange way. Uh, the way I think about evals, and tell me if this is wrong, just simply is like unit tests for, for code and they may (laughs) You're smiling. You're like, "No, you idiot." Uh-

    9. SS

      No, that's not what I was thinking.

    10. LR

      Okay, okay, Tell me, tell me 'cause how does that feel as a metaphor?

    11. SS

      So, okay. I like what you said first, which is we had a very broad definition. Evals is a big spectrum of ways to measure application quality. Now, unit tests are one way of doing this. Maybe there are some non-negotiable functionalities that you want your AI assistant to have and unit tests are going to be able to check that. Now, maybe you also, because these AI assistants are doing such open-ended tasks, you kind of also want to measure how good are they at very vague or ambiguous things like responding to new types of user requests or, you know, figuring out if there's new distributions of data, like new users are coming and using your real estate agent that you didn't even know would use your product and then all of a sudden you think like, "Oh, there's a different way you want to kind of accommodate this new group of people." So evals could also be, you know, a way of looking at your data regularly to find these new cohorts of people. Evals could also be like metrics that, you know, you just want to track over time, like you want to track people saying, "Yes, thumbs up. I liked your message." Um, you want to tr- very, very basic things that are not necessarily AI related but can go back into this flywheel of improving your product. So I would say on the end, um, overall, right, unit tests are a very small part of that very big puzzle.

  3. 9:5616:51

    Demo: Examining real traces from a property management AI assistant

    1. SS

    2. LR

      Awesome. You guys actually brought an example of an eval just to show us exactly what the hell we're talking about. We're talking in these big ideas. So how about let's pull one up and show people here's, here's what an eval is.

    3. HH

      Yeah. Let me just set the stage for it a little bit. So to echo what Shreya said, it's really important that we don't think of evals as just tests. There's a common trap that a lot of people fall into because they jump straight to the tests, like, "Let me write some tests." And usually that's not what you want to do. You should start with some kind of data analysis to ground what you should even test, and that's a little bit different than software engineering where you have a lot more expectations of how the system is going to work. With LLMs, it's a lot more surface area and it's very stochastic. So we kind of have a different flavor here. And so the example I'm gonna show you today, it's actually a real estate example. It's a different kind of real estate example. It's, uh, from a company called NurtureBoss. I can share my screen to show you their website just to help you understand this, uh, use case a little bit. So let me share my screen. So this is a company that I work with called NurtureBoss, and it is a AI assistant for property managers who are managing apartments. And it helps with various tasks such as inbound leads, customer service, booking appointments, so on and so forth, like all the different sort of operations you might be doing as a property manager, it helps you with that. And so, you know, you can see kind of what they do. It's a very good example because it has a lot of the complexities of a modern AI application. So there's lots of different channels that you can interact through the AI with, like chat, text, voice, but also there's tool calls, lots of tool calls for like booking appointments, getting, uh, information about availability, so on and so forth. There's also RAG retrieval, getting information about customers and properties and things like that. So it's pretty fully fledged in terms of an AI application.And so they have been really generous with me in, uh, allowing me to use their data as a teaching example. And so we have anonymized it, but what I'm gonna walk through today is, okay, let's create... let's do the first part of how we would start to build evals for NurtureBoss. Like, why would we even want to do that? So let's go through the very beginning stage, what we call error analysis, which is, let's look at the data of their application and first start with what's going wrong. So I'm gonna jump to that next and I'm gonna open an observability tool and you can use whatever you want here. I just happen to have this data loaded in a tool called Braintrust. But you can load it in anything. You know, it's not... we don't have a favorite tool or anything. In the blog post that we wrote with you, uh, we had the same example but in Phoenix, Arize, um, and I think Aman, on your blog post, used Phoenix, Arize as well. There's also LangSmith. So, these are kind of, like, different tools that you can use. So what you see here on the screen, this is logs from the application and let me just show you how it looks. So, what you see here is... And let me make it full screen. So this is one particular interaction that a customer had with the NurtureBoss app- application. And what it is, it's a detailed log of everything that happened. So it's, it's a... it's called a trace, and it's just an engineering term for logs of a sequence of events. It's been a... The concept of a trace has been around for a really long time, but it's especially really important when it comes to AI applications. So we have all the different components and pieces and information that the AI needs to do its job, and we have logged all of it. And we're looking at a view of that. And so you see here, a system prompt. The system prompt says, "You are an AI assistant working as a leasing team member at Retreat at Acme Apartments." Remember, I said this is anonymized, so that's why the name is Acme Apartments. "Your primary role is to respond to text messages from both residents and prospective, uh, both current residents and prospective residents. Your goal is to provide accurate, helpful information," yada, yada, yada, and then there's a lot of detail around guidelines of how we want this thing to behave.

    4. LR

      Is this their actual system prompt, by the way, for this company?

    5. HH

      It is. Yes it is.

    6. LR

      Amazing.

    7. HH

      It's a real system prompt.

    8. LR

      That's so cool. That's amazing 'cause that's really... it's rare you see a actual company product system prompt. That's like their crown jewels a lot of the time, so this is actually very cool on its own.

    9. HH

      Yeah. Yeah. It's really cool. And, you know, you see all these different sort of features that they want to, or different use cases, so things about tour scheduling, handling applications, guidance on how to talk to different personas, so on and so forth. And you can see a user just kind of jumps in here. It says, asks, "Okay. Do you have a one bedroom with study available? I saw it on virtual tours." And you can see that the LLM calls some tools. It calls this Get Individuals' Information tool and it pulls back that person's information and then it gets the community's availability. So it's, you know, it's querying a database with the availability for that apartment complex. And then, finally, the AI responds, "Hey. We, we have several one bedroom apartments available, um, but none specifically listed with a study. Here are a few options." Uh, and then it says, "Can you let me know when one with a study is available?" Then it says, "I currently don't have specific information on the availability of a one bedroom apartment." User says, "Thank you," and the AI says, "You're welcome. If you have any more questions, feel free to reach out." Now, this is an example of a trace and this is... We're looking at ve- one specific data point. And so one thing that's really important to do when you're doing data analysis of your LM application is to look at

  4. 16:5123:54

    Writing notes on errors

    1. HH

      data. Now, you might wonder, there's a lot of these logs. It's kind of messy. There's a lot of things going on here. How in the hell are you supposed to look at this data? Do you want to just drown in this data? How do you even analyze this data? So it turns out, there is a way to do it that is completely manageable, and it's not something that we invented. It's been around in machine learning and data science for a really long time and it's called error analysis. And what you do is... The first step in conquering data like this is just to write notes, okay? So, you got to put your product hat on, which is why we're talking to you because product people have to be in the room, um, and they have to be involved in sort of doing this. You k- you know, usually a developer is not suited to do this, especially if it's not a coding application.

    2. LR

      And, um, just to-

    3. HH

      And-

    4. LR

      Just to mirror back why I think you're saying that is because this is the user experience of your product. People talking to this agent is the entire product essentially, and so it makes sense for the product person to be involved, super involved in this.

    5. HH

      Yeah. So let's, let's reflect on this conversation. Okay. A user asked about availability. The AI said, "Oh, we don't really have that. Have a nice day."Now, for a product that is helping you with lead management, is that good? Like do you feel like this is the way we want it to, to go?

    6. LR

      Not ideal.

    7. HH

      Yes, not ideal. And I'm glad you said that. A lot of people would say, "Oh, it's great." Like, "The AI did the right thing. It said we don't... it looked, said we didn't have it available and it's not available." But with your product hat on, you know that's not correct. And so what- what you would do is you would just write a quick note here. You would say, okay, um, you know, you might pop in here, let me just... and you can write a note. So every observability ha- application has the ability to write notes. And you wouldn't try to figure out if something is wrong in this applica- you know, in this case, it's kind of not doing the right thing. Um, but you just write a quick note, um, should, you know, should have hand- handed off to a human.

    8. LR

      And as we watch this happening, it's, like you mentioned this and you'll explain more. You're doing this, this feels very ma- manual and unscalable, but, uh, as you said, this is just one step of the process now. There's a system to this. And it's just the first part.

    9. HH

      Yeah. And you don't have to do it for all of your data. You can, you sample your data and just take a look and it's surprising how much you learn when you do this. Everyone that does this immediately gets addicted to it and they say, "This is the greatest thing that you can do when you're building an AI application." You just learn a lot. You're like, hmm, this is not how I want it to, to work. Okay. And so, um, that's just an ai- example. So you write this note. And then we can go onto the next trace. So this is the next trace. I just pushed a hot key on my keyboard. Let me go back to, uh, looking at it.

    10. LR

      And these tools make it easy to go through a bunch and add these notes quickly.

    11. HH

      Yes. And so this is another one. Similar system prompt, we don't need to go through all of it again. We'll just jump right into the user question, okay, I've been texting you all day, maybe it's- is funny, um, uh, and, uh, the user says, "Please..." Okay, yeah, this one is, this one is just like an error in the application where, you know, um, th- this is a text message application and so, you know, it's a tech, the, sorry, the channel through which the customer is communicating is through text message and you're just getting like really garbled and you can see here that it kind of doesn't make sense. You know, like the words are being cut off like, "In the meantime," and then the assistant doesn't know how to respond 'cause you know how people text message, they like write short phrases, they, you know, split, split their sentence across four or five different turns. So in this case-

    12. LR

      Yeah, so what do you, what do you do with something like that?

    13. HH

      Yeah, so this is a, this is a different kind of error.

    14. LR

      Mm-hmm.

    15. HH

      This is more of, hey, we're not handling this interaction correctly, this is more of a technical problem. Um, rather than, hey, the AI is not doing exactly what we want.

    16. LR

      So we would write that down too.

    17. HH

      Which is still really cool.

    18. LR

      Like it's amazing you're catching that too here. Otherwise you'd have no idea this was happening.

    19. HH

      Yeah, you might not know this is happening, right? And so you would just say, okay, um, you would write a note, like, uh, "Conversation flow is janky because of text message."

    20. LR

      And I like, yeah, I like that. (laughs) I like that you're using the word janky. It shows you just how informal this can be at this stage.

    21. HH

      Yeah, it's supposed to be chill. Like just don't overthink it. And there's some, there's a way to do this. So the question always comes up, how do you do this? Do you look at, do you try to find all the different problems in this trace? What- what do you write a note about? And the answer is, just write down the first thing that you see that's wrong, the most upstream error. Don't worry about all the errors. Just capture the most, the first thing that you see that's wrong and stop and move on. And you can get really good at this. The first two or three can be very painful, but you know, it doesn't, you can, you know, do a bunch of them really fast. So here's another one. And, uh, let's skip the system prompt again and the user asked, "Hey, I'm looking for a two to three bedroom with either one or two baths. Do you provide virtual tours?" And a bunch of tools are called and it says, "Hi, Sarah. Currently we have three bedroom, two and a half bathroom apartment available for $2,175. Um, unfortunately we don't have any two bedroom options at the moment. We do offer virtual tool, tours, you can schedule a tour," blah, blah. It just so happens that there is no virtual tour. (laughs) Right?

    22. LR

      Nice.

    23. HH

      So, um, you know, it is hallucinating something that doesn't exist and you would, you kind of have to bring your context as an engineer or even, you know, a product content and say, "Hey, this is kind of weird. Like, you know, we shouldn't be telling a person about a virtual tour when it's not offered." So you would say, okay, uh, you know, "Offered virtual tour, tour." And you just, you know, you just write the note. So you can see there's a diversity of different kinds of errors that we're seeing. And we're actually learning a lot about your application, um, in a very short amount of

  5. 23:5425:16

    Why LLMs can’t replace humans in the initial error analysis

    1. HH

      time.

    2. SS

      One common question that we get from people at this stage is, "Okay, I understand what's going on. Can I ask an LLM to do this process for me?"

    3. LR

      Hmm, great question.

    4. SS

      And I loved Hamel's most recent example because what we usually find when we try to ask an LLM to do this error analysis is it just says, "The trace looks good." Because it doesn't have the context needed to understand whether something might be, you know, bad product smell or not...... for example, the hallucination about scheduling the tour, right? I can guarantee you, I would bet money on this, if I put that into ChatGPT and asked, "Is there an error?" It would say, "No, did a great job." But Hamel had the context of knowing, oh, we don't actually have this virtual tour functionality, right? So I think in these cases it's so important to make sure you are manually doing this yourself. Um, and we'll talk a l- we can talk a little bit more about when to use LLMs in the process later. But, like, number one pitfall right here is people are like, "Let me automate this with an LLM."

    5. LR

      Do you think they'll... We'll get to a place where- where an agent can do this without-

    6. SS

      Oh, no, no, no, no.

    7. LR

      ... the context?

    8. SS

      Sorry, there are parts of error analysis-

    9. LR

      Mm-hmm.

    10. SS

      ... that an LLM is suited for, which we can talk about-

    11. LR

      Mm-hmm.

    12. SS

      ... later in this podcast. But right now in this stage of freeform notetaking-

    13. LR

      Mm-hmm.

    14. SS

      ... it's not the place for an LLM.

    15. LR

      Got it. And this is something you call open coding, this step?

    16. SS

      Yes, absolutely.

    17. LR

      Cool.

  6. 25:1628:07

    The concept of a “benevolent dictator” in the eval process

    1. LR

      Uh, another, uh, t- term that you used in your post that I love and that's... fits into this step is this idea of a bene- benevolent dictator. Maybe just talk about what that is, and maybe Shreya cover that?

    2. SS

      Yeah. So Hamel actually came up with this term.

    3. LR

      Okay, maybe Hamel

    4. HH

      (laughs)

    5. LR

      ... will cover the (laughs) actually.

    6. HH

      No problem. And we'll actually show the LM automation in this example-

    7. LR

      Oh, awesome.

    8. HH

      ... uh, because we're gonna take this example. We're gonna go all the way through.

    9. LR

      Amazing.

    10. HH

      And so- and so, um, benevolent dictator is just a catchy term for the fact that when you're doing this open coding, a lot of teams get bogged down in having a committee do this, and for a lot of situations, that's wholly unnecessary. Like, you know, people get really uncomfortable with, "Oh, okay, you know, we want everybody on board. We want everybody involved," so on and so forth. You need to cut through the noise. Um, in a lot of organizations, if you look really deeply, especially small or medium-sized companies, there's really... like, you can appoint one person whose taste that you trust. Um, and you can- you can do this with a small number of people, and often one person, and that's s- really important to make this tractable. You don't want to make this process so expensive that you can't do it. You're gonna lose out. So, that's the idea behind benevolent dictator is, hey, you need to simplify this across as many dimensions as you can. Another thing that we'll talk about later is when you... goes to building an LM as a judge, you need a binary score. You don't want to think about, is this like a one, two, three, four, five, like assign a score to it. You can't- that's gonna slow it down.

    11. LR

      Just to make sure this benevolent dictator point is- is really clear, basically this is the person that does this notetaking, and ideally they're the expert on the stuff. So if it's law stuff, maybe there's, like, a legal person that owns this. It could be a product manager. Give us advice on who this person should be.

    12. HH

      Yeah, it should be the person with domain expertise. So, in this case, you know, it would be the person who understands the business of leasing, apartment leasing, and has context to understand if this makes sense. It- it's always a domain expert, like you said. Okay, for legal it would be a law person, for mental health it would be the mental health expert, whether that's, like, a psychiatrist or, you know, someone else.

    13. LR

      Cool.

    14. HH

      Um, oftentimes it is the product manager.

    15. LR

      Cool. So the advice here, pick that person. May not feel so- super fair that they're the one in charge and they're the dictator, but they're benevolent. It's gonna go- be okay.

    16. HH

      Yeah. It's gonna be okay. You're just trying to... It's not perfection, you're just trying to make progress and- and get signal quickly so you have an idea of what to work on, because it can become infinitely expensive if you're not careful.

  7. 28:0731:39

    Theoretical saturation: when to stop

    1. HH

    2. LR

      Yeah. Okay, cool. Let's go back to your examples.

    3. HH

      Yeah, no problem. So this is another example where we have someone saying, "Okay, do you have any specials?" And the assistant, or the AI responds, "Hey, we have a 5% military discount." User responds, "Can you..." and it switches the subject, "Can you tell me how many floors there are? Do you have any one bedrooms available... or one bedrooms on the first floor?" And the AI responds, "Yeah, okay, we have several one bedroom apartments available." And then the user wants to confirm, "Any of those on the first floor? And how much are the one bedrooms?" And then, also, it- it's a current resident, so it's al- they're also asking, "I need a maintenance request." This is actually pretty... Like, you can see the messiness of the real world-

    4. LR

      Mm-hmm.

    5. HH

      ... in here. And the assistant just calls a tool that says transfer call, but it doesn't say anything.

    6. LR

      Mm-hmm.

    7. HH

      It just abruptly does transfer call. So it's pretty jank, I would say. Like, it's just not, you know-

    8. LR

      Another jank.

    9. HH

      Another kind of jank, a different kind of jank. So you don't want to... When you write the open note, you don't want to say jank-

    10. LR

      Mm-hmm.

    11. HH

      ... because what we want to do is we want to understand what... And when we look at the notes later on, we want to understand, like, what happened. So you just want to say, um, you know, "Did not confirm call transfer with, uh, with user." It doesn't have to be perfect, you just have to have a general idea of what's going on.

    12. LR

      Cool.

    13. HH

      So, okay. So let's say we do... And we, Shreya and I, we recommend doing at least 100 of these. The question is always like, how many of this do you do? And so there's not a magic number. We say 100 is because we know that as soon as you start doing this, once you do 20 of these, you will automatically find it so useful that you will continue doing it. So we just say 100 to mentally unblock you so it's not intimidating. Like, don't worry, you're only gonna do 100. And there is a- a term for that of... So- so the right answer is keep looking at traces until you feel like you're not learning anything new.... maybe Shreya should talk about-

    14. SS

      Yeah. So there's actually a term in data analysis and quanta- qualitative analysis called theoretical saturation. So what this means is when you do all of these processes of looking at your data, when do you stop? It's when you are theoretically saturating or you're not uncovering any new types of notes, new types of concepts, or nothing that will, like, materially change the next part of your process. Um, and this kind of takes a little bit of intuition to develop. So typically, people don't really know when they've reached theoretical saturation yet. That's totally fine. When you do two or three examples or rounds of this, like, you will develop the intuition. A lot of people realize like, "Oh, okay, like, I only need to do 40. I only need to do 60. Actually, I only need to do, like, 15." I don't know. Like, depends on the application and develops like how- depends on how savvy you are with error analysis, for sure.

    15. LR

      And your point about you probably wanna, uh, you're gonna wanna do a bunch. I imagine it's because you're just like, "Oh, I'm discovering all these problems. I gotta see what else is going on here."

    16. SS

      Exactly.

    17. LR

      Is that right? Okay.

    18. SS

      And I promise at some point, you're, like, not gonna discover new types of problems.

  8. 31:3944:39

    Using axial codes to help categorize and synthesize error notes

    1. SS

    2. LR

      Yeah. Awesome. So let's say you did a hundred of these, what's the next step?

    3. HH

      Yeah. Okay. So you did a hundred of these. Now you have all these notes. So this is where you can start using AI to help you. Um, you... So the part where you looked at this data is important, like we discussed. You don't want to automate this part too much.

    4. LR

      Humans will still have jobs. This is a takeaway here.

    5. SS

      (laughs)

    6. HH

      That's great. Yes.

    7. LR

      Just reviewing traces. At least there's one job left for now.

    8. HH

      Yeah.

    9. LR

      That's great.

    10. HH

      So, o-... Yeah. Exactly. Um, and so, okay, you have all these notes. Now, to turn this into something useful, you can do basic counting. So basic counting is the most powerful analytical technique in data science, uh, because it's so simple and it's kind of undervalued, um, in many cases. And so it's very approachable for people. And so the first thing you wanna, you wanna do is take these notes and you can categorize them with an LLM. And so there's a lot of different ways to do that. Right before this podcast, I took three different, uh, coding agents or, you know, uh, AI tools and had it categorize these notes. So one is, okay, I upload it into a Claude project. I uploaded a CSV of these notes, and I just exported them directly from this interface. Um, there's a lot of different ways to do this, but I'm, I'm showing you the simple, stupid way, the most basic way of doing things. And so I dumped a CSV in here, and I said, "Please analyze the following CSV file." There's... And I told it there's a metadata field that has a note in it. But what I said is I used the word open codes, and I said, "Hey, I have different open codes." And that's a term of art. That's, um... LLMs know what open codes are and they know what axial codes are because it is a t- it is a concept that's been around for a really long time. So those words help me shortcut, like, what I'm trying to do.

    11. LR

      That's awesome. And the end of the, end of the prompt is telling it to create axial codes.

    12. HH

      Yes. Creating axial codes. So what it does is-

    13. SS

      So maybe it's worth talking about-

    14. HH

      ... you know?

    15. SS

      ... what are-

    16. HH

      Yeah.

    17. SS

      ... axial codes or, like, what's the point here? Right? You have a mess-

    18. HH

      Mm-hmm.

    19. SS

      ... of open codes, right? And you don't have 100 distinct problems. Actually, mo-... many of them are repeats, but because you phrased them differently, right? And, and that po-... You shouldn't have tried to create your taxonomy of failures as you're open coding. You just want to get down what's wrong and then organize, okay, what's the most common failure mode? So the purpose... Axial code basically is just a failure mode. It's like a label or category. And what our goal is, is to get to this clusters of failure modes and figure out what is the most prevalent. So then you can go and run an attack with that problem.

    20. LR

      That is really helpful. Basically just synthesizing all these categor-

    21. SS

      Absolutely.

    22. LR

      ... into categories and, and themes. Super cool. And we'll, uh, include this prompt in our show notes for folks, so they don't have to, like, sit there and screenshot it (laughs) and try to type it up themselves.

    23. HH

      Yeah. Great idea. Um, and so Claude, you know, went ahead and analyzed the CSV file, decided how to parse it, blah, blah, blah. We don't need to worry about all that stuff, but it came up with a bunch of axial codes. Basically, axial codes are categories, like Shreya said. So one is, okay, ca- capability limitations, misrepresentation, pro- process and protocol violations, human handoff issues, communication quality. It created these categories. Now, do I like all the categories? Not really. I like some of them. It's a good first, like, stab at it. I would probably rename it a little bit because some of them are a bit too generic. Like, what is capability limitations? That's still a little bit too broad. That's not actionable. I wanna get, like, a little bit more actionable with it, so that if I do decide it's a problem, I know what to do with it. But we'll discuss that in a little bit. Um, so you can do this, like, with anything. And this is the dumbest way to do it, but dumb sometimes is a good way to get started. So.

    24. LR

      And, and this is-

    25. HH

      Uh-

    26. LR

      ... what LMs are really good at, taking a bunch of information and synthesizing.

    27. SS

      Absolutely.

    28. HH

      Yeah.

    29. SS

      Synthesizing for us to make sense of, right?

    30. LR

      Yeah.

  9. 44:3946:06

    The results

    1. HH

      sheet. What comes next? Okay, so here's the big unveil. Hmm. This is the magic moment-

    2. SS

      (laughs)

    3. HH

      ... right now. So we have all these codes we, that, you know, we applied. The ones that we like on our traces. Now you can do the ta-da, you can count them. So here's a pivot table, and we just can do pivot table on those. And we can count how many times those different things occurred. So what did we find? Found on this, on these like traces that we categorized, we found 17 conversational flow issues. And I really like pivot tables because you can do cool things. You can like double click on these, you can say, "Oh, okay, let me, let me take a look at those." But that's going into an aside about pivot tables, how cool they are. But, um, um, you know, w- now we have just a nice rough cut of what are our problems. And now we have gone from chaos to some kind of thinking around, "Oh, you know what? These are my biggest problems. I need to fix conversational issues." You know, maybe these human handoff issues. It's not necessarily the count is the most important thing. You know, there might be something that's just really bad and you want to fix that, but okay, now you have some way of looking at your problem. And now you can think about whether you need evals, uh, for, for some of these.

  10. 46:0648:31

    Building an LLM-as-judge to evaluate specific failure modes

    1. HH

      So, you know, with the... You know, there might be some of these things that might be just dumb engineering errors that you don't need to write an eval for because it's very obvious on how to fix them. Um, maybe the formatting error with output, maybe you just forgot to tell the LLM how you want it to be formatted. And like, you didn't even say that in the prompt, so like just go ahead and fix the prompt, maybe. You know? And we can decide like, okay, do you want an- uh, do you want to write an eval for that? You might be, you might still want to write an eval for that because you might be able to test that with just code. You could just test the string. Does it have the right formatting, potentially, um, without running an LLM? So there's a cost benefit trade-off to evals. You don't want to get carried away with it. Um, but you want to start, you want to usually ground yourself in your actual errors. You don't want to skip this step. And so the reason I'm kind of spending so much time on this is, like, this is where people get lost. They go straight into evals like, "Let me, let me just write some tests." And that is where things go off the rails. Um, so let's, let's... Okay, so let's say we want to tackle one of these things. Uh, so for example, uh, let's say we want to tackle this human handoff issue. And we're like, "Hmm, I'm not really sure how to fix this." Like, that's a kind of subjective sort of judgment call on, you know, should we be handing off to a human? And I don't know immediately how to fix it. It's not super obvious per se. Yeah, I can like change my prompt, but I'm not like sure. I'm not 100% sure. Well, that might be sort of an interesting, um, thing for an LLM as a judge, for example. So there's different kinds of evals. One is code based, which you should try to do if you can because they're cheaper. You don't have to, you know... LLM as a judge is something, it's like a meta eval. You have to eval that eval to make sure the LLM that's judging is doing the right thing, which we'll talk about in a second.So okay, LLM as a judge, that's one thing. Okay, how do you build an LLM as a judge?

  11. 48:3152:10

    The difference between code-based evals and LLM-as-judge

    1. HH

    2. LR

      Before we get into that, actually-

    3. HH

      Yeah.

    4. LR

      ... just to make sure people know exactly what you're describing there, there's two types of evals. One is, you said, is code-based, and one is an LLM as judge. Maybe, Shreya, just help us understand what the, what, what code-based eval even is. Is it just like, it's like essentially a unit test? Is that a simple way to think about it?

    5. SS

      Yeah. Maybe eval is not the right term here, but think like-

    6. LR

      Mm-hmm.

    7. SS

      ... automated evaluator. So when we find these failure modes, one of the things we want is like, okay, can we now, like, go check the prevalence of that failure mode in an automated way without me manually labeling and doing all the coding and the grouping, and I wanna run it on thousands and thousands of traces, I wanna run it every week. That is, okay, you should probably build an, an automated evaluator to check for that failure mode. Now, when we're saying code-based versus LLM-based, we're saying, okay, so maybe I could write like a Python function or a piece of code to check whether that failure mode is present in a trace or not. And that's possible to do for certain things like, you know, checking the output is JSON, um, or, you know, checking that it's Markdown, or checking that it's short. Like, these are all things you can capture in code, or you could approximately capture in code. Uh, when we're talking about LLM judge here, we're saying that this is a complex failure mode, and we don't know how to evaluate in an automated way. So maybe we will try to use an LLM to evaluate this very, very narrow, specific failure mode of handoffs.

    8. LR

      So just to try to mirror back how, what you're describing, you wanna test what your, say, agent or AI product is doing. You ask it a question, it gets back with something. One way to test if it's giving you the right answer is if it's consistently doing the same thing, that you could write a code to te- to tell you this is true or false. For example, will it ever say there's a virtual tour?

    9. SS

      Yes.

    10. LR

      So you could ask it, "Is, do you provide virtual tours?" It says yes or no. And then you could write code to tell you if it's correct based on that specific answer. But if you're asking about something more complicated and it's not binary, you almost need, like in a, in a one world, you need a human to tell you this is correct. The solution to avoid humans having to review all this every time automatically is LLMs replacing human judgment, and you'd call it a LLM as judge, the LLM as being the judge if this is correct or not.

    11. SS

      Absolutely. You nailed it. Um...

    12. LR

      Great.

    13. SS

      So people always think like, oh, like, this is at least as hard as my problem of creating the original agent, and it's not. Because you're asking the judge to do one thing, evaluate one failure mode. So the scope of the problem is very small, and the output of this LLM judge is like pass or fail. So it is a very, very tightly scoped thing that LLM judges are very capable of doing very reliably.

    14. LR

      And the goal here is just to have a suite of tests that run before you ship to production that tell you things are going the way you want them to, the way your agent is interacting is correct.

    15. SS

      The beautiful thing about LLM judges, you can use them in unit tests or CI, sure, but you could also use it online for monitoring.

    16. LR

      Mm-hmm.

    17. SS

      Right?

    18. LR

      Mm-hmm.

    19. SS

      Like I can sample, like, 1,000 traces every day, run my LLM judge, real production traces, and see what the failure rate is there. This is not a unit test, right? But still now we get like an extremely specific measure of application quality.

    20. LR

      Cool. That's a really great point, because a lot of people dis evals for being this, like, not real life thing, it's a thing that you test before it's actually in the real world, and what's actually happening in the real world. You're saying you could actually, you should actually do exactly that-

    21. SS

      Yeah.

    22. LR

      ... test your real thing running in production.

    23. SS

      You, yeah.

    24. LR

      And, and it's like a daily, hourly sort of thing you could be running.

    25. SS

      Totally.

  12. 52:1054:45

    Example: LLM-as-judge

    1. SS

    2. LR

      Awesome. Okay, uh, Hamel's got a, a-

    3. SS

      (laughs) .

    4. LR

      ... example of an actual LLM as judge eval here. So let's take a look.

    5. HH

      I love how Shreya really teed it up, um, for me.

    6. SS

      (laughs) .

    7. HH

      So thank you so much. So what we have is a LLM as a judge prompt for this one specific failure, like Shreya said. You would wanna do one specific failure, and you want to make it binary, because we wanna simplify things. We don't want, "Hey, like, score this on a rating of one to five. Like, how good is it?" That's just mostly, i- in most cases, that's a weasel way of, like, not making a decision. Like, no, you need to make a decision. Is this good enough or not? Yes or no? It can be painful to think about what that is, but you should absolutely do it. Otherwise, this theme becomes very untractable. And then when you report these metrics, no one knows what 3.2 versus 3.7 means. So-

    8. SS

      This is, yeah, we see this all the time also, and even with, like, expert curated content on the internet, where it's like, "Oh, here's your LLM judge evaluator prompt. Here's a one to seven scale." And I always th-, I s- always text Hamel like, "Oh, no, like, now we have to fight the misinformation again," because we know somebody's going to try it out and then come back to us and say, "Oh, I have 4.2 average." And we're gonna be like, "Okay." (laughs)

    9. HH

      (laughs)

    10. LR

      It's wild how much drama there is in eval space.

    11. SS

      (laughs)

    12. HH

      (laughs)

    13. LR

      We're gonna get to that. Oh, man. This episode is brought to you by Mercury. I've been banking with Mercury for years, and honestly, I can't imagine banking any other way at this point. I switched from Chase and holy moly, what a difference. Sending wires, tracking spend, giving people on my team access to move money around, so frequent easy. Where most traditional banking websites and apps are clunky and hard to use, Mercury is meticulously designed to be an intuitive and simple experience. And Mercury brings all the ways that you use money into a single product, including credit cards, invoicing, bill pay, reimbursements for your teammates, and capital. Whether you're a funded tech startup looking for ways to pay contractors and earn yield on your idle cash, or an agency that needs to invoice customers and keep them current, or an e-commerce brand that needs to stay on top of cash flow and access capital, Mercury can be tailored to help your business perform at its highest level. See what over 200,000 entrepreneurs love about Mercury. Visit mercury.com to apply online in 10 minutes. Mercury is a fintech, not a bank. Banking services provided through Mercury's FDIC-insured partner banks. For more details, check out the show notes.

  13. 54:451:00:51

    Testing your LLM judge against human judgment

    1. HH

      Okay. So this is your judge prompt. There's no one way to do it. It's okay to use an LLM to help you create it. But again, put yourself in the loop. Don't just blindly accept what the LLM does. And in all of these cases, that's what we did. Like with the axial codes, we kind of iterated on this. You can use an LLM to, like, help you create this prompt but make sure you read it, make sure you edit it, whatever. This is not to say the perfect prompt. This is just the stupid, like, very keeping it very simple just to show you the idea is like, okay, for this handoff failure, um, you know, I said, "Okay. I want you to output true or false of as binary." It's a binary judge. That's the way we recommend. And then we, then I just go through and say, "Okay. Like, when should you be doing a handoff?" And I just list them out like, okay, ex- explicit human request ignored or looped, uh, some policy mandated transfer, sensitive resident issues, tool data unavailability, same day walk in or tool requests. You know, you need to talk to a human for that. So on and so forth, right? And so the idea is like now that I know that this is a failure from my data, I'm interested in iterating on it because I know this is actually happening all the time. And like Shreya said, like, "It would be nice to have a way not only to evaluate this on ex- like the data I have, but also on production data just to get a sense of like, well, what scale is this happening? Let me find more traces. Let me have a w- you know, a way to iterate on this." And so we can take this prompt and I'm gonna use a spreadsheet again. So the first step is... Okay. Uh, when I'm doing this judge, I wrote the prompt. Now, a lot of people stop there and they say, "Okay. I have my judge prompt. We're done." Good. Like, let's just, let's just ship it. And let's, uh... The prompt says if the judge says it's wrong, it's wrong. They just, like, accept it as the gospel but like, "Okay. The LLM said it's wrong. It's, it must be wrong." Don't do that. Because that's the fastest way that you can have evals that don't match what's going on. And when people l- lose trust in your evals, they lose trust in you. So it's really important that you don't do that. And so one, before you release your LLM as a judge, you want to make sure it's aligned to the human. So how do you do that? Is you actually, you have those axial codes and you want to, like, measure your judge against the axial code and say like, "Hey, d- does it agree with me? Does my own judge does agree with me?" Just measure it. And so what we have here is... Okay, I say assess this LLM trace. Again, I'm using just spreadsheets here. Assess this LLM trace according to these rules. And, and the rules are just the prompt that I just showed you. And I ask it, "Okay. Is there a handoff error, true or false?" So then this column... Let me just zoom in a bit. Column H I have... Okay, is, did this error occur? And column G is whether I thought the error occurred or not. And you can see-

    2. LR

      Yeah. This is you going through it manually, you'd do that.

    3. HH

      Yeah. Yeah. And which we already did. We, we already went through it manually.

    4. LR

      Yeah.

    5. HH

      So we don't, it's not like we have to do it again 'cause we kind of have that cheat code from the axial coding. We already did it. Um, you might have to go through it again if you need more data. And there's a lot of details to this on, like, how to do this correctly. Um, you want to split your data and do all these things so that you're not cheating. But I just want to show you the concept. And basically, um, what you can do is measure the agreement. Now-

    6. LR

      Mm-hmm.

    7. HH

      ... one thing you should know as a product manager is a lot of people go straight to this, like, agreement. They say, "Okay. My judge agrees with the human at some percentage of the time." Now, that sounds appealing but it's a very dangerous metric to use because a lot of times errors have, um... You know, they only happen on the, on the long tail and they don't happen as frequently. So, like, if you only have the error 10% of the time then you can easily have 90% agreement by just having a judge say, uh, it passes all the time. Does that make sense? So like-

    8. LR

      Mm-hmm.

    9. HH

      ... 90% agreement might look good on paper but it might be misleading. And that's-

    10. LR

      And it's a rare, it's a rare area.

    11. HH

      Yeah. So, you know, as a product manager or someone, even if you're not doing this calculation yourself, if someone ever reports to you agreement, you should immediately ask, "Okay. Tell me more." Like, you need, you know, you now need to look into it. To give you more intuition, here is, like, a matrix-

    12. LR

      Mm-hmm.

    13. HH

      ... okay? Of this specific judge in the Google sheet. And this is, again, in Pivot Table. Just keeping it dumb and simple is... Okay. On, on the, uh, rows I have what did the human think? What did I think? Did it have an error, true or false? And then did my judge have an error, true or false?

    14. SS

      The intuition here is exactly what Hamel said, right? You need to look at each type of error. So when the human said false but the judge said true or vice versa. So those non-green diagonals here. And if they're too large then go iterate on your prompt. Make it more clear to the LLM judge so that you can reduce that misalignment. You want to get to a point where most... You're gonna have some misalignment. That's okay. We talk about in our course also how to go and correct that misalignment. But in this stage, if you're a product manager and the person who's building the LLM judge eval has not done this. They're saying like, "Oh, it agrees 75% of the time. We're good." They don't, like, have this matrix and they haven't iterated to make sure that these two types of errors have gone down to zero, then it's a bad smell. Go and ask them to go fix

  14. 1:00:511:05:09

    Why evals are the new PRDs for AI products

    1. SS

      that.

    2. LR

      Awesome. That's a really good tip, is, is what to look for when someone's doing this wrong.

    3. SS

      Yeah.

    4. LR

      Actually, can you take us back to the LLM as judge prompt? I just wanna highlight something really interesting here. I've had some guests on the podcast recently who've been saying, "Evals are the new PRDs." And if you look at this, this is exactly what this is. Like, product managers, product teams are like, "Here's what the product should be, here's all the requirements, here's, like, the how it should work." They build the thing and then they test it manually often. What's cool about this is, this is exactly that same thing and it's running constantly. It's telling you, "Here's how this agent should respond in very specific ways. If it's this, this, this is do that. If it's this, this, that, do that." And so it's exactly what I've been hearing again and again. You could see it right here. This is, like, the purest sense of what a product requirements document should be is this eval judge that's telling you exactly what it should be, and it's automatic and running constantly.

    5. SS

      Yeah, absolutely. And it's kind of derived from our own data. So, of course, it's a product manager's expectations.

    6. LR

      Mm-hmm.

    7. SS

      What I find that a lot of people miss is they just put in what their expectations are before looking at their data. But as we look at our data, we uncover more expectations that we couldn't have dreamed up in the first place, and that ends up going into this prompt.

    8. LR

      So that is interesting. So it's not... (clears throat) So your advice is not skip straight to evals and LLM as judge prompts before you build the product. Still write traditional one-pagers, PRDs we use to tell your team what we're doing and why we're doing it, what success looks like. But then at the end, you could probably pull from that and even improve that original PRD if you're evolving the product, uh, using this process.

    9. SS

      I would go even further to say you're going to improve... It's going to change. You're never gonna know what the failure modes are gonna be upfront, and you're always going to uncover new, you know, vibes that you believe... Think that your product should have, where you don't really know what you want until you see it with these LLMs. So you've got, you gotta be kind of flexible, have to look at your data, have to... PRDs are a great abstraction for thinking about this, but it's not the end all be all. It's going to change.

    10. LR

      I love that. And Hamel's pulling up some cool research report. What's this about?

    11. HH

      (laughs) Oh, this is one of the coolest research reports you can possibly read if you wanna know about evals. So it was authored by someone named Shreya Shankar.

    12. SS

      Oh my God. (laughs)

    13. HH

      Um...

    14. LR

      Whoa.

    15. SS

      (laughs)

    16. HH

      And her collaborators. And so it's called Who validates the validators?

    17. LR

      That is the best name for a researcher I've ever heard of. (laughs)

    18. SS

      Thank you. Thank you.

    19. LR

      Sorry. (laughs)

    20. HH

      So I sh- I should let Shreya talk about this. I think the, one of the most important things to pay attention in this paper are the criteria drift.

    21. SS

      Yeah.

    22. HH

      And what she found.

    23. SS

      So we did this super fun study when we were doing user studies with people who were trying to write LLM judges or just validate their own LLM outputs. And we were... This was, I think, this was before evals was, like, extremely popular, I feel like, on the internet. This was... We did this project, like, late 2023 was when we started it. But then, uh, the thing that really was burning in my mind as a researcher was like, "Why is this problem so hard? We've been having machine learning and AI for so long. It's not new. But suddenly, this time around, everything is really difficult." So we just did this user study with a bunch of developers and we realized, "Okay, what's new here is that you can't figure out your rubrics upfront." People's opinions of good and bad change as they review more outputs. They think of failure modes only after seeing 10 outputs they would never have dreamed of in the first place. And these are experts, right? These are people who have built many LLM pipelines and now agents before. And just you can't ever dream up everything in the first place. Um, and I think that's so key in today's world of AI development.

    24. LR

      Okay. That is a really good point. That's very much reinforcing what we were just talking about.

    25. SS

      Yeah.

    26. LR

      And that's why Hamel pulled this up, is just, okay, you still gotta do-

    27. SS

      The research behind it.

    28. LR

      Yeah. Okay, great. You still gotta do product the same way, but now you have this really powerful tool that make- helps you make sure what you've built is correct. Uh, it's not gonna replace the PRD process. Cool.

  15. 1:05:091:07:41

    How many evals you actually need

    1. LR

      How many evals of these... How many, say, I don't know, LLM as judge prompts do you end up with usually? Say, I don't know. Like, I know it obviously depends complexity to the product, but what's like a number in your experience?

    2. SS

      For me, like, between four and seven. Um...

    3. LR

      Oh, that's it?

    4. SS

      It's not that many 'cause a lot of the failure modes, as Hamel said earlier, can be fixed by just fixing your prompt. You just didn't think to put it in your prompts and now you put it in your... You shouldn't do an eval like this for everything. Just the, the pesky ones that, um, uh, you've described your ideal behavior in your agent prompt, but it's still failing.

    5. LR

      Got it. So say you found a problem, you fixed it. In traditional software development, you'd write a unit test to make sure it doesn't happen again. Is your... Insight here is don't even bother writing an eval around that if it's just gone.

    6. SS

      I think you can if you want to, but the whole game here is about prioritizing. You have finite resources and finite time. You can't write an eval for everything, so prioritize the ones that are the more pesky errors. That's-

    7. LR

      And probably the ones that are most risky to your business-

    8. SS

      Yes.

    9. LR

      ...if they say something like, "Make a Hitler (laughs) as grok."

    10. SS

      Yikes.

    11. LR

      And it's... Cool. Okay, so that's, that's very, uh, relieving that this... 'Cause this is, this prompt is, like, a lot of work to really think through all these details.

    12. SS

      Yeah.

    13. LR

      So if that's-

    14. SS

      But it's a lot of one-time cost.

    15. LR

      Right, right.

    16. SS

      And now forever you can run this on your application.

    17. LR

      Right.

    18. HH

      And I wanna say, okay, data analysis is super powerful, is going to drive lots of improvements very quickly to your application. We showed the most basic kind of data analysis, which is counting, which is accessible to everyone. You can get, you know, more sophisticated with the data analysis. There's lots of different ways to sample, look at data.We kind of made it look easy, in a sense. But there's a lot of skills here to do, to it well. Um, you know, building an intuition and a nose for how to sort through this data. For example, let's say I find conversational issues, these, like, conversational flow issues. Maybe if I was trying to chase down this problem further, I would think about ways to find other conversational flows, flow issues that I didn't code. You know, I would maybe dig through the data in several ways. Um, and there's, you know, different ways to go about this. It kind of... It- it's very similar, if not almost exactly similar as kind of traditional analytics techniques that you would do on any product.

  16. 1:07:411:09:57

    What comes after evals

    1. HH

    2. LR

      Give us just a quick sense of what comes next, and then let's talk about the debate around evals and a couple more things.

    3. SS

      So what comes next after you've built your LLM judge? Well, we find that people just try to use that everywhere they can. So they'll put the LLM judge in unit tests, as you s- and they will know, like, "Oh, here are some example traces where we saw that failure because we labeled it. Now we're going to make those part of unit tests and make sure that every time we push a change to our code, these tests are gonna pass." They also use it for online monitoring. People are making dashboards on this. And I think that's incredible. I think, like, the products that are doing this, right, they have a very sharp sense of how well their application is performing. Um, and people don't talk about it because this is their moat, right? So people are not going to go and share all of these things because makes sense, right? If you are an email writing assistant and you're doing this and you're doing it well, you don't want somebody else to go and build an email writing assistant and then kind of get you out of business. So I really want to stress the point that it's like try to use these artifacts that you're building wherever possible, online, repeatedly. Um, use them to drive improvements to your product. Oftentimes, Hamel and I will kind of... We'll tell people how to do this up to this very point, and it clicks for people, and then they, like, never come back again. So either they have, I don't know, quit their jobs, they're not doing AI development anymore, or they know what to do from here on out. Um, I think it's the latter. (laughs) But, um-

    4. LR

      Yeah.

    5. SS

      ... I think it's very powerful.

    6. LR

      Like, just watching you do this really opened my eyes to what this is and how systematic the process is.

    7. SS

      Yes.

    8. LR

      I always imagine you just sit at a computer, "Okay, what are the things I need to make sure work correctly?" And what you're showing us here is, here's... It's a very simple step-by-step based on real things that are happening in your product, how to catch them, identify them, prioritize them, and then-

    9. SS

      Absolutely.

    10. LR

      ... catch them if they happen again and fix them.

    11. SS

      Yeah. It's, it's not magic. Like, anyone can do this. You're going to have to practice the skill. Like any new skill, you have to practice. But you can do it. Um, and I think what's very empowering now is that product managers are doing this and can do this. They can really build very, very profitable products with this skill

  17. 1:09:571:15:15

    The great evals debate

    1. SS

      set.

    2. LR

      Okay. Great segue to a debate that we kind of got pulled into that was happening on, on X the other day. Uh, I did not realize how much controversy and drama there is around evals. There's a lot of people with very strong opinions. Uh, so how about, Shreya, give us just a sense of the two sides of the debate around the importance and value of evals, and then give us your perspective.

    3. SS

      Yeah. So, all right. I'll be a little bit placating and I say, I think everyone is on the same side. I think the misconception is that people have very rigid definitions of what evals is. For example, they might think that evals is just unit tests, or they might think that evals is just the data analysis part and no online monitoring or any... No monitoring of product-specific metrics, like actually number of chats engaged in or whatnot. Um, so I think everyone has a different mindset of evals going in. And the other thing I will say is that people have been burned by evals in the past. So I think people have done evals badly. One concrete example of this is they've tried to do an LLM judge, but it has not aligned with their expectations. They only uncovered this later on, and then they didn't trust it anymore, and then they're like, "Oh, I'm anti-evals." And I 100% empathize with that because you should be anti-Likert scale LLM judge. I absolutely agree with you. We are anti-that as well. So a lot of the misconception stems from two things, right? Like, people having a narrow definition of evals and then people not doing it well and then getting burned and then wanting to avoid other people making that mistake. And then unfortunately, X or Twitter is like a medium where, you know, people are misinterpreting what everybody is saying all the time, and you just get all these strong opinions of like, "Don't do evals. It's bad. We tried it. It doesn't work. We're Claude Code," or, you know, whatever other famous product, "and we don't do evals." And there's just so much nuance behind all of it because, 'cause a lot of these applications are standing on the shoulders of evals. Coding Agents is a great example of that. Claude Code, right? They are standing on the shoulders of Claude base mod... Not base, but the f- the fine-tuned Claude models have been evaluated on many coding benchmarks. Can't, can't argue against that.

    4. LR

      And just to double do- just to make clear exactly what you're talking about there, one of the heads, uh, I think maybe the head engineer of Claude Code went on a podcast and he's like, "Oh, we don't do evals. We just vibe. We just look at vibes." And vibes meaning they just use it and feel if it's right or wrong.

    5. SS

      And I think that kind of works. So there's two things to that, right? One is they're standing on the shoulders of the evals that their colleagues are doing for coding.

    6. LR

      Of the Claude foundational model.

    7. SS

      Absolutely, right? We know that they report those numbers because we see the benchmarks. We know who's doing well on those. The other thing is they are actually probably very systematic about the error analysis to some extent. I bet you...... that they're monitoring who is using Claude, how many people are using Claude, how many chats are being created, how long these chats are. They're also probably monitoring in their internal team. They're dogfooding. Anytime something is off, they maybe have a queue or they send it to the person developing Claude Code, and this person is implicitly doing some form of hair- error analysis that Hamel talked about. All of this is evals, right? There's no world in which they are just being like, "I made Claude Code. I'm never looking at anything." Um, and unfortunately, right, when you don't think about that or talk about that, I think that the community, most of the community is beginners, right, are people who don't know about evals and want to learn about it. Um, and it sends the wrong message there. Now, I don't know what Claude Code is doing, obviously, um, but I would be willing to bet money that they're doing something (laughs) in the form of evals.

    8. HH

      We'll also say that coding agents are fundamentally very different than other AI products because the developer is the domain expert. So you can short-circuit a lot of things, be- and also, the developer is using it all day long. So there's a type of dogfooding and type of dome- domain expertise that is, you know, you can collapse the activities. You don't need as much data. You don't need as much feedback or exploration because you know. So your eval process, you know, should look different, uh-

    9. LR

      Be- because you're seeing the code.

    10. HH

      ... in that situation.

    11. LR

      Like, you see the code-

    12. SS

      Yes.

    13. LR

      ... it's generating. You can tell this is great, this is terrible.

    14. HH

      Yeah, yeah. And so, and so, I think a lot of people have generalized coding agents because coding agents are the first AI product released into the wild, and I think it's a mistake to try to generalize that at large.

    15. SS

      The other thing is, yeah, b- a- engineers have a dogfooding personality. Right, there are plenty of applications where people are trying to build AI in certain domains and, and they don't have dogfooding for... Like, doctors, for example, are not out there trying to get all the most incorrect advice from AI, (laughs) and be tolerant and receptive to that. So it's very important to keep, I think, these nuanced things in mind.

  18. 1:15:151:18:23

    Why dogfooding isn’t enough for most AI products

    1. SS

    2. LR

      So what I'm hearing from you, Shreya, is- interestingly is that if you... if humans on the team are doing very close, uh, data analysis, error analysis, dogfooding it like crazy, and essentially, they are the human evals, and you're describing that as that's within the umbrella of evals, so you could do it that way if you're very... if you have time and motivation to do that, or you could set these things up to be automatic.

    3. SS

      Absolutely. Uh, uh, it's also about the skills, right? People who work at Anthropic are very, very highly skilled. Um, they've been trained in data analysis-

    4. LR

      Hm.

    5. SS

      ... or software engineering-

    6. LR

      I can notice.

    7. SS

      ... or AI and whatnot, right?

    8. LR

      Yeah.

    9. SS

      And, you know, you can get there. Anyone can get there, of course, by, like, learning the concepts. But most people don't have that skill right now.

    10. HH

      Do- dogfooding is w- is a dangerous one only because a lot of people will say they're dogfooding. They're like, "Yeah, we dogfood it." But are they really? And a lot of people aren't really dogfooding it at that visceral level that you would need to, to have... to close that feedback loop. So that's the only caveat I would add.

    11. LR

      There's also this kind of, (clears throat) feels like straw man argument of evals versus AB tests.

    12. SS

      Hm.

    13. LR

      Talk about your thoughts there, because that feels like a big part of this debate people are having. Like, do you need evals if you have AB tests that are testing production level metrics?

    14. SS

      So AB tests are, again, another form of evals, I imagine, right? Like, when you're doing an AB test, right, you have two different experimental conditions, and then you have a metric that quantifies the success of something, and you're comparing the metric. And again, right, an eval, in our mind, is systematic measurement of quality, some metric. Um, you can't really do an AB test without a eval to compare. Um, (laughs) so maybe (laughs) , maybe we just have a different weird take on it.

    15. LR

      Yeah, okay. So what I'm hearing is, like, you consider AB tests as part of the suite of evals that you do. I think when people think AB test, it's like we're changing something in the product. We're gonna see if this improves some metric we care about. Is that... is that enough? Why do we need to test every little feature? Like, if it's impacting a metric we care about as a business, we have a bunch of AB tests that are just constantly running.

    16. SS

      This is now a great point. Um, so I think a lot of people prematurely do AB tests because, you know, they've never done any in error- error analysis in the first place. They just have hypothetically come up with their product requirements, and they, like, believe that, you know, we should test these things. Um, but it turns out, right, when you get into the data, as Hamel showed, that, like, the errors that you're seeing are, like, not what you thought what the errors might be. They were these, like, weird handoff issues or, like, I don't know, like, the text message thing was strange. Um, so I would say that, like, if you're going to do AB tests, and they are powered by actual error analysis, as we've shown today, then that's great, go do it. Um, but if you're just going to do them, which we find that people try to do, just trying to do them based on, like, what you hypothetically think is why this is important, then I would encourage people to go and, like, rethink that and kind of ground your hypotheses.

  19. 1:18:231:22:28

    OpenAI’s Statsig acquisition

    1. LR

      Do you have thoughts on what Statsig's gonna do at OpenAI? Is there anything there that's interesting? Just like, that was a big deal, a huge acquisition. AB test company, people are like, "Oh, AB tests of the future." Uh, thoughts?

    2. HH

      You know, just w- to add to the previous question a little bit is, why is there this debate, AB testing versus evals? I think fundamentally, evals is people are trying to wrap their head around what... how to improve their applications. And fundamentally, um, you need to do data sci- you need... Data science is useful in products, like looking at data, doing data analytics. There's many s- different suite of tools. And, um, you don't need to invent anything new.... sure, you don't need, like, necessarily the whole breadth of data science, and it looks slightly different, just slightly, with LMS. Um, you know, you might, your tactics might be different. And so really what it is, is like using analytic tools, uh, to understand your product. Now, people say the word eval is trying to kind of like carve out this new thing in saying, oh, evals and then A-B testing, but if you zoom out, it's, it's the same data science as before. And I think that's what's causing the confusion is, hey, we need data science thinking. And AI product is, you know, is helpful to have that thinking in AI products like it is in any product, uh, is my take on that. So-

    3. LR

      Yeah. That's a really good take. Like, I think just the word evals triggers people now.

    4. SS

      Yeah.

    5. LR

      And if you just call it we're just doing error analysis using, doing data science to understand where our problem br- our product breaks and just setting up tests to make sure we know-

Episode duration: 1:46:32

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode BsWxPI9UM4c

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome