Skip to content
Aakash GuptaAakash Gupta

How to Build AI Evals in 2026 (Step-by-Step, No Hype)

Hamel Husain and Shreya Shankar are back with the definitive guide to AI evals. Step-by-step walkthrough using real production data from Nurture Boss. Error analysis, LLM judges, and the mistakes 90% of teams make. Full Writeup: https://www.news.aakashg.com/p/hamel-shreya-podcast-2 Transcript: https://www.aakashg.com/how-to-master-ai-evals-a-step-by-step-guide-with-hamel-husain-shreya-shankar/ ---- Timestamps: 0:00 - Intro 2:09 - Why Every AI Product Needs Evals 3:11 - Real Example: Nurture Boss Case Study 5:26 - Starting with Observability 11:24 - Ad Start 13:05 - Ad End: Analyzing Traces 24:55 - Error Analysis Introduction 27:00 - Axial Coding Explained 30:53 - Ad Start 32:40 - Ad End: Counting Issues 42:26 - Building Your LLM Judge 48:02 - Measuring the Judge 56:38 - PM vs AI Engineer Roles 1:01:29 - Common Mistakes to Avoid 1:06:31 - Outro ---- πŸ† Thanks to our sponsors: 1. The AI Evals Course for PMs & Engineers: You get $800 with this link: https://maven.com/parlance-labs/evals?promoCode=ag-product-growth 2. Vanta: Automate compliance, Get $1,000 with my link : https://www.vanta.com/lp/demo-1k?utm_campaign=1k_offer&utm_source=product-growth&utm_medium=podcast 3. Jira Product Discovery: Plan with purpose, ship with confidence - https://www.atlassian.com/software/jira/product-discovery 4. Land PM job: 12-week experience to master [getting a PM job](https://www.landpmjob.com/) - https://www.landpmjob.com/ 5. Pendo: the #1 Software Experience Management Platform - http://www.pendo.com/aakash ---- Key Takeaways: 1. AI evals are the #1 most important new skill for PMs in 2025 - Even Claude Code teams do evals upstream. For custom applications, systematic evaluation is non-negotiable. Dog fooding alone isn't enough at scale. 2. Error analysis is the secret weapon most teams skip - Looking at 100 traces teaches you more than any generic metric. Hamel: "If you try to use helpfulness scores, the LLM won't catch the real product issues." 3. Use observability tools but don't depend on them completely - Brain Trust, LangSmith, Arise all work. But Shreya and Hamel teach students to vibe code their own trace viewers. Sometimes CSV files are enough to start. 4. Never use agreement as your eval metric - It's a trap. A judge that always says "pass" can have 90% accuracy if failures are rare. Use TPR (true positive rate) and TNR (true negative rate) instead. 5. Open coding then axial coding reveals patterns - Write notes on 100 traces without root cause analysis. Then categorize into 5-6 actionable themes. Use LLMs to help but refine manually. 6. Product managers must do the error analysis themselves - Don't outsource to developers. Engineers lack domain context. Hamel: "It's almost a tragedy to separate the prompt from the product manager because it's English." 7. Real traces reveal what demos hide - Chat GPT said the assistant was correct but missed: wrong bathroom configuration, markdown in SMS, double-booked tours, ignored handoff requests. 8. Binary scores beat 1-5 scales for LLM judges - Easier to validate alignment. Business decisions are binary anyway. LLMs struggle with nuanced numerical scoring. 9. Code-based evals for formatting, LLM judges for subjective calls - Markdown in text messages? Write a simple assertion. Human handoff quality? Need an LLM judge with proper rubric. 10. Start with traces even before launch - Dog food your own app. Recruit friends as beta testers. Generate synthetic inputs only as last resort. Error analysis works best with real user behavior. ---- πŸ‘¨β€πŸ’» Where to find Hamel Husain: Website: https://hamel.dev Twitter/X: https://x.com/HamelHusain Course: https://evals.info πŸ‘¨β€πŸ’» Where to find Shreya Shankar: Website: https://www.shreya-shankar.com Twitter/X: https://x.com/sh_reya Course: https://evals.info πŸ‘¨β€πŸ’» Where to find Aakash: Twitter: https://www.x.com/aakashg0 LinkedIn: https://www.linkedin.com/in/aagupta/ Newsletter: https://www.news.aakashg.com #aievals #aipm #productmanagement ---- 🧠 About Product Growth: The world's largest podcast focused solely on product + growth, with over 200K+ listeners. πŸ”” Subscribe and turn on notifications to get more videos like this.

Aakash GuptahostHamel HusainguestShreya Shankarguest
Jan 15, 20261h 7mWatch on YouTube β†—

EVERY SPOKEN WORD

  1. 0:00 – 2:09

    Intro

    1. AG

      What are we gonna do today?

    2. HH

      So today we're gonna walk you through how to do evals step by step with a real live example on real data.

    3. AG

      Do we even need eval? I've heard Claude Code doesn't use evals.

    4. SS

      Oh my gosh. This is a crazy controversy that's been going around.

    5. AG

      There's way too much hype about AI. To build a really good AI feature that's not just a demo, you need to build something that goes to production. I consider AI evals the number one most important new skill for product managers. Where do people at Anthropic and OpenAI go to learn AI evals? It's Hamel Husain and Shreya Shankar. This is what your [laughs] AI agents are actually doing out there in production, and that's why looking at the traces is so important.

    6. SS

      We show so many demos in class where we just dump this trace into ChatGPT, and we ask, "Was the assistant correct?" And then ChatGPT will say, "Yeah, absolutely," but it will miss all of this nuance.

    7. AG

      Do I need an AI observability tool? I already am paying for whatever, Datadog or whatever APM tool I already have. What is the difference?

    8. HH

      You don't necessarily need a tool. It's sometimes good to start with one, but if you want to, you can log to CSV file, JSON file, text file, whatever you feel comfortable with.

    9. AG

      What are other mistakes, like separating the prompt from the product manager, that people might be doing in this process that we walk through today that is unintentionally inhibiting them?

    10. HH

      The main thing that's inhibiting people is not doing the error analysis.

    11. AG

      Before we get into today's episode, if you can do me a quick favor and check if you have a following on Apple and Spotify podcasts and subscribed on YouTube, these are free actions you can take that really help the show grow. And if you become an annual subscriber to my newsletter, did you know that you get access to over $28,000 of premium products? That's right, Mobbin, Arise, Relay.app, Dovetail, Linear, Magic Patterns, DeepSky, Reforge Build, and Descript. They are all free for an entire year if you become an annual subscriber to my newsletter. So go take advantage at bundle.aakashg.com. And now into today's episode.

  2. 2:09 – 3:11

    Why Every AI Product Needs Evals

    1. AG

      Hamel, Shreya, welcome to the podcast.

    2. HH

      Thank you for having us again.

    3. SS

      Yeah, super excited.

    4. AG

      What are we gonna do today?

    5. HH

      So today we're gonna take you step by step on how to do application-specific evals.

    6. AG

      Do we even need evals? I heard Claude Code doesn't use evals.

    7. SS

      Oh my gosh. This is a crazy controversy that's been going around. Absolutely everyone needs evals, and some people are less rigorous about it because perhaps there's somebody else who's done evals for them upstream. For example, in coding agents, you know that people who are training the models are testing on a bunch of code, so maybe you can easily build a coding application based on, you know, rig- religiously dogfooding your outputs. But for most applications that are not just naive applications of foundation models, such as what you're building, you're going to need some form of evals.

    8. AG

      Couldn't agree more. Evals, it's about actually improving your product. Maybe you're doing that through dogfooding, or maybe you're doing it through the systematic process that we're about to walk you through, but you have to do them. So let's get

  3. 3:11 – 5:26

    Real Example: Nurture Boss Case Study

    1. AG

      started.

    2. HH

      So today I'm gonna walk you through how to go about evals using a real company that I worked with, Nurture Boss. And Nurture Boss has been very generous in allowing me to use some of their anonymized data as a teaching example. So what is Nurture Boss? Nurture Boss is an, is a tool that allows property managers who are managing apartment complexes deal with things like tenant interaction and marketing and sales. And so you can see their website here, nurtureboss.io, and you can kind of get a feel for some of their... You know, it's a mobile... You can have a mobile app. You can embed it on the website, but you can see here's a example interaction. Do you have any two bedrooms available? And the Nurture Boss application is interacting with the tenant for you, helping to show listings, schedule appointments, so on and so forth. Um, you know, kind of all the different activities that you might be engaged in as a property manager, the, their application is helping you manage that with the assistance of AI. So it's a really good example because it incorporates all the messiness of a real world AI application. There is tool calls, there's rag, multi-turn conversations. There's even multiple channels you can interact with the application through voice, text message, or chatbot. And so it's kind of a lot of different messiness of the real world. This is not a simplified example. This is something that you will encounter, well, you know, in the real world, your application might have these complexities. So, like, how do you go about thinking about eval? So when I started work- working with Nurture Boss, they had something initially that worked, but they really wanted to know, okay, how do we, number one, figure out what's going wrong, and number two, like, how do we improve the application systematically beyond just doing vibe checks? 'Cause they already did vibe checks. You know, they were using their own application. They had some, uh, design partners and some initial customers, but they wanted to move beyond that and make it really good.

  4. 5:26 – 11:24

    Starting with Observability

    1. HH

      Okay, so what did... So the first thing that you wanna start with is some kind of observability. So the Nurture Boss application, they instrumented their code, and they captured their traces. So let me just t- show you what a trace looks like. So this is an, uh, observability platform called Brain Trust, and doesn't really matter what you use. There's a lot of popular ones out there. Ones that I see are things like... Okay, so Brain Trust is one. There's LangSmith is another one that's popular. Arise is another one. It really doesn't matter which one you use, um, but the reason I'm showing you one is so that you get a feel for whattraces might look like and also learn what a trace is. So this is an example of a trace. Let me just make it big.

    2. AG

      And before we even get there, some people might be-

    3. HH

      Yeah

    4. AG

      ... wondering, "Do I need an AI observability tool? I already am paying for whatever DataDog or whatever APM tool I already have." What is the difference between those tools?

    5. HH

      Yeah, you don't necessarily need a tool. It's sometimes good to start with one, but if you want to, you can log to CSV file, JSON file, text file, whatever you feel comfortable with. The reason I have it pulled up right here is so that we can read it together and kind of, you know, have something to look at. But if you're using DataDog, feel free to log things to DataDog to begin with. Um, the most important thing that you're gonna want to have is to take notes on your traces, and we'll show you why in a second. Um, one of the things that Shreya and I teach is actually to vibe code your own trace viewer, and we'll talk about that in a moment, and I can show you Nurture Boss did vibe code their own trace viewer eventually. They didn't end up using this. Um, but, you know, sometimes to get started, it is handy to use an observability platform, but sometimes it's not. Sometimes, you know, depending on what you want, maybe you already have one. Y- feel free to use that. The key is, like, do the simplest thing you can think of.

    6. AG

      Hmm. Get started.

    7. HH

      Yeah. Um, okay, so here's a trace, and you can see the trace just logs all of the different turns and all of the information that is shown to the LLM. And so what you have here is a system prompt. You are an AI assistant working as a leasing team member at Harmony, which is the name of a fictitious apartment complex. Your primary role is to respond to text messages from both residents and perspectives, uh, both residents and prospective residents. So it's very interesting. Uh, people will be text messaging this application.

    8. AG

      Hmm.

    9. HH

      Um, and so, you know, you're engaged with the customer to answer questions, book tours, and drive applications, and there's a whole host of different rules, um, about how to respond to the customer. And what you'll see here is, um, you have some rules, okay, like how to interact with them, determine if the inquiry is from a resident or prospective resident. I won't read all of these, uh, for you. There's some property-specific information here, like URLs. This is anonymized obviously. The URL is not acme.com. Um, but you can get the idea. And, uh, you know, this is basically the system prompt. So that's the system prompt. You see the first user message is, um, y- you can see that there's, like, some kind of logging error. Um, it's like, "Don't know who this is or what apartment you're from." So you could see, like, it's, you know, this is real world messy logging. But, um, the first question is, "I need a one-bedroom with the bathroom not connected and floor plans." And then you can see there's a tool call of some kind, where we're getting the individual's information. We're getting, um, something about the availability, and the tool is returning a list of apartments here. You can see these are just kind of like a list of apartments. And then we have the assistant responding with, you know, these apartments here, and, um, you know, it's saying, "Okay, we have, like, these three apartments with a link to this floor plans page," so on and so forth. Um, and then the y- you know, it's a text message, so this user is saying something about being sick and, um, you know, not being able to book tours, um, and something about want a bathroom connected to the room. Um, "I'll check for one-bedroom apartments where the bathroom is not connected." "Thank you," and, "You're welcome." Okay, so, so there's a lot of things that kind of went wrong here. One is what is going on here with, um, "I'd want a," you know, "I want a bathroom connected to the room," but it just said, "I'll check on that," but then it didn't do anything.

    10. AG

      [laughs]

    11. HH

      Right? [laughs] Uh, so we would hope that it would actually, you know, do something or if it's not able to do it, hand off to a human.

    12. AG

      Yeah.

    13. HH

      So, uh, you know, that's, that's kind of funny. I do happen to know, um, that... So this response up here is also in markdown. So you can see here, um, yeah, this is a markdown response, and this is a text message. So it's gonna be rendered a bit weird in a text message 'cause text messages don't have markdown. And so, um, this is a bit problematic too with the bold and everything. It's gonna, it's gonna come across, uh, you know, potentially in a weird way.

    14. AG

      With, like, asterisks and stuff.

    15. HH

      Yeah, yeah. It's gonna have asterisks, and it's gonna have square brackets and stuff like

  5. 11:24 – 13:05

    Ad Start

    1. HH

      that.

    2. AG

      If you've been enjoying this demo on how to do AI evals, you are going to love Hamel and Shreya's course. It is the top grossing course on Maven. It is taken by people from OpenAI to Anthropic to Google to Meta. All of the top AI companies are taking this course to improve their evals. I have secured a massive 35% discount for you guys. So use code AG-EVALS so that you can get this course. You can learn even more in detail how to write great evals, how to build great AI products that are working in production. Check it out at maven.com and look for their course and type in code AG-EVALS. That means it will only cost you $2,275. The next version of their course starts on October 6th.Today's episode is brought to you by Vanta. As a founder, you're moving fast toward product market fit, your next round, or your first big enterprise deal. But with AI accelerating how quickly startups build and ship, security expectations are higher earlier than ever. Getting security and compliance right can unlock growth or stall it if you wait too long. With deep integrations and automated workflows built for fast-moving teams, Vanta gets you audit ready fast and keeps you secure with continuous monitoring as your models, infra, and customers evolve. Fast-growing startups like LangChain, Writer, and Cursor trusted Vanta to build a scalable foundation from the start. So go to vanta.com/aakash. That's V-A-N-T-A.com/A-A-K-A-S-H to save one thousand dollars and join over ten thousand ambitious companies already scaling with Vanta.

  6. 13:05 – 24:55

    Ad End: Analyzing Traces

    1. SS

      One more thing is actually in the first message, the user said they wanted the bathroom and bedroom disconnected. Yeah, see, "I need one bedroom with a bathroom not connected." And then the assistant's first message was, "Here are some bedrooms with bathrooms connected."

    2. HH

      Oh, yes. See, there you go. That's a, that's a good observation. So it actually didn't really help the user. Um, the user just kind of gently reminded them again, like, "Hey, I do want a bathroom connected to the room." I mean, this is, like, very messy. You could see, like, there's misspellings. I could almost not understand what the person was asking.

    3. SS

      Yeah. The n- the now should be a not. I do not want a bathroom connected to the room.

    4. HH

      Yeah.

    5. AG

      So this is awesome. This is... Guys, if you are PMs listening, this is what your AI agents are actually doing out there in production a lot of the time. And so your demo is one thing. It goes well, but then when it goes out in production, there's all this hairiness, and that's why looking at the traces is so important.

    6. HH

      Yes. And so this is exactly why you don't want to have generic metrics. If you try to put helpfulness score, conciseness score, whatever in here, or you try to have AI look through your traces, it's not gonna catch stuff like this very well at all. 'Cause there's a lot of context that you have as a PM and a lot of things that kind of you have tastes that you need to reflect on and say, "Hey, this is not a good experience from a product perspective." And the language model is not gonna do-- know that because it hasn't been able to read your mind.

    7. AG

      Mm-hmm.

    8. HH

      Um, and so-

    9. SS

      This happens all the time. We show so many demos in class where we just dump this trace into ChatGPT. I mean, you can probably even do it now, or Claude, and we ask, "Was the assistant correct?" And then ChatGPT will say, "Yeah, absolutely." You know, "It sounds correct." But it will miss all of this nuance that Hamel and I and Akash have been mentioning because, you know, we actually put our product hat on and thought about the user experience a little bit.

    10. HH

      Yeah. Let's see what ChatGPT does. Oh, look at that.

    11. SS

      So it found that one.

    12. AG

      Okay. So it figured out that we didn't get the connected bathroom. Didn't use the-

    13. SS

      But this is hilarious. It says it doesn't filter by bathroom configuration, and the interesting thing is, who knows if that's a filter that the tool provides.

    14. HH

      Yeah.

    15. SS

      Assistant only cherry-picked three examples. I mean, maybe that's fine for us, right? Like, nobody wants to s- put all one bed... I don't wanna see a text message of every single apartment. I actually only wanna see a couple.

    16. HH

      Yep.

    17. AG

      So ChatGPT might help you a little bit, but you ultimately need to put your human touch on top of this and make sure it's correct.

    18. HH

      Yeah. It won't, it won't, uh, know that about the markdown, you know? So, like, I can-

    19. SS

      Yeah, didn't catch that.

    20. HH

      Yeah, you can, um, you know, I can change the format of this rendering, I believe, somehow to, uh, show you the raw. This is actually all rendered as markdown, but, you know, since it's a text message. But, like, you know, ChatGPT is not gonna know that. And there's other examples that we'll see that, okay, you need to put your... You need to kind of have a keen eye about what's going on in the product and bring your whole, whole product knowledge to bear. Um, and so yeah. If you try to... It, it can, it can miss a lot of things. Um, and so, oh yeah, this is, like, the raw. This is another raw way of looking at it in, like, YAML form. Um, but anyways, like, that's not, that's not important. Um, so what you could do from here is you need to write a note. So I'm gonna go into review mode. Oops. Let me go back to... Whoops. Let me go back to that trace. Sorry. The wrong, wrong hotkey. Okay, let me find that trace again. Uh, let me see. Is it this one? Nope. If I try going back, maybe that will save me. No, that didn't save me. Uh, let's see. Is it this one?

    21. SS

      This is also why we encourage people to build their own tools.

    22. HH

      Um, let me see. Is this the right one? Yeah, this is the right one I think, right?

    23. SS

      I think this is it, yeah.

    24. HH

      Yeah. Okay.

    25. SS

      Yeah.

    26. HH

      So let me, uh, come here and... Oh, there we go. Okay, notes. So I would put a note here. Um, so some issues here would be, um, you know, told user that it would check on bathrooms but didn't do it. Um, could say, like, also did not-Uh, follow user instructions and, uh, rendered markdown in a text message. And so, uh, what you really wanna do is, this can sound very tedious. Like what I just did, it sounds like it's very resource and time intensive, but it's really not. Like you just scan the trace and, you know, if you're familiar with the system prompt, you don't have to read it. Um, you're not gonna read every system prompt 'cause they're gonna be the same really, unless you need to. Um, but it's really... You can, you know, within about 30 seconds or so, or less, you can kinda scan this and say, "Hmm, okay." You can get a sense of like what is happening, and you can write some notes. And it's... The perfection is not key. The key is like see what's going wrong in the trace, and note what you see and s- and move on. Um, you don't have to catch everything, just catch the most important things. And so we can keep going. So let's go down to the next trace. And let's see. Let me just... You're gonna have to edit this out. Uh, let me find one with an issue. Okay, here we go. Here's one with the issue, so let me edit that navigation out. Um, so we have, uh, let me just hide some stuff here. So the user in this case is asking, this is an, uh, this is a new trace now, "Do y'all have one bedroom with study available? I saw it on the virtual tours." And there's a tool call to get information, um, and availability, and it says, "Yes. Hi Priya. We currently have several one-bedroom apartments available, but n- but none specifically listed with a study." Um, so okay. It matches. She did ask for a study, um, so it gave her one bed- uh, one bedrooms instead. And then the user asked, "Can you let me know when one with a study is available?" And the assistant says, "I currently don't have specific information on the available of a one-bedroom." So okay. Is this kind of... This is where you get frustrated, right? I asked a question, you just responded with some robotic like, "I don't have that." Um, and so I would say, yeah, this is a, th- this is a product failure. Um, and what you wanna do is say like, okay, um, you wanna just note that real quick. So, um, should have handed off to a human or had have better lead nurturing. I'm not-- There's no pun intended with the word nurturing. Just that's what came to mind.

    27. SS

      [chuckles]

    28. HH

      So, um, you know, anything else that you think is wrong with this particular trace?

    29. SS

      Yeah, I don't think you have to get bogged down. It's a good question to ask, but typically we tell people, "All right," like, "think of everything that comes to mind, write them down, move on." Right? You wanna like get into kind of a flow state here. Like you can d- debate every trace endlessly, and sometimes you see people get stuck in that. So try to just kind of avoid that. We got a problem. Next.

    30. HH

      Yeah. I agree with that. Move on. Um, this one I already did, but that's okay. We can, we, we can prot- we can do it again. Um, so let's find the first user message. Let me scroll up. Sorry. Um, okay. Uh, so I'm in California. So okay, we have to edit the part out where I scroll to this part, but we can start. So this is a new-- This is, uh, kind of the third trace we're looking at. Um, so the user asks, "I'm in California looking to relocate to Texas by March 15th, moving." "Thanks for sharing. Since you're planning to relocate," blah, blah, blah, "I can help you explore available apartments. If you'd like, we can also schedule a virtual tour." Um, and he's like, "Yeah, that's great." Uh, you know, "Thank you." And, "Okay. I'll arrange a virtual tour for you so you can explore community. What's your preferred date and time?" "Tomorrow is fine for me, 9:00 AM." "I can schedule a virtual tour for you." That's fine. It schedules a tour, and then it says, "Your virtual tool- tour is all set." Looks good, right? Actually, it didn't go so well. And so the reason it didn't go so well is 'cause there's no such thing as a virtual tour for this apartment.

  7. 24:55 – 27:00

    Error Analysis Introduction

    1. HH

      So the, the idea is that you just keep doing this, and so this actually, you can do this quite fast. You do this for, let's say, 100 or so traces. Just write down what you see. Don't try to get into root cause analysis. Don't try to figure out, like, what went wrong exactly. Just journal, observe freely. Ob- just journal what you see going wrong, if anything. If it's nothing is wrong, you can just skip it. Uh, but when you do find something wrong, go ahead and write that down.

    2. SS

      Mm-hmm.

    3. HH

      Does that seem cl- Is that clear, Aakash-

    4. SS

      Yeah

    5. HH

      ... that process?

    6. SS

      Yep, exactly.

    7. HH

      Okay, great. So, so now you have what we call a bunch of open codes, and this is not a... So, okay. This is the start of the most important part of evals, which is called error analysis, and something that's very approachable to everyone, and it's actually... It's very important for product managers to be involved in this, 'cause a lot of times engineers don't have the context, the full context to know if this is good or bad. Um, and so what you end up having is you have, let's say, a bunch of these, like, notes. So y- I have the spreadsheet open right now with a collection of all the notes that I took. And, you know, uh, let's say I did 100 of these. I actually found 40 or so different errors. Um, you might find different number of errors if it's y- if you're doing it. Um, but here's a collection of these, like, notes. Okay, great. So up until this step, you've already learned quite a lot. If you d- if you look at 100 traces, you're gonna learn and you're gonna understand your system better than anyone else, and you're gonna have a really, like, deep understanding of what is wrong, and you might also have a pretty good sense of what t- you need to work on next already without doing any analysis. But it's really good to do analysis of these notes that you took. So how do you do this analysis?

  8. 27:00 – 30:53

    Axial Coding Explained

    1. HH

      So the next step is you categorize these notes. So the term for that is called Axial Coding, and I'm gonna show you how to do this in a spreadsheet, so let me just zoom out a little bit. Um, but first, one thing you can do is, okay, you can take these, and you can, yeah, put them into ChatGPT or Claude. So what I did is I took, um, sort of the logs, and I said, "Okay, please analyze." I exported it from here. So, uh, you know, there's... Let me go back. So there's an Export button here, and I said, "Okay, download as CSV," downloaded it, put it in Claude, and I said, "Hey, there's a metadata field which has a nest field called Znote that contains all the open codes." And I use the word open codes, 'cause a term of art that LLMs understand, 'cause this, uh, these terms open coding and axial coding, which we mentioned, open coding is the writing of the notes. That's actually a term that's well understood in the field of machine learning, but it also goes... It's a- been around before ma- machine learning. It's been used in the social sciences. This kind of process of op- open coding, axial coding, is a thing that LLMs understand. And so, um, I just say there's a metadata field which has a nested field called Znote that contains open codes for analysis of LLM logs, which are, uh, that we are conducting. Please extract all the different open codes and then propose five to six categories that we can create axial codes from. Okay? And then, like, you know, it'll, it'll kind of go through, and you can, like, get these categories. So here are some categories, like, you know, capability limitations, measure rep- representation. Some of these I don't like, 'cause they're a little bit too broad. They're not actionable. I'm actually not 100% sure what that means, so I might look into it and rename it a bit. Um, you know, human handoff issues. There's certainly some of that. That's when, hey, you want to escalate to a human being or hand off to a human being, but it's not doing it properly. Um, temporal contextual awareness. It doesn't know what the current date and time is. Um, you know. So there's some categories here. Um, you can refine this. What I like to do is take it to a spreadsheet. So I, like, have some categories that I kind of have, that I maybe have from ChatGPT, and then I kind of look at them and edit them. So I kind of have these, like, categories that I sort of edited a bit, and I said, "Okay, let me just collect these into a list." So that's what this formula does, is I'm just collecting these list of categories into a list. That's all that is. Now-

    2. SS

      I have a note here.

    3. HH

      Yeah.

    4. SS

      Sometimes you... So I think one of the things that's very interesting from looking at your Claude is a lot of those axial codes are very vague, right? Like quality or temporal issues, and you kind of want to make sure your actual axial codes are not so vague, because imagine you're giving them to somebody else to do some labeling with, right? Like, something like conversational flow issues might be a little bit better. Honestly, we could even make it a little bit more specific, but something like temporal issues, right? Like, if Hamel told me to go label with temporal issues category, I wouldn't even know what he means. I would want to say, like, you know, f- date formatting error or, like, something like that, right? So I think that's a- another place where people get tripped up, which is, you know-Just taking things out of the LLM as is and not really thinking about, okay, how do I refine that in a way that's gonna give me meaningful error categories?

    5. HH

      Mm.

  9. 30:53 – 32:40

    Ad Start

    1. HH

      Today's episode is brought to you by Jira Product Discovery. If you're like most product managers, you're probably in Jira tracking tickets and managing the backlog. But what about everything that happens before delivery? Jira Product Discovery helps you move your discovery, prioritization, and even roadmapping work out of spreadsheets and into a purpose-built tool designed for product teams. Capture insights, prioritize what matters, and create roadmaps you can easily tailor for any audience. And because it's built to work with Jira, everything stays connected from idea to delivery. Used by product teams at Canva, Deliveroo, and even The Economist, check out why and try it for free today at atlassian.com/product-discovery. That's A-T-L-A-S-S-I-A-N.com/product-discovery. Jira Product Discovery, build the right thing. Today's episode is also brought to you by my cohort-based coaching program to help you land your dream PM job. I am taking 30 elite PMs to land their jobs at Google, OpenAI, and other $700,000 plus roles. If you want in, check out landpmjob.com. Once all 30 seats are sold out, that's it, and already seats are going almost every day, so grab yours at landpmjob.com. Today's podcast is brought to you by Pendo, the leading software experience management platform. McKinsey found that 78% of companies are using gen AI, but just as many have reported no bottom line improvements. So how do you know if your AI agents are actually working? Are they giving users the wrong answers, creating more work instead of less, improving retention, or hurting it? When your software data and AI data are disconnected, you can't answer these questions. But when you bring all your usage data together in one

  10. 32:40 – 42:26

    Ad End: Counting Issues

    1. HH

      place, you can see what users do before, during, and after they use AI, showing you when agents work, how they help you grow, and when to prioritize on your roadmap. Pendo Agent Analytics is the only solution built to do this for product teams. Start measuring your AI's performance with Agent Analytics at pendo.io/aakash. That's P-E-N-D-O.I-O/A-A-K-A-S-H. Exactly right. Uh, and that's a really important thing to pay attention to. And that's why if you reflect on the categories I have in this spreadsheet, they're definitely better than the ones in the Claude-

    2. SS

      Yeah, they're different for a reason

    3. HH

      ... that I showed you. Um, it's because I, I iterated on in a bit. And that's, that's an important principle. You never wanna completely hand off the wheel to AI. You wanna think about what it's saying. Maybe it helps you to different degrees, but you wanna see, okay, like what are the categories here? Um, and actually, you might wanna go back and forth. So what I did here is I went to those notes, which I have here. Every row is a different note. And, you know, you can use AI, so I used AI. Categorize the following note into one of the following categories. Okay? And what I did is I, you know, had AI, and this is a sp- this is a formula in a spreadsheet, so you can see the prompt. Um, and basically, like classify each of these notes into one of those categories, and I went back and forth like, "Hmm, this category actually is not the greatest for this particular note." And I like went and edited the note. I went back to this category field, maybe like added one, deleted one, and like kind of fiddled with it till I was like reasonably happy, like, "Okay, this is like a good set. It's good enough." Um, one thing I d- should have added here is like none of the above, um, which would've been better. But, you know, I'm showing you the simple, stupid version, which is like get started. I don't wanna overcomplicate it, but that, that's what you should do.

    4. SS

      And none of the above is, is mainly like a means to the end. The end is really having these categories, but sometimes you might have like missed a category. So if you put none of the above here, and then your AI does a classification and tells you none of the above, then you can go read those traces again and wonder, maybe there's another category that I should add, um, so I can classify those.

    5. HH

      Mm. So when you get-- So you have these classifications and... Okay, let me just zoom out so we can see it together. Sorry about that. Um, so you have all these classifications, and now comes like the powerful part where you will put on... Well, you will have like real superpowers if you do this as a PM, that you will go above and beyond and kind of be armed with information that a lot of people usually not armed with. And it's counting. So now you can count these issues and, you know, you can just use a pivot table. That's what I did here, is say, "Okay, like how many times did I see this?" So now you have taken a world where it's kind of messy and like you don't really know like what is, you know, might not know what is going on. You know that there's some errors and you're, and you have this kind of paral- paralysis of like, "What do I work on? What do I fix next? What's the most burning problems in my app?" And, you know, you have some data in front of you. You know, like, hey, you're having these conversational flow issues a lot, and this conversational flow issue, it's actually regarding situations where there's text messages. So I happen to know that I can click on this. Um, you know, it can say like, "Hey, yeah, there's like disparate messaging, sent in-person tour link." Um, okay, there's, there's different... Sometimes it's about text messages. Sometimes it's just like it's not being clear. And we can go back to the trace and look at that. Some of my fir- uh, favorite things about pivot tables, you can like double-click. Um-

    6. SS

      You can also make it hierarchical. I like to do this too. So sometimes I like to say, I like to break down conversational flow into like three different categor- subcategoriesMaybe some will be like repeated messages, and some things will be, um, you know, the AI just should have handled this one particular thing better. I don't know. Um, so you should... I think this is kind of where the subjectivity in your product experience shines, right? It's like you have to do this process in a way that enables you to make your product better, right? Based on the capabilities of your product or what your team can do. So, you know, if you can't have virtual tours, then you can't have virtual tours. That has to somehow be encoded in your system, right? So.

    7. HH

      Yeah, definitely. Um, and so yeah, this could be made better. I think that's, you know... I didn't try to, let's say, make it perfect. But as Shreya points out, you can have, you know, subcategories which can help you kind of refine what's going on more. Um, but you can take a look at this, so you can say, "Okay, like, what do I think is, like, most important?" Like, okay, maybe the human handoff issue is not happening as much as the conversational flow issue, but let's say you feel as a product manager that is a catastrophic error and that the, the magnitude of that problem, you know, the, the sort of the impact of that problem is so high that I'm gonna prioritize that as number one. But you, you have some data to back up that this is happening, and you have an idea of what's happening, and now you have a reason to potentially write evals. Now you're not writing evals in the dark. Now you can write evals in response to actual problems that you are seeing instead of, like, hallucination score or some AI-generated something or the other. Um, you know, you can motivate this in, in this thing that you want to fix. Now, you don't have to write an eval about every- for everything. There might be some of these things that might be easy to fix. Like, for example, there's this formatting error with output. Um, you know, an example might be using markdown in text messages. You might be able to just fix that. Um, maybe, maybe there's no instruction in the prompt at all. Um, and it depends, like, what kind of eval you need to write. So there's two kinds of evals. One is code-based, where you can test something without calling an LLM. So the formatting error without- with output, you might be able to u- use a code-based eval for that. Like, hey, is the format... Do I see markdown elements in this output where there shouldn't be a markdown? Uh, in which case, maybe you should write the eval 'cause it's not gonna be expensive. Um, whereas with the LLM as a judge, something more subjective like, "Hey, you should be handing off to a human," you might need an LLM for that. You might not be able to write a, an assertion in code. That's a little bit more expensive of an eval, and you have to do it, have a judgment call like, "Okay, is that something that's trivial to fix?" Like, I didn't have that. I didn't have, like, a, you know, something in my prompt that, you know, had this instruction. Maybe you've, you've, or found, like, some dumb mistake that you made. Go ahead and fix it. You don't have to get caught up in evals. What you wanna do is write an eval for something that you think you might want to iterate against. I don't know if there's a better way to say that. I feel like that could've been... Shreya, you think there's a better way to-

    8. SS

      No, I-

    9. HH

      Okay.

    10. SS

      I think it makes a lot of sense, right? Like, so already as a PM, like, this is the secret sauce for your product. If you don't do this product, you can't per- you c- if you don't do this process, you can't kind of put your own taste or your company's taste into your product. Once you get to this point, that's kind of when paths diverge. I think that's what Hamel is trying to say. Maybe sometimes you figure out, okay, there's some errors that are higher priority than others. I'm gonna go and fix those. Maybe you want to run these checks at larger scale. Like, maybe you're Meta or you're Google, and you're like, "I can't make a decision based on 42 traces, and I need a lot of buy-in, so I'm just going to, you know, build a team, do a bunch of evals," or, "I'm gonna write automated evaluators to check this at scale." Yeah, go for it. We're not gonna cover that kind of today. That... Take our course if you're interested in those, um, techniques. But I think overall, I think it's super incredible. Hamel started with zero, right? Now we're at a place where we know what are the biggest failure modes in a sample of traces are, right? And most people don't get to this point.

    11. HH

      Um, and so, okay, so going further from there, um, let's say you wanna write an eval for the human handoff issue. Like, hey, you should be escalating to a human being. You should be handing off. Um, it can be really useful to write an eval to help you see t- all the traces where that, flag the traces where that m- might be happening and help you iterate on that problem. So let's go into how you would go about building that. So here is the prompt for an LLM judge. Now, this is just a very basic prompt. It could be made better, but I just want to keep it simple. And so, um, you know, you are scoring a leasing assistant, um, to determine if there's a handoff failure. Return only true or false. So that's one thing that we teach, is you want the LLM as a judge

  11. 42:26 – 48:02

    Building Your LLM Judge

    1. HH

      to produce a binary score. Shreya, you wanna talk about why?

    2. SS

      Yeah. So all right, I'll try to give a succinct answer for this. The short answer is that people run into a lot of misalignment when they try to use, like, a Likert or a range-based score, and that's because it's very tedious to check that every single possible LLM output aligns with your preference. Now, when you do a binary scoreYou only have to check that true aligns with your trues and false aligns with your falses, right? It's only two things that you have to check, which makes the process of checking for alignment easier. The other thing is when you're shipping products, right, you make binary decisions. Either this thing was bad or this thing was good. I should fix this, or I should not fix this, right? It's not like even if you have a score of like this is 30% failing, that gets turned into a binary decision of how you're going... Are you gonna act on it or not, right? So that's kind of why we tell people, "Focus on a binary decision here." It's easier to align, and ultimately, your business decisions are yes or no decisions.

    3. AG

      Yeah. There were some people who were trying to do like one through five scales and stuff, but it seems like LMs are not very good at those types of numbered scales, so it's much better to stay binary.

    4. HH

      Definitely. We need to bring you into our consulting, Aakash, if you can tell people that. Um, so, okay, so we have this prompt, and, you know, we have like a list of things that we ha- you know, these seven things that we have where there is a failure. You know, if there's explicit human handoff, um, that's request. That's, that... Sorry. So we have these seven things. Um, an example of one is the user asks to be sent to a human, but that's ignored, or the, there's a policy that you should be transferred, but that's not handled properly. There's a sensitive issue, um, like billing disputes or legal issues that are not adhered to. Um, same-day walk-in or tour requests, you wanna hand that off to a human, so things like that. And then, um, you know, we have some notes about when there's not a handoff failure. It's important not to get bogged down in the prompt itself. So if you're thinking to yourself, "Can you send me this prompt? Can I copy and paste this prompt?" You're asking the wrong question because you need to, you know... So okay, how do you write this prompt? Um, you want to try to describe the rubric of, okay, what is a failure and what is not a failure. You can get LLM to help you bootstrap that, but you should try to edit it, and the key is iterating, okay? It's not necessarily a recipe, and you wanna try to have examples. I didn't put examples in here, so you wanna have a section of like maybe some examples. Um, it's not nec- necessary to begin with. In a simple case, you may not... You know, you don't need to have examples. I'm just, you know, trying to give you like the most dumbest LLM as a judge, so you can get the concept. The idea is like, you know, you're gonna write a prompt. Um, in our class we do have a recipe of like a, what you can follow, but, you know, zooming out from that, it's important to just iterate, honestly. Um, that's, that's what's gonna get you the furthest. And so, um, you know, this prompt would be structured differently if it wasn't in a spreadsheet also. Like I'm kind of begging it to return true or false. Um, I wouldn't have to beg it if I'm using API, for example. I could, I could do something else.

    5. AG

      Mm.

    6. HH

      So, um, this, this is the prompt, and then what you can do is, okay, I have my trace here. This is a different tab of the spreadsheet. I'm saying, okay, um, you know, this is the trace, and I have AI, um, here, and it says, "Assess this LLM trace according to these rules," and I give it the prompt of my LLM judge. That's all I'm doing.

    7. AG

      Nice.

    8. HH

      Now here's the key part.

    9. AG

      The AI function built into Google Sheets, isn't that just runs Gemini in the back end?

    10. HH

      Yeah. Yeah, it's okay. I wouldn't say it's amazing. It's some kind of very fast model-ish. Um, you know, it's good to get started to get a mental model of what's going on, but I would m- you know, I would be a little bit careful using this model for everything in real life because I'm not too sure about it. So don't get lost in the sauce of what I'm doing. I'm trying to show the exact... I'm trying to give you a mental model, but you might not actually... You might want to do something, uh... You might wanna use like a more powerful model, uh, potentially for LLM as a judge.

    11. AG

      Mm.

    12. HH

      So, okay, we have two columns here, column G and column H. So the column H is the score outputted by the LLM judge, true or false, and most people just stop here. They're like, "Okay, here's my LLM judge. I gave it a prompt. Woo-hoo, we're done. LLM judge says like it's good."

    13. AG

      [chuckles]

    14. HH

      "So we're good, right?" And then what ends up happening is stakeholders, they start to feel or observe

  12. 48:02 – 56:38

    Measuring the Judge

    1. HH

      that there is a dissonance between the out- like your evals and actual, the, the product's performance, and they can lose trust in the evals. And they start to ask you questions like, "How do you know this metric? Like how, what is this metric?" And you tell them, "Okay, it's an LLM judge." They're like, "Well, how do you trust that?" A lot of people get stumped there. They're like, "Uh, well, that's all we got." You don't wanna do that. What you want to do is you want to measure the judge against your label. So remember when we were doing the axial coding, you actually have your own human labels, so you know for these various, um, traces like, okay, if there were... If, if this issue existed or not. And you can... So and then you can compute metrics. You can compute how good your LLM as a judge is.Now, in this spreadsheet, I have three metrics, agreement, TPR, and TNR. Now, agreement is like the trap metric. The reason it's the trap metric is that's what you might gravitate towards, you know, in the naive case. You might say, "Okay, like we'll just measure the agreement between the judge and the human." You don't wanna do that. The reason is, is if you're... If this failure is only happening, let's say, 10% of the time, you can have the dumbest judge in the world have 90% accuracy by just always predicting pass. In fact, your LM judge can just be like equals pass.

    2. SS

      [chuckles]

    3. HH

      You can introduce a bug that doesn't even call an LLM, it'll be 90% accurate. So you don't wanna do that. It can be very misleading. It can mislead you. So what you want to do is measure two things, how good is your judge at catching errors, uh, like, you know, catching errors that exist, and how good is your judge at... Sorry, let me rephrase that. That was get... Let Shreya explain this one, uh, so I can give myself a break.

    4. SS

      Sure, sure. Oh, man. I mean, I think you've, like, said basically most of it. The point is, okay, if you're a product manager and somebody tells you you have high agreement with your judge, or they got high agreement with the judge, be a little bit suspicious. Ask them, "Okay, what's the alignment in the positives or the passes, truths? What's the alignment in the falses?" And make sure both of them are pretty high individually. If not, then you have to rework your LLM judge prompt.

    5. HH

      And if you are confused about this, what, why, why isn't... Like, if you're not convinced intellectually somehow that like, "Why can't I just use agreement, Hamel? Why do I need to, like, measure positives and negatives separately?" You should use the spreadsheet. You should get a spreadsheet like this, and you should, like, do some experiments and say, "Oh, okay, like, what if I just, you know, hard-coded this to false all the time?" I think this confusion matrix may be not necessary. Um, it might confuse people, so-

    6. SS

      It's gonna confuse people

    7. HH

      ... no pun, no pun intended

    8. SS

      It always does.

    9. HH

      Okay.

    10. SS

      You can't teach the whole course in, in a one-and-a-half-hour thing, I think, we just cut our losses. [chuckles]

    11. HH

      Yeah, yeah. Okay. So, um, you know, that's kind of... There's a lot of things that we didn't cover here. One key thing is how to split your data set so you're not over-fitting your judge and you're not inadvertently cheating. That's a lot-- That's way too much to get into in a one-hour podcast. There's no way we could cover that. But just know that, okay, there's a lot of nuance here in how you do this correctly, how you build the judge, how you get confidence. Um, there's ways to calculate your metrics, um, use this TPR, TNR to, like, calculate what, you know, your real accuracy is. That's, you know, we haven't gone into that. Um, there's a lot of nuance on, like, okay, how do you analyze agents? Like, how do you... You know, if you have lots of steps and lots and lots of handoffs, how do you tame that? And how do you do analysis of that to catch those errors? Um, another thing we teach is how do you do analysis of retrieval? So retrieval is a kind of an Achilles' heel of a lot of AI systems. And so a lot of times you have to kind of dive deep and diagnose what's going on with your retrieval step in your rack. And so there's a host of metrics and a- analysis you might want to do there. So there's a lot of things that we didn't cover here, but the reason that we gave you a taste of error analysis, because error analysis is the step that most people skip in evals, and it's rarely talked about, and it's the thing that's gonna give you extreme leverage, uh, as a PM, and you can get there just with counting. I hope that I've convinced you by using this, the spreadsheet that it's within your reach. And-

    12. AG

      We're-

    13. HH

      You know, I don't wanna discourage you from using spreadsheets either. Like, feel free to use whatever you're comfortable with. Sorry, go ahead.

    14. AG

      So where do you go from here? Once you have this initial set of metrics, how do you go about improving once you have created your initial evals?

    15. HH

      Right. So let's say, like, this handoff error eval that we created. Uh, what you can do is you can... Now you have a judge, an LLM judge that you like. You feel good enough about it's accurate enough. You can use it to score a large sample of all your production traces, and now you can find, you can learn... First of all, you can learn more about what is going wrong in those situations. But secondly, you can iterate on this problem really fast. You can make some changes to your prompt, and you can calculate w- your error rate on these test cases that you curated and kind of iterate really fast and say, "Okay, like, this prompt is working. This prompt is not working." And you will have a suite of these evals, and you can test against all of them to see, okay, like, if you are iterating on this problem, are you inadvertently breaking something else? And you have some kind of way to a system that you can use to sort of be confident in what you're doing rather than just guessing.

    16. AG

      What does a holistic, like, end-state eval suite look like?

    17. HH

      Shreya, you wanna talk about that? Like, how many evals do you usually have in your-

    18. SS

      Yeah, it's, it's different for every application, and it's different for how s- high stakes the application is. Typically, I'll see, like-You know, several code-based evals, especially in CI, maybe one or two LLM-based evals in CI, but not really. I do see some people, like myself included, run LLM-powered evaluations, like kind of like monitoring or online. Like every week or so, I'll sample some of my traces, run my LLM-powered evaluators on them, and then kind of just look at the score, see if anything's off or whatnot. Um, and often I'll see every few weeks that like, oh, there's this new distribution of data or this new cohort of people who are using the tool. Like I build AI-powered data processing tools. Um, and so I'll see like, oh, there's different document types that have come in or a different set of contracts. I do this for law a lot, so like there's a new type of contract or a new type of document that's come up. Um, and then now I need to like think about it, right? So LLM-powered evals, automated evals allow me to really quickly iterate out those. You don't need 100 of them, like just a few is fine.

    19. AG

      And what's the role of PM and AI engineer and AI researcher in all of this? How are you working together? Where are the handoffs happening? Like that quick iteration on the system prompt, who's doing that?

    20. HH

      It's a really good question. You know, it depends on the size of the team and the company and the roles. Sometimes these roles are being collapsed into one, um, in terms of... Okay, so Jacob Carter, the engine-- the CEO and, you know, also engineer of this product, is the

  13. 56:38 – 1:01:29

    PM vs AI Engineer Roles

    1. HH

      product manager and the AI engineer, um, all in one. So, you know, he is, has a pretty good pulse on like, "Hey, is this interaction good or not?" So that's not feasible all the time. Um, you know, in other situations, you want the domain expert to be driving, especially the error analysis process, as much as possible. It might take some training in the beginning to get the right tools surfaced for the PM or get the PM, you know, able to access the data, and might need some engineering, like kind of co... What's the right word? Um, pair, pair programming in a way, or pair pairing. So, so pairing on error analysis just to feel comfortable, but you should try to have one person do the error analysis, so it doesn't become onerous. And usually a product manager is pretty good at that because they have all of the domain expertise to actually judge if something is going wrong. So I would, I would bias on the side of having the domain expert or the product manager do this error analysis. And then as far as like writing the prompt is concerned, you do want to try to make it accessible for the product manager to write the prompt. So what I've seen in a lot of tools is having a, like an admin view where a non-technical person can, can edit the prompt. Um, I actually have like a screenshot of that here. One thing that we talk about a lot in our course is you want to create your own tools to look at your traces. And so Nurture Boss actually created their, pipe coded their own tool, um, to look at traces, to remove all the friction of looking at traces, because it's so important. And so, you know, this is pretty simple. You see all the different, um, channels, voice, email, text, chatbot. You can see like they hide the system prompt by default. Um, you know, it's like a very quick and dirty interface on doing this like open coding and Axial coding. And actually they have a step here that helps them automate the Axial coding. You see like, hey, transfer handoff issues, tour scheduling, blah, blah. So, uh, you know, that's worth noting. That's something to think about, is that's how important error analysis is. Um, so to get back to your question about, okay, how do you, how, how might you surface the prompt to non-technical people? So this is an example where you might have an admin view. So this is like, um, you know, a real, a, a different real estate agent, like, hey, um, you know, showing you real estate listings, and you might have this admin mode, where then you allow someone to fiddle with the prompt. And so like this prompt experimentation is really key. Um, and so having a way that people can interact with prompts is really helpful. Now, a lot of tools have like prompt playgrounds. The only thing that's limiting about most prompt playgrounds is they don't have access to your code. You know, 'cause you might have various tool calls, you might have RAG, you might have these things. All your application code is not, you know, in those prompt playgrounds. And so that's why a lot of teams that I see have these like, these interfaces where you can like edit the prompt directly in your tool and then like play with it and redo it and whatever. Um, so-

    2. SS

      This is so nice. [chuckles]

    3. HH

      Yeah. So I mean, whatever I... You know, whenever possible, you want to expose the prompt to the domain expert because the reason is, is because it's English. It's made for the domain expert. It's almost a tragedy to separate the prompt from the product manager 'cause it's, it's English.

    4. AG

      What are other mistakes, like separating the prompt from the product manager, that people might be doing in this process that we walked through today that is unintentionally inhibiting them?

    5. HH

      The main thing that's inhibiting people is not doing the error analysis. People wanna jump straight to, "Hey, let me, let me take an off-the-shelf metric," the vendor gives you and, and just like create a score.That people are very scared of this error analysis. They look at it and they're like, "Oh, I don't have time for this." But it doesn't take that long at all, and it's kind of this thing.

  14. 1:01:29 – 1:06:31

    Common Mistakes to Avoid

    1. HH

      It's like a secret club. When you do it just once, you will forever keep doing it. But just, like, getting over that hump of doing it the first time is just extremely scary for people.

    2. SS

      Another common mistake is people, people will see this video or, I don't know, they'll realize, "Okay, it's worth doing error analysis," but then they think it's worth some human doing error analysis, not them, so they'll just outsource it out, which again, huge pitfall, right? This is the... Error analysis is where you build your product, right? That's where you build your moat. So if you're giving it to someone else, then you kind of have no personal touch in your product.

    3. HH

      Yeah. Do not, uh, outsource this to developers. Um, and, and, you know, if you're working on a coding app, yes, you can... There's not really a... You're the domain expert is the developer. But in most cases, the domain expert is not the developer, and a lot of people, a lot of companies are like, "Oh, this AI stuff is, like, for engineering. Like, the whole thing is engineering. Let me just shove it over there. They need to figure out whether it's, like, good or not." That's usually the wrong approach. Um, and-

    4. SS

      It's not-

    5. HH

      Yeah

    6. SS

      ... an engineering skill set. I think that's another interesting thing about today's day and age for PM, especially AI PM, right? You can't expect engineers to be able to do all of these things. Um, the people that have been successful at this process either are very, very technical PMs or are engineers who are actually PMs, and they just think that they're engineers and they, they don't realize that they're doing product work, right? So I hope people are kind of convinced that today's day and age, you kind of have to have your product knowledge, put your product hat on.

    7. HH

      This error analysis is so powerful that, um... This is a video, and we can put it in the show notes, of this is Jacob Carter, and he just recorded, like, a two-minute-long conversation of how thrilled he was with error analysis. He's actually so thrilled with it that he just... He t- he thought this was the best thing that's ever happened. Um, and he got so much value out of it that, yeah, like, he, he kind of stopped there, um, to begin with, and, like, had so much work to do that he didn't need, you know, to build evals right away, 'cause he just found so many things, as you can see right here in this picture, um, that, you know, he was able to fix. So he did eventually build evals, but you know, starting here gives you a really good grounding and lets you work through issues and, like, get to evals for things that make sense.

    8. AG

      Going back to our beginning, there's so much hype about what you're trying to sell that your AI feature does, but to actually deliver on that hype, you have to go through these errors [laughs] so that when people are experiencing it in production, they actually get the experience you intended. And this has been our master class in how to do that step by step. If people want to learn more, where can they find you guys?

    9. HH

      So the URL for the course, you can go to evals.info, and you can find the course there.

    10. SS

      Yeah. Or follow us on X. Our websites are on the internet. You know, if you just look us up, we're there. Um, but check out evals.info. I think we've really tried to put together as much information that we can to be freely accessible and available to folks. Um, so take a look, right? You can dive in, and I'm sure you will learn things along the way.

    11. AG

      Awesome. Thank you guys so much for being here.

    12. HH

      One thing I, I wanna clarify that-

    13. AG

      Yeah

    14. HH

      ... sorry, is, um, so we mentioned, like, hey, you need to look at traces in production. So you might be wondering, like, what if your application is not in production? What do you do? What if you don't have data? What if you don't have traces? Where do you get them? One, try to recruit some friends. Make... Try to dog food your own app. That's the best thing. For whatever reason, if you're not able to recruit friends, you're not able to dog food your product, which would be kind of sad, but let's say, you know, there could be valid reasons you're not able to do that, you can generate synthetic inputs into your system, and there's a, there's a way to do that correctly, is kind of what you're doing is you're pretending to be a user, and you're having an LLM simulate that at scale. Um, that's one of the things that we go into in our course as well. So, so there's ways to bootstrap yourself, but you do need to look at data.

    15. AG

      Amazing. Thank you guys so much for being here.

    16. SS

      Cool. Thanks for having us.

    17. HH

      Thank you.

    18. AG

      I hope you enjoyed that episode. If you could take a moment to double-check that you have followed on Apple and Spotify Podcasts, subscribed on YouTube, left a rating or review on Apple or Spotify, and commented on YouTube, all these things will help the algorithm distribute the show to more and more people. As we distribute the show to more people, we can grow the show, improve the quality of the content and the production to get you better insights to stay ahead in your career. Finally, do check

  15. 1:06:31 – 1:06:51

    Outro

    1. AG

      out my bundle at bundle.aakashg.com to get access to nine AI products for an entire year for free. This includes Dovetail, Maven, Linear, Reforge Build, Descript, and many other amazing tools that will help you as an AI product manager or builder succeed. I'll see you in the next episode.

Episode duration: 1:07:00

Install uListen for AI-powered chat & search across the full episode β€” Get Full Transcript

Transcript of episode J7N9FMouSKg

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome