How I AIEvals, error analysis, and better prompts: A systematic approach to improving your AI products
EVERY SPOKEN WORD
50 min read · 10,002 words- 0:00 – 3:05
Introduction to Hamel Husain
- CVClaire Vo
What are the fundamental concepts folks need to know of getting to higher quality products?
- HHHamel Husain
The most important thing is looking at data. Looking at data has always been a thing, even before AI. There's just a little bit of a twist on it for AI, but really the same thing applies.
- CVClaire Vo
When you see a real user input like this, you actually look at what users are prompting your AI with, you realize it's very vague.
- HHHamel Husain
Absolutely. That's the whole interesting bit, is like once you see that people are talking like that, you might actually want to simulate stuff that looks like that, 'cause if that's what the real distribution of the data are, that's what the real world looks like.
- CVClaire Vo
I'm sure our listeners expect some, like, magical system that does this automatically, and you're like, "No, man, just spend three hours of your afternoon, go through, read some of these chats, look at some of them with your human eyes, put one-sentence notes on all of them, and then run a quick categorization exercise and get to work." And you see this have actual real impact on quality and reducing these errors?
- HHHamel Husain
Yeah, it has an immense quality. It's so powerful that some of my clients are so happy with just this process, that they're like, "That's great, Hamel. We're done." And I'm like: "No, wait, we can do more." [upbeat music]
- CVClaire Vo
Welcome back to How I AI. I'm Claire Vo, product leader and AI obsessive, here on a mission to help you build better with these new tools. Today, I have such an educational episode for people like me that are building AI products. We have Hamel Husain, who is gonna demystify debugging errors in your AI product, writing good evals, and show us how he runs his entire business using Claude and a GitHub repo. Let's get to it. This episode is brought to you by GoFundMe Giving Funds, the zero fee DAF. I wanna tell you about a new product GoFundMe has launched called Giving Funds, a smarter, easier way to give, especially during tax season, which is basically here. GoFundMe Giving Funds is the DAF, or donor-advised fund, from the world's number-one giving platform, trusted by 200 million people. It's basically your own mini foundation without the lawyers or admin costs. You contribute money or appreciated assets, get the tax deduction right away, potentially reduce capital gains, and then decide later where to donate from 1.4 million nonprofits. There are zero admin or asset fees, and while the money sits there, you can invest and grow it tax-free, so you have more to give later, all from one simple hub with one clean tax receipt. Lock in your deduction now and decide where to give later. Perfect for tax season. Join the GoFundMe community of 200 million and start saving money on your tax bill, all while helping the causes you care about the most. Start your giving fund today in just minutes at gofundme.com/howiai. We'll even cover the DAF pay fees if you transfer your existing DAF over. That's gofundme.com/howiai
- 3:05 – 6:58
The fundamentals: why data analysis is critical for AI products
- CVClaire Vo
to start your giving fund. Hamel, I'm really excited for this particular episode, because I have been building products for a very long time, and this has been one of a few times in my career where the how and what of products that I'm building are so different than what I've built in the past. They're technically different. They're different from a user experience perspective, and then they have, they have these non-deterministic models on the back end that I'm somehow, as a product leader, responsible [chuckles] for making output high quality, consistent, reliable, interesting user experiences, and it's such a challenging problem. And what I love about what you're gonna show us today is how to approach that systematically, that quality of product building in an AI world systematically, and how you use different techniques to get AI products, which are new to all of us, from good to great.
- HHHamel Husain
Yeah, happy to be here. Excited to talk about it.
- CVClaire Vo
So, you know, this is such a new thing for product managers. I'm curious if you could start with the fundamentals. What are the fundamental concepts or things that you think folks building AI products really need to know about the process of getting to higher quality products? And then I know you're gonna show us a couple examples of how to do that.
- HHHamel Husain
So the fundamentals really come down to... The most important thing is looking at data, and I, I believe, uh, from working with many product managers in the past, is looking at data has always been a thing, like, even before AI. Uh, you know, like, I'm pretty sure that product managers that, like, can, like, write a little bit of SQL, are okay with spreadsheets, looking at numbers, looking at metrics, you know, that feels like it's kind of table stakes for being a good product manager nowadays. And so there's just a little bit of a twist on it for AI, but really the same thing applies. Um, and it's just like, okay, how do you do that for AI? And that's, that's what we'll... we teach, and that's what I'm gonna show you today.
- CVClaire Vo
Great, and I cannot agree more. I think one of the most transformational skills I learned as a young, um, baby chicken product manager was being able to write SQL and actually do my own data analysis and exploration. But I think the surface area is so broad now with AI, and the data is different. So why don't you show us what we should be looking at when we're building these AI products?
- HHHamel Husain
Yeah. So, um, let me share my screen a bit. Let me give you some background first. So this is one of my clients. The name of the company is called Nurture Boss, and as you can see, it's an AI assistant for apartment managers or, uh, property managers. And really, like, you know, um, you can kind of get an idea from their website, which I'm showing right now. Um, you know, it's a virtual leasing assistant, so-... you know, they'll, they help with the whole top of funnel of, like, helping set up appointments, helping prospective residents, like, find their apartments, setting up appointments, questions about rent, so on and so forth. Kind of like trying to reduce the toil of property managers, still having humans in the loop. And so when they came to me, they had already prototyped something out, you know, kind of vibe checking it, just like everyone does, and put everything together, but they wanted to know, like, "Okay, how do we actually make it work well?" Because the AI fails in weird ways, and it doesn't always do the right thing. But it feels like, okay, every time we fix a prompt, we're not really sure, like, maybe we're breaking something else, or is it really improving things as a whole? We don't really know. We're just guessing. We're just kinda like looking at it and just getting, uh, vibes, and that is a very uncomfortable feeling of trying to scale a product.
- 6:58 – 13:35
Understanding traces and examining real user interactions
- HHHamel Husain
Okay, so the first thing that I'll jump right into is this idea of traces. So traces are this concept of it from engineering, but it doesn't have to be scary. Basically, like, and it's very, uh, topical for AI because with AI, usually have many different events, or especially, like for a chatbot, you have multi-turn conversations where you're going back and forth with an AI. There might be retrieval of information. They might be calling some tools, an- external tools, internal tools, so on and so forth, and so you wanna log these traces. And there's, um, there's many different ways to go about it, but just to kind of show you exactly what happened at Nurture Boss, let's go into what that looks like. So this is a platform called Braintrust. There's a lot of them. Uh, this is one called Phoenix, which is, like, the same exact data in here. Um, it doesn't really matter. You can see, like, they're both the same, right? Like, so what we have here... Let me just go into a single trace. So, um, this is what I would call a trace. I can make this bigger, so you can see it in a full screen, and you can see what an AI interaction looks like in this product. So you have, okay, the system prompt, "You are an AI assistant working as a leasing team member at some apartment." These are all fictitious 'cause these have all been scrubbed for, uh, PII stuff. You know, "Your primary role is to respond to text messages," so this is receiving text messages. Okay, and you have a whole host of rules, like respond, you know, uh, provide accurate information, answer any question for residents, do the following, you know, provide this website. For example, if you had to ask for a rental application, provide this, so on and so forth. All these rules, right? And this is a real, uh, user saying: "Hello, there's what's up to four-month rent." I don't even know what that means. This is real.
- CVClaire Vo
I- I, I got you. I got you. Let me read it. "Hello. Hello there. What's up? Two four-month rent." [laughing] I thought I had it. I thought I had it.
- HHHamel Husain
Yeah. It's unclear, but okay. I mean, like, it's fine. This is real. This is the real world. These are real traces. So, um, you know, and then there's a, there's a tool call here, get communities information. It's calling this tool, this internal tool, and, um, the tool call result, uh, comes back with this information, and this is all hidden from the user. The user is not seeing this tool call result. Um, they're like: "Okay, here's the information you can use about the community," blah, blah, blah. It's not even sure, like, this is the right tool call. We'll, we'll get to that in a moment. And then the assistant goes: "Hello, we are currently offering up to..." So this is, like, back to the user. This is what the AI responds to the user with. "Hello, we are currently offering up to eight weeks rent-free as a special promotion. Please note, the applicable lease specials and concessions can vary," blah, blah, blah. Okay, so, like, is this... And I have a cheat sheet for myself about what is actually right and wrong. Um, okay, so, like, the comment here is, "The user is probably asking about lease terms and stuff like that, not about specials." So, like, it's not really clear, like, this is the right... This is not, like, what we want, and this is so realistic, right? Like, everyone has experienced AI like this. This is like, it's kinda, it's being helpful, but it's not really doing what you want to, and it's actually pretty challenging 'cause it's not really clear what the user wanted. So you could go in a lot of different directions of this.
- CVClaire Vo
You know, when I'm testing my own AI, this is such an eye-opening example, 'cause when I'm testing my own AI, I ask it good questions, and I spell correctly, and I'm very clear. But when you see a real user input like this, you actually look at what users are prompting your AI with, you realize it's very vague. They say stuff like, "What's up?" Um, the, the question, there's, there's no clear question, and so I really do think looking at real user data kinda can get a developer or PM out of their own mind on how they think users are gonna interact with the system.
- HHHamel Husain
Absolutely. It's very critical, um, that you do this. And so now, you might not have this data, and I just jumped right into a real example just to set things off, and we can go into all these different rabbit holes of, like, what if you don't have data and stuff? But I just wanna, like, ground it in, like, okay, so to set the stage, like, this is kind of one foundation is, like, you have to have data.
- CVClaire Vo
Mm-hmm.
- HHHamel Husain
There's different ways to get it. One is you can log it from your real system, and you have these things to look at. Another way is, like, okay, you can have synthetic data, where you sort of generate with an LLM, you can generate questions like this, you know, "Hello, what- what's..." You know, it might be hard to generate stuff that looks like that 'cause I don't even know... We don't know what it means, um, and probably an LLM won't generate stuff like that, but that's the whole interesting bit, is like, once you see that people are talking like that, you might actually want to simulate-... stuff that looks like that, 'cause if that's what- if that's the real distribution of the data, or that's what the real world looks like, um, you might want to challenge your LLM or your AI system appropriately. Okay, so let's step back here. So you have this system, it's doing st- it's like there's stuff like this happening. We can look at another trace if you want-
- CVClaire Vo
Mm-hmm.
- HHHamel Husain
... uh, just to kind of get an idea. And this is, you know, this is not pre-scripted. I didn't memorize what's going on in these, uh, traces. We're just looking at them naturally. So this is something... This is another, uh, apartment complex, Meadowbrook Apartments. Same idea, so we won't read the whole system prompt again.
- CVClaire Vo
Mm.
- HHHamel Husain
Okay, so we'll scroll down here. Let's get to what the user is asking.
- CVClaire Vo
[chuckles]
- HHHamel Husain
"Walk in TOR" So this must be another text message situation, and the assistant says, "Our team tries their best to accommodate walk-ins. Me get you..." Now, that's hilarious. Like, I don't-
- CVClaire Vo
[chuckles]
- HHHamel Husain
... why, why is the LLM... That's surprising. Like, why is it saying, "Me get you to someone who can help?" Maybe it's trying to mimic the, uh, [chuckles] the, uh-
- CVClaire Vo
The tone
- HHHamel Husain
... the user somehow. Um, and then it does, like, uh, "Yes," and then, okay, great. So it seems like this one maybe is okay. Um, let's see what we ended up annotating. Uh, yeah, we said this one is okay. There's, there's some metadata down here about our labels, which we'll talk about next. But yeah, you can... So you can see, like, this is a real system. There's many different things that can happen here.
- 13:35 – 17:40
Error analysis: a systematic approach to finding AI failures
- HHHamel Husain
So the question becomes like, okay, so we talked about just, like, writing SQL and data, but, like, how do you take that same mindset to this? Like, what do you even do with this, right? You have these, like, crazy, like, interactions. Like, how do you analyze this without go- without getting stuck? 'Cause, like, this seems like, um, intractable, right, at first pass.
- CVClaire Vo
No, I, I was just thinking, I was like, "What is the SQL query I write to get, like, the first prompt?" And, uh, like, how do you query for, "Give me all the first prompts that include typos?" Like, "Give me the- all the first prompts that are ambiguous questions." It just feels almost insurmountable. And then, you know, you showed us two examples, and it's two of probably thousands and thousands and thousands, so going through it manually is probably not super scalable. So I'm curious, what is the systematic kind of solution here?
- HHHamel Husain
Okay. So the systematic solution is called, something called error, error analysis. So error analysis just means... It's kind of a counterintuitive process that's extremely effective, and it's dumb, but it's, it's accessible to everybody, and it works. And it's not something that I made up. Um, it's been around in machine learning for a really long time. 'Cause actually, machine learning has the same problem. Like, before, like, generative AI, like, we had these stochastic systems that can do, like, a whole number of things, and, like, how do you actually, like, analyze that and, like, figure out, like, what's going wrong and improve it? So, error analysis has two steps. The first step is writing notes, and it's called open coding, and it's basically like journaling what is wrong. So if we go back to, like, that, that other, uh, trace that we saw-
- CVClaire Vo
Mm-hmm
- HHHamel Husain
... so let me just, uh, go back to it, like the first one. We would, uh, step into this trace, and we would say, okay, like, every, every observability tool has their own, let's say, uh, different ways to take notes. You know, already have a note in here.
- CVClaire Vo
Mm.
- HHHamel Husain
"Assistant should have asked follow-up questions about," you know, about the question-
- CVClaire Vo
Mm
- HHHamel Husain
... "What's up with four-month rent?" Because it's unclear user intent.
- CVClaire Vo
Yeah.
- HHHamel Husain
It is just writing notes about what is going on.
- CVClaire Vo
Yep.
- HHHamel Husain
Okay? And you do that for, like, 100 traces. Randomly sample 100 traces-
- CVClaire Vo
Yeah
- HHHamel Husain
... and you do that. You, and you stop at the most upstream error you find. So you read this, and you see what's going on, and you're like, "Hmm, okay, the user intent seems like we didn't do a good job of, like, clarifying what the hell that they're, they need."
- CVClaire Vo
Yeah.
- HHHamel Husain
And so I think that's the most upstream problem in this sequence of events, so I'm gonna go ahead and just write that as a note.
- CVClaire Vo
Yeah, and, and you say, "Focus on the most upstream problem," because you presume that if you can get early intent, early kind of clarity, correctness right, the rest of the system is more likely to be correct downstream.
- HHHamel Husain
Yeah, because it's causal in nature. So as we have the sequence of in- int- events, whether it's, like, user prompts, tool calls, retrieval for RAG, whatever it may be, any error at any point along the chain, you know, like, will cause downstream problems.
- CVClaire Vo
Yep. Yeah.
- HHHamel Husain
And so to simplify our lives for this purposes of error analysis, it's heuristic. You know, eventually, you do wanna care about the different errors and different downstream-
- CVClaire Vo
Yeah
- HHHamel Husain
... but when you're starting out, just focus on the upstream error because we're trying to make it tractable-
- CVClaire Vo
Yeah
- HHHamel Husain
... and this is, like, the way that you're gonna get results fast. So basically, um, what you do is you go through and you collect a bunch of notes, and then what you do is you can take these notes, and you can, like, download them or whatever, and you can categorize those notes. And you can even put these notes into, like, ChatGPT. It's like, "Hey, here's all my notes. Like, can you bucket these into categories?" And you kind of have to go back and forth with it a little bit. Like, "Hey, these are my notes. These are the categories. Um, oh, I think, like, you're missing a category," whatever.
- 17:40 – 22:23
Creating custom annotation systems for faster review
- HHHamel Husain
Now, with NurtureBoss, what we ended up doing is, we actually made... One of the things that we highly recommend a lot of people think about is to make your own custom annotation tool. Like, there's- you see this, this here in Braintrust, and it's also here in Arize Phoenix. They're very similar. You can see this is a very similar-looking UI, and you have... They even called it error analysis here. And you can, like, add your notes, like, you know, whatever, and you can save those notes, and same thing. If you're gonna be looking at a lot of data, you don't wanna slow yourself down.... and you wanna be able to have, like, very human-readable sort of, you know, output. And sometimes, like, this markdown stuff is, like, not that readable, and you wanna make sure that, okay, like, it makes sense to you, and you can fly through it as fast as possible. So, um, you know, it's really easy to vibe code this stuff, um, because ul- ultimately, what you're doing is, like, showing data. So in the, in the Nurture Boss situation, so as you might have gathered, like, they have multiple channels that customers can contact them on. They have, like, text message, which will be- which, uh, we saw. They have email. They have a chatbot on the website, so on and so forth. So they just wanted something they could, like, navigate faster. So just, like, vibe coded, essentially... I mean, they have the per- we were developers, but, you know, we're using AI in our process and do this very fast, is, okay, like, what channel is the trace from? And then, like, some other filters about like, "Hey, did we already annotate this or not?" And then just kind of have some statistics at the top. You know, this is, like, what the annotation, like, looks like. It's kind of very similar but just, like, dialed into what we wanted, and it was like... You know, we just took notes. And then what, uh, for Nurture Boss, what we did is, okay, we had an automated process that would summarize, like, categorize those notes into, like, what are the biggest issues, and then we would just... Something very simple like counting. Counting is always powerful. As you know, as a product manager, you can go into a system, the SQL, you experience, like, writing SQL queries, like, you know how powerful counting is. Counting remains powerful. And so you can, uh, count these issues, right? So, so, like, okay, for Nurture Boss, I don't know if you can see my screen or if it's too small. I can try and zoom in more.
- CVClaire Vo
Yeah, yeah, that's great.
- HHHamel Husain
Is, okay, what are the most- what are the biggest issues after doing that error analysis exercise, which only took, you know, a few hours?
- CVClaire Vo
Yeah.
- HHHamel Husain
It's like, okay, um, we're having a lot of transfer and handoff issues. We're trying to transfer the u- the customer to a human. We're having a lot of tour scheduling issues, so, like, they're trying to schedule a tour, but, like, a rescheduled tours... In this case, we found that, like, someone's asking to reschedule, there is no rescheduled tour. But, like, the AI doesn't know that. It just keeps scheduling more tours, which is bad. Um, you know, uh, follow-up, so, you know, AI not following up when the user has a question, you know, sometimes incorrect information provided. Okay, so, like, you see, like, these are kind of the count, and now we have... Now we're not lost. Now we know what we should be working on. We know, okay, you know what? We should fix this, like, transfer and handoff issue and this tour scheduling issue. We have confidence. Like, you know what? Like, we- we're not paralyzed anymore. We know, okay, this is what we need to fix it on our AI.
- CVClaire Vo
This episode is brought to you by Persona, the B2B identity platform, helping product, fraud, and trust and safety teams protect what they're building in an AI-first world. In 2024, bot traffic officially surpassed human activity online, and with AI agents projected to drive nearly ninety percent of all traffic by the end of the decade, it's clear that most of the internet won't be human for much longer. That's why trust and safety matters more than ever. Whether you're building a next-gen AI product or launching a new digital platform, Persona helps ensure it's real humans, not bots or bad actors, accessing your tools. With Persona's building blocks, you can verify users, fight fraud, and meet compliance requirements, all through identity flows tailored to your product and risk needs. You may have already seen Persona in action if you've verified your LinkedIn profile or signed up for an Etsy account. It powers identity for the internet's most trusted platforms, and now it can power yours, too. Visit withpersona.com/howIAI
- 22:23 – 25:15
The impact of this process
- CVClaire Vo
to learn more. I love this. Just to recap, so you're taking these traces of these real conversations, and, you know, you don't even have to read all of it. You have to read till you hit, hit a snag, right? Re- to hit an obvious, sort of, like, incorrect, um, or high-friction part of the experience. You have vibe coded an app that makes it really easy for the team generally to go in, annotate these, rate them, sort of like good quality, bad quality, automatically categorize them, count them, and then you have a prioritized list, and you're like, "Here are the problems that I need to go solve." And what I love about this is, you know, I, I, I'm sure our listeners expect some, like, magical system that does this automatically, and you're like, "No, man, just spend three hours of your afternoon, go through, read some of these chats, look at some of them with your human eyes, put one-sentence notes on all of them, and then run a quick categorization exercise and, and get to work." And you see this have actual real impact on quality and reducing these errors?
- HHHamel Husain
Yeah, it has an immense quality. It's so powerful that some of my clients are so happy with just this process, that they're like, "That's great, Hamel. We're done." And I'm like: No, wait.
- CVClaire Vo
[chuckles]
- HHHamel Husain
Like, we can do more. Um, you know, you've paid for more, like, you know, whatever. They're like, "No, this is so great. Like, I just feel like I, I, I know what to do." [chuckles] And so they, they find so much value in this, like, process that... And it is, like, very important. This is something that no one talks about. Like, people, when you talk about evals, it's like, "Well, how do you write an eval? What eval do you do? What tools should you use?" Before you get into all that stuff, you need to have some grounding in, like, what eval you should even write because there's infinite evals. So, like, in this case, we would write... We wrote an eval about tour scheduling issues, and we wrote an eval about transfer and handoff issues, and we felt really good about that because we knew that, like, that is a real problem, and we, we knew how to write the eval because, like, we saw the error.... um, and we knew how to find data to test that eval, because again, we already tagged it, and we saw that error, which is exactly the way you wanna do it.
- CVClaire Vo
Yeah, and what I also like about this is it does take the burden off your users. I mean, so many people try to collect this data by, like, putting a little thumbs up and thumbs down or little comments. Like, I even have that on parts of my product, and yes, it is useful, but it only gives you a sliver of the kind of self-identified errors in the app. And users are highly tolerant of systems, and so sometimes those errors just don't get escalated by a user. They'll either abandon, or they'll just work through too many steps to get to the outcome that they want. They'll have a quality experience. And so I think just taking the burden on yourself and saying, "You're responsible for looking at the data," you can create simple ways to categorize it,
- 25:15 – 29:30
Different types of evaluations
- CVClaire Vo
um, and then you have a prioritized list. Now, if your client is willing to go the next step and do something about this, um, and write evals and fix prompts, what are your kinda next steps here? What's another example of, of where-
- HHHamel Husain
Yeah
- CVClaire Vo
... we go from here?
- HHHamel Husain
Um, I just wanna talk about just, uh, s- for a minute, like, okay, so this partic- this particular technique is so powerful, and not that many people know about it. Um, you know, so I actually recently did a training with OpenAI, showing, uh, the people at OpenAI, like, you know, how this works for domain-specific evals. Um, if you wanna learn more about, like, this, we had Jacob, the founder of Nurture Boss, like, walk through, like, this whole process in, like, two minutes.
- CVClaire Vo
Yeah.
- HHHamel Husain
So you can find it on this, on this page if you would like. Um, okay, so to get to your question, like, what do you do now? Okay, so you have, uh, like... You know, you've done your error analysis. You have, like, prioritized these things, so, like, now what do you do? So now you get into, uh, writing the evals. So now you have to decide, like, what kind of evals do you want? There's different kinds of evals. So there's reference-based evals, which is like you know what the right answer is, and maybe you can write some code. You don't need, like, an LLM to do the eval for you. Or if it's more subjective in nature, then, you know, maybe, like, this transf- this transfer handoff issue, maybe it's more subjective in nature, um, then you need an LLM judge. And so, um, what you can do is you can start to write those evals. And so, um, I have this blog post here about evals in general. So there's this diagram... It's really hard to put this whole thing into a diagram, honestly, but, 'cause, you know, it can be-- it's kind of... It's not, it's a nonlinear process. Um, but really, what you wanna do is, okay, we already covered, like, logging traces, and there's two different kinds of-- there's different kinds of evaluators or evaluations. There's the kind of, like, unit tests, which is like, well, I would say, like, code-based evals, and then there's, like, models, so, like, LLMs. You know, code-based evals, so like, you know, for example, what is c- what do-- kinds of things that would be good for a code-based eval is, like, okay, if you have, like, user IDs showing up in the response or something like that, okay, you can test for that in code. Um, for-
- CVClaire Vo
I, I have to say, you're saving my life here because I was thinking, "What is one of these unit tests I need to write?" And that is exactly one of them, which is, "My tool calls need UUIDs, and users definitely do not." [chuckles] So, uh, that's a, that's a great example of one for anybody that's writing a chatbot that does a lot of kind of tool calling.
- HHHamel Husain
Yeah, 'cause they can show up by accident. Like, you might have the UUID in the system prompt, and it inadvertently shows up in the output for some reason or other, and you don't want that. Okay, you wanna write these tests, um, with... No matter what kinds of tests you write, you wanna create test cases, and sometimes you can gather those from your traces. Sometimes, sometimes you might wanna generate synthetic data. And so, um, you know, this is like a prompt for a different real estate agent assistant called Rechat, which is, uh, for residential real estate. Um, and this is kind of like a simplified version of your prompt, write 50 different instructions that a real estate agent can give to their assistant. It creates contacts on their CRM. Contact details can include name, phone, email, whatever, and basically, you know, it can generate synthetic inputs to a system that then you can then log traces from. I'm gonna jump around a little bit, so we'll kinda come back to that. Um, okay, we already covered logging traces. You know, this is another, like, custom log annotation thing yet again, 'cause we, we, you know, really emphasize this, that it's really important to remove all friction doing this, so I won't linger on this too much. And basically, um, you know, one kind of thing you wanna do is, like, okay, if you're using LLM as a judge or anything else, what you wanna do is...
- 29:30 – 33:58
LLM-as-a-Judge
- HHHamel Husain
So one thing that's usually skipped when we talk about LLM as a judge is, like, people are just using LLM as a judge off the shelf. Like, they're, like, writing a prompt. They're saying, "Okay, judge it," and then reporting that. Um, let me actually go to a different blog post that is a little bit better for LLM Judge, which is this one. Okay, so LLM as a Judge, so you often see sometimes in LLM eval land, like, a dashboard that looks like this: helpfulness, truthfulness, conciseness score, tone, whatever. What the hell does that mean? Does anyone know what that means? Nobody knows. No one understands concretely, like, if the helpfulness score is four point two, and it goes to four point seven, like, do you really know, like, what's wrong, what changes? No. And so there's a lot of guidance in w- how to-... create an LLM-as-a-Judge. Um, it's probably too much for this podcast to, like, tell you all of the things, and this blog post is quite long, um, like enumerating, like, how to do it correctly. But the main things that you need to keep in mind is, like, one, you need to have binary outputs. Like, is it good or bad for a specific problem? So for, like, you know, the handoff problem for Nurture Boss, like, okay, was there a problem or not? And you want specific evaluators for specific problems. Um, number two is, like, you want to-- you need to hand label some data, which you already kind of do in error analysis, and you wanna compare the judge to the hand-labeled data so that you can trust the judge. The last thing you wanna do is, like, throw up a judge on the dashboard like this, and then, like, people don't know if they can trust it. And the worst thing you do as a product manager is, like, start showing people evals, and then at some point, the people's perception of the product or their experience with the product doesn't, uh, doesn't match the eval. So like, hey, like, it's, it's broken, but the evals are showing that it's good. And that's a lot... That's the moment, like, people lose trust in you, and then they'll... Like, it's gonna be really hard to regain that trust. And so the way that you make sure you can trust these automated LLM evals is to, you know, measure sort of agreement with these hand labels.
- CVClaire Vo
Yep. So what I'm hearing from you in terms of LLM-as-a-Judge is these general buckets with arbitrary ratings against them, not useful and will often work against you. You wanna write specific binary outcome evals for specific tasks. So you want a set of evals that are like, "Does this get scheduled correctly? Yes or no?" And so you're making a list of evals, um, that the LLM-as-a-Judge is evaluating that gives you a pass, fail, or yes, no, true, false, binary outcome. Very simple. And then you're doing the additional layer of work of validating that the eval itself is valid by actually looking at that outcome and saying, "Do I actually agree with this LLM-as-a-Judge evaluation of the quality of this output?" And that-- those steps together are gonna give you a much more comprehensive view of how your product's performing, and then a-- and then that, that second layer of human evaluation, it's gonna give you more confidence that either your LLM-as-a-Judge is good and is evaluating your outputs correctly, or you actually need to tune, um, that judge itself to get to higher quality evaluations. Is that kind of the summary of, of what you think as well?
- HHHamel Husain
Yes. And the thing that's really important is, like, it's really difficult to write any LLM judge prompt if you don't do this. Because the research shows, and there's some research, uh, that my co-instructor for the course that I'm teaching, um... There's a paper called Who Validates the Validators? And the research shows that people are really bad at writing specifications or requirements until they need to react to what an LLM is doing, to clarify and help them externalize what they're, what they want. And it's, like, only going through this process of sort of, okay, writing detailed notes and critiquing things that you can then, like, start refining the LLM judge.
- 33:58 – 38:15
Improving prompts and system instructions
- CVClaire Vo
Great. And so we've, we've covered sort of, um, traces and errors, annotation. You have kind of how to build unit tests that are automated tests. Of course, you're looking at it manually. You're doing LLM-as-Judge the correct way. Now, tell me, I've identified all these problems. I have these evals that give me data. How do I write a good prompt? Like, are there, [chuckles] are there some techniques or, you know, what do I, what do I do? Are there things that you've found consistently in the next step of improving your system instructions, improving your tools, um, where you actually have to go solve these problems, are, are effective?
- HHHamel Husain
Yeah. So, um, when you get to, like, the errors that you have... So, like, you know, you're gonna use these evals, and you're gonna deploy it at scale, okay? It's like, you're not looking at all your data. You're, you're looking at a sample of data, and you're going to score your LLM-as-a-Judge against, like, a sample of labeled data, and you're gonna deploy that at scale, and you're gonna, like, look at where are there errors. And it's pretty-- Like, you know, you have to make a judgment call on, like, how do, how do you improve your system based on the errors you're finding. Like, is it a retrieval problem? Is it a prompting issue? Is it, um... Should you be putting more examples in the prompt? And, you know, there's not really a silver bullet there, I would say. Um, you know, retrieval is its own sort of beast. It tends to... Like, retrieval tends to be the Achilles heel of a lot of AI products, um, you know, where things tend to go wrong. But sometimes, yeah, it's just like, especially in the beginning, you're gonna find a lo- lot of low-hanging fruits. Like, for example, in Nurture Boss, the system prompt didn't t- contain today's date.
- CVClaire Vo
Yep.
- HHHamel Husain
So when the person said, "Hey, can you do a schedule for tomorrow?" AI had no idea what, like, like-
- CVClaire Vo
What the day was
- HHHamel Husain
... We don't know what tomorrow is, but didn't, didn't tell the user that, right?
- CVClaire Vo
Yep.
- HHHamel Husain
We just guessed. So, like, you know, that's really obvious, so there'll be, like, obvious things you can fix, and then there's, like, lesser obvious things you can fix. You could try, like, prompt engineering. So there's a spectrum of like, okay, prompt engineering all the way to, like, fine-tuning. Most people shouldn't get into fine-tuning. I will say that if you do all this eval stuff, fine-tuning is basically free because, uh, you have all this infrastructure set up to do all these measurements and curate data, like high-signal data that is difficult-... and that difficult data, that those difficult examples where your AI is not getting right, that's exactly the stuff you want to fine-tune on. That's, like, the very high-value stuff for fine-tuning. So, um, and yeah, fine-tuning is not so hard. At, in the Rechat case, we had to do fine-tuning to get the extra mile. Um, but in most cases, like, it's prompt engineering. There's no magic prompt engineering tri- tricks. It's really, like, I would say there's a lot of experimentation, uh, that you should engage in.
- CVClaire Vo
Well, and one of the things that I found so interesting as an AI builder that comes from a software engineering background is now I have a natural language surface for bugs, in terms of my system instructions and prompts. And I had this experience recently on ChatPRD, where we were really having a hard time with tool calling. Like, one of our tools just was intermittently not being called, no matter what the user would say, and it was really hard to pin down. And we have this, you know, monster system prompt, and I went through, and there was, like, two words in the prompt that were just incor- they were incorrect. It was about UUIDs, but it was, like, incorrect. And as soon as I deleted those two words, which had just been, you know, typed in by somebody and pushed in the repo, blah, blah, uh, our quality of that tool calling shot right up. And so I just have to... You know, we have to, as product people, as engineers, start thinking of the full surface area of our product, and it's not the construction of the agent or the chatbot itself. It really goes down into what words are going in and out of your system, and it's a complicated surface area to debug and keep track of 'cause it's unstructured, but it's super high impact, in my experience.
- 38:15 – 40:38
Analyzing agent workflows
- HHHamel Husain
Yeah, definitely. You know, when it comes to tool calls... Actually, let me show you one thing that always comes up, is people, uh, uh, wonder, like, how do you evaluate agents? Because, like, you know, there's so many different handoffs. Like, how do you actually, like, do it in real life? So let me see if I can share that. Okay, so I'm sharing, like, um, the book that we give students in our class. Um, but let me go to the table of contents. So there's all these different areas. We'll kind of skim towards the agent part of it. So, um, there's, like, analytical tools you can use for everything. You know, for agents, you can build these transition matrices. So going from one step to the other, where are the errors located, in, like, what agent handoffs? Or what steps are being hand- handed off to what other steps? So, like, in this case, okay, we have this, like, generate SQL to execution SQL. That's where a lot of this, like, errors are happening, and then you can, like, m- then you can narrow it down. So as you get more advanced into evals, it's a very deep subject, you- there's a lot of analytical tools you can use to kind of go about things. It is very interesting, like, as a product manager, you can get really far with AI-assisted notebooks.
- CVClaire Vo
Yeah, what I was gonna say about this from a product manager perspective is, this is really put from the frame of errors and evals, but even just analytics for agentic systems, figuring out what your users are trying to do, I, I haven't thought of this idea of actually mapping out the different conversation-to-tool or tool-to-tool handoffs. And even if all of this was working effectively, a product ma- product manager's ability to see the data of its agent's behavior from a tool-to-tool handoff perspective and really identify, like, where are users trying to get value out of the system, also can do things like drive roadmap ideas, right? If you're seeing, okay, people are just writing SQL, executing SQL, like, we need to dig into what other things around that could we build for users that are interesting. So I like it from the error perspective. I also like it just from the product discovery perspective.
- HHHamel Husain
Yeah, definitely. That's, that's very true. Um, yeah, I like that
- 40:38 – 48:02
Hamel’s personal AI tools and workflows
- HHHamel Husain
perspective.
- CVClaire Vo
Okay, so you've shown us how to y- i- the other thing that I like that you've shown us is that there's no way to do this than just do it. Like, I- people want these tricks. They want some hack. They want some off-the-shelf solution, and you're saying, like, "Honestly, look at the data, build yourself a solution if you have to, validate it yourself, do the hard work, and if you do the hard work, you can actually create these leaps in product quality and experience." But right now, you just, you just gotta look at the data and, and make some decisions and make things better. So I think this has been super illuminating in terms of helping people like me, that are building AI products, make them higher quality. Let's spend just a couple minutes on a totally different topic, which you are running this business. You're running a course. You are clearly an expert in AI. What tools are in your stack for kinda running your day-to-day life or at least your business life?
- HHHamel Husain
Yeah, so I do a lot of writing, and I do a lot of communication with clients, and, you know, I also want to reduce my own toil. And so, um, let me share my screen again.
- CVClaire Vo
Yeah.
- HHHamel Husain
It's probably easiest to show you Claude Project. So I have all these Claude projects. Um, so okay, I have, like, one for copywriting. I have a legal assistant. I have consulting proposals. Consulting proposals is pretty interesting. So it's basically, like, um, an example of consulting proposals. It's, um, you know, um... So it's kind of funny. I have skill level, Partner at Palantir, Expert in Generative AI, blah, blah, blah, and, you know, I give it some instructions on the, on the other, like, let's say, proposals I have. Um, and, you know, I have, like, this prompt, you know, whatever, um, get to the point, write in short sentences, whatever, and basically, I have a lot of examples, and basically, anytime I have a intake call with a client who wants a proposal, I n-... give this the transcript, and then it's main- it's basically al- almost ready. It's like, just need... It takes me about a minute to, to kind of edit it and get it going. So that's, that's proposals. You know, I have one for the course, which is like, you know, a lot of context about my course, which is like the entire book. Uh, I have an FAQ that's very, like, extensive that I've published. Um, there's all the transcripts, all the Discord messages, office hours, you know, and again, my prompt is like, "Hey, your job is to help course instructors to create standalone, interesting FAQs." These are... This is like a writing prompt that I have everywhere.
- CVClaire Vo
[chuckles] Do not add-
- HHHamel Husain
Um
- CVClaire Vo
... filler wor- words.
- HHHamel Husain
Yeah.
- CVClaire Vo
Don't repeat yourself. Get to the point.
- HHHamel Husain
Yeah. Yeah, yeah. It's very... You have to really, you know... Um, and so, okay, like, yeah, and it's just, you know, this stuff here. Um, you know, so there's, like, one for the course. There's, um, you know, there's one to help me create these things called lightning lessons, which is basically, like, you know, this lead magnet. Um, so there's all kinds of stuff like this. Um, I have one for-
- CVClaire Vo
I, I see you and I share a general counsel here.
- HHHamel Husain
Oh, okay.
- CVClaire Vo
With Claude AI. [laughing]
- HHHamel Husain
Oh, yeah. Right. Exactly. Yeah, there you go. Um, so there's that, and I also have, like, you know, my own software that I have.
- CVClaire Vo
Yeah.
- HHHamel Husain
Um, so I have, uh... Let me see if I can find it. I mean, it's not, I'm not really advertising it-
- CVClaire Vo
Yep
- HHHamel Husain
... but I have, like, YouTube chapter creation, and then I, I basically have this thing that, um, will create blog posts, like, out of YouTube videos.
- CVClaire Vo
Mm-hmm.
- HHHamel Husain
So, like, um, let me show you an example. So, uh, like this one, basically what I do is I take a YouTube video-
- CVClaire Vo
Mm-hmm
- HHHamel Husain
... and it becomes an annotated presentation, so you don't have to watch the video.
- CVClaire Vo
Yep.
- HHHamel Husain
Like, you can just... Especially if the video has slides, what it'll do is it'll s- screenshot all the slides and then have a summary under each slide about what was said. So you can consume, like, a one-hour presentation in, like, you know, what, or five minutes. Um, and that's really good 'cause, like, you know, I have-- I teach a lot, and I have a lot of content, and so I distribute notes, so all of that. So, like, a lot of that stuff, educational stuff, is part of my workflow. Um, and that, this is used, like, this uses Gemini. Essentially, what it does is it pulls the transcript, it pulls the video. I can put in the slides all at once and get-- I have a lot of examples, and I give it to it, and it produces this. Um-
- CVClaire Vo
Yeah, I've heard this in a couple of podcasts that we've done recently, that folks really like Gemini for video information, and just seems to be the fan favorite for taking basically YouTube videos or other video content and turning it into text or other, other applications that you can extract from that. So try the Gemini models for that, folks, if you're-
- HHHamel Husain
Yeah, it's absolutely brilliant. It's amazing.
- CVClaire Vo
Cool! Okay, so you have Claude projects for every little part of your business. I love the proposal workflow. It's something that we, we, we folks that do enterprise sales could probably make, make some use out of. I'm about to start doing blog posts on all the How I AI podcasts, so maybe I will download your repo and give that a little spin. And then you're using, um, Gemini models to extract out content and share it as, as templates. And then you have... Oh, look at these, prompts. We've got a GitHub with prompts.
- HHHamel Husain
Yeah, so I give GitHub with prompts. This one is private, but just to give you an idea, conceptually, like, I... It's basically a monorepo of everything. Um, the reason that is is because I like to have Claude code, OpenHands, you name it. And basically what I say, because all these things are all interrelated, right? Like, a lot of these projects. So, like, you know, uh, this is my, my blog is in here. This is my blog, for example. This is that, that, like, YouTube thing I just showed you, this Hamel project. This is, like, something else that fetches Discord. This is about copywriting, proposals, whatever. And I just point AI at this repo and, you know, there's, like, Claude rules in here that says, like: "Okay, what is this repo about? And, like, where do you find stuff?" Like, okay, you know, this is, like, if you need to w- like, for writing, you should look here, um, you know, s- so on and so forth.
- CVClaire Vo
So, my friend, you have buried the lead here, 'cause we could have done an entire episode on just this repo. What this makes me think of is, you know, five, five years ago, there was this big, like, note-taking, second brain, where do you put all your information so you can have access to it forever? And I see this, and my little engineering brain goes, "Obviously, it should go in a repo, and it should be a combination of data sources, notes, articles, things that I've written, things that I like, and prompts and tools to actually do something with that." So you have given me a personal project that I'm gonna go work on in the next couple days. 'Cause I think this is, this is how I, as somebody who lives with Cursor or Claude Code as sort of copilots for everything I do, this is how I would wanna organize my data and my prompts to be able to do something with it.
- 48:02 – 54:48
Lighting round and final thoughts
- CVClaire Vo
This has been so great. Um, I have two lightning-round questions for you, and then I will get you out of here. I know you're a busy guy. My first question is, you know, a lot of what you showed us requires someone, a person, to go through with their human eyes, read things, and evaluate, and I'm curious, whose role do you think this is? Is this the product manager's role? Is it the engineer's role? Is it the subject matter expert's role? Who, who does this?
- HHHamel Husain
I think the subject matter expert is very central. A lot of times, the product manager is the subject matter expert in a, uh, SE- SME in a lot of organizations. Like, they're kind of the person that everyone looks to for, like, the taste-
- CVClaire Vo
Yep
- HHHamel Husain
... of, like, "Hey, this is what should be happening with the user." So I would say a lot of times it is the product manager.... um, that should be doing that annotation. Now, when it gets into the analysis, it's really interesting. It would be good if a product manager, like, the more you can do, the better, just like the SQL and the stuff-
- CVClaire Vo
Mm-hmm
- HHHamel Husain
... that you know about. At some point, you do need a- did probably need a data scientist when it gets advanced. Um, but, you know, the more you learn, the better, and vice versa. The more data scientists learn more product skills, you know, it's gonna be better. It's hard to predict, like, you would... You know, there's always this tension or this kind of, "Okay, can we collapse roles? Can we collapse the product role and this, like, data scientist-type AI role?" I'm not sure. Um, it's yet to be seen. I, I don't think so. Um, there's a lot of surface area, actually. There's, like, there's something called AI engineer, there's AI product manager, and there's not- there's also, like, still this data scientist aspect. So those three roles are still operating on this problem. Um, and there's, there's definitely a lot of surface area for all of them, especially as you scale.
- CVClaire Vo
The, the one other thing that I would call out, or my hope is, in addition to sort of like the technical building teams who are sort of proxies in, in my mind, for the subject matter experts... So a lot of times the product manager is a proxy for, like, the leasing agent in this example. They understand that user. They understand what high quality is. But, you know, I would really love to see folks that are in operational or more functional roles come in and actually contribute to the quality of the products, because you know what makes a good user experience. You know what makes a good leasing agent. You know how they should speak and what they should do, and I think there is an opportunity for folks to lean in and bring that expertise to bear in a way that scales across a company, um, that if you're willing and brave to do it, I think product teams would welcome in, um, kinda like non-technical colleagues into this process to add some more kind of user empathy and subject matter expertise.
- HHHamel Husain
Yeah, definitely. Yeah, the more you can bring, like, the actual required taste-
- CVClaire Vo
Yeah
- HHHamel Husain
... in the product sense into the process, the more that... Yeah, 'cause that's what, essentially what you're doing when you're annotating-
- CVClaire Vo
Yep
- HHHamel Husain
... and doing this error analysis, and the error analysis is the foundation for everything.
- CVClaire Vo
Yep. Okay, and then my final question I ask everybody, I know you're very structured, and you'll tell me you'll look at the data and then figure out exactly what to say, but you have to admit, sometimes AI is very frustrating and doesn't do [chuckles] what you want it to do. Do you have any back pocket, uh, prompting techniques you use? Do you yell? Are you all caps? What, what's your, what's your strategy?
- HHHamel Husain
AI has frustrated me the most is writing.
- CVClaire Vo
Mm-hmm.
- HHHamel Husain
'Cause, like, writing, I don't want the writing to sound like AI.
- CVClaire Vo
Yeah.
- HHHamel Husain
And it's hard... You know, that's the last thing you want in certain situations, for your writing to sound like AI. And not that AI is, like, wrong, it's just that, yeah, you wanna make sure your, like, flavor is coming across. And so, um, so one thing, one thing is, like, okay, I showed you my writing prompt, a little bit of, of it. I can share it with you separately-
- CVClaire Vo
Mm-hmm
- HHHamel Husain
... also, is, like, provide lots of examples, but then also take it step by step. So for writing, what I do is have it write an outline, and then I have it write the first one or two sections and edit it very carefully. Now, one tip is use something like AI Studio that allows you to edit the output of what the LLM is giving you. That's really important, 'cause, like, what that ends up doing is it creates examples for the LLM in, kind of right there.
- CVClaire Vo
Yeah, inline.
- HHHamel Husain
You know?
- CVClaire Vo
Yep.
- HHHamel Husain
Yeah. And so, um, yeah, you wanna edit the output and, you know, yeah, something like a notebook or AI Studio.
- CVClaire Vo
Mm-hmm.
- HHHamel Husain
There's not too many things that let you edit the output. Um, but once you do that, once you, like, do that hard work of, like, that, those examples, especially, like, the thing you're trying to write now, then it starts to work really well.
- CVClaire Vo
Yeah, it was one of the most important things that I built into my, my AI product was every asset that gets generated has a real-time editor for the user to update, and then those updates go back into the model. Because I just think if the central value proposition of your product is writing, which mine is, it's one of the hardest stylistic challenges I've seen AI struggle with. It all sounds like slop. Like, I can identify AI writing from a mile away, and so yeah, I found this, like, um, [tsks] incremental optimization, first outline, then draft, then edit, then refine process takes a while. Um, there's some latency in the experience, but it ends up netting higher quality. And then just, like, use it as a draft, edit it, get the system, get s- get the system to be better. So that's really, really great feedback.
- HHHamel Husain
Is this for ChatPRD?
- CVClaire Vo
This is for ChatPRD.
- HHHamel Husain
That's what you're doing?
Episode duration: 54:48
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode PgzOBNse2EA
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome