The Most Important New Skill for Product Managers in 2026: AI Evals Masterclass

AI features don't fail because of the model. They fail because nobody evaluated them. Ankit Chukla has taught thousands of PMs how to build evals, and today he's open-sourcing the knowledge he normally charges thousands for — for free. In this episode, he demos the complete workflow for building offline and online evaluations, using LLM judges, code-based checks, and expert review that catch failures before they reach your users. Complete write-up: https://www.news.aakashg.com/p/ai-evals-explained-simply ---- Timestamps: 0:00 – AI features fail without evals 2:14 – What makes this episode different 3:46 – 5 components of a Gen AI product 5:54 – Case study: AI job website 10:30 – Why prototypes fail (5 reasons) 13:43 – How evals fix failures 11:31 – Ads 12:06 – Back to evals 22:04 – How to build evals end-to-end 35:43 – Offline evals = your AI PRD 37:10 – Online evals & observability 39:26 – Case study: IDMoney Mind 57:52 – Real-life eval examples 1:01:24 – Key takeawayShare ---- 🧠 Key Takeaways: 1. Evals are your AI PRD — The best AI companies have PMs define evals first. Engineers then use pass rates (36%, 56%, 80%) to know when the feature is ready to ship. No evals = no PRD. 2. Non-determinism is why evals exist — LLMs give different outputs for similar inputs. Like tea made in three different places — same ingredients, different result. Evaluations are how you tame that behavior. 3. Build your dataset from 4 sources — Past logs, desk research, synthetic LLM-generated data, and domain experts. Without real edge cases in your dataset, your evals will miss the failures that actually matter. 4. Match the eval type to the metric — Word count and format checks? Use code. Tone, relevance, and hallucination? Use an LLM judge. Compliance and legal risk? Use humans. Don't use a sword when a needle will do. 5. Offline evals before you ship, online evals after — Offline = pre-launch quality gate. Online = production monitoring on sampled traffic (1 in 10, 1 in 100). Both are required. Neither is optional. 6. Cost optimization requires evals — There's a 25x price difference between GPT-5 and GPT Nano. You'll never confidently switch to a cheaper model unless your evals prove the quality holds. 7. Involve domain experts — A PM can't always tell a good financial answer from a bad one. Bring in investment advisors, compliance leads, or customer support reps. Show them outputs. They'll tell you what's broken. 8. Use hybrid evaluation — LLM flags issues at scale, humans make the final call on edge cases. This is how you get thoroughness without burning budget on full human review. ---- 🏆 Sponsors: 1. Reforge Build: AI prototyping built for product teams — try free at reforge.com/Aakash, use code BUILD for 1 month free premium - https://build.reforge.com/ ---- 👨‍💻 Where to find Ankit: LinkedIn: https://www.linkedin.com/in/ankythshukla/ Website: https://hellopm.co/ YouTube: https://www.youtube.com/@UC_gl5BtaGFDtB-imTWBWTpw 👨‍💻 Where to find Aakash: Twitter: https://www.x.com/aakashg0 LinkedIn: https://www.linkedin.com/in/aagupta/ Newsletter: https://www.news.aakashg.com Premium Bundle: https://bundle.aakashg.com #aievals #aiproductmanagement ---- 🧠 About Product Growth: x The world's largest podcast focused solely on product + growth, with over 200K+ listeners. 🔔 Subscribe and turn on notifications to get more videos like this.

Ankit ChuklaguestAakash Guptahost

Feb 19, 20261h 3mWatch on YouTube ↗

EVERY SPOKEN WORD

70 min read · 14,324 words

0:00 – 2:14
AI features fail without evals
1. ACAnkit Chukla
  Your AI feature fails not because of the model, but because you didn't evaluate it. If you are shipping AI features without evaluations, your product is lying to you and you have no idea.
2. AGAakash Gupta
  Ankit Chukla has taught thousands of PMs AI evals, and today he's open sourcing the knowledge that he normally charges thousands of dollars for, for free.
3. ACAnkit Chukla
  So I'm going to call today's class The Masterclass on Creating Effective Evaluations.
4. AGAakash Gupta
  How do we actually build this?
5. ACAnkit Chukla
  So first is the success criteria and the expected behavior.
6. AGAakash Gupta
  The way the best AI companies work is that the AI PM defines these evals, and that is basically the PRD for the AI engineers.
7. ACAnkit Chukla
  If you are not doing offline evals correctly, then you have not even created a product that can be actually launched to the real audience.
8. AGAakash Gupta
  How can I become an AI PM in 2026?
9. ACAnkit Chukla
  Follow these steps. Number one is make sure that your product sense skills are exceptional. The second part is make yourself aware about certain technical concepts and some Gen AI concepts. Most of the PMs have no idea about how to write evaluations. They have no idea about how to go ahead and build a production level app. In these 60 minutes, I'm going to assure you that you are not only going to understand the fundamentals of evals, but I'll give you a real case study where we are going to help you understand how to plan your evals like a pro AI PM.
10. AGAakash Gupta
  Let's get right into it. Before we go any further, do me a favor and check that you are subscribed on YouTube and following on Apple and Spotify podcasts. And if you want to get access to amazing AI tools, check out my bundle, where if you become an annual subscriber to my newsletter, you get a full year free of the paid plans of Mobbin, Arise, Relay app, Dovetail, Linear, Magic Patterns, DeepSky, Reforge Build, Descript, and Speechify. So be sure to check that out at bundle.aakashg.com. And now into today's episode. Ankit, welcome to the podcast.
11. ACAnkit Chukla
  Thanks a lot for having me again, Akash. And till this day, I think we have done that podcast about three months ago. And till this day, I almost get like a couple of messages every day on my LinkedIn appreciating the content of that podcast. So thanks a lot for creating that content with me.
12. AGAakash Gupta
  It's so evergreen. Like some months it's my top podcast episode, [chuckles] even though I've released other new podcasts. So continues to grow. True testament to the value we put there.
2:14 – 3:46
What makes this episode different
1. AGAakash Gupta
  What are we gonna do in today's episode that's different from all the other AI evals content out there that people might have seen?
2. ACAnkit Chukla
  Yeah. So understand, most of AI product management is actually product management only. But there is this one skill that is new and most of the PMs are not aware about, and that skill is eventually to go ahead and write good kind of evaluations for your system. Now, what is happening in the market is that I could see a lot of content is being posted around evals, but most of the con-content is still like introductory level or intermediate level content, and we lack, let's say, real examples, and most of the examples are hypothetical. So I thought that why don't we create this episode so that anyone who has an aspiration to become or pursue like the AI product management career seriously, they should be able to understand with crystal clarity that this is how I should approach evals. Understand evals is not, let's say, one, uh, one thing is going to fit everyone, but I'm going to tell you the nuances that you need to take care of, and we'll be walking a bit today with a framework so that you can write strong evals for almost any kind of product if you follow these kind of practices. So that is my promise for today.
3. AGAakash Gupta
  Well, let's get straight into it, and maybe we can start with fundamentally justifying why is it important for PMs to learn AI evals?
4. ACAnkit Chukla
  Yeah. So I'm going to call today's class The Masterclass on Creating Effective Ev-Evaluations. And understand that I'm going to follow this approach that initially. So I want to give the agenda beforehand so that you are able to take time and make sure that whenever you are watching this episode, you are sitting out with a notebook so that you can go ahead and revise and recollect things. So I'll divide this whole section into three parts.
3:46 – 5:54
5 components of a Gen AI product
1. ACAnkit Chukla
  The first part is we look at an AI product, because if you don't understand what are the nuances of an AI product, you'll not understand what evaluations are. Then I'll give you a quick five-minute introduction of what evaluations are so that you understand what we are talking about. After that, we'll give you like a quick introduction to evals. I'm going to talk about the nature of large language models, what are the metrics for evaluations, and then I'll give you an end-to-end flow of how to create evaluations for almost any kind of product, whether you are talking about agents, you're talking about simple chatbots, or you're talking about some enterprise-grade products. And eventually, I'm going to go ahead and give you some tips for writing effective evaluations. And then I'm going to give you an end-to-end case study of how a company might do evaluations. So that is the agenda that we are going to follow today. Now, before I could go ahead and talk about evals, let's talk about the fundamental building blocks of Gen AI product and why it is different. So, so far, most of us have been creating products which are-- were very deterministic in nature. But if I go ahead and talk about the nature of a Gen AI product, these are some of the critical components. Let's say one of the component is the language model. I'm not calling it a large language model because there are also some useful small language models, so we'll only use language models. Then we have the context engineering part, which is you-- data that you give from RAG or something, or the prompt that you put. Then we have tools, then we have orchestration, how things are going to connect with each other. And then we also take care of the user experience, how the user is going to interact, how do you include humans in the loop, how do you take care of latency and all, right? So these are, I would say, five critical components. But now the issue here is that this particular part, the language models, it is not deterministic. For similar kind of inputs, it can give you different kind of outputs. So it is almost like, like a lion in, in a circus, where you have to make sure that although you know about the nature of the lion, which is it's a beast, but as a ringmaster of the circus, you need to make sure that you are able to tame that behavior and show like a good circus or a good product. So that is why we need evaluations. And there are other things also that are going to matter for the evaluations, which we are going to go ahead and talk about. So this is the reason why we need evaluation, because there is a indeterministic nature or stochastic nature for the large language models. Now, before I could go deep into the evals part to make everyone understand what evaluations are, I'll take a very simple
5:54 – 10:30
Case study: AI job website
1. ACAnkit Chukla
  case study. It'll only take, let's say, four or five minutes. So let's say we are creating a AI first job website, and the use case is very simple. Let's say, uh, I want to apply for a job. Every one of us who is applying for a job, we need some information. The information is what I'm-- the product is that I will crawl the major job portals of the world, maybe the LinkedIn, the Hired, or the Angel.co of the world. After that, I'm going to put that job description through a large language model. It could be any of these large language model or something better. And then I am going to enhance that job description, make it more, I would say, much better for the candidates, because I'm going to enhance it into summary of the job description, possible interview questions from the job description for that job. What are the skills that you need? What is the learning guide if you are really seriously preparing for it? And if you think you are prepared, we are also going to give you quiz for assessment. So this we are creating from the small piece of information that we have got from the job description. And I'm sure that many of the people who are looking for jobs would definitely find this interesting. So this is like a simple product. You can also call it a wrapper on the top of an LLM, right? Now, even if I do this, I will understand that I cannot trust large language models to do this job really, really well. So what I'll do is, in order to understand that whether I'm giving the right output to the users and all this information is done correctly, what I'll do is I will try to evaluate. And by evaluation, I mean what I'll do is I will use some or the other method to make sure that whatever the content is being generated, it is factual, accurate, correct, helpful, and maybe some other kind of parameters are being satisfied. For example, I want to make sure that the summary of the job descriptionShould be less than 300 words, otherwise it's not a summary. I want to make sure that the skills that are needed are actually real product management skills. And I also want to make sure that the learning guide that I'm creating is actually an actionable guide and the model is not hallucinating. So what I'll do is I'll generate all of this, then I'll write an evaluation, and then I will understand whether my prompt, my model are working correctly or not. Because it might happen that initially I've given a prompt, but it is not able to work correctly. Or maybe I need to choose between whether I should use GPT-5 or I should be okay with maybe GPT-3 or GPT-4, which is a cheaper API, right? So that is the purpose of evals. If I need to exactly show you what this evaluation prompt is going to look like, let me show you. So this is what the evaluation prompt is going to look like. It's a very simple use case. So I'm going to call another LLM, where I'm going to tell it that you are an AI quality evaluator for product management job listings, and whatever input output that I've got from this particular model, like the prompt, my initial prompt, I'm going to give it over there. That this is the job description, this is the AI-generated summary, this is the interview questions, this is the skill, these are the concepts, these are the quiz. Now you go ahead and run all of these things, which is, is the summary accurate? Are the interview questions relevant? Are the listed skills aligned? And all of these things. So what we'll do is that we will-- whatever output that you have got from the model initially, from the base prompt, we are base system prompt, we are going to take it and we are going to run this particular prompt or evaluation through that, right? And then I'll understand that whether the summary was good or not, questions are-- were good quality or not, or skills were aligned or not. And then based on that, I'll be able to understand whether my prompts, my language, my language models are working correctly or not. In case I was able to understand that accuracy is not good, so I'm going to tweak the prompt to make it more tight- tightened. If I think that maybe I'm able to get the same kind of output with a cheaper model, which is GPT-5 or something, uh, GPT-4 or something, I'm going to use that and not use the most costliest model. So this is what evaluation is. I am going to get output from, let's say, some of the LLMs or something, and then I want to make sure that I am able to evaluate that output on certain kind of metrics. So one of the things that I am going to, uh, monitor is the length. Length can be done by a simple evaluation of counting the characters that can done with the help of code. While other things which are the accuracy, whether these are the good questions or not, can be evaluated by maybe a human or LLM or other large language model as a judge, which is generally more intelligent than the other models that you use in your product. So that was like a-
2. AGAakash Gupta
  So the eval is here
3. ACAnkit Chukla
  ... volume-
4. AGAakash Gupta
  Right. The eval, it doesn't always have to be an LLM judge. We've shown a prompt, but for the 300 words, we can use a simple code. So an eval, it can be anything, right? If we were explaining this really simply, it could be any sort of test. It could be a unit test in the old language. It could just be a counting of words here. Or in the most advanced form, as we've shown, it can be an LLM judge, which is kind of replicating some of that human intuition that we encoded into that prompt we saw. Yes.
5. ACAnkit Chukla
  Now, we are going to address two challenges today, which are coming in the AI world, AI product world right now.
10:30 – 11:31
Why prototypes fail (5 reasons)
1. ACAnkit Chukla
  The one major challenge is that prototypes fail to scale. And this has been verified in a lot of research. One of the very popular research that is being quoted everywhere is by this, uh, uh, the Nanda committee from the MIT that 95% of the all AI initiative fail, and the major reason for them is the learning gap. So when I got into the paper, I was able to understand that what do they mean by learning. And learning basically means that either they do not like the people who are building AI initiatives, either they are not solving the right kind of problems for the companies, which is they are not aware about the workflows, and they are just trying to implement some of the other kind of Gen AI solutions. And the second part, the second part of the learning gap is that these systems do not evolve, which is if I'm giving feedback to the AI that please go ahead and don't do this next time, still they are going ahead and repeating the mistakes.
2. AGAakash Gupta
  AI evals are not the only important new skill for product managers. The second most important skill is AI prototyping. And the problem with most of the AI prototyping tools out there on the market is that they're trying to be low-code, no-code tool alternatives.
11:31 – 12:06
Ads
1. AGAakash Gupta
  They're trying to allow you to create an entire startup. Today's episode is brought to you by Reforge Build, which takes a completely different approach to AI prototyping. They are putting the product manager at the center, and they have architected the product in that way. So whether it's quickly picking up the design system or generating divergent alternatives, you can do it all with Reforge Build, and you can get a full year free of the paid version if you use Aakash's bundle. So whether you sign up for Aakash's bundle or not, go check the link in the description, try out Reforge Build. And now back in today's episode with Ankit.
12:06 – 13:43
Back to evals
1. ACAnkit Chukla
  And to talk about the first part, why prototype fails, we have been able to do research, and we are able to find that there are generally five reasons why prototypes fail to scale. And this is very important because this is the reason between you playing with prototypes and you going ahead and becoming like a real AI PM and building, let's say, some impactful products. So the reason is one very common reason is the data drift. Data drift means that you have trained the model in different conditions. You have created the product in different conditions, but now the customers have evolved, the data have evolved, the context has evolved, and the knowledge has evolved, and your product is not able to keep up. The second part is cost considerations. Now, AI is not like some-- your-- like Gen AI is not like other product where the costs are fixed. So in SaaS, most of the time, the operating cost or the, I would say, the marginal cost for every, uh, other user is almost very, very small. But in AI, with each and every call, you are going ahead and paying money. So sometimes what happens is that costs do not scale well. After that, we have engineering limitations, which is we have not done stress testing, issues of scalability, asynchronous behavior we are not taking care of. And then this is an important part, which is in prototype, because you are going ahead and doing it with less data, you do not think of the guardrails, which is how you are going to create a feedback loop, how you are going to create the fallback logic, what are the legal and compliance that you are going to play there. And then the last part, which is not the discussion for today, but still it's the major problem, which is collaboration failure. Now, this part of AI PM is also same as a product manager's job, which is you need to make sure that you are a good collaborator. You need to be good collaborator, not between only your teams, but also between users and your own team. So that makes sure that, yes, uh, if you go ahead and do these kind of things, there is high probability that your prototype will not fail to scale, right? So yeah, you want to say something?
2. AGAakash Gupta
  So do evals help address all five of these, or where do evals really help you?
13:43 – 22:04
How evals fix failures
1. ACAnkit Chukla
  Yeah. So evals can help you in data drift. So when you are putting evals, you are continuously able to monitor that, yes, this is what is happening with my product, and if something is not going in the right direction, you'll be able to take action on it as soon as possible. So evaluations are going to act on the top of your observability, like your metrics. Then for cost considerations also. So what happens is, because in prototypes we are using the best kind of models because we want to show the prototypes in the best light, what happens is we are not thinking about the cost. But when it goes to production, when you, uh, and you think that you have to use the right kind of model, then if you're only using the best model, which is maybe five or something, then the issue is that the-- Okay, uh, we'll cut from here. We'll start from cost considerations. Okay? So yeah. SoCost consideration is a very important part because most of the PMs, what they do is they think that if in the, uh, uh, in the prototype I have used maybe the most advanced model, I should use the same model in the production as well. And then because of cost considerations, the management has to go ahead and pull the plug. But now there is a good po-possibility, which is that maybe another model, which is a cheaper model, can go ahead and produce a similar kind of output. So if you go ahead and look at... So I'll just show you the pricing of different kind of models, okay? So I'll just search for OpenAI pricing API, and then we can see the pricing, right? So you can see that the best model, GPT 5.1, which is mostly going to be used in your prototype because it's, uh, very intelligent, the output is 10 mill-- $10 per million token. But for maybe a model such as GPT Nano, the output is maybe zero point four, right, which is only 40 cents. Now, you do not need to use only this. You can also use this, and you will only get the confidence of using this when you have created the right kind of evaluations, right? So that is why evaluations are important over here. They can let you understand whether a cheaper model can also go out and perform at a similar kind of level or not. And it has-- We have seen at multiple places that people just need to, uh, uh, let's say, do the intent matching or intent separation, and still they are using for the simple tasks, they are also using GPT 5.1 because they are not sure. And they're not sure why? Because they have not created the right kind of evaluations. And-
2. AGAakash Gupta
  Yeah, and there we saw a 25x price difference.
3. ACAnkit Chukla
  Yes.
4. AGAakash Gupta
  That's really not crazy because there's also open source models out there that price similar to that 40 cent per 10 million token. So you might say, "Oh, GPT Nano, it's so much worse than GPT 5.1." But actually, there are some competitors to GPT 5.1, whether that's DeepSeek's latest model or Kimi K2-
5. ACAnkit Chukla
  Yeah
6. AGAakash Gupta
  ... that are pretty good and also pretty cheap.
7. ACAnkit Chukla
  Yeah. And also there is a concept called as transfer learning, where you take a small language model, like a Gemma from Google, and then because your, let's say, your case is very specific, such as the customer support for a particular kind of company with limited kind of product, you don't need to use GPT 5.1. You can use a small language model, use transfer learning in order to create, let's-- you, you can create synthetic data, synthetic data, do transfer learning, do fine-tuning, and then you'll have a similar kind of model which is able to go ahead and maybe dedicate it to all the use cases that you have.
8. AGAakash Gupta
  So as long as you have a really specialized use case, you can take these smaller models, you can do transfer learning, and you can make the cost much cheaper.
9. ACAnkit Chukla
  Yes.
10. AGAakash Gupta
  But how do you know that you're doing it at the same quality? Evals.
11. ACAnkit Chukla
  Evals. Yes, correct. And then for engineering limitations, I think, uh, uh, this is going to be wider in scope. Yes, some of this can be solved with evals, but not all of it. And then guardrails. This is a major part where evals actually shine, which is you are going to define what should the model do, what should the product do, what should, should it not do, where-- like when should it code, when should it ask for help, and what are the different kind of output that it should produce. And this is where evals are actually going to go ahead and shine, which we are going to see in just a moment. And then this collaboration failure. Now, although this is not directly related to evals, but when you are writing good evals, you are always going to involve the subject matter experts, and that will give you a different kind of empathy with the user. So I think this also is indirectly contributed because you have not cared enough, enough to go ahead and create the right kind of evals.
12. AGAakash Gupta
  Hundred percent. Makes sense.
13. ACAnkit Chukla
  Yes. Cool. Uh-
14. AGAakash Gupta
  Actually, one more question since we saw that MIT slide. Do you-- Is that BS? It seems, like, overstated. Ninety-five percent of AI initiatives fail. Like, it feels like maybe that's true in, like, a bank or somebody where they're, like, really tech backwards. But in tech startups, it feels like it's probably the opposite. Like 95% of AI initiatives succeed.
15. ACAnkit Chukla
  Cool. So understand, uh, data is data, right? You can create a story with the selective data that you have. So now what has happened in this study is if you look all the, let's say, 30, 35 pages of that particular, uh, research paper, you'll be able to understand that first, we are right now in the very early stages of AI, which is considered '89, '91, 1989, 1991 when the internet has came. So it's very-- it's too quick for us to go ahead and maybe assess the, let's say, the profitability or the, I would say, uh, the scalability of AI. That's the first, or usefulness of the AI. The second part is the companies that were considered were on account of double or triple digits only. Okay? So they have not considered a lot of companies which are going ahead and maybe, uh, trying a lot of, trying a lot of things in the AI world. However, I generally am very much aligned with the findings of because it is very logical. So if you go back to, like, the traditional product management, product basically fail because they are not catering to any user needs, and they are not even evolving with the user needs. And that is what they are saying in the end. And if you now look at this, then even in the real world, like, almost 90% products fail. You would have created so many products, so many prototypes, but most of them have failed. Like, at least they were not commercial success. So I think, yes, the output of the study could be good, but yes, there was some kind of selective data. But then in the future, I believe that if more people are taking AI initiatives, then the failure rate could be mostly around the same because it's not easy to create, like, a very scalable and a very profitable product.
16. AGAakash Gupta
  Hmm. Okay.
17. ACAnkit Chukla
  Yeah. So now we are talking about evals because of one major reason, which is the large language models are non-deterministic, right? And this is a major reason why we have to go ahead and do all of these things. Otherwise, this could be, uh, let's say, handed over to maybe a testing team or a QA team, right?
18. AGAakash Gupta
  Yeah.
19. ACAnkit Chukla
  I'll give you an example. So although I've given you the job site example, now I'll go deeper. I'll give you, like, a more commonsensical example, which is maybe example of a chai or a tea, right? So if someone asks you, "How would you create a tea?" Then maybe your answer would be different than someone else. And then every person in the world would love a different kind of taste of the tea. But then their definition of tea is the same, that you go ahead and have some water, you have some tea leaves, and then some people would have the sugar, somebody would have the milk, and then they are going to create the tea. So but now when I go to different places, okay, for example, when I go to maybe a hill such as Manali or something, the quality of tea is going to be different, the taste is going to be different. When I go ahead and take a tea at home, it is going to be different. And then when I'm going to maybe go on a trip on railway, you'll understand that the tea is different, and this is mostly the worst kind of tea that anyone has gone ahead and-
20. AGAakash Gupta
  [laughs]
21. ACAnkit Chukla
  So, uh, what hap- what is happening here is that although the output is the same, output is the tea, but based on the customer needs, based on the context, based on who is preparing the tea, the feeling that you get while drinking that tea is different, right? So your product is also the same. So if I give you, let's say if a large language model, if I give you the prompt that how do I make a cup of tea? Because they are non-deterministic, they are going to give me the right answer. They will tell me how to go ahead and maybe, uh, make tea. And understand hallucination rate is not as much as it was maybe a couple of years ago. Right now, hallucination has reduced a lot. Most of the answer are generally factful, but still this tea is different than this tea, and this tea is different than this tea, right? This is the moral of the equation, that the large language model, yes, it can be correct. We are not going ahead and questioning correctness at this point of time. They hallucinate, but the hallucination rate has reduced. However, apart from hallucination, even if the models go ahead and become perfect in terms of not hallucinating in the future, which they are going to be, still they cannot go ahead and understand what your customers want, because every customer is different, their needs are different. So you, as a product manager, need to make sure that you are able to ensure a similar kind of experience or the right kind of experience to all the customers, so that if people want to take a tea, they are going to come to your shop rather than going to some other product out there. Yes. Do you like tea, Aakash?
22. AGAakash Gupta
  Yes, of course. I drink chai every day. [chuckles]
23. ACAnkit Chukla
  Yes.
24. AGAakash Gupta
  All right, so I think we have the intuition down. Now, how do we actually build
22:04 – 35:43
How to build evals end-to-end
1. AGAakash Gupta
  this?
2. ACAnkit Chukla
  Yeah. So now I'm going to give you, like, a detailed flow, and I've made it, like, very dramatically... Like-- Sorry. I've made it in the form of a diagram so that everyone is able to understand very visually. So I'll just go ahead and share the Figma. Yeah, so this is Figma. Right now, it might look overwhelming, but I'll tell you how it works. Okay? So very first thing is, whenever you go ahead and start any kind of product, you are going to define the success criteria and expected behavior. Okay, let me give you a very small example. Okay, so although we are going to talk about this example in detail later, but I'll show you a gist. So let's say, uh, most of you might have used Robinhood Cortex, which is the new feature that they have launched, and, uh, a similar feature like a company in India, INDMoney, which I used to work, let's say, three, four years ago. They have also, let's say, built that kind of feature. I'm not any way affiliated to the company, and all of this is actually reverse engineering. So the product is very simple. INDMoney or Robinhood are stock trading apps where you can buy and sell your stocks. Now, what they have done is they have understood that many people, before buying a stock, they want to do some research. And for that research, they are generally going to go to Google, they are going to type something, they are going to go to ChatGPT, or they are going to ask a financial advisor. These people thought, "Why don't we go ahead? With AI, we should be able to help people understand more about this particular stock." So they have created a feature called as INDMoney Mind. And what it does is whenever you click on this below any stock, it is going to give you some auto-populated questions, and then you can also write your own question. These are commonly asked question. And then when you click on that, it is going to give you an answer which is powered by AI, which is it is going to fetch the right kind of documents, and it is going to give you a very contextual answer. This is the use case, right? Now, in this use case, if I walk you through the whole flow now. So now remember this use case. I'll walk you through the whole flow. So first is the success criteria and the expected behavior. Now understand, I want to make sure that, uh, there are some guardrails that I need to follow, which is whatever the input, the output that I'm getting out of this chatbot, let's say that should be limited to maybe one fifty words or maybe three hundred characters to make sure that it is summarizable and people are able to go ahead and see it. The second behavior is that I should make sure that the model should never recommend selling or buying a stock. Why? Because legally, you are not allowed to go ahead and give any kind of recommendation. So there is a regulation in India which says that no, if you are not a registered, uh, uh, uh, if you are not a registered investment advisor, then you cannot go ahead and do-- AI cannot do this, right? So let's say these are the behaviors, that the thing should be factual, they should be grounded from the data, and then you should not go ahead and suggest someone to go ahead and buy or sell a stock. You should not make a direct recommendation. You should just give the information. So this is the expected behavior that we have. Success criteria should be that in the end, when people are going ahead and maybe they are, uh, they are getting this information, in the end, they should give a thumbs up or something, so that we understand if the, uh, if the output was actually good or not. So that is the expected behavior that we have. Now what we'll do is we will go ahead and transform that ex-expected behavior and success criteria into some kind of metrics, right? So the metrics could be, let's say, what is the quality of information, maybe some kind of UX metrics such as latency and all. Then the output has to be safe. It has to be performance-oriented, which is again latency. And then we are going to talk about behavior, which is it should not go ahead and suggest to you that you should go ahead and buy or sell that stock, right? Now, you can understand that if I go ahead and talk about, uh, uh, UX, maybe if I want to make sure that it is up to maybe one fifty character or three hundred character, I can choose to do a code-based evaluation. So now, in order to make sure that these evaluations are being done correctly, I can choose multiple options. One option is I can do it through code, I can have a human to review it, I can do it through LLM, or I can go ahead and choose a combination of these three, right? So this is what I'm doing at this side. But on the other side, as soon as I understand the success criteria and the expected behavior, the first step that I need to take is I'll create a base product.
3. AGAakash Gupta
  Yeah.
4. ACAnkit Chukla
  So I have an expectation that, yes, I'm going to create the version one of the product. It should not do these things. So that is the base knowledge that I have. From that base knowledge, I'm going to create a system prompt. System prompt is very simple. Given this stock and this question, answer this question, make sure that you are not suggesting, make sure that you are always, uh, getting the output from, input from this kind of data and maybe some other system prompt, right? And then I'm going to choose certain kind of system prompt. I'm going to choose some kind of models. I'm going to choose some kind of tools. Here I'm going to use a tool called as web search, right? And then I'm going to give it certain kind of context, which is a user information or its background, and then I'm going to use certain kind of orchestration. Understand all of these are variables. A major mistake that product managers make is they think that now these things are fixed, and they tend to love their solutions. Understand all of these things are variables. These are knobs on a dashboard that you need to click here and there in order to make a better product, right? So you are going to start from some-something basic that is coming out of the information initially that you have. So you are going to put an input, and then you are going to get an output. This is the base product.For example-
5. AGAakash Gupta
  What might be a orchestration layer there?
6. ACAnkit Chukla
  Yeah. So orchestration layer is that how are you going to make sure that all of these four things are going to connect with each other? For example, a good example is n8n. So on n8n, I'm going to have different nodes of LLMs, tools, memory that connect with each other. Now, this orchestration layer could also be a region of failure. If it has a lot of latency, it is across the geographies, or maybe there are some kind of orchestration issues, this can also give you certain kind of challenges, right? So you as a product manager don't have to fix anything right now. You should understand them as variables. You don't have to love your product. You have to make sure that your aim is to give the best experience to the users and be helpful for them. So now we are going to start from here. After this, what you'll do is you need to understand how your product is performing. So what you'll do is in order to see that performance, you will create a very good dataset. This is where I have marked this a star because this is where most of your efforts are going to go. You have to create this dataset, which this dataset is nothing but what are the different kind of inputs that users can give your product, right? So you are going to collect the past data. So for example, at INDmoney, they already have, let's say, some kind of advisors, which are humans who are sitting at the back end. So they also offer a service where you can talk to an advisor. So they can talk to advisor, and they can understand that from the logs that these are the different kind of question that people generally ask. So that will become one source of data. The second source of data is research. They are going to go to Google, ChatGPT, and all in order to go ahead and research as in what do people ask when it comes to understanding about a stock. Similarly, they are going to use LLM. So with LLMs, you can also generate something called as the synthetic data. You can tell that this is the product that I'm offering. Can you go ahead and give me some kind of sample dataset? And then it is going to give you some kind of dataset. And then eventually, there are experts. You are going to talk with real investment advisors. You are going to ask them that what are the different kind of question that people ask, and then you'll get them to fill certain kind of sheets. These four things are very important because they are going to make sure that you are actually dealing with real cases, right? So once you have that, then you are going to run it through your base product, right? Whatever you have created, and then you'll get certain kind of output. And I can assure you, and you'll also get surprised when you'll see that no, this output is, was not as good as I was thinking my base product to be.
7. AGAakash Gupta
  [chuckles]
8. ACAnkit Chukla
  Right? And, uh, most of the times you might not be a good judge for the same, so you can also include experts. So let's say I'm a product manager. I do not know that what is a good advice, what is a bad advice in terms of finan-finances, so I'm going to involve a financial expert in this particular case. And then I'm going to ask them to tell me whether these outputs are good outputs or bad outputs, which is they have failed or passed the criteria. Okay. And then they also need to tell me that what is that criteria, right? Otherwise, what happens is most of the people, because if product manager, they are not like, if they are not subject matter expert or domain experts, they'll not be able to come up with right kind of evaluations. Once you show people data, then they'll be able to tell you that this is a mistake. So it's easy to point mistake rather than to go ahead and prepare for them in advance, right? So what we'll do is we have this output, we have these remarks. Now, these remarks are again going to go through this. So what we'll do is from the expert analysis, from the user empathy, from the success criteria, from the expected user behavior, what we'll do is we'll have a set of evaluation metrics that now we should make sure that these things do not happen. So-
9. AGAakash Gupta
  Yeah
10. ACAnkit Chukla
  ... one of the investment advisors will see that we are going ahead and building this, uh, like your mo- they will say that this output machine is actually cut. So one of the evaluation, uh, experts can tell you that your product is actually generating recommendations or the information that it has given is very outdated or they are trying to hallucinate the information. So you are going to take all of these outputs by actually giving them the input and the outputs, and then you are going to decide these metrics. Okay. And understand, uh, uh, it will take you some time to understand and decide these metrics. That is why I al- I have also created a cheat sheet. Okay, so this is a cheat sheet that I've created with the help of, let's say, some of my knowledge and Claude and GPT. If you are building any kind of product, I'll make this available in the description as well. Aakash will make it available. Where you can understand that for what kind of product, what kind of evaluations metrics should you go ahead and consider, right? So this is a very exhaustive cheat sheet. After that, what will happen? Now I have certain kind of criterias. Now I will decide what should I use for evaluations, things which are very definitive. I'm going to use code for the same, such as, uh, whether I have all the words mentioned or not, whether I am following a certain kind of criteria, which is summary length or not. Then I'm going to use code for the same. This is cheapest. In some ways, I'm going to use humans. In some of the evaluations, I can use LLMs, but most of the times I'm going to use hybrid. Hybrid means that LLMs are going to flag situations that is not working, and then the human is going to go ahead and maybe give it a final call, right? And then you are going to write evaluations. Okay, so now in the machine learning or LLP world, we already have some, I would say some base level evaluations that can be done by code. For example, this, uh, length, this bilingual, uh, evaluation, this ROUGE and word, uh, word ratio. Here we are going to ma- word error rate. Here we are going to make sure that we are able to understand whether this is following this criteria or not. And then in some of the parts where code cannot work because it is, let's say, uh, it is something that is very subjective, then we are going to use other evaluations. So evaluations with LLMs can be of type such as measuring the guardrails, understanding the UX, tone, helpfulness, relevance. This can be done with the help of mo- prompts that we are going to give to a large language model, and we'll make LLM as a judge. Now, once we have done this-
11. AGAakash Gupta
  What does ROUGE mean?
12. ACAnkit Chukla
  Yes. So there are two things, BLEU and ROUGE, right? So in BLEU what hap- in BLEU and ROUGE, what happens is traditionally in machine learning, we tend to see that, let's say if I am, let's say I have some output which is given to me by the machine learning model, and then I have a golden dataset, right? So now what I'll do is I will not play intelligently. What I'll do is let's say I'm saying I am... Wait, I'll try to explain this again. I'll take the question from ROUGE, okay? Yes. So B-L-E-U and ROUGE are two methods which are going to help you understand the recall value and the accuracy for your models. For example, let's say I have a case where I am getting this output from the large language model. The mo- the output is, "The cat is on the bed." And thenThe golden dataset. Golden dataset means this is the real output. This should be the accurate output. The output is the cat is on the bed. Sorry, the cat is on the mat, right? Now, these things are entirely different in terms of meaning, right? It is a different scenario. This is a different scenario. But what BLEU and ROUGE do is they are going to compare the words, which is if you go ahead and consider the BLEU and the ROUGE metric for the same, it is going to come around, let's say I have one, two, three, four, five, six words, and here I have one, two, three, four, five, six words. The BLEU and ROUGE are going to tell me that five of the six words are matching. That means, yes, your output data and the golden dataset are actually matching with each other, right? But if you go out and use another LLM, you will understand that, boss, this is not true. The cat on the mat and cat on the bed is actually a different kind of statement, different kind of scenario. So that is where they are used, right? Now, in traditional machine learning, they are used a lot. They can be used in order to make sure that your information is grounded or not. You can just do some matchmaking. But ultimately, if you are giving answer on the basis of BLEU and ROUGE only, you'll not be able to do it. That is why these are slowly getting outdated from real GenAI cases.
13. AGAakash Gupta
  Yeah. So these are the generic ones, and I've also heard this as ROUGE. So that's R-O-U-G-E, which is the recall-oriented understudy for gisting evaluation. And I think you mentioned these come from the traditional ML and NLP. If people don't know, machine learning and neuro-linguistic programming language. So these are the initial ones, but it's these evaluation prompts that you have below the functions. That's where a lot of the, the meat and the success is gonna be driven from, right?
14. ACAnkit Chukla
  Yes. Correct. But it does not mean that you will not use these, because when you can use a needle, why would you use a sword? So we should not repeat the mistake that we should-- we are only using the costly, effective methods because we don't want to go ahead and engineer things in the best possible way. So don't try to save time, because these are going to-- these costs are going to combine in the, uh, like going to compound in the future.
15. AGAakash Gupta
  Yeah.
16. ACAnkit Chukla
  Right? Now, so now once we have set the guardrails, we have created the evals, what we'll do is we will now run these evaluations. Okay, we are going to run these evaluations. Now, what we'll do is, as we are running these evaluations, there are two methods of running evaluations. One is the offline evals, which is we are going to run this on the product before we launch the product or before we make a major release.
17. AGAakash Gupta
  Yeah.
18. ACAnkit Chukla
  Make sure that whatever changes that we have made, they're actually correct. Right? This is like the, the alpha-beta testing that we go ahead and do, right? So you are going to refine the system prompts, model selection, and all the other parameters that I have mentioned here.
19. AGAakash Gupta
  And I think that this is a really important point to double-click on for people.
35:43 – 37:10
Offline evals = your AI PRD
1. AGAakash Gupta
  The way the best AI companies work is that the AI PM defines these evals, and that is basically the PRD for the AI engineers. Then the engineers say, "Okay, here's how we're performing on the evals. We're at 36%, 42%, 56%, 80%, and 10%." What they're gonna try to do is all those low ones, 10%, 34%, 62%, they're gonna try to get those to, like, 80% or 90%, and then you'll actually go ahead and ship something. So there's this hill-climbing mechanism, and that's why evals are so important. And when you keep hearing this term offline evals, don't think offline equals unimportant. Offline is actually the critical pre-development process evals that are your PRD. Is that right?
2. ACAnkit Chukla
  Yes. Correct. Correct. Correct. So what happens is, if you are not doing offline evals correctly, then you don't even, I would say, uh, uh, you have not even created a product that can be actually launched to the real audience, right? I'm not saying that offline eval should be always perfect, but at least you, whatever you know, you should try to go ahead and implement that before you go ahead and launch the product.
3. AGAakash Gupta
  Yeah.
4. ACAnkit Chukla
  Right?
5. AGAakash Gupta
  And this is how you're-- Tr-traditionally, we define, like, the edge cases, the corner cases in the PRD. We're defining those in the form of evals now.
6. ACAnkit Chukla
  Yes, correct. And then what will also happen is that, yes, offline evals are good. Many people do it. But there is also something which is equally, if not more important, which is online evals, which is you have to use a platform. You can use any of the observability platforms. All of them are now having these. Two major popular ones are Arize and Trulens.
37:10 – 39:26
Online evals & observability
1. ACAnkit Chukla
  So what they will do is you will keep on observing the product. So I have talked about data drift in the beginning, which is that now the user expectations have improved. So your current prompts, whatever you have tested in the evals or your current models or something is not, is now not working. The d- the world has changed. So you are going to keep on observing, and maybe if not on every output or input, you are going to run the online evals, which is the same evals, on your production-level data. Maybe you'll not run it on, on every input and output, but maybe you'll choose one in 10, one in 100, one in 1,000, whatever is the, uh, let's say the ratio that you need to take because they are costly as well. And then you are going to make sure that you are observing them. And whenever any change is made, you are going to make sure that you are able to observe them, and you are able to make changes again in your base product. And then this whole cycle is going to go ahead and repeat itself. Okay, so just to give you, like everyone, a summary again, we start with the success criteria and the expected behavior. On the basis of that, we are going to get one level of metrics and our expectations, what the prompt should do. And then we are going to create, like, a base product where we are going to put the very primary prompt, very primary system prompt, whatever the best thing that we can do, right, and the models and everything. After that, we are going to collect a lot of data from multiple places, edge cases, and from the expert, from the LLMs and all. And then we are going to go ahead and run our, all the dataset, input dataset with our base model, like our base product. And then we are going to evaluate everything on-- We are going to create evaluations based on the mistake that we have found. And then we are going to make these into evaluation. So evaluation will become, let's say, a set which is going to run all the time whenever you are releasing them, doing the major release. You can choose it to run every week or every month so that you know that if the data is not drifting. And then you are having something called as online evals, which you are going to run in the production-level dataset, and you are going to get informed whenever an evaluation is passing here and there. For example, we have an evaluation called as accuracy. So if we believe that accuracy is anytime going less than 98% in these evaluations, then we should get flagged, and we should go ahead and maybe improve the prompt of any of the orchestration that we have. So this is, like, the whole end-to-end flow that you do while creating evaluations.
2. AGAakash Gupta
  Awesome. And it sounds like a lot of the art here, or actually I know a lot of the art here isIn writing those LLM judge system prompts, creating those metrics. So how do we see this in action?
3. ACAnkit Chukla
  Yeah. So what do we do is, let's say I'll-- I have created, let's say,
39:26 – 57:52
Case study: IDMoney Mind
1. ACAnkit Chukla
  a PRD. Let me just go ahead. Mm. Yes. So let's say this is what a product manager will actually do, right? So this is-- I have created-- I've reverse engineered this. I'm again saying that I'm not affiliated to the company, so if there are some similarities, that are just coincidental. So I've reverse engineered this product, and this is what they would have done. Okay. So I have broken down the document into these 10 sections. The first part is like before you talk about AIPM, understand that you are a PM first before an AIPM first to make sure that you are setting the context correctly, which is this is the product, this is what it does, so that you are able to understand it. So yes, and you have to always start with writing the value for the user, which is reduce search friction, decentralized financial. So what is the value of the user? That user do not have to spend a lot of time in order to go ahead and search for something. That is the value for the product. Uh, they'll get the advisory within the product itself. And then we have written down the value for the business because these are again going to come when you go ahead and understand the metrics for evals and your product. And now we are going to make different layers. Okay. One is a user interface layer. I'm talking about these layers because these layers are also going to decide your metrics. So user interface layer, orchestration layer, data retrieval layer, LLM layer, logging and analytics. All of these are going to play an important role in your eval systems. And then eventually this is the level one. Okay, so I've not written a complete prompt, but I have written what will go into that prompt, right? So prompts and context, system prompt core, and AI assistant is configured with the following key principles. Now look at this. Role as an analyst, not advisor. So I know that I don't have to ask it to be an advisor and suggest me something. I want it to just act an an-analyst and then give me the understanding. Although it is a very small line, but this is going to define the behavior of the system, right? So I'm going to use all of these things and understand that I'm not writing the whole collective prompt because I have to make it iteratable. So what I can do is I can look from these points that, yes, now in the new eval, whenever I've run eval or whenever I have to change the system prompt, I can choose which line I need to pick. Otherwise, in the bigger prompt, it's very difficult to go ahead and find what you want to go ahead and edit.
2. AGAakash Gupta
  Mm.
3. ACAnkit Chukla
  Right? So we have to be, let's say, taking care of productivity because these are like small frictions that let product managers to not do the right thing. And then-
4. AGAakash Gupta
  Yeah, I think you're bringing up an important point, which is that you need to be very sensitive to your organization, and you need to go to your development lead, your AI engineers, your head of product, and clearly define where does the role of PM end and the role of engineer begin. I think this is a nice line you've shown here where there's guidance for the system prompt, but you're not necessarily saying this is exactly what the LLM judge and system prompt should be. That's gonna often be the case in a larger company, where your AI engineers are still gonna be the ones writing the system prompt for your evals, but you've defined at a high level how it should look. At a smaller company, you might actually be writing the system prompt. You might actually be in an arise or something like that. So it's good to have the skill, but it's really important to understand where PMs should be in this. And I think your example here, this is more for a, a slightly larger company. Is that fair?
5. ACAnkit Chukla
  Yeah, I-I think, yeah. I-I-I would say it would depend upon the autonomy that a PM gets. I have seen that in a larger companies also, some people are able to be more agentic and have some more agencies, and they are able to do all of these things. But in small companies, generally, you don't have to do lot of things as a product manager because then you are figuring things out as the light becomes more clear.
6. AGAakash Gupta
  Makes sense.
7. ACAnkit Chukla
  Yeah. Right. So now these are the response guidelines, and then these are also some context variables that we are going to inject in some queries. And understand this document has to be created so that everyone is on the same page, right? So don't just think that you have written a evaluation prompt and then your task is done. You have to give all of this context so that people can read it and they can understand. Because understand, you will not be the eventual person who is coding, and the engineers have to be aware about all of these things. Right. Now, yes. So now I have also mentioned that for maybe fundamental analysis, technical analysis, these are, let's say, other things I need to make sure that I'm giving to the context to the AI so that it is not able to hallucinate.
8. AGAakash Gupta
  What do some of these acronyms mean here?
9. ACAnkit Chukla
  Now, for example, if I talk about fundamental analysis, then I need to collect the information. So whenever I have to understand a stock fundamentally, I need to look at a variable called as P/E ratio, which is profit to equity ratio, right? Here I need to-- And I need to make sure that I'm giving this context to the LLM that what these terms are, so that it is not able to do a mistake of hallucination.
10. AGAakash Gupta
  Mm.
11. ACAnkit Chukla
  Right? Now understand, I could have not done this, and I could have used GPT-5, but that would be a very lazy decision, right? Because in the long run, the costs are going to combine. What can, what can I do in order to give more context to the AI, and then maybe I'm okay with using a lesser capable model? That is what you need to understand as a PM.
12. AGAakash Gupta
  So almost putting on the hat that you're not working with the best model.
13. ACAnkit Chukla
  Yes, yes, yes. Correct. So aim for the worst-- best, but design for the worst.
14. AGAakash Gupta
  Hmm. Makes sense. Yes.
15. ACAnkit Chukla
  Yes. And then, yes. So then we are going to also put... So understand-- you understand what, what evaluations are, but engineers might not have at the top of their mind, or maybe designers or the leadership might not have the top of their mind because they are only exposed to the prototypes. They're not about, like, familiar about this nature, so you have to clearly explain why it is happening, right? Here is a high-stakes domain, and in India, this space of fintech is heavily, heavily regulated. You cannot do this even without, let's say, doing evaluation. And also understand it might happen that sometimes this is going to go wrong. At least at that point of time, you'll be able to show the regulation that we have taken all the fail-safe features to make sure that it is not happening, right? And then we are going to have all of these. I'll share those documents so that everyone can read in advance, and then we'll have these deciding metrics and expected behavior. So what are evaluation dimensions? Factual accuracy, compliance, groundedness, relevance. This is what it means, and this is why it matters. So but understand it will take some time to read this document, and you don't have to write it by yourself. My general recommendation is what you should do is if you are really like an AI-enabled PM, go ahead, talk through Whisper flow to your GPT or your Claude or your Google Doc, right? Talk as much as you can because that is more productive. After that, ask Claude or GPT to go ahead and put it in this structure.Right? So you are-- you will-- and also ask GPT and Claude to also fill the gaps that you have missed in this particular document. And then we have some more documents. Yes, we have also defined some thresholds. So numerical accuracy means that let's say if AI is suggesting any kind of numbers, images or percentage of returns or numbers, I should make sure that my target is more than ninety-eight percent of the time they should be correct. And if it is becoming more than less than ninety-five percent, I should get flagged and this product should not go into production unless I'm improving something, right? And I have also mentioned this. Now, this is super helpful for the online evals, which is if this metric compliance, compliance pass rate is going below this, then I should take an immediate action, right? Immediate action. And then this is the expected behavior by query type. This now, this will not come out of the blue moon. You would have gone ahead and done this process of creating inputs, output from the expected inputs. So what we'll do is... Yes. So here we are going to make sure-- Where is the document? Yes. [coughs] So here we are able to observe that maybe these are the things that should never be missed, right? And then we would have some edge case behaviors, which is stock with missing data, penny stock, these kind of things. And this will only come once you are collecting all the data in your data set. Otherwise, it will not come at the top of your mind. Right. Now I can show you the data set. Now understand this is all synthe-synthetically created data set, but it will serve the purpose of learning. Okay, so we have divided into multiple parts and these are the sources. Let's say for fundamental analysis, what I'm going to do is I have collected the data from multiple sources, synthetic data, talking to the experts, looking at my own data, doing my own research, and I was able to understand these things. Let's say if someone asks, "How good is ITC as a dividend stock?" Then this is the context that I need to give and this is what I expect. And then eventually these are going to be the red flags, right? So now eventually what I'll do is I'll not write the evaluation prompt by myself. I will put all of these things, this information, into an LLM and ask it to write a better prompt.
16. AGAakash Gupta
  Mm-hmm.
17. ACAnkit Chukla
  Right? Because as a, as a human, you can go ahead and miss out on a lot of things, and then you will run all of these things again with the data set to make sure that it is not making a mistake, right? And I've also set some kind of priorities.
18. AGAakash Gupta
  So the role of the human is figuring out what are the expected elements, what is the overall guidance for the prompt, then use AI to create the final prompt. And there are studies that show that AI is better at writing prompts than humans. So put AI where AI is better, put humans where humans are better.
19. ACAnkit Chukla
  Yes, correct. And I think collaboration is, is, is the best way out there. Rather than competing, you should collaborate with an LLM and then you are g-- you, both of, both of you are going to unlock different kind of powers. Yes.
20. AGAakash Gupta
  For sure.
21. ACAnkit Chukla
  Yeah. And then this data set is there. And then understand that we have done multiple methods of collection, which is we have taken it from production queries. So if you are creating a support chatbot, you would have been giving support before this chatbot, right? So you can collect that data. Then there are expert curation. In this case, we can take, uh, ideas from the financial experts, and then we are going to do synthetic generation. And then we are also going to-- Look, this is also important that we have to maintain this data, which is, although at IDMoney it is not needed to maintain this data very frequently, but if your product is, let's say, uh, talking about a use case where data is very much changing, you should make sure that you are able to at least update this data once in a while. So this is kind of a test use cases which should run always before you are doing a major release, right? And then this is evaluations, right? So we are going to do three kind of evaluations: automated programmatic evals, LLM as a judge, and human evaluations, right? For automated evals, I can do factual accuracy checker, compliance checker, groundedness checker, structural checker. What I can do is I can just match the words. Whatever is happening, numbers and all, do I able to see it in the sources as well also, right? And then LLM as a judge, I'm-- I can check relevance, balance, tone, and then I can... So I will not create a prompt by myself. I'll just give this particular thing to an LLM. It is going to go ahead and generate like a good prompt for me, right?
22. AGAakash Gupta
  Mm.
23. ACAnkit Chukla
  And then as a human evaluation protocol, I'm going to use, uh, uh, I'm going to use them, humans, because they are costly and they are going to take time. So whenever a new feature launches, something happens, like something important happens, I'm going to make sure that they are using the same, right? And if the automated metrics, let's say the LLM as a judge or these things are failing, then I'll make sure that I'm involving a new human to check what is happening out there, right? And then, yes. Now this is, uh, a bit different thing, which is I can-- For a human, I have two methods. I can ask them to just give me pass or fail, or I can give them a rating of one to five. Different people have different kind of opinions, but it's good to start with maybe a pass or fail criteria so that people are objective. But eventually, as you go forward, even if people are rating one to five, ask them to give you a remark, which is why do they think this is the case? Because you can use that data in order to further train your, like further improve your, let's say, context or the models, right? And then we have this. This is the criteria that at any point of time we have any kind of these issues, we are going to make sure that we are going to block the deployment, right? Which should-- It should be actually less than that. Yes. And then offline eval, I think this we have already understood, but I'll share this document. Yes. So here what we are doing is that we are going to smoke test, we are going to do full regression and everything. We are going to run all the, uh, uh, let's say all the evals, and we are going to do block or no block in these particular criteria. We'll not release it. And then on the online evals, what we'll do is we will have the latency. So P50, P95, P99 means that whenever we do averaging, averaging is not going to give you the right kind of results. So let's say if you are having 100 users on your website and if you are going to see that maybe 10% people-- So I think that's an important part, so I can take some time here. So in order to measure latency, in order to measure-- So many people might have heard about P99, P95. Okay, so what are these things? So let's say I want to measure the latency of a product. Then I cannot say that if 90% people are getting it at 100 millisecond and 10 people are getting it at, let's say 1,000 millisecond or 10,000 millisecond. If I take the average, average will comeMaybe something around, let's say, a better number. It will look like, let's say, not a, not a big number, but still ten percent people are facing the issues. P ninety-five means that five percent of people, like ninety-five percent of people are actually having this kind of latency, which is, let's say, a good number, right? Or maybe I can say P ninety-nine, which is the most used metric, which is ninety-nine percent of the people should be able to get their results within, let's say, a particular latency, let's say a hundred milliseconds or ten milliseconds, because averages do not work there. So you have to maybe write these kind of latency metrics. And then eventually, yes, so this is their sampling-based quality. Well, this is good user feedback. Yes. Now this is one thing which is more important that apart from only running your evaluations, online evaluations have one more input, which is after every AI tool has finished doing its job, you will see a hands up or a hands down option, like thumbs up or a thumbs down option, so that you are able to integrate it back into your product, right? This is like a hard feedback, but a soft feedback could be that people are again and again generating, trying to generate the same answer, or they are not closing the session as soon as possible. They are just going ahead and maybe frustrated with the answers. That means that you have a soft feedback, which you should also go ahead and consider. And some other things such as maybe they have removed the session before going ahead and buying or selling something, or they have gone ahead and escalated it to support. That means that your evaluations are maybe not working. So also consider these as evaluations, like also consider these four things also as some kind of, I would say, evaluations in your product, right? And then drift detection is already there. Yes. And then we have AB testing query, which is that make sure that you are also going to go ahead and maybe do some AB testing on various kind of evalu-- like with various kind of prompts and models and something, so that you are able to understand how evaluations are running over there, what is the user experience. And then we have the last part.
24. AGAakash Gupta
  Can you say a little bit more about that? So are you AB testing the evals?
25. ACAnkit Chukla
  Yeah. So we are testing, we are-- Okay, so evals cannot be AB tested. What we'll do is we are AB testing the prompts and the models, and then we are going to see in production that how do evals performs there, evals plus the hard and soft feedback from the users.
26. AGAakash Gupta
  Okay.
27. ACAnkit Chukla
  So it might be possible that your evaluations are running correctly, but the user has something else to say, right? So these AB testing will make sure that you are paying the right stuff for the users, and you are not too much shortsighted by your synthetic evaluations.
28. AGAakash Gupta
  Yeah, makes sense. You can't just only rely on the online evals. You have to actually look at the user data.
29. ACAnkit Chukla
  Yes, yes, yes. Correct. Because understand, your evaluation did not come out of your own mind. You have gone ahead and maybe done a lot of research on subject matter experts, right? And here also you have to understand that you don't have to stick yourself or you don't, don't have to go to your fixed mindset. You are going to be evolving with the user feedback as well, and there is no better method than to understand the user with the help of AB testing, right?
30. AGAakash Gupta
  Got it.
57:52 – 1:01:24
Real-life eval examples
1. AGAakash Gupta
  like. Can you maybe bring this back to us with some real-life examples of why evals matter?
2. ACAnkit Chukla
  Yes. So I'll give you some examples. Okay. Yes. So before I could talk about all the examples, one, there is major, I would say, confusion in the world that evals are nothing but fancy QA role that has been now given to a product manager. Okay.
3. AGAakash Gupta
  Yes.
4. ACAnkit Chukla
  Then that cannot be far from reality because now you have seen the process. Okay. A QA is not involved with the subject matter experts. A QA is not improving the prompt. He's not improving the product. He's going ahead and inf- informing. So information is different from transformation. As a product manager, you are transforming your product, while as a QA, generally, you are giving the information to the developer that, "No, this is not working." So there is difference between the transformational role and an informational role. So don't think that if you are doing evals, you are just doing the job of a QA. It is much more than that, although the terms are matching. Yes.
5. AGAakash Gupta
  [chuckles]
6. ACAnkit Chukla
  Yeah. Now talking about why evaluations matter, and I'll give you some cool examples. So reliability and trust, right? So evaluations, like good evaluations can give you reliability and trust. If you consider example of Grammarly, that if one tone error can change, like let's say Grammarly will translate across multiple languages. So if one tone error can change meaning across 500 plus scenarios, so there is a lot of trickle-down effect, right? So if you're writing good evals, you are making sure that, yes, the tones and everything is matched correctly. Similarly, this actually happened in GitHub Copilot, that when it was initially launched, they had a very small error. The error was that in the YAML file, there was some mistake, and that was not caught by the code. And they have-- don't have evaluations for the same. And what happened was when people started using it and when they were move- moving it to production, most of their products were breaking, right? So if they would have written an evaluation, this would not have done. And now it's a scale product, so it, it faced a lot of repercussions. Then we have Klarna. So Klarna, uh, uh, uh, make sure that they are not... So initially when Klarna developed their AI chatbot, they were focusing on things such as, uh, uh, how many people are looking at this chatbot, how many people are saying that it is helpful. But soon they were able to understand where they need to push people in the conversion funnel. So there are business metrics also that we need to take care of. And then they transformed their strategy, and now they are using the AI-led suggestion increased. Uh, they are-- Their AI-led suggestions are now increasing the checkout conversion rates, right? So don't think that it is only for the users. You should also evaluate your, uh, AI on some business metrics, right? And then eventually, chatbots, you will see a lot of chatbots that, uh, let's say you created a chatbot. You created a chatbot on, let's say, some information that is available right now. But in the future, if you add more products to your system and if you are going ahead and maybe, uh, some context is changing for the users, the user behavior is changing, your products are changing, then you have to make sure that your evaluations are always running. They are always online, right? So that is why, uh, a chatbot, maybe any chatbot that you're developing for, this is a very common use case that chatbots are being developed for customer support by AI. Then support chatbot will keep giving old policy into drop CSAT, and it will lead to... Okay, I'll repeat this again. I'll repeat the C- uh, the chatbot part. Okay? Okay. Mm. Yeah. And another very interesting use case is the chatbot, that whenever you are building like an AI-assisted chatbot to support your customers, then a major issue is that if you are only giving it older information, if you're not making it very relevant, then you are going to give outdated information to the users, and then that is going to fail the efforts. And you will not be able to know whether the information is old or new unless you are running the evaluations. So evaluations are going to play very important role in all of these things. And in the end, my major takeaway
1:01:24 – 1:03:49
Key takeawayShare
1. ACAnkit Chukla
  from this session would be, uh, for everyone is that evaluations are not optional. They are the guardrails for all the AI-driven outcomes. Don't think that you'll be able to create like a very solid, uh, complex product without having the right kind of evals. And also evaluation is not like a, I would say it's a goal. It is actually an ongoing journey which will keep on evolving as your product evolves.
2. AGAakash Gupta
  So there you have it, folks. We walked you through, if you recall at the very beginning, the intuition for what evals are. If you're making and getting instructions for making chai, even the same LLM model is gonna give you three different responses. That's non-determinism in action. To check are those responses acceptable, are they not hallucinating, are they at quality, are they at the length that we want, we would create evals for those. Some of those would be code-based evals. Some of those would be LLM judge evals. Some of those would be hybrid human evals, where you bring a system, a domain expert, a subject matter expert in to help you. And so this eval is not just a QA rebranded. It is a critical skill. It's almost part of the PRD. We saw the long document that Ankit created. We are gonna include all those links for all these things that we shared in the description below of this episode or in the newsletter accompanying and summarizing this episode. So that'll give you all the resources to go build these evals yourself. Don't be scared of adding AI to your features, guys. AI is just another API, and with these evals, you will be able to handle the non-determinism of it. Ankit, thank you so much for this masterclass in evals.
3. ACAnkit Chukla
  Thanks a lot, Aakash. Happy to be here and happy to be helpful.
4. AGAakash Gupta
  See y'all later.
5. ACAnkit Chukla
  Take care, guys.
6. AGAakash Gupta
  I hope you enjoyed that episode. If you could take a moment to double-check that you have followed on Apple and Spotify podcasts, subscribed on YouTube, left a rating or review on Apple or Spotify, and commented on YouTube, all these things will help the algorithm distribute the show to more and more people. As we distribute the show to more people, we can grow the show, improve the quality of the content and the production to get you better insights to stay ahead in your career. Finally, do check out my bundle at bundle.aakashg.com to get access to nine AI products for an entire year for free. This includes Dovetail, Mobbin, Linear, Reforge Build, Descript, and many other amazing tools that will help you as an AI product manager or builder succeed. I'll see you in the next episode.

Episode duration: 1:03:59

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode Raa3qjEBvKE

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome