Aakash GuptaThe ONE AI Skill Every Product Manager NEEDS in 2026
EVERY SPOKEN WORD
80 min read · 16,094 words- AGAakash Gupta
Why do PMs need to be good at AI Evals?
- HHHamel Husain
Okay, so there's three things that are really important. One, evals give you a way as a PM to inject your taste and your judgment directly into the critical path of the AI product being developed. The second thing is, like, evals are really important in helping you iterate. The most effective way to do that is using evals, specifically looking at data in a very structured way. And then the third thing is scale. By mastering evals, what you can do is you can make sure that you can scale your taste, judgment, uh, so on and so forth, user requirements across all the AI workloads that are running.
- AGAakash Gupta
When it comes to AI Evals, Hamel Husain and Shreya Shankar are known as the worldwide leading experts. Companies like OpenAI and Arize go to them, and today we're gonna learn everything you need to know about evals from them. What is the most critical skill for PMs who want to build AI features to develop?
- SSShreya Shankar
Hands down, error analysis, the ability to look at your outputs and systematically figure out what makes for a bad output, quantify how many of these failure modes you see in a big batch of traces for your system, and then figure out how to turn that in measurement into a continuous flywheel of improving your product.
- AGAakash Gupta
If you guys had to build a roadmap for people who wanted to get really deep on AI Evals, what topics should they learn? Really quickly, I think a crazy stat is that more than 50% of you listening are not subscribed. If you can subscribe on YouTube, follow on Apple or Spotify podcasts, my commitment to you is that we'll continue to make this content better and better. And now on to today's episode. Hamel Husain and Shreya Shankar are the people who the experts go to for evals, OpenAI, Arize AI. Those people are going to them for evals, and we have them on the podcast today. Welcome, Shreya and Hamel.
- SSShreya Shankar
Hey.
- HHHamel Husain
Thank you. Nice to be here.
- AGAakash Gupta
Why do PMs need to be good at AI Evals?
- HHHamel Husain
Okay, so there's three things, um, that are really important. One, uh, evals give you a way as a PM to, you know, inject your taste and your judgment directly into the critical path of the AI product being de- uh, developed. So, like, you know, as we all know, like, PMs, they spend a lot of time gaining context from customers, user feedback, so on and so forth. They're writing PRDs. They're, you know, trying to give context to engineers and, you know, they're hoping, like, kind of engineers are faithfully carrying out their vision. Now, what evals give you is y- you know, you can directly make sure that your taste and all of that context, if done correctly, is now on the critical path when your engineering team is developing those AI products. The second thing is, like, evals are really important in helping you iterate. So, you know, nothing is, like, set in stone. You have to constantly, like, change your requirements. You're learning more about your customers, so on and so forth. The most effective way to do that is, uh, you know, using evals, specifically looking at data in a very structured way, and which this is one of the things that Shreya and I teach. Um, you know, that allows you to, to refine and have really fast feedback loops and really fast cycles of feedback. And then the third thing is scale. So, you know, by mastering evals, what you can do is you can make sure that you can scale your taste, judgment, uh, so on and so forth, user requirements across, you know, all the AI workloads that are running in a way that you just couldn't before, because ultimately there's a lot of... You know, you can bake a lot of these evals. They're using AI themselves. You just have to make sure that you do it correctly. So you have to make sure that you, uh, align the AI with yourself as a PM in a very kind of, uh, process that we teach. Um, and as long as you do that correctly, and you do it in such a way that you develop trust in the AI that is doing the eval, and there's a way to do that, that alignment, then you can really scale yourself. So a lot of times, PMs, or not just PMs, but people at large, they kind of view evals as a very monotonous task that, you know, you just want someone else to do it. It's like, "Oh, like, I have to look at data. I have to annotate data. Um, you know, who's gonna do this?" You don't wanna give up that leverage because when you, when you build that foundation of evals, um, you develop... You have immense leverage, and you can... You know, it's a really quick way to kind of exert lots of influence over the process in, in a good way. And so this is why I would encourage PMs to really pay attention to this.
- AGAakash Gupta
Can you guys precisely define evals?
- SSShreya Shankar
Yeah, I can take this one. An eval is some systematic measurement of some aspect of quality. So what varies in an eval is what that criterion is. For example, maybe it's conciseness of a response, and then how you want to measure it. So maybe that is, I'm gonna define it by, you know, word length. I'm gonna define it by sentence length. Um, maybe it is some, you know, very, very complex bespoke human judgment or something that's more subjective. Um, but those two things make up an ema- eval.And oftentimes, products actually have a suite of evals. I've never seen just one eval doing the job. I see three to five, sometimes even up to 10 evals that are really important for a product.
- AGAakash Gupta
People say that if you get evals right, you've gotten the hardest part of the AI product solved. Is that accurate?
- SSShreya Shankar
I think it's accurate now. Hamel, what do you think?
- HHHamel Husain
I think it's totally accurate.
- SSShreya Shankar
Yeah.
- HHHamel Husain
Just like anything else, it's the process of creating the evals that provides all the value. It's not necessarily the eval itself. It's the journey that creates all the value. And so once you've done all of that work, you've looked at all of your data, you've iterated on your system, you've thought very carefully and s- and, you know, oftentimes scientifically about how to improve your system, you've already got 99% of the way there.
- SSShreya Shankar
The way that I like to think about it is if you ever want your product to make it past one iteration, you need evals. I've never seen somebody make it through multiple iterations of their product without any evals. But once you have good evals in place, then evals are not necessarily the bottleneck for you. But that's a good thing. That's how it should be, right? You should be able to focus on building out other aspects of the product, making things faster, making things feel better, more intuitive, um, you know, everything beyond that.
- AGAakash Gupta
Why can't you just rely on, like, human evals? Like, the PM looks at the feature, the engineers look at the feature, and they feel like those outputs are good enough.
- SSShreya Shankar
Oh, I love this question on the vibe checks and why... So, so Hamel and I teach our course and pitch it in a way that we, we are helping you codify, operationalize, and scale up your vibe checks. Your vibe checks are very important, but they don't scale, right, 'cause they involve you, the human. It's very hard to onboard other people to do the vibe checks in the same way as you are. So, like, I would have to observe you do this thousands of times, look at outputs, try to build my own rubric or mental model of what you're doing, and then I have no good way of teaching other people of how to do this. So being able to do evals just means taking your vibe checks and translating them to something concrete. In our course, we define that as a rubric of binary criteria. Every criteria could be complex, that's fine, could be subjective, that's fine, but you better have a very precise definition for pass, fail, have some examples of pass, have some examples of fail, and we also teach people ways to measure alignment on those results. Um, that's really what this whole process is about.
- AGAakash Gupta
I think the critical phrase there is binary criteria. Why binary?
- HHHamel Husain
Yeah, so binary really is a kind of a heuristic in a w- in a sense, like, that is, like, a simplification for, that works for most people. And the, the thing is, like, you know, a lot of people try to, like, assign scores, let's say, on a rating scale of one to five. That's usually a really bad idea because no one knows what that means. If you have a average score of 3.2 versus an average score of 3.7, what does that really mean? And, you know, that can be very hard to calibrate, and you have to work incredibly hard to, you know, make sense of that. So binary judgments force you to kind of make a pass-fail decision, and that tends to also correlate with the fact that you're going to have to ship this product. Do you wanna ship it or not? And it really distills that decision-making down into the annotation. And for the vast majority of people, that's the right choice.
- SSShreya Shankar
Yeah, and to provide a little bit extra context on the background of LLM as judge and why, you know, people have a lot of variance in whether they want it to be, you know, binary or rating-based scale, um, LLM as a judge has been around, you know, before these foundation models even, just regular language models, fine-tuning models to serve as judges. Um, and in those cases, people, A, had a lot of preference data of what is good and bad, and maybe even a fine grain scale, and B, could fine-tune models to be aligned with that preference data. Today's world of LLM judge is very different. We don't see people fine-tuning judge models as much. We see people trying to use off-the-shelf models, still want to align with their complex subjective criteria, and now the alignment problem is much harder, right? You can't, you know, steer the LLM in a way that you could before. Um, and for that reason, we say limit yourself to binary because that is what the LLM can do very well. All you have to do is provide examples of pass, p- provide examples of fail, and have, you know, very simple or, like, good rubrics. Um, and people find that much easier to do than, say, you know, rating on a scale of one to five. Okay, now you need to provide examples for one, for two, for three, for four, for five. You need to, you know, have descriptions of what makes a one different from a two. All of these things, you know, the pairwise interactions between all these ratings just explode in complexity, and we never see people successfully able to operationalize that at the rate at which they can do binary evals.
- HHHamel Husain
And a lot of times, um, non-binary evals, like ratings of one to five, that is a s- a smell of intellectual laziness.
- SSShreya Shankar
[laughs]
- HHHamel Husain
Like, the work hasn't been done-
- SSShreya Shankar
Contic
- HHHamel Husain
... to actually, you know, to make a call of, like, what is good enough and what's not good enough. And it's kind of like, "Oh, we don't really know. Let's just capture these, like, rough things, um, you know, in, in this, like, s- uh, score 'cause we're gonna lose something." And it doesn't... You know, like, the binary scale a lot, like, really forces you to be very clear about what you want.
- AGAakash Gupta
AI Evals are one of the most important skills for PMs, and I know you know they matter. The question is, are you doing them right? Most teams are winging it with basic metrics and hoping for the best. Meanwhile, the teams that actually ship reliable AI, they've cracked the code on systematic evaluation. Today's episode is brought to you by the AI Evals for Engineers and PMs course by Hamel Husain and Shreya Shankar. This live Maven course will teach you the battle-tested frameworks from Hamel and Shreya, who are the engineers behind GitHub Copilot's evaluation system and 25-plus production AI implementations. Four weeks, live instruction, next cohort starts July 21st. Start shipping AI that actually works. Enroll at maven.com with my code AG-PRODUCT-GROWTH for over $800 off. That's AG-PRODUCT-GROWTH. Today's episode is brought to you by Jira Product Discovery. If you're like most product managers, you're probably in Jira, tracking tickets and managing the backlog. But what about everything that happens before delivery? Jira Product Discovery helps you move your discovery, prioritization, and even roadmapping work out of spreadsheets and into a purpose-built tool designed for product teams. Capture insights, prioritize what matters, and create roadmaps you can easily tailor for any audience. And because it's built to work with Jira, everything stays connected from idea to delivery. Used by product teams at Canva, Deliveroo, and even The Economist, check out why and try it for free today at atlassian.com/product-discovery. That's A-T-L-A-S-S-I-A-N.com/product-discovery. Jira Product Discovery, build the right thing. I've heard that LLMs are also not very good at one to five ratings. Is that true?
- SSShreya Shankar
They're good at what they're trained on. [laughs] Somewhere out there in the world, I am sure there is a task with very clear or simple one to five ratings, and the LLM is good for that. But to make such a blanket statement for all products and all use cases is very hard to do. And that's the thing, that's the message we want to hammer home to every single product manager who takes the course. Like, look, you think the LLM might be able to do something. You saw an instance of an LLM being able to do the task for some other domain. That doesn't mean it's gonna translate to your domain or your use case. You still have to put in this work, um, and just don't trust any... Hamel has a great way of saying this. Maybe he should talk about it, but he always tells people, "Never trust it. Always put on your detective hat." Hamel, you wanna talk about that?
- HHHamel Husain
Yeah. What underlines the entire process of evals is the scientific method, something that we've all learned in high school education, um, but it's really applied, you know, in this context. And what you have to do is be very skeptical of everything and do lots of experiments, and prove to yourself that the, the thing that you're trying to achieve or some new complexity you wanna add, whatever it is, that it's actually working, and try to do it in the simplest way. Um, and build intuition doing, by doing lots of experiments. Um, but the point is to, like, measure those and, like, you know, record those and go through it in a structured way, uh, rather than those vibe checks. You, you, you asked about vibe checks earlier. The analogy that I like to use, and I can give this to you, uh, if you ask me later, there's a, uh, there's a little video of my friend Greg Secarelli playing Whack-a-mole, and it's my favorite meme to use when telling people about the need for evals. It's, you know, it's like you're playing Whack-a-mole, uh, without evals, you know? So you see a problem, okay, hammer it over with some tool or a prompt change. Then another problem comes up, you hammer that, and you keep going. You don't really make any progress. It's really with evals that you can systematically try to solve the problem without, like, going in circles.
Episode duration: 1:34:32
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode g_3LJ2QBOQE
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome