Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar

Lenny's PodcastSep 25, 20251h 46m

Lenny Rachitsky (host), Hamel Husain (guest), Narrator, Shreya Shankar (guest)

Definition and purpose of AI evals for LLM applicationsError analysis and open coding of traces (manual, human-led review)Using LLMs to cluster errors into actionable failure modes (axial codes)Designing automated evaluators: code-based tests vs. LLM-as-judgeAligning LLM judges with human judgments and avoiding bad metricsOperationalizing evals: unit tests, monitoring, dashboards, and flywheelsDebates and misconceptions around evals, vibes, and A/B testing

In this episode of Lenny's Podcast, featuring Lenny Rachitsky and Hamel Husain, Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar explores aI evals: The new must-have superpower for serious product builders The episode argues that systematic AI evals—structured ways to measure and improve LLM applications—are becoming a core skill for PMs and engineers, comparable to knowing how to write PRDs or run A/B tests.

AI evals: The new must-have superpower for serious product builders

The episode argues that systematic AI evals—structured ways to measure and improve LLM applications—are becoming a core skill for PMs and engineers, comparable to knowing how to write PRDs or run A/B tests.

Hamel Husain and Shreya Shankar walk through a concrete, end‑to‑end eval workflow using a real real-estate assistant: manual error analysis on traces, open coding, clustering failures with LLMs, and then building focused automated evaluators (code-based and LLM-as-judge).

They emphasize that good evals start with looking at real product data, not with abstract benchmarks or generic tools, and that you only need a small number of well-chosen evals to unlock large product gains.

The conversation also unpacks common misconceptions and Twitter drama around evals, arguing that ‘vibes’ and A/B tests are not alternatives but sit inside a broader, data-science‑driven eval practice that top AI teams quietly rely on.

Key Takeaways

Start with manual error analysis, not with writing tests.

Before building any evals, inspect real traces from your AI product, write quick notes about what went wrong (open coding), and look for upstream errors. ...

Get the full analysis with uListen

Use a ‘benevolent dictator’ to own qualitative judgments.

Avoid design-by-committee for labeling and error notes; appoint a single domain expert—often the PM—to make final calls on what counts as ‘good’ or ‘bad’. ...

Get the full analysis with uListen

Cluster your notes into a small set of failure modes.

Feed your open-coded notes into an LLM to synthesize axial codes (categories like ‘human handoff issues’ or ‘conversational flow problems’), then refine them into specific, actionable buckets and count their frequency to prioritize what to fix.

Get the full analysis with uListen

Reserve LLM-as-judge evals for complex, subjective failures.

Simple issues (JSON format, length, presence of a field) should be checked with code; use LLM judges only for nuanced behaviors (e. ...

Get the full analysis with uListen

Always validate your LLM judges against human labels.

Don’t trust a judge just because the prompt looks good: compare its outputs to your human-coded labels via a confusion matrix, and iterate until misalignments (false positives/negatives) are acceptably low instead of relying on a single ‘agreement’ percentage.

Get the full analysis with uListen

Operationalize evals across unit tests and live monitoring.

Once you have good evaluators, plug them into CI/unit tests for known problematic traces and into periodic monitoring on production data, so you track specific failure rates over time instead of relying on vibes or anecdotal complaints.

Get the full analysis with uListen

You only need a handful of high-quality evals for big impact.

Most products end up with four to seven core LLM-as-judge evaluators; combined with some simple code-based checks and ongoing trace review, this is enough to sharply improve product quality and maintain a durable moat.

Get the full analysis with uListen

Notable Quotes

“To build great AI products, you need to be really good at building evals. It's the highest ROI activity you can engage in.”
— Hamel Husain

“The goal is not to do evals perfectly. It's to actionably improve your product.”
— Shreya Shankar

“You can appoint one person whose taste you trust. It should be the person with domain expertise. Oftentimes, it is the product manager.”
— Hamel Husain

“People have been burned by evals in the past... They did evals badly, then they didn't trust it anymore, and then they're like, 'Oh, I'm anti-evals.'”
— Shreya Shankar

“There’s no world in which they are just being like, 'I made Claude Code. I'm never looking at anything.' All of this is evals.”
— Shreya Shankar

Questions Answered in This Episode

How do I practically choose which 4–7 failure modes deserve their own LLM-as-judge eval in my specific product?

The episode argues that systematic AI evals—structured ways to measure and improve LLM applications—are becoming a core skill for PMs and engineers, comparable to knowing how to write PRDs or run A/B tests.

Get the full analysis with uListen AI

What’s a good rule of thumb for when a failure should be addressed with a prompt change versus building a dedicated evaluator?

Hamel Husain and Shreya Shankar walk through a concrete, end‑to‑end eval workflow using a real real-estate assistant: manual error analysis on traces, open coding, clustering failures with LLMs, and then building focused automated evaluators (code-based and LLM-as-judge).

Get the full analysis with uListen AI

How can smaller teams without dedicated data scientists build lightweight tools and workflows for error analysis and trace review?

They emphasize that good evals start with looking at real product data, not with abstract benchmarks or generic tools, and that you only need a small number of well-chosen evals to unlock large product gains.

Get the full analysis with uListen AI

In products where ‘quality’ is highly subjective (e.g., creativity, tone), how do you define a binary pass/fail rubric without overconstraining the model?

The conversation also unpacks common misconceptions and Twitter drama around evals, arguing that ‘vibes’ and A/B tests are not alternatives but sit inside a broader, data-science‑driven eval practice that top AI teams quietly rely on.

Get the full analysis with uListen AI

How should eval strategies evolve as a product matures from MVP with low traffic to a scaled product serving millions of users?

Get the full analysis with uListen AI

Transcript Preview

Lenny Rachitsky

(instrumental music) To build great AI products, you need to be really good at building evals.

Hamel Husain

It's the highest ROI activity you can engage in. This process is a lot of fun. Everyone that does this immediately gets addicted to it. When you're building an AI application, you just learn a lot.

Lenny Rachitsky

What's cool about this is you don't need to do this many, many times. For most products, you do this process once and then you build on it.

Narrator

The goal is not to do evals perfectly. It's to actionably improve your product.

Lenny Rachitsky

I did not realize how much controversy and drama there is around evals. There's a lot of people with very strong opinions. (laughs)

Narrator

People have been burned by evals in the past. People have done evals badly, so then they didn't trust it anymore, and then they're like, "Oh, I'm anti-evals."

Lenny Rachitsky

What are a couple of the most common misconceptions people have with evals?

Hamel Husain

The top one is we live in the age of AI. Can't the AI just eval it? But it doesn't work.

Lenny Rachitsky

A term that you used in your posts that I love is this idea of a benevolent dictator.

Hamel Husain

When you're doing this open coding, a lot of teams get bogged down in having a committee do this. For a lot of situations, that's wholly unnecessary. You don't want to make this process so expensive that you can't do it. You can appoint one person whose taste that you trust. It should be the person with domain expertise. Oftentimes, it is the product manager.

Lenny Rachitsky

(instrumental music) Today, my guests are Hamel Hussein and Shreya Shankar. One of the most trending topics on this podcast over the past year has been the rise of evals. Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders. And since then, this has been a recurring theme across many of the top AI builders I've had on. Two years ago, I had never heard the term evals. Now it's coming up constantly. When was the last time that a new skill emerged that product builders had to get good at to be successful? Hamel and Shreya have played a major role in shifting evals from being an obscure, mysterious subject to one of the most necessary skills for AI product builders. They teach the definitive online course on evals, which happens to be the number one course on Maven. They've now taught over 2,000 PMs and engineers across 500 companies, including large swaths of the OpenAI and Anthropic teams, along with every other major AI lab. In this conversation, we do a lot of show versus tell. We walk through the process of developing an effective eval, explain what the heck evals are and what they look like, address many of the major misconceptions with evals, give you the first few steps you can take to start building evals for your product, and also share just a ton of best practices that Hamel and Shreya have developed over the past few years. This episode is the deepest yet most understandable primer you will find on the world of evals, and honestly, it got me excited to write evals, even though I have nothing to write evals for. I think you'll feel the same way as you watch this. If this conversation gets you excited, definitely check out Hamel and Shreya's course on Maven. We'll link to it in the show notes. If you use the code LENNIESLIST when you purchase the course, you'll get 35% off the price of the course. With that, I bring you Hamel Hussein and Shreya Shankar. This episode is brought to you by Fin, the number one AI agent for customer service. If your customer support tickets are piling up, then you need Fin. Fin is the highest performing AI agent on the market with a 65% average resolution rate. Fin resolves even the most complex customer queries. No other AI agent performs better. In head-to-head bake-offs with competitors, Fin wins every time. Yes, switching to a new tool can be scary, but Fin works on any help desk with no migration needed, which means you don't have to overhaul your current system or deal with delays in service for your customers. And Fin is trusted by over 5,000 customer service leaders and top AI companies like Anthropic and Synthesia. Because Fin is powered by the Fin AI engine, which is a continuously improving system that allows you to analyze, train, test, and deploy with ease, Fin can continuously improve your results too. So if you're ready to transform your customer service and scale your support, give Fin a try for only 99 cents per resolution. Plus, Fin comes with a 90-day money back guarantee. Find out how Fin can work for your team at Fin.ai/Lenny. That's Fin.ai/Lenny. This episode is brought to you by Dscout. Design teams today are expected to move fast, but also to get it right. That's where Dscout comes in. Dscout is the all-in-one research platform built for modern product and design teams. Whether you're running usability tests, interviews, surveys, or in-the-wild fieldwork, Dscout makes it easy to connect with real users and get real insights fast. You can even test your Figma prototypes directly inside the platform. No juggling tools, no chasing ghost participants. And with the industry's most trusted panel plus AI-powered analysis, your team gets clarity and confidence to build better without slowing down. So if you're ready to streamline your research, speed up decisions, and design with impact, head to Dscout.com to learn more. That's D-S-C-O-U-T.com. The answers you need to move confidently. (instrumental music) Hamel and Shreya, thank you so much for being here and welcome to the podcast.

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

More from Lenny's Podcast

Snapchat CEO: Why distribution has become the most important moat | Evan Spiegel

1h 10m

Snapchat CEO: Why distribution has become the most important moat | Evan Spiegel

How Anthropic’s product team moves faster than anyone else | Cat Wu (Head of Product, Claude Code)

1h 25m

How Anthropic’s product team moves faster than anyone else | Cat Wu (Head of Product, Claude Code)

Why half of product managers are in trouble | Nikhyl Singhal (Meta, Google)

1h 35m

Why half of product managers are in trouble | Nikhyl Singhal (Meta, Google)

Hard truths about building in the AI era | Keith Rabois (Khosla Ventures)

1h 22m

Hard truths about building in the AI era | Keith Rabois (Khosla Ventures)

Head of Growth (Anthropic): “Claude is growing itself at this point”

1h 52m

Head of Growth (Anthropic): “Claude is growing itself at this point”

An AI state of the union: We’ve passed the inflection point & dark factories are coming

1h 39m

An AI state of the union: We’ve passed the inflection point & dark factories are coming

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.