The ONE AI Skill Every Product Manager NEEDS in 2026

Today, we’ve got some of our most requested guests yet: Hamel Husain and Shreya Shankar, creators of the world’s best AI Evals cohort. You’ll learn everything you need to know about AI Evals, how to build them, common mistakes to avoid, and much more! If I were you, I’d stop everything right away and binge watch it right now & make my action plan to execute tomorrow. Also, we’ve also done a Newsletter deep dive with them, check it out - AI Evals: Everything You Need to Know to Start: https://www.news.aakashg.com/p/ai-evals 🎥 Timestamps: Preview - 00:00 Three reasons PMs NEED evals - 02:06 Why PMs shouldn't view evals as monotonous - 04:40 Are evals the hardest part of AI products solved? - 06:23 Why can't you just rely on human "vibe checks"? - 07:37 Ads - 12:11 Are LLMs good at 1-5 ratings? - 14:06 The "Whack-a-mole" analogy without evals - 15:45 Hallucination problem in emails (Apollo story) - 16:26 How Airbnb used machine learning models? - 21:22 Evaluating RAG Systems - 23:56 Ads - 29:52 Hill Climbing - 31:42 Red flag: Suspiciously high eval metrics - 35:51 Design principles for effective evals - 39:02 How OpenAI approaches evals - 42:42 Foundation models are trained on "average taste" - 44:39 Cons of fine-tuning - 49:36 Prompt engineering vs. RAG vs. Fine-tuning - 51:27 Introduction of "The Three Gulfs" framework - 53:00 Roadmap for learning AI evals - 56:04 Why error analysis is critical for LLMs - 01:01:41 Using LLM as a judge - 01:08:29 Frameworks for systematic problem-solving in labels - 01:10:15 Importance of niche and qualifying clients (Pro tips) - 01:17:42 $800K for first course cohort! - 01:18:43 Why end a successful cohort? - 01:20:15 GOLD advice for creating a successful course - 01:25:49 Outro - 01:33:39 ---- Podcast transcript: https://www.news.aakashg.com/p/hamel-shreya-podcast 💼 Check out our sponsors: 1. The AI Evals Course for PMs & Engineers :Get $800 off with this link - https://maven.com/parlance-labs/evals?promoCode=ag-product-growth 2. Jira Product Discovery: Plan with purpose, ship with confidence - https://www.atlassian.com/software/jira/product-discovery 3. Vanta: Automate compliance, security, and trust with AI (Get $1,000 with our link) - https://www.vanta.com/lp/demo-1k?utm_campaign=1k_offer&utm_source=product-growth&utm_medium=podcast 4. Product Faculty: Get $500 off the AI PM certification with code AAKASH25 - https://maven.com/product-faculty/ai-product-management-certification?promoCode=AAKASH25 👀 Where to Find Hamel & Shreya Hamel’s LinkedIn: https://www.linkedin.com/in/hamelhusain/ Shreya’s LinkedIn: https://www.linkedin.com/in/shrshnk/ 👨‍💻 Where to find Aakash: Twitter: https://www.twitter.com/aakashg0 LinkedIn: https://www.linkedin.com/in/aagupta/ Instagram: https://www.instagram.com/aakashg0/ 🔑 Key Takeaways: 1. Stop Guessing. Eval Your AI. Your AI isn’t an MVP without robust evaluations. Build in judgment — or you’re just shipping hope. Without evaluation, AI performance is a happy accident. 2. Error Analysis = Your Superpower. General metrics won’t save you. You need to understand why your AI messed up. Only then can you fix it — not just wish it worked better. 3. 99% Accuracy is a LIE. Suspiciously high metrics usually mean your evaluation setup is broken. Real-world AI is never perfect. If your evals say otherwise, they’re flawed. 4. Fine-Tuning is a Trap (Mostly). Fine-tuning is expensive, brittle, and often unnecessary. Start with smarter prompts and RAG. Only fine-tune if you must. 5. Your Data’s Wild. Understand It. You can’t eyeball everything. Without structured evaluation, you’ll drown in noise and never find patterns or fixes that matter. 6. Models Fail to Generalize. Always. Your AI will break on new data. Don’t blame it. Adapt it. Use RAG, upgrade inputs, and stop expecting out-of-the-box magic. 7. Your Prompts Are S**T. If your AI is bad, it’s probably your fault. The cheapest, most powerful fix? Sharpen your prompts. Clearer instructions = smarter AI. 8. Let AI Teach You. Seriously. LLM judges aren’t just scoring you — they can teach you. Reviewing how your AI fails is the best way to learn what great outputs should look like. #ai #aievals #aiproducts #aiprompt 🧠 About Product Growth: The world's largest podcast focused solely on product + growth, with over 175K listeners. Hosted by Aakash Gupta, who spent 16 years in PM, rising to VP of product, this 2x/ week show covers product and growth topics in depth. 🔔 Subscribe and like the video to support our content! And turn on the bell for notifications.

Aakash GuptahostHamel HusainguestShreya Shankarguest

Jul 10, 20251h 34mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

AI evals: the core PM skill to ship reliable products

AI evals let PMs encode their taste and product judgment into the development critical path instead of relying on unscalable “vibe checks.”
The single most important eval competency is systematic error analysis: identifying failure modes, quantifying them across traces, and turning findings into an improvement flywheel.
Effective evals are domain-specific and are built through an iterative, scientific-method process—generic vendor metrics (e.g., “hallucination scores”) often fail in real products.
Binary, rubric-driven pass/fail criteria typically outperform 1–5 rating scales because they force clarity, reduce calibration overhead, and are easier to align LLM-judges to.
Evals become a product moat: a portable evaluation pipeline enables fast model swaps, safer iteration (avoiding overfitting), and makes fine-tuning a “last step” rather than the starting point.

IDEAS WORTH REMEMBERING

5 ideas

Evals are how PMs “ship their judgment,” not just requirements docs.

Hamel argues evals put the PM’s taste directly on the product’s critical path, reducing reliance on engineers interpreting intent from PRDs and meetings. The eval rubric becomes an executable version of product quality.

Treat evals as an iterative scientific workflow, not a dashboard checkbox.

They emphasize skepticism, experimentation, and structured measurement—otherwise teams end up in “whack-a-mole” prompt tweaking without durable progress. The process of creating evals is where most of the value is generated.

Start with error analysis; it tells you what to evaluate and what to fix.

Shreya calls error analysis the hands-down most critical skill: review traces, identify failure modes, and quantify prevalence. This prevents infinite metric sprawl (toxicity/conciseness/hallucination, etc.) by grounding priorities in observed failures.

Codify vibe checks into binary rubrics to scale and reduce ambiguity.

Binary pass/fail criteria force a shipping-level decision and avoid the interpretability problem of averages like 3.2 vs 3.7. It also makes LLM-as-judge alignment more tractable because you can provide clear pass/fail examples.

Domain-specific hallucination evals beat generic “hallucination scores.”

For cases like AI-written sales emails, the team must define what hallucination means in that domain, label examples, and iteratively prompt/validate an LLM judge against human labels. Generic vendor metrics often don’t match your product’s real failure modes, leading to dashboards nobody trusts.

WORDS WORTH SAVING

5 quotes

Hands down, error analysis, the ability to look at your outputs and systematically figure out what makes for a bad output, quantify how many of these failure modes you see in a big batch of traces for your system, and then figure out how to turn that in measurement into a continuous flywheel of improving your product.

— Shreya Shankar

Evals give you a way as a PM to inject your taste and your judgment directly into the critical path of the AI product being developed.

— Hamel Husain

Your vibe checks are very important, but they don't scale, right, 'cause they involve you, the human.

— Shreya Shankar

A lot of times, um, non-binary evals, like ratings of one to five, that is a s- a smell of intellectual laziness.

— Hamel Husain

I think evals are the moat for AI products, and, like, truly nothing else.

— Shreya Shankar

Evals as a PM leverage mechanism (taste, iteration, scale)Definition and structure of eval suites (3–10 criteria)Vibe checks vs operationalized rubricsBinary pass/fail vs 1–5 ratingsLLM-as-judge alignment and trust-buildingError analysis via grounded theory (open/axial coding)Overfitting and test set disciplineRAG evaluation: retrieval vs generation metricsVerifiable domains (coding) vs non-verifiable domains (email)Fine-tuning vs prompting vs RAG (Three Gulfs framework)CI/CD and productionizing eval pipelinesInterfaces for fast human labeling

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.