Skip to content
Aakash GuptaAakash Gupta

The ONE AI Skill Every Product Manager NEEDS in 2026

Today, we’ve got some of our most requested guests yet: Hamel Husain and Shreya Shankar, creators of the world’s best AI Evals cohort. You’ll learn everything you need to know about AI Evals, how to build them, common mistakes to avoid, and much more! If I were you, I’d stop everything right away and binge watch it right now & make my action plan to execute tomorrow. Also, we’ve also done a Newsletter deep dive with them, check it out - AI Evals: Everything You Need to Know to Start: https://www.news.aakashg.com/p/ai-evals 🎥 Timestamps: Preview - 00:00 Three reasons PMs NEED evals - 02:06 Why PMs shouldn't view evals as monotonous - 04:40 Are evals the hardest part of AI products solved? - 06:23 Why can't you just rely on human "vibe checks"? - 07:37 Ads - 12:11 Are LLMs good at 1-5 ratings? - 14:06 The "Whack-a-mole" analogy without evals - 15:45 Hallucination problem in emails (Apollo story) - 16:26 How Airbnb used machine learning models? - 21:22 Evaluating RAG Systems - 23:56 Ads - 29:52 Hill Climbing - 31:42 Red flag: Suspiciously high eval metrics - 35:51 Design principles for effective evals - 39:02 How OpenAI approaches evals - 42:42 Foundation models are trained on "average taste" - 44:39 Cons of fine-tuning - 49:36 Prompt engineering vs. RAG vs. Fine-tuning - 51:27 Introduction of "The Three Gulfs" framework - 53:00 Roadmap for learning AI evals - 56:04 Why error analysis is critical for LLMs - 01:01:41 Using LLM as a judge - 01:08:29 Frameworks for systematic problem-solving in labels - 01:10:15 Importance of niche and qualifying clients (Pro tips) - 01:17:42 $800K for first course cohort! - 01:18:43 Why end a successful cohort? - 01:20:15 GOLD advice for creating a successful course - 01:25:49 Outro - 01:33:39 ---- Podcast transcript: https://www.news.aakashg.com/p/hamel-shreya-podcast 💼 Check out our sponsors: 1. The AI Evals Course for PMs & Engineers :Get $800 off with this link - https://maven.com/parlance-labs/evals?promoCode=ag-product-growth 2. Jira Product Discovery: Plan with purpose, ship with confidence - https://www.atlassian.com/software/jira/product-discovery 3. Vanta: Automate compliance, security, and trust with AI (Get $1,000 with our link) - https://www.vanta.com/lp/demo-1k?utm_campaign=1k_offer&utm_source=product-growth&utm_medium=podcast 4. Product Faculty: Get $500 off the AI PM certification with code AAKASH25 - https://maven.com/product-faculty/ai-product-management-certification?promoCode=AAKASH25 👀 Where to Find Hamel & Shreya Hamel’s LinkedIn: https://www.linkedin.com/in/hamelhusain/ Shreya’s LinkedIn: https://www.linkedin.com/in/shrshnk/ 👨‍💻 Where to find Aakash: Twitter: https://www.twitter.com/aakashg0 LinkedIn: https://www.linkedin.com/in/aagupta/ Instagram: https://www.instagram.com/aakashg0/ 🔑 Key Takeaways: 1. Stop Guessing. Eval Your AI. Your AI isn’t an MVP without robust evaluations. Build in judgment — or you’re just shipping hope. Without evaluation, AI performance is a happy accident. 2. Error Analysis = Your Superpower. General metrics won’t save you. You need to understand why your AI messed up. Only then can you fix it — not just wish it worked better. 3. 99% Accuracy is a LIE. Suspiciously high metrics usually mean your evaluation setup is broken. Real-world AI is never perfect. If your evals say otherwise, they’re flawed. 4. Fine-Tuning is a Trap (Mostly). Fine-tuning is expensive, brittle, and often unnecessary. Start with smarter prompts and RAG. Only fine-tune if you must. 5. Your Data’s Wild. Understand It. You can’t eyeball everything. Without structured evaluation, you’ll drown in noise and never find patterns or fixes that matter. 6. Models Fail to Generalize. Always. Your AI will break on new data. Don’t blame it. Adapt it. Use RAG, upgrade inputs, and stop expecting out-of-the-box magic. 7. Your Prompts Are S**T. If your AI is bad, it’s probably your fault. The cheapest, most powerful fix? Sharpen your prompts. Clearer instructions = smarter AI. 8. Let AI Teach You. Seriously. LLM judges aren’t just scoring you — they can teach you. Reviewing how your AI fails is the best way to learn what great outputs should look like. #ai #aievals #aiproducts #aiprompt 🧠 About Product Growth: The world's largest podcast focused solely on product + growth, with over 175K listeners. Hosted by Aakash Gupta, who spent 16 years in PM, rising to VP of product, this 2x/ week show covers product and growth topics in depth. 🔔 Subscribe and like the video to support our content! And turn on the bell for notifications.

Aakash GuptahostHamel HusainguestShreya Shankarguest
Jul 11, 20251h 34mWatch on YouTube ↗

CHAPTERS

  1. Why AI evals are a must-have PM skill: taste, iteration speed, and scale

    Hamel explains why PMs need to be strong at AI evaluations: they let PMs encode product taste into the build loop, create faster iteration cycles, and scale judgment across many AI workloads. Evals are framed as leverage—not a monotonous checkbox task.

  2. What an “eval” is (and why products need a suite of them)

    Shreya defines an eval as systematic measurement of some aspect of quality, made up of a criterion and a method of measuring it. In practice, real products require multiple evals because quality is multi-dimensional.

  3. Why “getting evals right” solves the hardest part of AI products

    Both guests argue the hardest part is the process of creating evals—because it forces teams to examine data, define success, and iterate scientifically. Once that groundwork is done, teams can move faster and focus on other product improvements.

  4. From “vibe checks” to scalable quality: binary rubrics and why they work

    Shreya explains that vibe checks are important but don’t scale or transfer across people. They recommend translating vibe checks into precise, binary pass/fail rubrics with examples—especially for aligning LLM-as-judge systems.

  5. Why generic dashboards fail: hallucinations and domain-specific eval design

    Using Aakash’s email-writing hallucination story, Hamel warns against generic vendor metrics (e.g., off-the-shelf hallucination scores). Instead, teams must characterize their specific failure modes, label examples, and iteratively build a judge they can trust.

  6. ML roots of evals: lessons from Airbnb (ranking, fraud, LTV)

    Hamel connects LLM evaluation to classic machine learning evaluation for stochastic, non-deterministic systems. He shares Airbnb ML use cases and notes that LLM product teams can borrow proven ML evaluation discipline without needing a full ML curriculum.

  7. Evaluating RAG like search: what changes when the consumer is an LLM

    Shreya and Hamel explain that RAG evaluation splits into retrieval (classic search metrics) and generation. The key nuance: LLMs can handle large context windows, changing tolerances like recall@K and the importance of rank position versus human search.

  8. Code generation and verifiable domains: why Copilot-style evals work

    Hamel describes why developer tools were early AI successes: developers are the domain experts and code is more verifiable. Copilot-like systems can run tests at scale, creating powerful harnesses that unlock rapid iteration and measurable improvements.

  9. PM-defined evals enable “hill climbing”—and the trap of overfitting

    They explain why engineers excel at optimizing well-defined metrics, but PMs must define those metrics for LLM products. They also caution that hill-climbing can lead to overfitting, and discuss how to detect and prevent it with ML-style safeguards.

  10. What “good evals” look like in the wild: interfaces, scoped judges, and strong proxies

    Shreya notes no company has solved evals perfectly, but highlights converging best practices: custom labeling interfaces, well-scoped judges, and metrics that correlate with product success (like next-token prediction in coding). The discussion also covers why autocomplete is better in code than email.

  11. Foundation benchmarks vs business evals: what OpenAI can’t do for you

    They distinguish general-purpose model benchmarks (MMLU, HumanEval) from domain-specific product evals. Foundation model labs focus on the former, while each company must define “good” for their product taste—making domain evals defensible differentiation.

  12. Evals as the moat (and where fine-tuning fits: last, not first)

    Shreya and Hamel argue eval systems and pipelines are the true moat, more than the model choice. Fine-tuning should come after evals and simpler levers (model upgrades, decomposition, RAG), because it adds ongoing operational and maintenance complexity.

  13. Prompting vs RAG vs fine-tuning: the “Three Gulfs” decision framework

    They introduce the Three Gulfs framework to decide which lever to use. Prompting addresses the gulf of specification (clear requirements), while RAG and fine-tuning address the gulf of generalization when the model lacks context or capability.

  14. A roadmap to mastery: error analysis, grounded theory, LLM-judge iteration, and productionization

    They lay out a learning roadmap based on their course reader, emphasizing error analysis as the biggest bottleneck and highest leverage. They describe open/axial coding from social science (grounded theory), building/validating LLM judges, handling RAG/agents/multi-turn, and production workflows like CI/CD.

  15. Course-building business tangent: consulting origins, pricing, and why the live cohort ends

    Hamel explains his transition from industry roles to consulting and teaching, plus the economics and positioning of a high-priced course. They share why they’re limiting live cohorts (time intensity, protecting quality) and plan to reinvest into a book and other formats.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome