Skip to content
Aakash GuptaAakash Gupta

The Most Important New Skill for Product Managers in 2026: AI Evals Masterclass

AI features don't fail because of the model. They fail because nobody evaluated them. Ankit Chukla has taught thousands of PMs how to build evals, and today he's open-sourcing the knowledge he normally charges thousands for — for free. In this episode, he demos the complete workflow for building offline and online evaluations, using LLM judges, code-based checks, and expert review that catch failures before they reach your users. Complete write-up: https://www.news.aakashg.com/p/ai-evals-explained-simply ---- Timestamps: 0:00 – AI features fail without evals 2:14 – What makes this episode different 3:46 – 5 components of a Gen AI product 5:54 – Case study: AI job website 10:30 – Why prototypes fail (5 reasons) 13:43 – How evals fix failures 11:31 – Ads 12:06 – Back to evals 22:04 – How to build evals end-to-end 35:43 – Offline evals = your AI PRD 37:10 – Online evals & observability 39:26 – Case study: IDMoney Mind 57:52 – Real-life eval examples 1:01:24 – Key takeawayShare ---- 🧠 Key Takeaways: 1. Evals are your AI PRD — The best AI companies have PMs define evals first. Engineers then use pass rates (36%, 56%, 80%) to know when the feature is ready to ship. No evals = no PRD. 2. Non-determinism is why evals exist — LLMs give different outputs for similar inputs. Like tea made in three different places — same ingredients, different result. Evaluations are how you tame that behavior. 3. Build your dataset from 4 sources — Past logs, desk research, synthetic LLM-generated data, and domain experts. Without real edge cases in your dataset, your evals will miss the failures that actually matter. 4. Match the eval type to the metric — Word count and format checks? Use code. Tone, relevance, and hallucination? Use an LLM judge. Compliance and legal risk? Use humans. Don't use a sword when a needle will do. 5. Offline evals before you ship, online evals after — Offline = pre-launch quality gate. Online = production monitoring on sampled traffic (1 in 10, 1 in 100). Both are required. Neither is optional. 6. Cost optimization requires evals — There's a 25x price difference between GPT-5 and GPT Nano. You'll never confidently switch to a cheaper model unless your evals prove the quality holds. 7. Involve domain experts — A PM can't always tell a good financial answer from a bad one. Bring in investment advisors, compliance leads, or customer support reps. Show them outputs. They'll tell you what's broken. 8. Use hybrid evaluation — LLM flags issues at scale, humans make the final call on edge cases. This is how you get thoroughness without burning budget on full human review. ---- 🏆 Sponsors: 1. Reforge Build: AI prototyping built for product teams — try free at reforge.com/Aakash, use code BUILD for 1 month free premium - https://build.reforge.com/ ---- 👨‍💻 Where to find Ankit: LinkedIn: https://www.linkedin.com/in/ankythshukla/ Website: https://hellopm.co/ YouTube: https://www.youtube.com/@UC_gl5BtaGFDtB-imTWBWTpw 👨‍💻 Where to find Aakash: Twitter: https://www.x.com/aakashg0 LinkedIn: https://www.linkedin.com/in/aagupta/ Newsletter: https://www.news.aakashg.com Premium Bundle: https://bundle.aakashg.com #aievals #aiproductmanagement ---- 🧠 About Product Growth: x The world's largest podcast focused solely on product + growth, with over 200K+ listeners. 🔔 Subscribe and turn on notifications to get more videos like this.

Ankit ChuklaguestAakash Guptahost
Feb 19, 20261h 3mWatch on YouTube ↗

CHAPTERS

  1. 0:00 – 2:14

    Why AI features fail without evaluations (and why it’s a PM skill)

    Ankit frames the core claim: AI features don’t fail primarily because the model is “bad,” but because teams ship without a reliable way to measure correctness, usefulness, and safety. The episode positions “writing evals” as a defining capability for product managers heading into 2026.

  2. 2:14 – 3:46

    What’s different about this masterclass: real examples, not hypotheticals

    Aakash and Ankit contrast most online eval content (introductory, theoretical) with what they’ll do here: a framework plus concrete examples and an end-to-end case study. The promise is that viewers walk away able to approach evals for many GenAI product types.

  3. 3:46 – 5:54

    The 5 components of a GenAI product (and where nondeterminism enters)

    Ankit breaks a GenAI product into five building blocks and explains why they require a different quality approach than deterministic software. The key issue is stochastic model behavior: the same input can yield different outputs, so teams must “tame the lion” with evals.

  4. 5:54 – 10:30

    Case study: AI-first job website & what an eval looks like

    A simple AI job site example illustrates how evals operate: generate summaries, interview questions, skills, learning guides, and quizzes from job descriptions, then assess quality and constraints. The chapter also clarifies that evals can be code-based, human, or LLM-judge prompts.

  5. 10:30 – 11:31

    Why prototypes fail to scale: the 5 failure modes

    Ankit explains why impressive demos break in production, citing research and practical patterns. He outlines five common reasons prototypes fail, many of which require systematic measurement and iteration to overcome.

  6. 11:31 – 12:06

    Nondeterminism intuition: the ‘chai’ metaphor & why correctness isn’t enough

    Using tea/chai variation across contexts, Ankit explains why even “correct” LLM answers may still fail user expectations. The takeaway: even as hallucinations decrease, products must be tuned to customer preferences and context, which evals operationalize.

  7. 12:06 – 13:43

    Building evals end-to-end: the full workflow diagram

    Ankit walks through an end-to-end eval lifecycle: define success criteria, build a baseline product, create a representative dataset, identify failures with SME help, convert them into metrics and evals, then iterate via offline and online loops.

  8. 13:43 – 22:04

    How evals address drift, cost, and guardrails (and where they don’t)

    Evals are mapped directly to the prototype failure modes. They’re positioned as continuous measurement for drift, as a way to compare models/cost tradeoffs, and as the backbone for guardrails—while acknowledging engineering constraints need additional approaches.

  9. 22:04 – 35:43

    Evaluation methods & metrics: code checks, LLM judges, and legacy NLP metrics

    This chapter drills into the “how” of measuring outputs: deterministic programmatic tests, subjective LLM-judge scoring, and when to use (or avoid) older NLP metrics like BLEU/ROUGE. The guiding principle is to use the cheapest reliable method for each metric.

  10. 35:43 – 37:10

    Offline evals: the AI PRD and ‘hill-climbing’ to ship quality

    Offline evals are positioned as the pre-launch gating system and effectively the PRD for AI engineers. The team iterates on prompts/models/tools until eval performance meets thresholds, then ships with confidence rather than hope.

  11. 37:10 – 39:26

    Online evals & observability: sampling in production + drift detection

    After launch, the same eval concepts extend into production via observability platforms and sampling-based checks. Online evals catch drift, regressions, and changing user expectations, creating a continuous improvement loop.

  12. 39:26 – 57:52

    Case study deep dive: INDMoney Mind / Robinhood-style stock Q&A assistant

    Ankit reverse-engineers a finance assistant feature and shows how a PM would define constraints, compliance guardrails, and evaluation dimensions in a regulated domain. The example highlights how expected behavior becomes explicit metrics and test criteria.

  13. 57:52 – 1:01:24

    Real-world evaluation artifacts: datasets, thresholds, latency percentiles, and user feedback loops

    The case study expands into concrete eval operations: dataset sourcing/maintenance, eval types (automated, LLM-judge, human), blocking criteria, online latency monitoring, and using hard/soft user feedback as additional signals.

  14. 1:01:24 – 1:03:59

    Evals aren’t QA rebranded: business impact examples & final takeaways

    Ankit distinguishes eval work from traditional QA by emphasizing transformation, SME alignment, and continuous system tuning. They close with examples (Grammarly, GitHub Copilot, Klarna, support chatbots) and the overarching lesson that evals are ongoing guardrails, not a one-time task.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome