The Most Important New Skill for Product Managers in 2026: AI Evals Masterclass

AI features don't fail because of the model. They fail because nobody evaluated them. Ankit Chukla has taught thousands of PMs how to build evals, and today he's open-sourcing the knowledge he normally charges thousands for — for free. In this episode, he demos the complete workflow for building offline and online evaluations, using LLM judges, code-based checks, and expert review that catch failures before they reach your users. Complete write-up: https://www.news.aakashg.com/p/ai-evals-explained-simply ---- Timestamps: 0:00 – AI features fail without evals 2:14 – What makes this episode different 3:46 – 5 components of a Gen AI product 5:54 – Case study: AI job website 10:30 – Why prototypes fail (5 reasons) 13:43 – How evals fix failures 11:31 – Ads 12:06 – Back to evals 22:04 – How to build evals end-to-end 35:43 – Offline evals = your AI PRD 37:10 – Online evals & observability 39:26 – Case study: IDMoney Mind 57:52 – Real-life eval examples 1:01:24 – Key takeawayShare ---- 🧠 Key Takeaways: 1. Evals are your AI PRD — The best AI companies have PMs define evals first. Engineers then use pass rates (36%, 56%, 80%) to know when the feature is ready to ship. No evals = no PRD. 2. Non-determinism is why evals exist — LLMs give different outputs for similar inputs. Like tea made in three different places — same ingredients, different result. Evaluations are how you tame that behavior. 3. Build your dataset from 4 sources — Past logs, desk research, synthetic LLM-generated data, and domain experts. Without real edge cases in your dataset, your evals will miss the failures that actually matter. 4. Match the eval type to the metric — Word count and format checks? Use code. Tone, relevance, and hallucination? Use an LLM judge. Compliance and legal risk? Use humans. Don't use a sword when a needle will do. 5. Offline evals before you ship, online evals after — Offline = pre-launch quality gate. Online = production monitoring on sampled traffic (1 in 10, 1 in 100). Both are required. Neither is optional. 6. Cost optimization requires evals — There's a 25x price difference between GPT-5 and GPT Nano. You'll never confidently switch to a cheaper model unless your evals prove the quality holds. 7. Involve domain experts — A PM can't always tell a good financial answer from a bad one. Bring in investment advisors, compliance leads, or customer support reps. Show them outputs. They'll tell you what's broken. 8. Use hybrid evaluation — LLM flags issues at scale, humans make the final call on edge cases. This is how you get thoroughness without burning budget on full human review. ---- 🏆 Sponsors: 1. Reforge Build: AI prototyping built for product teams — try free at reforge.com/Aakash, use code BUILD for 1 month free premium - https://build.reforge.com/ ---- 👨‍💻 Where to find Ankit: LinkedIn: https://www.linkedin.com/in/ankythshukla/ Website: https://hellopm.co/ YouTube: https://www.youtube.com/@UC_gl5BtaGFDtB-imTWBWTpw 👨‍💻 Where to find Aakash: Twitter: https://www.x.com/aakashg0 LinkedIn: https://www.linkedin.com/in/aagupta/ Newsletter: https://www.news.aakashg.com Premium Bundle: https://bundle.aakashg.com #aievals #aiproductmanagement ---- 🧠 About Product Growth: x The world's largest podcast focused solely on product + growth, with over 200K+ listeners. 🔔 Subscribe and turn on notifications to get more videos like this.

Ankit ChuklaguestAakash Guptahost

Feb 19, 20261h 3mWatch on YouTube ↗

EPISODE INFO

Released: February 19, 2026
Duration: 1h 3m
Channel: Aakash Gupta
Watch on YouTube: ▶ Open ↗

EPISODE DESCRIPTION

AI features don't fail because of the model. They fail because nobody evaluated them. Ankit Chukla has taught thousands of PMs how to build evals, and today he's open-sourcing the knowledge he normally charges thousands for — for free. In this episode, he demos the complete workflow for building offline and online evaluations, using LLM judges, code-based checks, and expert review that catch failures before they reach your users. Complete write-up: https://www.news.aakashg.com/p/ai-evals-explained-simply ---- Timestamps: 0:00 – AI features fail without evals 2:14 – What makes this episode different 3:46 – 5 components of a Gen AI product 5:54 – Case study: AI job website 10:30 – Why prototypes fail (5 reasons) 13:43 – How evals fix failures 11:31 – Ads 12:06 – Back to evals 22:04 – How to build evals end-to-end 35:43 – Offline evals = your AI PRD 37:10 – Online evals & observability 39:26 – Case study: IDMoney Mind 57:52 – Real-life eval examples 1:01:24 – Key takeawayShare ---- 🧠 Key Takeaways:
1. Evals are your AI PRD — The best AI companies have PMs define evals first. Engineers then use pass rates (36%, 56%, 80%) to know when the feature is ready to ship. No evals = no PRD.
1. Non-determinism is why evals exist — LLMs give different outputs for similar inputs. Like tea made in three different places — same ingredients, different result. Evaluations are how you tame that behavior.
1. Build your dataset from 4 sources — Past logs, desk research, synthetic LLM-generated data, and domain experts. Without real edge cases in your dataset, your evals will miss the failures that actually matter.
1. Match the eval type to the metric — Word count and format checks? Use code. Tone, relevance, and hallucination? Use an LLM judge. Compliance and legal risk? Use humans. Don't use a sword when a needle will do.
1. Offline evals before you ship, online evals after — Offline = pre-launch quality gate. Online = production monitoring on sampled traffic (1 in 10, 1 in 100). Both are required. Neither is optional.
1. Cost optimization requires evals — There's a 25x price difference between GPT-5 and GPT Nano. You'll never confidently switch to a cheaper model unless your evals prove the quality holds.
1. Involve domain experts — A PM can't always tell a good financial answer from a bad one. Bring in investment advisors, compliance leads, or customer support reps. Show them outputs. They'll tell you what's broken.
1. Use hybrid evaluation — LLM flags issues at scale, humans make the final call on edge cases. This is how you get thoroughness without burning budget on full human review. ---- 🏆 Sponsors:
1. Reforge Build: AI prototyping built for product teams — try free at reforge.com/Aakash, use code BUILD for 1 month free premium - https://build.reforge.com/ ---- 👨‍💻 Where to find Ankit: LinkedIn: https://www.linkedin.com/in/ankythshukla/ Website: https://hellopm.co/ YouTube: https://www.youtube.com/@UC_gl5BtaGFDtB-imTWBWTpw 👨‍💻 Where to find Aakash: Twitter: https://www.x.com/aakashg0 LinkedIn: https://www.linkedin.com/in/aagupta/ Newsletter: https://www.news.aakashg.com Premium Bundle: https://bundle.aakashg.com #aievals #aiproductmanagement ---- 🧠 About Product Growth: x The world's largest podcast focused solely on product + growth, with over 200K+ listeners. 🔔 Subscribe and turn on notifications to get more videos like this.

SPEAKERS

Ankit Chukla
guest
AI product leader and instructor focused on designing and running AI/LLM evaluations for product teams.
Aakash Gupta
host
Host of the Aakash Gupta channel/podcast, covering product management and AI with interviews and masterclasses.

EPISODE SUMMARY

In this episode of Aakash Gupta, featuring Ankit Chukla and Aakash Gupta, The Most Important New Skill for Product Managers in 2026: AI Evals Masterclass explores why AI evals become the core PM skill for 2026 success The speakers argue that most AI features fail in production not due to model quality, but because teams ship without robust evaluation systems that reveal errors, drift, and misbehavior.

RELATED EPISODES