Skip to content
Aakash GuptaAakash Gupta

The Most Important New Skill for Product Managers in 2026: AI Evals Masterclass

AI features don't fail because of the model. They fail because nobody evaluated them. Ankit Chukla has taught thousands of PMs how to build evals, and today he's open-sourcing the knowledge he normally charges thousands for — for free. In this episode, he demos the complete workflow for building offline and online evaluations, using LLM judges, code-based checks, and expert review that catch failures before they reach your users. Complete write-up: https://www.news.aakashg.com/p/ai-evals-explained-simply ---- Timestamps: 0:00 – AI features fail without evals 2:14 – What makes this episode different 3:46 – 5 components of a Gen AI product 5:54 – Case study: AI job website 10:30 – Why prototypes fail (5 reasons) 13:43 – How evals fix failures 11:31 – Ads 12:06 – Back to evals 22:04 – How to build evals end-to-end 35:43 – Offline evals = your AI PRD 37:10 – Online evals & observability 39:26 – Case study: IDMoney Mind 57:52 – Real-life eval examples 1:01:24 – Key takeawayShare ---- 🧠 Key Takeaways: 1. Evals are your AI PRD — The best AI companies have PMs define evals first. Engineers then use pass rates (36%, 56%, 80%) to know when the feature is ready to ship. No evals = no PRD. 2. Non-determinism is why evals exist — LLMs give different outputs for similar inputs. Like tea made in three different places — same ingredients, different result. Evaluations are how you tame that behavior. 3. Build your dataset from 4 sources — Past logs, desk research, synthetic LLM-generated data, and domain experts. Without real edge cases in your dataset, your evals will miss the failures that actually matter. 4. Match the eval type to the metric — Word count and format checks? Use code. Tone, relevance, and hallucination? Use an LLM judge. Compliance and legal risk? Use humans. Don't use a sword when a needle will do. 5. Offline evals before you ship, online evals after — Offline = pre-launch quality gate. Online = production monitoring on sampled traffic (1 in 10, 1 in 100). Both are required. Neither is optional. 6. Cost optimization requires evals — There's a 25x price difference between GPT-5 and GPT Nano. You'll never confidently switch to a cheaper model unless your evals prove the quality holds. 7. Involve domain experts — A PM can't always tell a good financial answer from a bad one. Bring in investment advisors, compliance leads, or customer support reps. Show them outputs. They'll tell you what's broken. 8. Use hybrid evaluation — LLM flags issues at scale, humans make the final call on edge cases. This is how you get thoroughness without burning budget on full human review. ---- 🏆 Sponsors: 1. Reforge Build: AI prototyping built for product teams — try free at reforge.com/Aakash, use code BUILD for 1 month free premium - https://build.reforge.com/ ---- 👨‍💻 Where to find Ankit: LinkedIn: https://www.linkedin.com/in/ankythshukla/ Website: https://hellopm.co/ YouTube: https://www.youtube.com/@UC_gl5BtaGFDtB-imTWBWTpw 👨‍💻 Where to find Aakash: Twitter: https://www.x.com/aakashg0 LinkedIn: https://www.linkedin.com/in/aagupta/ Newsletter: https://www.news.aakashg.com Premium Bundle: https://bundle.aakashg.com #aievals #aiproductmanagement ---- 🧠 About Product Growth: x The world's largest podcast focused solely on product + growth, with over 200K+ listeners. 🔔 Subscribe and turn on notifications to get more videos like this.

Ankit ChuklaguestAakash Guptahost
Feb 18, 20261h 3mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

Why AI evals become the core PM skill for 2026 success

  1. The speakers argue that most AI features fail in production not due to model quality, but because teams ship without robust evaluation systems that reveal errors, drift, and misbehavior.
  2. They present a practical framework: define expected behavior and success criteria, convert them into metrics, build a representative dataset, and implement code/LLM/human evals to iteratively improve prompts, models, tools, and orchestration.
  3. Offline evals are positioned as the AI PRD—engineers “hill-climb” eval scores until quality thresholds are met before shipping major releases.
  4. Online evals and observability extend evaluation into production via sampling, drift detection, and user feedback signals (thumbs up/down plus behavioral “soft feedback”).
  5. A detailed fintech case study (INDMoney Mind / Robinhood-like stock Q&A) shows how to translate regulatory constraints into eval dimensions, thresholds, gating rules, and ongoing monitoring cadence.

IDEAS WORTH REMEMBERING

5 ideas

Evals are the missing “truth” layer for AI products.

Because LLM outputs are stochastic, a product can appear fine in demos while failing in real usage; evals create repeatable checks for accuracy, safety, relevance, and UX constraints so you can trust what you ship.

Start eval design with explicit expected behavior and success criteria.

Write guardrails like “be an analyst, not an advisor,” length limits, and prohibited actions (e.g., no buy/sell recommendations), then translate them into measurable metrics and pass/fail thresholds.

Your dataset is the highest-leverage part of the entire eval system.

Collect representative and adversarial inputs from production logs, user research, subject matter experts, and synthetic generation; weak datasets produce misleading evals and fragile products.

Use the cheapest evaluator that can reliably measure each metric.

Structural constraints (length, formatting, presence of terms) should be code-based, subjective qualities (helpfulness, tone, balance) can use LLM-as-judge, and high-stakes cases should escalate to human review.

Offline evals function as the AI PRD and enable “hill-climbing” to ship readiness.

PM-defined evals give engineers a clear target (raise low-scoring dimensions to thresholds) and act as regression tests before launches or major changes to prompts/models/tools.

WORDS WORTH SAVING

5 quotes

Your AI feature fails not because of the model, but because you didn't evaluate it.

Ankit Chukla

If you are shipping AI features without evaluations, your product is lying to you and you have no idea.

Ankit Chukla

The way the best AI companies work is that the AI PM defines these evals, and that is basically the PRD for the AI engineers.

Aakash Gupta

If you are not doing offline evals correctly, then you have not even created a product that can be actually launched to the real audience.

Ankit Chukla

Evaluations are not optional. They are the guardrails for all the AI-driven outcomes.

Ankit Chukla

Why AI prototypes fail to scale (data drift, cost, engineering, guardrails, collaboration)Five components of a gen-AI product (model, context engineering, tools, orchestration, UX)Evals as guardrails for non-deterministic LLM behaviorDataset creation (logs, research, experts, synthetic data)Offline evals as PRD and release gatesOnline evals, observability platforms, and drift monitoringCost-quality optimization via model comparisons and smaller models/fine-tuning

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome