Aakash GuptaThe Most Important New Skill for Product Managers in 2026: AI Evals Masterclass
Aakash Gupta and Ankit Chukla on why AI evals become the core PM skill for 2026 success.
In this episode of Aakash Gupta, featuring Ankit Chukla and Aakash Gupta, The Most Important New Skill for Product Managers in 2026: AI Evals Masterclass explores why AI evals become the core PM skill for 2026 success The speakers argue that most AI features fail in production not due to model quality, but because teams ship without robust evaluation systems that reveal errors, drift, and misbehavior.
At a glance
WHAT IT’S REALLY ABOUT
Why AI evals become the core PM skill for 2026 success
- The speakers argue that most AI features fail in production not due to model quality, but because teams ship without robust evaluation systems that reveal errors, drift, and misbehavior.
- They present a practical framework: define expected behavior and success criteria, convert them into metrics, build a representative dataset, and implement code/LLM/human evals to iteratively improve prompts, models, tools, and orchestration.
- Offline evals are positioned as the AI PRD—engineers “hill-climb” eval scores until quality thresholds are met before shipping major releases.
- Online evals and observability extend evaluation into production via sampling, drift detection, and user feedback signals (thumbs up/down plus behavioral “soft feedback”).
- A detailed fintech case study (INDMoney Mind / Robinhood-like stock Q&A) shows how to translate regulatory constraints into eval dimensions, thresholds, gating rules, and ongoing monitoring cadence.
IDEAS WORTH REMEMBERING
7 ideasEvals are the missing “truth” layer for AI products.
Because LLM outputs are stochastic, a product can appear fine in demos while failing in real usage; evals create repeatable checks for accuracy, safety, relevance, and UX constraints so you can trust what you ship.
Start eval design with explicit expected behavior and success criteria.
Write guardrails like “be an analyst, not an advisor,” length limits, and prohibited actions (e.g., no buy/sell recommendations), then translate them into measurable metrics and pass/fail thresholds.
Your dataset is the highest-leverage part of the entire eval system.
Collect representative and adversarial inputs from production logs, user research, subject matter experts, and synthetic generation; weak datasets produce misleading evals and fragile products.
Use the cheapest evaluator that can reliably measure each metric.
Structural constraints (length, formatting, presence of terms) should be code-based, subjective qualities (helpfulness, tone, balance) can use LLM-as-judge, and high-stakes cases should escalate to human review.
Offline evals function as the AI PRD and enable “hill-climbing” to ship readiness.
PM-defined evals give engineers a clear target (raise low-scoring dimensions to thresholds) and act as regression tests before launches or major changes to prompts/models/tools.
Online evals are mandatory to catch drift and production-only failures.
Run evals on sampled live traffic (e.g., 1/10, 1/100) and monitor observability signals, because user context and data change over time and can silently degrade quality.
Evals unlock big cost savings by validating cheaper models or specialized fine-tunes.
Teams often default to the most expensive model in production out of uncertainty; evals provide the confidence to switch to cheaper models or small-model transfer learning while maintaining quality.
WORDS WORTH SAVING
5 quotesYour AI feature fails not because of the model, but because you didn't evaluate it.
— Ankit Chukla
If you are shipping AI features without evaluations, your product is lying to you and you have no idea.
— Ankit Chukla
The way the best AI companies work is that the AI PM defines these evals, and that is basically the PRD for the AI engineers.
— Aakash Gupta
If you are not doing offline evals correctly, then you have not even created a product that can be actually launched to the real audience.
— Ankit Chukla
Evaluations are not optional. They are the guardrails for all the AI-driven outcomes.
— Ankit Chukla
QUESTIONS ANSWERED IN THIS EPISODE
5 questionsIn the INDMoney-style stock Q&A example, what specific eval prompts or rubrics would you use to detect “implicit” buy/sell advice (not just explicit recommendations)?
The speakers argue that most AI features fail in production not due to model quality, but because teams ship without robust evaluation systems that reveal errors, drift, and misbehavior.
How do you decide the right sampling rate for online evals (1/10 vs 1/1000) given cost, risk, and traffic volume?
They present a practical framework: define expected behavior and success criteria, convert them into metrics, build a representative dataset, and implement code/LLM/human evals to iteratively improve prompts, models, tools, and orchestration.
What’s a practical method to keep the eval dataset up to date as user questions and market conditions drift—without turning it into a huge manual process?
Offline evals are positioned as the AI PRD—engineers “hill-climb” eval scores until quality thresholds are met before shipping major releases.
When do BLEU/ROUGE-style overlap metrics still help in gen-AI products, and what modern alternatives would you prioritize instead?
Online evals and observability extend evaluation into production via sampling, drift detection, and user feedback signals (thumbs up/down plus behavioral “soft feedback”).
How would you structure “gating” rules for releases (block/no-block) when multiple metrics trade off (e.g., higher groundedness increases latency)?
A detailed fintech case study (INDMoney Mind / Robinhood-like stock Q&A) shows how to translate regulatory constraints into eval dimensions, thresholds, gating rules, and ongoing monitoring cadence.
EVERY SPOKEN WORD
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome