Aakash GuptaThe Most Important New Skill for Product Managers in 2026: AI Evals Masterclass
CHAPTERS
- 0:00 – 2:14
Why AI features fail without evaluations (and why it’s a PM skill)
Ankit frames the core claim: AI features don’t fail primarily because the model is “bad,” but because teams ship without a reliable way to measure correctness, usefulness, and safety. The episode positions “writing evals” as a defining capability for product managers heading into 2026.
- •AI products can “lie” to teams without rigorous evaluation loops
- •Evals are a new differentiating skill for AI PMs (beyond classic product sense)
- •Goal of the session: a practical masterclass with real-world nuance, not just intro concepts
- •Evals help PMs translate product intent into measurable system behavior
- 2:14 – 3:46
What’s different about this masterclass: real examples, not hypotheticals
Aakash and Ankit contrast most online eval content (introductory, theoretical) with what they’ll do here: a framework plus concrete examples and an end-to-end case study. The promise is that viewers walk away able to approach evals for many GenAI product types.
- •Most public eval content lacks grounded, real examples
- •Evals aren’t one-size-fits-all; nuance depends on product and domain
- •The episode will cover nature of LLMs, metrics, an end-to-end eval workflow, and tips
- •Case study will demonstrate how a company plans evals in practice
- 3:46 – 5:54
The 5 components of a GenAI product (and where nondeterminism enters)
Ankit breaks a GenAI product into five building blocks and explains why they require a different quality approach than deterministic software. The key issue is stochastic model behavior: the same input can yield different outputs, so teams must “tame the lion” with evals.
- •Five components: language model, context engineering (RAG/prompts), tools, orchestration, UX/humans-in-loop
- •LLMs are nondeterministic; variability persists even as hallucinations decrease
- •PMs must ensure the product delivers a consistent, intended user experience
- •Evals are the mechanism to control and validate system behavior
- 5:54 – 10:30
Case study: AI-first job website & what an eval looks like
A simple AI job site example illustrates how evals operate: generate summaries, interview questions, skills, learning guides, and quizzes from job descriptions, then assess quality and constraints. The chapter also clarifies that evals can be code-based, human, or LLM-judge prompts.
- •Product flow: crawl job portals → LLM transforms job descriptions into structured candidate help
- •Quality goals: factuality, relevance, helpfulness, anti-hallucination, and format constraints (e.g., summary length)
- •Example of “LLM-as-judge” prompt evaluating multiple outputs against criteria
- •Evals also include simple deterministic checks (e.g., word/character count)
- 10:30 – 11:31
Why prototypes fail to scale: the 5 failure modes
Ankit explains why impressive demos break in production, citing research and practical patterns. He outlines five common reasons prototypes fail, many of which require systematic measurement and iteration to overcome.
- •Data drift: user context and knowledge change; the system stops matching reality
- •Cost scaling: each call costs money; prototype model choices may be unsustainable
- •Engineering limitations: scalability, stress testing, async behavior, latency issues
- •Missing guardrails: feedback loops, fallbacks, legal/compliance protections
- •Collaboration failure: misalignment across teams and with users
- 11:31 – 12:06
Nondeterminism intuition: the ‘chai’ metaphor & why correctness isn’t enough
Using tea/chai variation across contexts, Ankit explains why even “correct” LLM answers may still fail user expectations. The takeaway: even as hallucinations decrease, products must be tuned to customer preferences and context, which evals operationalize.
- •Same outcome category (“tea”) can differ dramatically in user satisfaction by context
- •LLMs may be factually correct yet misaligned with user needs or desired style
- •PMs must define what “good” means for their specific users, not just factuality
- •Evals enforce the product’s intended experience across varied inputs
- 12:06 – 13:43
Building evals end-to-end: the full workflow diagram
Ankit walks through an end-to-end eval lifecycle: define success criteria, build a baseline product, create a representative dataset, identify failures with SME help, convert them into metrics and evals, then iterate via offline and online loops.
- •Start with success criteria and expected behavior (guardrails, UX, compliance)
- •Treat model/prompt/tools/context/orchestration as adjustable ‘knobs’
- •The dataset is the most effort-intensive asset: real queries + research + synthetic + experts
- •Run dataset through baseline, review failures, and derive evaluation metrics
- •Choose eval methods: code, human review, LLM-judge, or hybrid
- 13:43 – 22:04
How evals address drift, cost, and guardrails (and where they don’t)
Evals are mapped directly to the prototype failure modes. They’re positioned as continuous measurement for drift, as a way to compare models/cost tradeoffs, and as the backbone for guardrails—while acknowledging engineering constraints need additional approaches.
- •Evals + observability help detect and respond to data drift early
- •Evals enable model benchmarking to choose cheaper models without quality loss
- •Guardrails become testable requirements (what the system must/ must not do)
- •Engineering limitations aren’t fully solved by evals, but evals help surface issues
- •Better eval practice tends to involve SMEs, improving user empathy and collaboration
- 22:04 – 35:43
Evaluation methods & metrics: code checks, LLM judges, and legacy NLP metrics
This chapter drills into the “how” of measuring outputs: deterministic programmatic tests, subjective LLM-judge scoring, and when to use (or avoid) older NLP metrics like BLEU/ROUGE. The guiding principle is to use the cheapest reliable method for each metric.
- •Code-based evals: structure/length/format and other deterministic constraints
- •LLM-as-judge: tone, helpfulness, relevance, guardrail compliance, qualitative criteria
- •Hybrid approach: LLM flags issues, humans arbitrate critical/ambiguous cases
- •BLEU/ROUGE compare word overlap vs golden outputs but can miss semantic differences
- •Use simple tools when sufficient; avoid unnecessary costly eval pipelines
- 35:43 – 37:10
Offline evals: the AI PRD and ‘hill-climbing’ to ship quality
Offline evals are positioned as the pre-launch gating system and effectively the PRD for AI engineers. The team iterates on prompts/models/tools until eval performance meets thresholds, then ships with confidence rather than hope.
- •Offline evals run before launch or major releases to validate changes
- •AI PM-defined evals become the spec engineers optimize against
- •Iteration pattern: measure → improve low-scoring areas → re-measure (‘hill-climbing’)
- •Offline evals need not be perfect, but must cover known risks and edge cases
- •Edge cases/corner cases are encoded as evals, not just prose in a PRD
- 37:10 – 39:26
Online evals & observability: sampling in production + drift detection
After launch, the same eval concepts extend into production via observability platforms and sampling-based checks. Online evals catch drift, regressions, and changing user expectations, creating a continuous improvement loop.
- •Online evals monitor real traffic; typically sampled due to cost (e.g., 1/10, 1/100)
- •Observability tools (e.g., Arize, TruLens) support production monitoring and alerts
- •Set thresholds (e.g., accuracy/compliance pass rates) that trigger intervention
- •Cycle repeats: production learnings feed back into dataset, prompts, and evals
- •Online monitoring complements offline testing for real-world variability
- 39:26 – 57:52
Case study deep dive: INDMoney Mind / Robinhood-style stock Q&A assistant
Ankit reverse-engineers a finance assistant feature and shows how a PM would define constraints, compliance guardrails, and evaluation dimensions in a regulated domain. The example highlights how expected behavior becomes explicit metrics and test criteria.
- •Feature: contextual stock Q&A inside a trading app via retrieval + LLM answers
- •Guardrails: concise outputs; factual grounding; no direct buy/sell recommendations (regulatory)
- •Metrics categories: information quality, safety/compliance, behavior constraints, performance/latency
- •PM artifacts: layered architecture (UI, orchestration, retrieval, LLM, analytics) informing eval scope
- •Clarifies prompts as modular principles to make iteration easier than editing monolithic prompts
- 57:52 – 1:01:24
Real-world evaluation artifacts: datasets, thresholds, latency percentiles, and user feedback loops
The case study expands into concrete eval operations: dataset sourcing/maintenance, eval types (automated, LLM-judge, human), blocking criteria, online latency monitoring, and using hard/soft user feedback as additional signals.
- •Dataset sources: production logs, expert curation, research, and synthetic generation
- •Evals: factual/compliance/groundedness/structure checks + LLM judging for tone/balance/relevance
- •Human protocols: pass/fail vs 1–5 ratings; require written rationale for learning
- •Online performance: monitor P50/P95/P99 latency (averages hide tail pain)
- •Feedback signals: thumbs up/down (hard) + retries, abandonment, escalation (soft)
- •A/B test prompts/models; validate with real user behavior beyond synthetic evals
- 1:01:24 – 1:03:59
Evals aren’t QA rebranded: business impact examples & final takeaways
Ankit distinguishes eval work from traditional QA by emphasizing transformation, SME alignment, and continuous system tuning. They close with examples (Grammarly, GitHub Copilot, Klarna, support chatbots) and the overarching lesson that evals are ongoing guardrails, not a one-time task.
- •QA informs; AI eval-driven PM work transforms and iterates the product system
- •Examples: tone errors cascading (Grammarly), production breakage from missed cases (Copilot), funnel optimization (Klarna)
- •Support chatbots degrade when policies change unless continuously evaluated
- •Evals must evolve with products and user expectations—avoid ‘set and forget’
- •Core takeaway: evaluations are not optional for reliable AI outcomes