Aakash GuptaThe ONE AI Skill Every Product Manager NEEDS in 2026
At a glance
WHAT IT’S REALLY ABOUT
AI evals: the core PM skill to ship reliable products
- AI evals let PMs encode their taste and product judgment into the development critical path instead of relying on unscalable “vibe checks.”
- The single most important eval competency is systematic error analysis: identifying failure modes, quantifying them across traces, and turning findings into an improvement flywheel.
- Effective evals are domain-specific and are built through an iterative, scientific-method process—generic vendor metrics (e.g., “hallucination scores”) often fail in real products.
- Binary, rubric-driven pass/fail criteria typically outperform 1–5 rating scales because they force clarity, reduce calibration overhead, and are easier to align LLM-judges to.
- Evals become a product moat: a portable evaluation pipeline enables fast model swaps, safer iteration (avoiding overfitting), and makes fine-tuning a “last step” rather than the starting point.
IDEAS WORTH REMEMBERING
5 ideasEvals are how PMs “ship their judgment,” not just requirements docs.
Hamel argues evals put the PM’s taste directly on the product’s critical path, reducing reliance on engineers interpreting intent from PRDs and meetings. The eval rubric becomes an executable version of product quality.
Treat evals as an iterative scientific workflow, not a dashboard checkbox.
They emphasize skepticism, experimentation, and structured measurement—otherwise teams end up in “whack-a-mole” prompt tweaking without durable progress. The process of creating evals is where most of the value is generated.
Start with error analysis; it tells you what to evaluate and what to fix.
Shreya calls error analysis the hands-down most critical skill: review traces, identify failure modes, and quantify prevalence. This prevents infinite metric sprawl (toxicity/conciseness/hallucination, etc.) by grounding priorities in observed failures.
Codify vibe checks into binary rubrics to scale and reduce ambiguity.
Binary pass/fail criteria force a shipping-level decision and avoid the interpretability problem of averages like 3.2 vs 3.7. It also makes LLM-as-judge alignment more tractable because you can provide clear pass/fail examples.
Domain-specific hallucination evals beat generic “hallucination scores.”
For cases like AI-written sales emails, the team must define what hallucination means in that domain, label examples, and iteratively prompt/validate an LLM judge against human labels. Generic vendor metrics often don’t match your product’s real failure modes, leading to dashboards nobody trusts.
WORDS WORTH SAVING
5 quotesHands down, error analysis, the ability to look at your outputs and systematically figure out what makes for a bad output, quantify how many of these failure modes you see in a big batch of traces for your system, and then figure out how to turn that in measurement into a continuous flywheel of improving your product.
— Shreya Shankar
Evals give you a way as a PM to inject your taste and your judgment directly into the critical path of the AI product being developed.
— Hamel Husain
Your vibe checks are very important, but they don't scale, right, 'cause they involve you, the human.
— Shreya Shankar
A lot of times, um, non-binary evals, like ratings of one to five, that is a s- a smell of intellectual laziness.
— Hamel Husain
I think evals are the moat for AI products, and, like, truly nothing else.
— Shreya Shankar
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome