Aakash GuptaThe ONE AI Skill Every Product Manager NEEDS in 2026
CHAPTERS
Why AI evals are a must-have PM skill: taste, iteration speed, and scale
Hamel explains why PMs need to be strong at AI evaluations: they let PMs encode product taste into the build loop, create faster iteration cycles, and scale judgment across many AI workloads. Evals are framed as leverage—not a monotonous checkbox task.
What an “eval” is (and why products need a suite of them)
Shreya defines an eval as systematic measurement of some aspect of quality, made up of a criterion and a method of measuring it. In practice, real products require multiple evals because quality is multi-dimensional.
Why “getting evals right” solves the hardest part of AI products
Both guests argue the hardest part is the process of creating evals—because it forces teams to examine data, define success, and iterate scientifically. Once that groundwork is done, teams can move faster and focus on other product improvements.
From “vibe checks” to scalable quality: binary rubrics and why they work
Shreya explains that vibe checks are important but don’t scale or transfer across people. They recommend translating vibe checks into precise, binary pass/fail rubrics with examples—especially for aligning LLM-as-judge systems.
Why generic dashboards fail: hallucinations and domain-specific eval design
Using Aakash’s email-writing hallucination story, Hamel warns against generic vendor metrics (e.g., off-the-shelf hallucination scores). Instead, teams must characterize their specific failure modes, label examples, and iteratively build a judge they can trust.
ML roots of evals: lessons from Airbnb (ranking, fraud, LTV)
Hamel connects LLM evaluation to classic machine learning evaluation for stochastic, non-deterministic systems. He shares Airbnb ML use cases and notes that LLM product teams can borrow proven ML evaluation discipline without needing a full ML curriculum.
Evaluating RAG like search: what changes when the consumer is an LLM
Shreya and Hamel explain that RAG evaluation splits into retrieval (classic search metrics) and generation. The key nuance: LLMs can handle large context windows, changing tolerances like recall@K and the importance of rank position versus human search.
Code generation and verifiable domains: why Copilot-style evals work
Hamel describes why developer tools were early AI successes: developers are the domain experts and code is more verifiable. Copilot-like systems can run tests at scale, creating powerful harnesses that unlock rapid iteration and measurable improvements.
PM-defined evals enable “hill climbing”—and the trap of overfitting
They explain why engineers excel at optimizing well-defined metrics, but PMs must define those metrics for LLM products. They also caution that hill-climbing can lead to overfitting, and discuss how to detect and prevent it with ML-style safeguards.
What “good evals” look like in the wild: interfaces, scoped judges, and strong proxies
Shreya notes no company has solved evals perfectly, but highlights converging best practices: custom labeling interfaces, well-scoped judges, and metrics that correlate with product success (like next-token prediction in coding). The discussion also covers why autocomplete is better in code than email.
Foundation benchmarks vs business evals: what OpenAI can’t do for you
They distinguish general-purpose model benchmarks (MMLU, HumanEval) from domain-specific product evals. Foundation model labs focus on the former, while each company must define “good” for their product taste—making domain evals defensible differentiation.
Evals as the moat (and where fine-tuning fits: last, not first)
Shreya and Hamel argue eval systems and pipelines are the true moat, more than the model choice. Fine-tuning should come after evals and simpler levers (model upgrades, decomposition, RAG), because it adds ongoing operational and maintenance complexity.
Prompting vs RAG vs fine-tuning: the “Three Gulfs” decision framework
They introduce the Three Gulfs framework to decide which lever to use. Prompting addresses the gulf of specification (clear requirements), while RAG and fine-tuning address the gulf of generalization when the model lacks context or capability.
A roadmap to mastery: error analysis, grounded theory, LLM-judge iteration, and productionization
They lay out a learning roadmap based on their course reader, emphasizing error analysis as the biggest bottleneck and highest leverage. They describe open/axial coding from social science (grounded theory), building/validating LLM judges, handling RAG/agents/multi-turn, and production workflows like CI/CD.
Course-building business tangent: consulting origins, pricing, and why the live cohort ends
Hamel explains his transition from industry roles to consulting and teaching, plus the economics and positioning of a high-priced course. They share why they’re limiting live cohorts (time intensity, protecting quality) and plan to reinvest into a book and other formats.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome