At a glance
WHAT IT’S REALLY ABOUT
Build a PM agent, trace it, eval it, improve iteratively
- The episode live-builds a “product taste” PM agent that pulls GitHub issues/discussions/releases, scores items by priority, and outputs a structured markdown PM report of top pain points and roadmap recommendations.
- A core message is that teams should start with real trace data and observability first, then use that evidence to design evals that meaningfully measure and improve agent behavior.
- Aparna shows how one-command/skill-based instrumentation can stream traces into Arize in real time, enabling step-by-step debugging via traces and spans.
- Claude/Arize can suggest initial evals (e.g., groundedness, priority alignment, actionability) and a more granular “priority accuracy” eval to judge per-issue prioritization, which then drives targeted iteration.
- The conversation argues AI-native PMs are increasingly indistinguishable from engineers: code is cheap, product taste and rapid iteration are the differentiators, while enterprises must focus on context access (context graphs), data silos, and safe human-in-the-loop change control.
IDEAS WORTH REMEMBERING
5 ideasStart evals after you have trace data, not before.
They recommend building a usable first version, instrumenting it, and using traces to reveal failure modes; eval ideas should emerge from observed behavior rather than speculation.
Product taste becomes the PM’s primary edge when code is cheap.
As coding agents lower implementation cost, the differentiator shifts to having strong opinions about UX/outcomes and converting customer feedback into crisp priorities and solutions.
Tracing is the debugging “playback” that makes agent iteration practical.
A trace shows the full step-by-step path (tool calls, LLM calls); a span is a single step within that trace, letting you pinpoint where scoring, retrieval, or summarization went wrong.
Use AI-suggested evals as a fast starting point—but expect to refine them.
Claude can propose evals like groundedness or actionability, but Aparna emphasizes ruthless human critique and periodic re-alignment as data and user expectations drift.
Granular evals often beat end-report evals early on.
Instead of only judging the final PM report, they highlight per-issue “priority accuracy” checks to see whether each item’s score matches your intended weighting (e.g., bugs should rank higher).
WORDS WORTH SAVING
5 quotesCode is so cheap to go create, which means that product taste is really the alpha today.
— Aparna Dhinakaran
Any product person that has used observability and is looking at their traces and looking at your evals, you're probably already in the top 1% of PMs in the world right now.
— Aparna Dhinakaran
I get excited when I see that evals are wrong, because then it gives me a chance to know that there's improvement that could be made.
— Aparna Dhinakaran
At the AI native teams, I am seeing that, that the gap between, um, a PM and an engineer is, is indistinguishable.
— Aparna Dhinakaran
The data that we all collect, the evals and observability, is the foundation for self-improving agents.
— Aparna Dhinakaran
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome