Skip to content
Aakash GuptaAakash Gupta

Aparna Full Pod Final AR

Aparna Dhinakaran, CPO of Arize AI ($131M raised), shows exactly how to build a PM agent in Claude Code, instrument it with observability, run evals against it, and close the self-improvement loop, all in one live session. If you want to understand what serious AI eval practice looks like in 2025, this is the episode. Full Writeup: [VERIFY - newsletter URL] Transcript: [VERIFY - transcript URL] --- Timestamps: 00:01 - What PMs are getting wrong when building agents 04:00 - Screen share begins — building the PM agent live 07:05 - What a product taste agent actually does 09:10 - When to start running evals 10:15 - Building the agent in Claude Code from scratch 16:13 - Preview of a pre-built version with tracing active 21:34 - Instrumenting the agent for observability (one command) 27:26 - Traces streaming into Arize in real time 30:38 - Asking Claude to suggest evals 34:36 - Running the priority accuracy eval 46:10 - Vibe evals vs. axial coding — when to use each 52:46 - Looping the improvement automatically 01:04:01 - What AI PMs need to do differently 01:09:05 - What enterprise PMs can realistically take on now 01:22:10 - The two things to do this weekend --- Thanks to our sponsors: 1. Superhuman - Sign up and get 1-month free of Superhuman Mail with my link: superhuman.com/akash (given by brand - Kartik) 2. Land PM Job - Land your next PM role faster - https://landpmjob.com 3. Vanta - Automate your compliance - http://vanta.com/aakash 4. Product Faculty - Get $550 off their AI PM Certification with code AAKASH550C7 - https://maven.com/product-faculty/ai-product-management-certification?promoCode=AAKASH550C7 5. Bolt - Ship AI-powered products 10x faster - https://bolt.new --- Key Takeaways: 1. Trace before you eval - A trace is the full step-by-step playback of what your agent did. Without it, you have no evidence base for evals. Every LLM call, every tool call, every intermediate output needs to be visible before you write a single eval. 2. A span is your unit of evaluation - A span is one discrete step inside a trace. Evals run at the span level, not the trace level. "Did this specific scoring step get the priority right?" is a more useful question than "was the whole run good?" 3. Instrumentation is now a one-command job - Claude Code's instrumentation skills can set up observability for your agent automatically. Arize Phoenix's skill looks at your codebase, identifies the LLM calls and tool calls, and wires them to the tracing layer. No engineering support required. 4. The vibe eval is a draft, not a verdict - An LLM can suggest what your evals should test by looking at your traces. That suggestion will not know your bug-first policy, your comp logic, or your definition of "critical." Treat it as v0 and refine against your actual judgment. 5. When evals fire, two things could be wrong - The agent produced a bad output. Or the eval is miscalibrated. Reading the flagged span yourself is the only way to know which one needs fixing. Both are normal. Both are good news. 6. Evals drift and need regular realignment - Your priorities change. Your bug policy changes. Your product changes. An eval calibrated to last quarter will start misfiring this quarter. Regular alignment to human feedback is maintenance, not a failure. 7. The self-improvement loop is already running at the best teams - Fetch all spans where evals fired. Group by failure category. Propose a specific prompt fix. Review and approve. Ship the new version. This loop runs on a schedule and requires a human at the approval step. 8. Enterprise PMs: start with one internal agent - Not a customer-facing product. An internal tool that takes four hours off your week. Once you have it, you will naturally want to trace it. That is when observability starts to matter to you personally. 9. The context graph is the enterprise unlock - Agents are only as useful as the context they have. Enterprise data lives in silos. The teams breaking through are building unified context layers that give one agent access to CRM, Gong, analytics, GitHub, and Slack. 10. Product taste is still the alpha - Code is cheap now. Shipping speed is table stakes. The PMs who pull ahead are the ones with the sharpest judgment about what to build, and the loops that make their agents better every day. --- Where to find Aparna Dhinakaran: LinkedIn: [VERIFY - Aparna LinkedIn URL] Arize AI: https://arize.com Where to find Aakash: Twitter: https://www.x.com/aakashg0 LinkedIn: https://www.linkedin.com/in/aakashgupta/ Newsletter: https://www.news.aakashg.com #AIagents #ProductManagement --- About Product Growth: The world's largest podcast focused solely on product + growth, with over 200K+ listeners. Subscribe and turn on notifications.

Aparna DhinakaranguestAakash Guptahost
May 21, 20261h 19mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

Build a PM agent, trace it, eval it, improve iteratively

  1. The episode live-builds a “product taste” PM agent that pulls GitHub issues/discussions/releases, scores items by priority, and outputs a structured markdown PM report of top pain points and roadmap recommendations.
  2. A core message is that teams should start with real trace data and observability first, then use that evidence to design evals that meaningfully measure and improve agent behavior.
  3. Aparna shows how one-command/skill-based instrumentation can stream traces into Arize in real time, enabling step-by-step debugging via traces and spans.
  4. Claude/Arize can suggest initial evals (e.g., groundedness, priority alignment, actionability) and a more granular “priority accuracy” eval to judge per-issue prioritization, which then drives targeted iteration.
  5. The conversation argues AI-native PMs are increasingly indistinguishable from engineers: code is cheap, product taste and rapid iteration are the differentiators, while enterprises must focus on context access (context graphs), data silos, and safe human-in-the-loop change control.

IDEAS WORTH REMEMBERING

5 ideas

Start evals after you have trace data, not before.

They recommend building a usable first version, instrumenting it, and using traces to reveal failure modes; eval ideas should emerge from observed behavior rather than speculation.

Product taste becomes the PM’s primary edge when code is cheap.

As coding agents lower implementation cost, the differentiator shifts to having strong opinions about UX/outcomes and converting customer feedback into crisp priorities and solutions.

Tracing is the debugging “playback” that makes agent iteration practical.

A trace shows the full step-by-step path (tool calls, LLM calls); a span is a single step within that trace, letting you pinpoint where scoring, retrieval, or summarization went wrong.

Use AI-suggested evals as a fast starting point—but expect to refine them.

Claude can propose evals like groundedness or actionability, but Aparna emphasizes ruthless human critique and periodic re-alignment as data and user expectations drift.

Granular evals often beat end-report evals early on.

Instead of only judging the final PM report, they highlight per-issue “priority accuracy” checks to see whether each item’s score matches your intended weighting (e.g., bugs should rank higher).

WORDS WORTH SAVING

5 quotes

Code is so cheap to go create, which means that product taste is really the alpha today.

Aparna Dhinakaran

Any product person that has used observability and is looking at their traces and looking at your evals, you're probably already in the top 1% of PMs in the world right now.

Aparna Dhinakaran

I get excited when I see that evals are wrong, because then it gives me a chance to know that there's improvement that could be made.

Aparna Dhinakaran

At the AI native teams, I am seeing that, that the gap between, um, a PM and an engineer is, is indistinguishable.

Aparna Dhinakaran

The data that we all collect, the evals and observability, is the foundation for self-improving agents.

Aparna Dhinakaran

What PMs get wrong building agents and evalsProduct taste agent for prioritizing user feedbackClaude Code workflow (build, cron/loop automation)Tracing, spans, and observability instrumentationEvals: vibe evals vs axial coding and alignmentSelf-improving agent loops with human review gatesPhoenix open source vs Arize AX paid scaling

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome