Skip to content
Aakash GuptaAakash Gupta

Aparna Full Pod Final AR

Aparna Dhinakaran, CPO of Arize AI ($131M raised), shows exactly how to build a PM agent in Claude Code, instrument it with observability, run evals against it, and close the self-improvement loop, all in one live session. If you want to understand what serious AI eval practice looks like in 2025, this is the episode. Full Writeup: [VERIFY - newsletter URL] Transcript: [VERIFY - transcript URL] --- Timestamps: 00:01 - What PMs are getting wrong when building agents 04:00 - Screen share begins — building the PM agent live 07:05 - What a product taste agent actually does 09:10 - When to start running evals 10:15 - Building the agent in Claude Code from scratch 16:13 - Preview of a pre-built version with tracing active 21:34 - Instrumenting the agent for observability (one command) 27:26 - Traces streaming into Arize in real time 30:38 - Asking Claude to suggest evals 34:36 - Running the priority accuracy eval 46:10 - Vibe evals vs. axial coding — when to use each 52:46 - Looping the improvement automatically 01:04:01 - What AI PMs need to do differently 01:09:05 - What enterprise PMs can realistically take on now 01:22:10 - The two things to do this weekend --- Thanks to our sponsors: 1. Superhuman - Sign up and get 1-month free of Superhuman Mail with my link: superhuman.com/akash (given by brand - Kartik) 2. Land PM Job - Land your next PM role faster - https://landpmjob.com 3. Vanta - Automate your compliance - http://vanta.com/aakash 4. Product Faculty - Get $550 off their AI PM Certification with code AAKASH550C7 - https://maven.com/product-faculty/ai-product-management-certification?promoCode=AAKASH550C7 5. Bolt - Ship AI-powered products 10x faster - https://bolt.new --- Key Takeaways: 1. Trace before you eval - A trace is the full step-by-step playback of what your agent did. Without it, you have no evidence base for evals. Every LLM call, every tool call, every intermediate output needs to be visible before you write a single eval. 2. A span is your unit of evaluation - A span is one discrete step inside a trace. Evals run at the span level, not the trace level. "Did this specific scoring step get the priority right?" is a more useful question than "was the whole run good?" 3. Instrumentation is now a one-command job - Claude Code's instrumentation skills can set up observability for your agent automatically. Arize Phoenix's skill looks at your codebase, identifies the LLM calls and tool calls, and wires them to the tracing layer. No engineering support required. 4. The vibe eval is a draft, not a verdict - An LLM can suggest what your evals should test by looking at your traces. That suggestion will not know your bug-first policy, your comp logic, or your definition of "critical." Treat it as v0 and refine against your actual judgment. 5. When evals fire, two things could be wrong - The agent produced a bad output. Or the eval is miscalibrated. Reading the flagged span yourself is the only way to know which one needs fixing. Both are normal. Both are good news. 6. Evals drift and need regular realignment - Your priorities change. Your bug policy changes. Your product changes. An eval calibrated to last quarter will start misfiring this quarter. Regular alignment to human feedback is maintenance, not a failure. 7. The self-improvement loop is already running at the best teams - Fetch all spans where evals fired. Group by failure category. Propose a specific prompt fix. Review and approve. Ship the new version. This loop runs on a schedule and requires a human at the approval step. 8. Enterprise PMs: start with one internal agent - Not a customer-facing product. An internal tool that takes four hours off your week. Once you have it, you will naturally want to trace it. That is when observability starts to matter to you personally. 9. The context graph is the enterprise unlock - Agents are only as useful as the context they have. Enterprise data lives in silos. The teams breaking through are building unified context layers that give one agent access to CRM, Gong, analytics, GitHub, and Slack. 10. Product taste is still the alpha - Code is cheap now. Shipping speed is table stakes. The PMs who pull ahead are the ones with the sharpest judgment about what to build, and the loops that make their agents better every day. --- Where to find Aparna Dhinakaran: LinkedIn: [VERIFY - Aparna LinkedIn URL] Arize AI: https://arize.com Where to find Aakash: Twitter: https://www.x.com/aakashg0 LinkedIn: https://www.linkedin.com/in/aakashgupta/ Newsletter: https://www.news.aakashg.com #AIagents #ProductManagement --- About Product Growth: The world's largest podcast focused solely on product + growth, with over 200K+ listeners. Subscribe and turn on notifications.

Aparna DhinakaranguestAakash Guptahost
May 21, 20261h 19mWatch on YouTube ↗

CHAPTERS

  1. 0:01 – 4:00

    What PMs get wrong about agents: build first, but plan for data

    Aakash and Aparna start by addressing a common misconception: teams worry about evals too early or treat them as optional. Aparna argues you should begin by building a real agent, then quickly shift to collecting trace data so you know what to evaluate and improve.

  2. 4:00 – 7:05

    Why “product taste” is the new PM advantage in an AI world

    Aparna reframes the PM job: code is cheap, so differentiation comes from judgment—what to build and why. She positions user feedback aggregation as the mechanism to develop “taste,” and proposes an agent that consumes feedback continuously.

  3. 7:05 – 9:10

    Designing the “product taste” PM agent: inputs, scoring, and outputs

    Aparna outlines the agent they’ll build: it gathers feedback (starting with GitHub) and produces a prioritized PM-style report. The key is turning messy qualitative inputs into a consistent priority score and actionable themes.

  4. 9:10 – 10:15

    Live build begins in Claude Code: repo setup and starter prompt

    They switch to hands-on building in the terminal using Claude Code. Aparna describes the minimal setup (directory/repo + API keys) and how to prompt the agent to fetch GitHub data and generate the PM report.

  5. 10:15 – 16:13

    Previewing a pre-built agent: what tracing reveals about agent behavior

    Before finishing the live build, Aparna shows a working version with tracing already enabled. She explains traces as a step-by-step replay of the agent’s actions—critical for debugging and later evaluation design.

  6. 16:13 – 21:34

    When to start running evals: after you have traces and real usage data

    Aparna answers the “when evals” question directly: don’t start from hypotheticals. Instrument first, collect traces, then use observed failures and patterns to define evaluations that matter.

  7. 21:34 – 27:26

    One-command observability: instrumenting the agent with Arize skills

    Aparna demonstrates how Claude Code can instrument the codebase using Arize’s “skills,” turning tracing from an engineering-heavy project into a fast workflow. The instrumentation inspects the repo, detects the stack, and wires up trace export.

  8. 27:26 – 30:38

    Watching traces stream live into Arize: debugging via the execution graph

    With instrumentation active, they run the agent and confirm traces appear in real time. Aparna walks through the trace components and shows the final report structure generated by the agent.

  9. 30:38 – 34:36

    Asking Claude to suggest evals from traces: good defaults, imperfect first pass

    Aparna uses Claude/Arize to suggest candidate evals based on the observed traces. The suggestions skew toward end-report quality checks, but she pushes toward a more granular, issue-by-issue priority correctness evaluation.

  10. 34:36 – 46:10

    Running the priority accuracy eval: separating agent mistakes from eval mistakes

    They run a “priority accuracy” evaluator across spans to see where scoring appears wrong. Aparna highlights the core iterative tension in eval work: sometimes the agent is wrong, sometimes the eval is wrong, and you must calibrate both.

  11. 46:10 – 52:46

    Vibe evals vs. axial coding: why grounding and alignment matter

    Aakash asks when “vibe evals” are acceptable versus more rigorous axial coding and human labeling. Aparna argues vibe-only approaches quickly hit limits; you need human-grounded alignment and continuous recalibration as data evolves.

  12. 52:46 – 1:04:01

    Automating the improvement loop: self-improving agents with human review gates

    Aparna describes how teams can automate the full loop: detect failures via evals, propose fixes, generate PRs, and repeat—while keeping safety through code review and controlled “radius” of changes. The long-term vision is continuous improvement powered by observability + evals.

  13. 1:04:01 – 1:09:05

    What AI PMs must do differently: PM–engineer gap collapses at AI-native teams

    They zoom out to the PM career implications: AI-native PMs operate deeply in tools like Claude Code and are close to implementation. The differentiator is speed from insight → build, enabled by high feedback throughput and strong taste.

  14. 1:09:05 – 1:22:10

    Enterprise reality: context graphs, data silos, and what’s feasible now

    Aakash asks what enterprises can realistically adopt. Aparna emphasizes that enterprises are innovating, but their biggest unlock is organizing context—breaking silos via “context graphs” so agents can use the right information safely and effectively.

  15. 1:22:10

    Two hours this weekend: build a small agent, then add traces + evals

    Aparna closes with an actionable challenge: pick a repetitive workflow and build a simple agent with Claude Code. Then instrument it, inspect traces, and use evals to push beyond a rough prototype—positioning observability + eval literacy as a career moat.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome