Aparna Full Pod Final AR

Aparna Dhinakaran, CPO of Arize AI ($131M raised), shows exactly how to build a PM agent in Claude Code, instrument it with observability, run evals against it, and close the self-improvement loop, all in one live session. If you want to understand what serious AI eval practice looks like in 2025, this is the episode. Full Writeup: [VERIFY - newsletter URL] Transcript: [VERIFY - transcript URL] --- Timestamps: 00:01 - What PMs are getting wrong when building agents 04:00 - Screen share begins — building the PM agent live 07:05 - What a product taste agent actually does 09:10 - When to start running evals 10:15 - Building the agent in Claude Code from scratch 16:13 - Preview of a pre-built version with tracing active 21:34 - Instrumenting the agent for observability (one command) 27:26 - Traces streaming into Arize in real time 30:38 - Asking Claude to suggest evals 34:36 - Running the priority accuracy eval 46:10 - Vibe evals vs. axial coding — when to use each 52:46 - Looping the improvement automatically 01:04:01 - What AI PMs need to do differently 01:09:05 - What enterprise PMs can realistically take on now 01:22:10 - The two things to do this weekend --- Thanks to our sponsors: 1. Superhuman - Sign up and get 1-month free of Superhuman Mail with my link: superhuman.com/akash (given by brand - Kartik) 2. Land PM Job - Land your next PM role faster - https://landpmjob.com 3. Vanta - Automate your compliance - http://vanta.com/aakash 4. Product Faculty - Get $550 off their AI PM Certification with code AAKASH550C7 - https://maven.com/product-faculty/ai-product-management-certification?promoCode=AAKASH550C7 5. Bolt - Ship AI-powered products 10x faster - https://bolt.new --- Key Takeaways: 1. Trace before you eval - A trace is the full step-by-step playback of what your agent did. Without it, you have no evidence base for evals. Every LLM call, every tool call, every intermediate output needs to be visible before you write a single eval. 2. A span is your unit of evaluation - A span is one discrete step inside a trace. Evals run at the span level, not the trace level. "Did this specific scoring step get the priority right?" is a more useful question than "was the whole run good?" 3. Instrumentation is now a one-command job - Claude Code's instrumentation skills can set up observability for your agent automatically. Arize Phoenix's skill looks at your codebase, identifies the LLM calls and tool calls, and wires them to the tracing layer. No engineering support required. 4. The vibe eval is a draft, not a verdict - An LLM can suggest what your evals should test by looking at your traces. That suggestion will not know your bug-first policy, your comp logic, or your definition of "critical." Treat it as v0 and refine against your actual judgment. 5. When evals fire, two things could be wrong - The agent produced a bad output. Or the eval is miscalibrated. Reading the flagged span yourself is the only way to know which one needs fixing. Both are normal. Both are good news. 6. Evals drift and need regular realignment - Your priorities change. Your bug policy changes. Your product changes. An eval calibrated to last quarter will start misfiring this quarter. Regular alignment to human feedback is maintenance, not a failure. 7. The self-improvement loop is already running at the best teams - Fetch all spans where evals fired. Group by failure category. Propose a specific prompt fix. Review and approve. Ship the new version. This loop runs on a schedule and requires a human at the approval step. 8. Enterprise PMs: start with one internal agent - Not a customer-facing product. An internal tool that takes four hours off your week. Once you have it, you will naturally want to trace it. That is when observability starts to matter to you personally. 9. The context graph is the enterprise unlock - Agents are only as useful as the context they have. Enterprise data lives in silos. The teams breaking through are building unified context layers that give one agent access to CRM, Gong, analytics, GitHub, and Slack. 10. Product taste is still the alpha - Code is cheap now. Shipping speed is table stakes. The PMs who pull ahead are the ones with the sharpest judgment about what to build, and the loops that make their agents better every day. --- Where to find Aparna Dhinakaran: LinkedIn: [VERIFY - Aparna LinkedIn URL] Arize AI: https://arize.com Where to find Aakash: Twitter: https://www.x.com/aakashg0 LinkedIn: https://www.linkedin.com/in/aakashgupta/ Newsletter: https://www.news.aakashg.com #AIagents #ProductManagement --- About Product Growth: The world's largest podcast focused solely on product + growth, with over 200K+ listeners. Subscribe and turn on notifications.

Aparna DhinakaranguestAakash Guptahost

May 21, 20261h 19mWatch on YouTube ↗

EVERY SPOKEN WORD

70 min read · 13,560 words

0:00 – 0:01
Intro
1. ADAparna Dhinakaran
  Any product person
0:01 – 4:00
What PMs are getting wrong when building agents
1. ADAparna Dhinakaran
  that has used observability and is looking at their traces and looking at their evals, you're probably already in the top 1% of PMs
2. AGAakash Gupta
  What is the role then of the PM? Like, do PMs need to become engineers at this point?
3. ADAparna Dhinakaran
  At the AI-native teams, I am seeing that the gap between a PM and an engineer is indistinguishable
4. AGAakash Gupta
  Aparna Dhinakaran is the CPO and co-founder of Arize AI. $131M raised, and most of the smartest AI teams I know building their evals on top of it. I feel like a good eval is like you're getting some healthy percentage right, but also healthy wrong so that you can make progress, right?
5. ADAparna Dhinakaran
  100%. Like, I get excited when I see that evals are wrong, because then it gives me a chance to know that there's improvement that could be made
6. AGAakash Gupta
  What are the things, if somebody has just two hours this weekend, that they should concretely go do and take away besides just they've watched this episode, but now they're gonna actually make impact in their career?
7. ADAparna Dhinakaran
  If you have any two hours this weekend, I would say literally what we just did right now, which is [fire crackling]
8. AGAakash Gupta
  Before we get into today's episode, I wanted to share that you can get a free year of my favorite AI tools, including Bolt.new, Mobbin, Arize, Relayapp, Dovetail, Linear, Magic Patterns, Reforge Build, Descript, and Speechify if you join my bundle at bundle.aakashg.com. On top of that, I wanted to quickly ask you to please double-check that you are subscribed on YouTube, Apple, and Spotify podcasts. It's a free thing you can do that really helps support the show. And now into today's episode. [fire crackling] So I've been doing a ton of episodes on Claude Code, a ton of episodes on AI agents, and separately episodes on evals. What this episode we're doing today is we're bringing it all together for you in one iterative loop. It's kind of like the product development cycle for AI products in a single shot. So you're gonna get to see front to back how we do it. I think we have a tremendous opportunity to learn from Aparna, so I'm gonna try to ask her the tough questions for you guys where maybe what she's doing, she's skipping some steps so that you guys can see it step by step. And she's volunteered to be our guinea pig on this. So Aparna, thank you so, so much for showing us the ropes of how to do Claude Code evals.
9. ADAparna Dhinakaran
  I'm super, super excited to be here. Thanks so much for having me, Aakash.
10. AGAakash Gupta
  So what are people getting wrong when you look at them building Claude Code agents and trying to do evals?
11. ADAparna Dhinakaran
  Yeah, I mean, I think the first question I get asked a lot is, "When should I even start doing evals? Like, why is that important? Um, do you, do I need to think about it before I even build my agent?" And I mean, if I'm honest with you, most teams are starting, uh, uh, you know, they're starting with just building. Like, you gotta start by having a, a real product before you wanna, y- you know, you, you run evals on it. And so, um, today what I'm gonna actually walk you through is the full end-to-end loop of getting started with building a product. When does it make sense to actually, because of the data that you've collected, start to actually run evals and automate that?
12. AGAakash Gupta
  Awesome. Let's see it in action. Where should we start?
13. ADAparna Dhinakaran
  So it's a little bit of a vision for anyone who's an AI PM today. Code is so cheap to go create, which means that product taste is really the alpha today. People, especially product managers, there's all this hype around, you know, are, is it gonna be the death of PMs? You know, I'll tell you this, we're hiring more PMs than ever. We're hiring more engineers than ever. The ones that stand out are those that actually have an opinion and a taste around what to go build. And so today, you know, a little cheeky, but, um, can we try to create taste? Can we try to have the PMs that are watching this
4:00 – 7:05
Screen share begins — building the PM agent live
1. ADAparna Dhinakaran
  have a upper hand to actually create that product taste? Well, where do, where does product taste actually come from? You, you look at kind of some of the best products out there, and what they're doing is taking in a ton of feedback. I mean, the best PMs do this. The best PMs, I mean, YC says this, uh, to their eight-- to every single cohort, which is talk to users and go build. And I think what we see is that in order to actually create taste, you need to be getting feedback from a ton of different sources, from-- It could be everything from where your team stores those issues. It could be from GitHub discussions or, like, you know, in real life discussions, from Slack and Discord, um, from your actual community talking to you. But also, we see teams building out really a context graph with all of this feedback, everything from Gong transcripts every time you talk to your customers, your product analytic tools from PostHog and Amplitude and Pendo and FullStory, um, even down to Twitter. If you have a product that your users are tweeting about and sharing feedback on, these are all ways for you to actually create and cultivate that feedback source. And instead of having just a human consume it, you can actually have your agent consume that feedback. And so what we're gonna do today is we're gonna build a bit of a product taste agent. This agent, you know, you're a PM, uh, your, your job is to come in and kind of figure out what to go build, what are users asking for. Every day, this product taste agent's gonna tell you what your biggest pains are, what your biggest priorities should be, and suggest where your product roadmap needs to go. Um, the product I'm gonna work off of today, and you can pick your own product that makes sense for you, but the product I'm gonna pick is actually our own open source product, Arize Phoenix. Arize Phoenix is the leading open source observability and evals platform. You can actually get started and host everything entirely open source with Phoenix. Um-Um, but with Phoenix, and you're gonna see what I do here, is that we have a ton of backlog of issues. We also have a really vibrant GitHub discussions. We have our own Slack community. We have feedback from people who are tweeting at us. And so what I'm gonna try to do is actually aggregate a lot of that, uh, uh, I'm gonna actually try to aggregate that feedback and use that to surface up where should we go and what should we build next. So the steps we're gonna do here is actually first create this PM agent. We're gonna do this using Claude Code. The magic behind everything that we're gonna use to improve is really tracing. We're gonna trace everything. We're gonna get literally every step of what our agent does is gonna be visible to us, and then we're actually gonna run the evals, Aakash. And I think this is kind of the big, you know, when people ask, "When do I do evals?" You know, I always e- you know, kind of point towards get the data, trace everything, get the observability.
7:05 – 9:10
What a product taste agent actually does
1. ADAparna Dhinakaran
  The evals can kind of help you then take you to the next level for your agent. Um, so we're gonna trace it, we're gonna eval it, and then we're gonna do this loop where we improve our agent and bring it right back. Um, so pick your favorite product that you wanna actually use. Pick a product that you have all the context of. Um, you could start super simple. What I'm gonna start with today is literally just the GitHub, you know, issues, the GitHub discussions, and use that to actually inform what my product taste or PM agent is gonna look like. Let's do this. We're gonna go ahead and build a PM product taste agent just using Claude Code. So go ahead, kick up Claude Code in your terminal. For product folks, uh, you know, this might feel intimidating in the beginning, but I can guarantee you the level of control and iteration you're gonna get by just doing this in your terminal and getting comfortable is going to feel... Y- just the unlock you're gonna get is gonna be worth a little bit of that learning kind of pain in the beginning.
2. AGAakash Gupta
  I'll be honest, I've not always been the best with my email inbox, and just thinking about it made me feel anxiety. But my anxiety has really never been lower since I started using Superhuman Mail, today's podcast sponsor. Their Ask AI feature is one thing that really stands out for me because I have so many contract details or deliverables buried eight replies deep, and I can just ask the AI. I also love the auto drafts feature so that I have a draft to react and respond to. And of course, their follow-ups are a lifesaver. Now is the time to give it a try. Check it out at superhuman.com/akash. Today's episode is brought to you by Vanta. As a founder, you're moving fast toward product market fit, your next round, or your first big enterprise deal. But with AI accelerating how quickly startups build and ship, security expectations are higher earlier than ever. Getting security and compliance right can unlock growth or stall it if you wait too long. With deep integrations and automated workflows built for fast-moving teams, Vanta gets you audit ready fast and keeps you secure with continuous monitoring as your models, infra, and customers evolve. Fast-growing startups like LangChain,
9:10 – 10:15
When to start running evals
1. AGAakash Gupta
  Writer, and Cursor trusted Vanta to build a scalable foundation from the start. So go to vanta.com/akash. That's V-A-N-T-A.com/A-A-K-A-S-H to save $1,000 and join over 10,000 ambitious companies already scaling with Vanta.
2. ADAparna Dhinakaran
  So let's do this. Uh, go ahead and create a repo or create just a directory, and you can go ahead and initialize Claude inside of that directory. And let's just go ahead and first give it a starter prompt to actually build this agent. I'm gonna ask it to build me a PM agent for the Arize AI Phoenix product. Um, and I can go ahead and actually just link the URL to that entire repo directly in here so that it has exactly context of what I'm asking it to build. Um, and then I'm just gonna go ahead and ask what context do I want it to have? So pull recent GitHub discussions,
10:15 – 16:13
Building the agent in Claude Code from scratch
1. ADAparna Dhinakaran
  pull all the recent releases, um, and look at the GitHub issues. I'm gonna start kind of piecemeal here first, first just starting with context from one location, which is GitHub.
2. AGAakash Gupta
  Mm.
3. ADAparna Dhinakaran
  As we scale this, you can add in context from, like I was saying, your Gong transcripts, your product analytics. You can add context from, um, literally your Slack convos, your Discord channels. Anything can be brought into here. And what I first wanted to do is first just figure out, score the issues and the discussions, um, based off of priority. Like first just figure out how important is the stuff that we want it to actually look at and build. So things to look at is like bugs versus features, uh, reactions that people gave it, comments. Um, you know, I do want it to look at recency. So these are all things that I'm actually asking this product taste agent to take a look at and consider. Uh, then-
4. AGAakash Gupta
  Mm
5. ADAparna Dhinakaran
  ... call Claude or, you know, I can be specific here. I can say call Claude Opus, whatever model I want. So call Claude, um, with, uh... You know, I am, I could even ask it to go ahead and do some kind of like prompt caching so that it doesn't keep pulling down the issues every time that I run this loop. But just to keep it simple in the beginning, what I'm gonna do is just call Claude and, uh, write down a, just a markdown PM report that has, um, you know, that has as the output the top pain points, feature asks, andOrder this by P0 to P3 priority. So this is basically going to be like initial starter prompt for me to actually build this product taste. I can get super... You know, typically what I like to do is, uh, be really thoughtful about the plan that I'm giving my agent so that it, you know, it's not just going off of nothing. But, you know, there's also times where you'll just have it go off, build something, and then you're iteratively giving it feedback, and that's totally also okay. So, um, and then I'll just say here, use my GitHub token and my Anthropic API key. So let's see what it can come back just with that. Super simple. Um, while this is going and kind of doing its thing in the background, what I'm actually gonna show you, as you can see, it's gonna interrupt and ask a ton of questions as we go through this. But what I'm actually gonna show you all is just a, a simple one I built right before this and see if we can get the one we're building right now to just match up, um, and, and see how, how close we can get in just an hour here. Okay, so this is basically a PM agent that is already built out and already kind of, um, you know, we've had tracing set up and is sending to Arize already. And I'm just gonna open one of these so I can show you all kind of what it looks like here. But this PM agent is, these are the traces of our actual PM agent. And for those of you who are like, "What's a trace?" Like that's, that's, you know, new concept to, to understand. Um, you can think about a trace really just as, um, it is the step-by-step playback of what this agent actually did. In this scenario, this agent is first going ahead and pulling back GitHub discussions. It's pulling back the GitHub issues. It's figuring out what are all the releases that were recently released. And then it's going through, and it's actually looking at every single issue that is inside of that project, and it's actually consuming all of these and coming up with a score of how important each of these issues that it's raised are. As a product person, this is kind of the first thing you need to understand is like, how important are all of these asks that are coming from your users? What is the pain that it's solving? Um, and so the first thing I'm just asking you to do is figure out, well, can you score basically how important is each one of these asks that are coming back from, for this project? And what I'll actually do, you know, as, as it scores, I wanna actually have an eval that will evaluate how good was the score that my PM agent actually came up with, and is it accurate or inaccurate based off of, you know, the context that I have around how I wanna prioritize bugs, how I've historically prioritized feature requests. And so I actually wanna write an eval that will help teams kind of evaluate the quality of this initial PM agent that we've built.
6. AGAakash Gupta
  Hmm.
7. ADAparna Dhinakaran
  Go back and check on our agent here and see how far we've gotten. Um, so still kind of thinking-
8. AGAakash Gupta
  So when somebody's setting up this repo correctly, like basically you created a new GitHub repo, you gave it your Anthropic API key, and you just-- And I guess to create the repo, you have to log into GitHub. Those are the main steps people have to do before this?
9. ADAparna Dhinakaran
  Correct. Correct.
10. AGAakash Gupta
  Okay.
11. ADAparna Dhinakaran
  And I'm happy to go ahead and, you know, send you guys the, you know, a sample repo if you wanna get started doing this yourself so that you can follow along with a project of your choice. But in this case, you can see, great. Okay. So it's gone ahead. It's actually built this agent. I'm gonna go ahead and, um,
16:13 – 21:34
Preview of a pre-built version with tracing active
1. ADAparna Dhinakaran
  you know, it just looks like it's updating what the, um, what the... Okay, great. So it's actually just updated. It's using my GitHub token. It's using my Anthropic API key, and now it's actually gonna go ahead. It's pulled 40 discussions, 60 issues, eight releases, and now it's gonna go ahead, score each item, and then based off of the score that it gives every single one of these issues, it's gonna go ahead and give me a report about what the most important things to actually, you know, top pain points, feature requests, themes, what shipped, and give me a game plan that I can then use as a starting point when I come in. A really useful feature that, um, you know, you, you'll do this once today, but ideally, you want this kind of running all the time, kind of consistently every time someone adds a new bug report, adds a new issue, it's kind of always doing this. So what you can do is actually just say, "Can you run this in a loop? Can you run this in a loop?" And you can specifically say using the Claude loop kind of skill. Um, this is really awesome because what Claude does is that it spins up essentially a cron job. Um, well, what's a cron job? It's basically you asking Claude to be able to run some type of workflow that you do every day in a loop. Um, and so in my case, every day, every hour, every-- You could set this to every five minutes if you wanted to. It'll go ahead and, um, it'll go ahead and actually, uh, run this loop every, you know, however cadence you set so that it actually does your job. Every hour, you have the latest report of what you should be prioritizing for your agent. So let's go ahead. Oh, it looks like I need to go ahead and set my GitHub token, so give me one second and let me do that. Um, and thenWe can actually go ahead and run this agent, and you can watch it live. So this is actually going ahead and running my Phoenix PM agent. Um, I'm gonna show you guys how to do this so that, uh, you can also do it, but I've also kind of already set up traces. So what does that actually mean? Tracing is the way for teams to actually get visibility into everything these agents are doing. This is kind of a really hard, uh, y- you know, thing to debug because Claude is spinning off a bunch of different things and, and running this in a loop, and you might not always know, you know, if it comes back with slop or it comes back with something great, you know, how do I go and improve it? Or how do I go and figure out how it did that? And so tracing is a really awesome way to understand what your agent's doing. Today, what I'm gonna actually show you is that, you know, you know, I'd say tracing used to be really hard. You had to kind of go call your engineering partner to have to go and set up tracing. I think with AI, it's probably never gotten easier to do this. So what we have is essentially skills. Uh, we've released a kind of, uh, a series of, let's call it skills, that you can actually just give to your coding agent. This is kind of a set of Arize skills. Um, you just go in, install NPX skills add. I'll show you. We'll go ahead and do this. But once you actually add this, you can just ask Claude Code to go ahead and instrument the entire agent that we asked it to go build right now. You're looking here at a whole bunch of different skills. One of them is the Arize instrumentation skill. For those of you who are curious, it's literally just in English telling what Claude Code should do to actually send trace data over to Arize. Um, it makes it super easy. I'm gonna show you. It's gonna feel super magical, and you're not gonna need to wait for your... You know, you're not gonna need to wait for your engineering partner to have to go and do all of this lift to go get data, uh, from your agent to your observability platform. So let's go do this. Um, what we're gonna do actually is from here, I'm gonna say, "Can you help me instrument this agent?" Um, so I'm gonna go ahead and actually, uh, ask it to instrument this agent. So what this is actually gonna do is call the Arize kind of instrumentation agent. Um, so you can see here... Sorry, the instrumentation skill that we just talked about. So it's going ahead. It's calling the skill. This instrumentation skill will actually first look at the codebase and understand how is this agent built, what's actually calling the LLM calls, uh, what's
21:34 – 27:26
Instrumenting the agent for observability (one command)
1. ADAparna Dhinakaran
  actually calling the tool calls, and it'll go ahead, and it'll figure out kind of, you know, this case, the language that it was written in was Python, the LLM provider was Anthropic. Here's the library to go use. Here's what it's actually gonna go do to set up the different calls. And it says, "Cool. Everything is already wired up, sending to Arize," and, "Is there anything else specific you'd like to go change?" So now let's go ahead and just see, run my agent, um, see if it sends recent traces, and I should be able to go pop over to the platform, my observability platform, and go look at traces. We'll see if there's ones that are gonna show up right now from my recent run. But it should go ahead and actually start streaming in traces from, uh, the last... There we go. This is everything from the last 15 minutes that's just showing up here. And so-
2. SPSpeaker
  Mm.
3. ADAparna Dhinakaran
  Um, you kind of basically get a way to do all of this, you know, and it figures out everything from here's the individual LLM calls, here's the actual, you know, tool calls that were made. Here's the... You know, it had to go and fetch stuff from GitHub. It had to go score every single individual LLM call, and then it finally had to come back with that report that I asked for, which was, what are my top pain points? What are my top feature requests? Uh, what was already kind of shipped? And so you can see here, it's giving me an executive summary, my top pain points, and kind of the, the things that it scored really, really highly for me to go and prioritize for my product. And so literally, I didn't open any IDE. I didn't open anything. Um, I literally just asked Claude Code to build me an agent, gave it a really good prompt, and, and then I asked it, you know, kind of what I was hoping for, and then I asked it, "Go instrument my agent with Arize using the skill," and boom, now I have visibility into my agent. Um, everything's probably not gonna be perfect, and I can probably already guarantee you that, that it's not gonna be perfect. But what we can do is actually start using this as a way to understand, well, um, how would I improve this agent? What I'm gonna actually show you right now is actually an in-product agent that we've built called Alex. Alex is an agent that sits inside of, uh, our, our kind of product, and you can ask all sorts of questions like, um, you know, help me figure out the commonTypes of issues that, uh, are coming up. And so, this will actually go through, it'll look across all of the data, like the inputs and the outputs, and it'll start to surface up common types of issues that users are asking from my traces. Um, I can use this to actually first figure out what types of evals should I actually be running on top of my agent. Um, and the reason why that's interesting is that you're, you're starting your evals from a place of actually looking at your traces, looking at your errors, and trying to understand, well, did it actually score some things correctly? Did it not score some things the way that I would've prioritized things? You know, how many times have you had someone on your team kinda say something was super important, super priority, but you wouldn't have given it that high of a ranking for yourself? And so, the, the next thing that I really wanna show is really for teams is how can you use Claude Code to actually help you figure out a baseline kind of eval for these agents that you're building? You can have it start just build a baseline eval and use that to actually iteratively improve your eval so that you're not starting from complete scratch. So what we can do here, you can do this in our product. You can also do this, uh, you know, kind of... You can also do this using Claude Code again. So kind of in the theme of today, I'll actually do this using Claude Code, um, and show you how you can set up evals directly from your terminal. But you're gonna see here, once I have the traces sent to Arize, I can actually ask, "Can you suggest, uh, a good eval for my agent? I want it to..." Uh, and you know, I can just start with that. "Can you suggest a good eval for my agent?" Um, let's see what it comes back with. What this will actually do, it'll call, um, the skill, the evaluator skill, um, that actually looks at, uh, looks across the traces and suggests kind of, um... There we go. Okay. So looks across the skill, and it suggests, okay, well, these are kinda three evals that you might wanna do. There's report groundedness, checks whether the quotes, the issues in the final PM report are grounded in the actual data fed in. Um, it runs kind of across everything, so it's almost, you know, I think about this almost like an eval on the final report that was created. You could do an eval on priority alignment, checks whether the P0, P1 kind of in the report matches the top scored issues from kind of what you're expecting, um, or something around report actionability. Okay. Well, I could do these, but these are all things that are kind of looking across
27:26 – 30:38
Traces streaming into Arize in real time
1. ADAparna Dhinakaran
  almost like the end product. What I actually want is, um, is something different as a PM. Um, what I want is actually to look at every single... I wanna get a little bit more granular in the beginning and start to understand for every single issue that this kind of, for every single one of these issues here, did it actually give it a right score? Like in this case, it said that, you know, it gave it a priority of a three. In this case, I don't know, let's, let's pick another set of them. This case, it gave it a zero. It said, you know, this integration is not that important. Um, it gave this privacy question a three. And so, there's kind of all of these, it's kind of making up these priorities. And I actually wanted to first just evaluate, is the score that it's attaching to kind of determine how important these issues are, is that actually something that I would have set by myself? So I actually wanted to run something like, uh, you know, a priority score, a priority kind of eval on, um, you know, is the score that it's actually saying how important these GitHub issues are, are they actually accurate based off of, um, how I wanna weight them? So let's go back to Claude Code. I can actually just ask it to help me come up with an e- with a way to eval this. Um, and this is very normal, where you're kind of doing this back and forth with Claude, and you're actually asking it to, to go back and repeat yourself and, um, you know, get really specific about what you want. So in this case, I can ask, "Can you help me build an eval, um, to evaluate, uh, if each issue is actually, uh, scored correctly?" Um, or it's each issue's priority, maybe is a good way to say this. Each issue's, each issue's priority is actually scored correctly.
2. AGAakash Gupta
  I think that's option two, right? Priority alignment?
3. ADAparna Dhinakaran
  Yeah. Yeah.
4. AGAakash Gupta
  Yeah.
5. ADAparna Dhinakaran
  This is... Oh, um, well, this is, this is slightly more about at the e- uh, 'cause it looks like it's checking at the end in the very report if the top scored issues are kind of what I would have picked. Um, but what I'm looking for is something slightly more nuanced, which is not just the top issues, but every single kind of individual issue is actually, um, given its appropriate kind of weight. So-
6. AGAakash Gupta
  Mm-hmm
7. ADAparna Dhinakaran
  ... it's kind of giving me this, like, priority accuracy evaluator. Um, so it'll go ahead. It'll create a way to run this evaluator on top of the actual traces. Uh, in this case, it's already picking one that I've actually already created.Um, to do this, just to show you guys kind of how this works. Um, but it'll kind of suggest, hey, there's this eval you've already created, which is kind of doing this, like, row-level, issue-level kind of priority.
30:38 – 34:36
Asking Claude to suggest evals
1. ADAparna Dhinakaran
  Um, and then it's actually gonna use this to go and run it on top of those kind of traces. So in this case, it's saying, "Hey, it's running it on older data. Do you wanna go ahead and run it from today's issues, like the new issues that you just grabbed from today?" So it'll go ahead and start running it on the newer spans. Um, and you can see here every single kind of GitHub issue that has come in, it's gonna go ahead and give it a score of how important it actually is. Um, and then it'll evaluate whether that score that it was given was actually an appropriate eval or not.
2. AGAakash Gupta
  I hope you're enjoying today's episode. Are you interested in becoming an AI product manager, making hundreds of thousands of dollars more joining OpenAI and Anthropic? Then you might wanna do a course that I've taken myself, the AI PM Certificate ran by OpenAI Product Leader Miqdad Jaffer. If you use my code and my link, you get a special discount on this course. It is a course that I highly recommend. We have done a lot of collaborations together on things like AI product strategy, so check out our newsletter articles if you want to see the quality of the type of thinking you'll get. One of my frequent collaborators, Pavel Hern, is the Build Labs leader, so you're gonna live build an AI product with Pavel's feedback if you take this AI PM certificate. So be sure to check that out. Be sure to use my code and my link in order to get a special discount. Here's the dirty secret about prototyping. You spend two weeks building a prototype. You validate your assumptions. Engineering loves the direction. Then what happens? You throw the whole thing away. Bolt changes this completely. When you prototype in Bolt, you're not building throwaway mock-up. You're building real front-end code that integrates with your existing design system. So when you hand it to engineering, they don't throw it away, they ship on top of what you've built. I use Bolt every single day. I host my Land PM Job cohort on it, and honestly, I'm up till 2:00 AM some days just vibing in the tool, having fun, and building. That's when you know a product is good, when you're using it past midnight, not because you need to, but because you want to. Check out Bolt at bolt.new. Link in the show notes. I used to think I had a retention problem. Turns out I had a messaging problem. I was sending the same onboarding emails to every new user, whether they activated on day one or never logged in again. I had no idea who was slipping or why. Customer.io changed that. Every message I send is now based on what users actually do in the product. Someone hits a key activation moment, they get nudged to the next one. Someone goes quiet, they get a different path entirely. Their AI agent makes it fast. I describe the campaign I want, and it builds the full journey for me: triggers, timing, copy, even branching logic. And when I want to know how something is performing, I just ask the agent directly, and it tells me what to do next. They also have an MCP server, which means AI tools like Claude can see directly what's happening in your Customer.io workspace, your segments, your customer data, your attribution, all of it. So instead of explaining your business context every time you need help, Claude already knows it. Notion used Customer.io to personalize their onboarding and hit nearly 50% open rate, improved conversion by six to seven percent with localized campaigns, and pushed open rates up another 20% through A/B testing. The idea is simple. Customer.io helps you deliver more impact from every message you send. If you're a PM or founder and your onboarding is still one size fits all, try Customer.io at customer.io. I'm keen to see what evals it creates. I guess the traditional sort of evals teaching literature is all about, like, you finding production traces that you feel like there was an error. So I guess-
3. ADAparna Dhinakaran
  Right
4. AGAakash Gupta
  ... that l- line of thinking would say you'd go to the trace dashboard in Arize. You'd look at those priorities. You'd say, "Oh, this is a zero, but this really should've been a four."
5. ADAparna Dhinakaran
  Right.
6. AGAakash Gupta
  And then you'd pick up like 50 of those errors. Then you'd group them and say like, "Okay, these are the 10 errors that it does."
7. ADAparna Dhinakaran
  Yeah.
8. AGAakash Gupta
  So is the-- Are we trying to replicate that process
34:36 – 46:10
Running the priority accuracy eval
1. AGAakash Gupta
  but have Claude Code basically do it itself? Is that what we're doing here?
2. ADAparna Dhinakaran
  Exactly. Exactly. So basically, what Claude Code is doing is it has access to all of the traces in Arize because the skills, basically, it can go and call an API, um, and I can kind of share what it's doing under the hood, um, so we can talk about it. 'Cause it does feel a bit, a little bit magical [chuckles] when, when we kind of just talk, talk through it. Um, so give me one second. Let me kind of share the secret sauce of kind of what's happening here. Under the hood, all of these skills are actually calling, uh, APIs, and specifically the APIs that skills, um, tend to call is that what we've realized is that these coding agents are really good with command line or CLI interfaces. So what it's doing is basically under the hood, calling and fetching all of the traces and, you know, you've seen kind of Hemel and Shreya tell you, "Hey, go through line by line. Look at where the individual traces failed." Um, that is totally s- you know, a great way to do this. You can, of course, go in and get started and start doing annotations and start doing, you know, like did it actually answer the question? Is this, you know, you can write free-form text and just write free-form text about like, you know, what was good, what was wrong about this. Um, it's absolutely a great way to do that. Um, I'm also someone who I love to see if Claude Code can help me cut some of that time and surface up some insights for me. And so what I'm actually doing here is trying to understand just with Claude Code, and if I can give it access to my spans and my traces, like what are some insights from this that I should have to go and learn, you know, help me go and tell me what's wrong with my agent. And sometimes it'sYou know, y- just being super honest, like sometimes it might not come back with something amazing as your first eval. But what I typically like about it is that it gives me a place to actually start thinking about problems and start thinking about areas of, of improvement. So in this case, I've gone ahead, uh, and created this, like, priority accuracy, like priority accuracy eval, and it's running, it's now running, it's run across all of my new spans, and I can go in here and just say, "Show me everything where the label's actually inaccurate," where Claude Code thinks that the priority, you know, you can see the scores here, the priority that it's come up with is actually wrong, and why is it wrong? And, you know, this is probably something that y- you're gonna hear all the time from folks who do evals is, "Was my eval wrong or was my agent wrong?" And you will definitely have scenarios, and there's a whole process that Hamel and Trey actually talk a lot about, which is aligning your evals so that your evals, uh, are grounded in that kind of human feedback. What I'm sharing is kind of a way right now of can you start... It's almost like y- you know, can you start with the vibe eval and then modify it and improve it so that it becomes something that you, you can trust and, and go from.
3. AGAakash Gupta
  Yeah.
4. ADAparna Dhinakaran
  And, uh, you can do either approach. You can go through the, you know, axial coding approach, surface up all the issues, have the human in the loop, uh, and, you know, identify categories of pain. Um, but as a product person, you might already know what types of things you definitely wanna catch. For me, what I wanna catch is, is every single issue that this agent is prioritizing, is it right or is it wrong? Is it accurate? Is it giving it an accurate score, or is it not giving it an accurate score? And I can start off by saying, "Well, let me see if I can just have it go and create an eval to suggest kind of what that priority accuracy looks like." Um, you can do it through a skill. You can also, if you do have human annotations that are built through here, it will... The skill will look at those human annotations and use it to actually build you an eval as well. So, um, I, in this scenario, didn't have any, but if I had one, it would go through and do the whole process that Hamel and Trey kind of walked through of, like, aligning the evals. So it's gone ahead. It's run kind of the priority accuracy eval. It's comparing the accuracy of, y- you know, it's something that's looking at the score that was assigned to each of the issues, and it's surfacing up kind of, you know, is this an accurate score or is this not an accurate score? Again, this is just based off of, you know, a simple first pass of this eval. I am going to refine this eval now because this eval is completely, you know, based off of just Claude looking at my traces and trying to identify problems. Um, and the whole point of this is, like, how do we get this loop kicked off? This loop is meant to kind of give you a starting spot. It is not meant to be your end-all, be-all kind of s- y- you know, state for your evals or your agent. Your evals will adopt, uh, w- will kind of get better, and your agent will get better. And that's kind of what we're showing in this workflow today is kind of how do you get started, how do you get unblocked, and then how do you do that improvement loop so we can make this better? So in this case, I kind of have, uh, a very simple, small eval here, which is, okay, looking at the accuracy of the score, these are ones that Claude thinks are not accurate. I can actually just directly ask here, like, "When my priority accuracy is inaccurate, you know, what are common issues or reasons for that?" So this will actually kick off and now look at, well, what types of, uh, what types of, uh, y- you know, things is my PM agent not, uh, prioritizing correctly? So I have my agent kind of kicking off, looking at the data, and what we're trying to do here is really go from you built an agent, you have traces set up automatically through Cod- Claude. You have Claude kind of suggesting what an eval could actually look like. And now these are already scenarios that Claude thinks are, are not right, accurately scored. This is a great starting ground for me to say, "Okay, well, what can I go to understand how to go improve this agent?" And you barely had to write... You didn't have to write anything. You kind of had to, you know, ask Claude a couple, couple things. Um, so let's go ahead, and this is Alex kind of giving me suggestions of, of kind of what to go do here. So in this case, there's, uh, whole categories of issues that it's looking at. So there's some where there is a feature request scoring, there's a legacy scoring system, there's bugs priority scoring, there's low priority scoring, there's data fetch. Okay, so there's a lot of different categories of where it's actually suggesting that my scoring might be off, and it's giving me a whole bunch of spans to go look at, to go debug and understand kind of what, what are some actual problems that this PM or taste agent might have in prioritizing issues that are coming. Um-
5. AGAakash Gupta
  And what's a span exactly? It's a group of traces?
6. ADAparna Dhinakaran
  A span is really an individual step in a trace. So in this case-
7. AGAakash Gupta
  Mm
8. ADAparna Dhinakaran
  ... what you're looking at here is this is, this entire interaction where it did this whole report is what you'd call a trace. A span is a single individual step or a single individual issue that I had to go look at.
9. AGAakash Gupta
  Got it.
10. ADAparna Dhinakaran
  Yeah.
11. AGAakash Gupta
  And it's weird. Isn't it a bit weird that Claude rated like everything it did inaccurate?
12. ADAparna Dhinakaran
  I, uh, some of them are accurate and some of them aren't accurate.
13. AGAakash Gupta
  Oh, okay. Some of them are accurate. Okay. [chuckles]
14. ADAparna Dhinakaran
  Yeah. If it did, then that would probably be a good spot for you to understand, "Okay, well maybe I shouldn't trust that eval from, from Claude." Um, and so-
15. AGAakash Gupta
  I feel like a good eval is like you're getting some healthy percentage right, but also healthy wrong so that you can make progress, right?
16. ADAparna Dhinakaran
  100%. And so you want that feedback of... Like, I get excited when I see that evals are wrong, because then it gives me a chance to know that there's improvement that could be made. But when everything's wrong, then, you know, it's obviously, that's definitely a scenario where you need to start looking at your eval to understand what, what to go improve.
17. AGAakash Gupta
  And when, when can we do the vibe evals? When do we have to do the axial coding, or can you always start from vibe evals and then layer in axial coding, talking to the agent later?
18. ADAparna Dhinakaran
  So my take is that vibe evals are gonna fall short very, very quickly. [chuckles] Um, it's-- And the reason for it is that it just doesn't have any-- It's not grounded on any actual human that is involved in curating that taste again of your agent. And so what you really want is something that helps you. You know, I think that it would be hard to say, "Hey, you have to go and, uh, immediately start by, y- you know, having a bunch of, uh, vibe evals and using that to evaluate your agent." Like that, it just, the signal-to-noise ratio there is gonna be really, really low. Um, and so having something where you have maybe a simple thing that gets kicked off, but then now what I'm gonna actually go do here is that process where I have a simple eval and I'm now gonna make sure, "Okay, well, is this eval that I've created actually something that I can trust?" Um, and it's not gonna be. It was a one-shot eval that's out of the box. I'm gonna actually go through and figure out, "Well, where do I disagree with it? Where do I not disagree with it? How do I actually..." And you would do this process even if you did axial coding. Even if you did axial coding and you did, um, individually, you know, human annotated every single span and every single issue, and you were able to put together this amazing ground truth dataset, um, your eval will get misaligned over time as you see more and more data. And so it is super important that, uh, you regularly align those evals to the data that you're actually seeing on the ground, um, with, with your users. [lip smack] So-
19. AGAakash Gupta
  Mm.
20. ADAparna Dhinakaran
  ... um, what I'm going to do right now is actually walk through a process where I've created a very simple eval out of the box. Claude just one-shotted it for me, and now I'm gonna start asking, "Okay, well, is this an issue with an eval? Is this an issue with my agent?" Um, and you have examples. You know, in this scenario, it looks like, [lip smack] uh, bugs are bug category items using the new scoring system with category
46:10 – 52:46
Vibe evals vs. axial coding — when to use each
1. ADAparna Dhinakaran
  four are also commonly inaccurate. And so it feels like there's scenarios where bugs maybe are not getting categorized or given the accurate score that I want it to. In my world, I want bugs to always be super high, because if it's a bug and a customer hits a bug, that's just a really bad experience with the product. So I would prioritize bugs over, uh, you know, even new feature work. Um, and so this gives me a way to say, "Okay, well, let me go look at some examples of where the bugs are, you know, being prioritized really low," and just gives me a category of problems to start looking at and start debugging and understanding kind of how good this, this agent is. Um, and what you can do is, you know, for some teams, these evals end up as you, as they get really good and as they get really better, uh, you can immediately ask Claude to, you know, going back to kind of using Claude Code with evals, say, "Hey, go grab everything where this eval failed and suggest an improvement and go improve, improve that eval for me." I think it's, uh, unfair to say people aren't creating, aren't create using Claude to create evals, and I think that's maybe one of the pain points that I see with always saying start with axial coding, is that in reality, um, you will always do it, but I think it's okay to start with Claude suggesting what a good suggestion of an eval could be. And these models have gotten so good, like having it go through and look at your answers and suggest, "Hey, that probably is something you should flag and look at," I would trust it. [chuckles] I would trust it as a first pass, like, "Go tell me what my evals should be."
2. AGAakash Gupta
  Yeah. That's my favorite workflow. Always start Claude generating it, but then you just give it like ruthless criticism, and I just turn on dictation mode and I'm like, "Well, you misjudged this for this reason. You misjudged this for this reason," and that's where the taste alpha that you bring can actually come back in.
3. ADAparna Dhinakaran
  Totally. And I think what for me is like how do I quickly get into that loop is get data in, get an eval set up, give it criticism, and let it go run on a loop. Um, [lip smack] so I showed earlier there's kind of the Claude Loop, um, uh, kind of skill that Claude has. And so what you can actually do here is now that you have this eval, you can create a whole 'nother skill that's just like every day go through, fetch everything that was inaccurate, and go, uh, that was inaccurately prioritized, and go fix and improve my agent. And you can go and create a skill that actually will then go suggest improvements to your agent from the evals that you just ran on top of this.
4. AGAakash Gupta
  Mm. So you actually loop the improvement too, not just the agent.
5. ADAparna Dhinakaran
  'Cause then you get to a world of self-improvement. And that's where, to be honest, I think we're all headed, is that the data that we all collect, the evals and observability, is the foundation for self-improving agents. And so you get your observability in, you build an initial eval. It's a first pass. You're gonna make it better. You're gonna have to give it ruthless criticism to kind of make the agent better or make the eval better. And I think what teams are doing right now is they're kind of doing that iteratively. You can just create a loop that essentially starts to look at the evals, identify... You know, I, I just asked right there, "Give me the common reasons why the priority accuracy is inaccurate." Oh, it's because the way I prioritize bugs is, is, uh, doesn't look right. And so what I can do, go back to my PM agent and just say, "Hey, go fix this issue." [chuckles] And then go fix the issue, ship a new agent, now go collect traces from the next rev of that agent. And so that improvement loop can actually run inside a Claude Code as a loop skill.
6. AGAakash Gupta
  So that is all fine and dandy for your internal agents that are assisting you in your work. How does this all change for the AI agents in your product? You just showed us Alex, so maybe you can go under the covers of how that worked when it's an actually a product. You're not gonna be shipping self-improvement to Alex every day because you don't know. It could just go off in some weird direction. So how does-- where do the human and the review loop parts come in there?
7. ADAparna Dhinakaran
  Totally. Uh, I mean, there is, there's still code review. There's still a human that, uh, y- you know, looks at every PR that is actually being put up by the self-improvement loop. But, you know, maybe what I can ask you back is, but isn't that the vision? Isn't that the future that we all wanna go to, is that I should be able to see someone file a bug and on, you know, Alex didn't give me... Somebody gave a response that Alex gave a thumbs down. Alex is able to immediately, and this is kind of what we're doing internally already, is you're gonna hear a lot more about us talking about it in the next couple of weeks. Um, but Alex has already taken that feedback, spinning up a whole debug kind of workflow, and using the eval, using the trace to debug what went wrong. And then in some scenarios, like we talked about, it's the eval that's wrong, and in that case, you know, it's a refinement on the eval. Um, and in some cases... But that's great, right? That's basically, you know, a little bit of what y- you hear all about the axial coding of figure out what are the reasons why that eval wasn't good, and then use it to go improve that eval. And in some cases, the eval was right, and it really was the agent that needed to be, you know, handle a specific scenario better. And so in that case, what we can do is just very simply, um, go in and do an improvement. Say, "Hey, go fix this," uh, and actually go in and, and improve the, the agent. And so what I can do-
8. AGAakash Gupta
  That is what we want, right? [chuckles] Ideally, like it's happening like in real time across millions of users automatically. So I guess, how do you do that safely? So code review is one step. What else do you need to like-- Where do you need to put the human in the loop?
9. ADAparna Dhinakaran
  So I think there's a, um, there's a couple maybe places where that needs to happen. One is as the eval changes, that's also a really important step to actually having the, um, human kind of curate that taste of what is good and what is not
52:46 – 1:04:01
Looping the improvement automatically
1. ADAparna Dhinakaran
  good. Um, so the humans kind of typically involved in eval changes. They're involved in the agent changes. Um, the, there's a lot that's happening right now around making sure that the, the skill that's actually being d- used to do the improvement workflows, that is one that is, uh, typically designed by a human. So what does that improvement skill need to look like? What is, what is all of the context that it needs to have access to in order to be able to know what the improvement is? In this scenario, it might not have all the context because all I gave it was just GitHub issues. But if I could then layer in my product analytic metrics, I could layer in my, um, the, my, my traces, my actual entire traces, um, it could actually end up using that information to build its own context of what went wrong, how do I need to go fix it, um, and u- leverage that information as basically context for the improvement loop.
2. AGAakash Gupta
  Got it. So there's human in the loop at any agent change, any eval change. But outside of that, you can actually use loop commands within Claude Code or whatever, if you're in more production database, a real cron job, and every day or whatever cadence. And so what are-- You get to work with like all of the best companies, Uber-
3. ADAparna Dhinakaran
  Yeah
4. AGAakash Gupta
  ... DoorDash, you name it. What are the, what is the state of the art looking like for this self-improvement? How fast are people moving, and how fast sh- do they need to be moving to be competitive?
5. ADAparna Dhinakaran
  I mean, I think it's gonna come very, very quickly. If I'm honest with you, I think the best teams are already doing this in, uh, i-in, in their, y- you know, call it like a radius that they're comfortable with today, but that radius is gonna get bigger and bigger. Um, is there, you know, maybe the initial improvement is around improvements to the agent that are kind of more simpler, more around the prompts, the tools. Um, does that radius then become about giving entire workflows that the agent didn't have access to do? Does it-- So the radius of those changes, I think, is going to become, uh-I- I- I- is going to become increasingly bigger, which we're excited about. Um, but it's just that self-improvement loop is not gonna happen without having really good, um, data, really good data and really good evals. Um, if you think about, and just to try to maybe take an analogy for something that's so different, but if you think about, like, some of the best sports players, what do they do? Like, I'm talking about, like, the Nadals, the Federers, if you're a tennis fan. Like, your Novaks. Like, what they're doing is actually looking at their plays. They're looking at their previous games. They're looking and studying their behavior of what they did and using that as a way to understand what went well and what didn't go well in their games to go make improvements. This is kind of studying your plays is kind of what agents, uh, y- you know, self-improving agents or self-improving harnesses have to do, is they kinda have to study their own plays, um, to understand what did the human say was a good response or what did the human not say was a good response, um, and use that to actually figure out how to improve their own gameplay in some way. Um, and that's what we're actually-- That's why the evals and the observability are kind of the, the foundational layer in order for teams to actually build that self-improving loop.
6. AGAakash Gupta
  So I personally have encountered PMs that I feel like are in one of three buckets, and I think you have customers in all three of those buckets. So there's the AI natives, like customers you have, like Handshake and the AI companies. Then there's, like, the digital-first companies, customers you have, like Uber and Reddit and Roblox. And then there's, like, the normal companies who have tech arms, Pepsi, Conde Nast, normal type of companies. So you get to-- you work with all three of those groups, and so what I wanna understand is, usually the AI native groups, they're gonna be doing the quote-unquote "best way" or, like, the right way of how to do things. So what are the AI native groups doing? And specifically not just with, like, how they're building their evals, but the role of the PM. What is the role of a PM in an AI native company versus a company who hasn't gotten there yet, and how does that company bring their PMs there?
7. ADAparna Dhinakaran
  Yeah. I, um, I think the role of a PM is, like, completely changed in the last year. The role of the PM is almost like the, you're the tastemaker for this product, and in order to become a really good tastemaker, you really have to understand the outcomes of the agents. That the-- especially the AI PMs, where the product is the agent. The product is the agent that's being built. Um, you have to spend a lot more time. You know, the AI native PMs, they are almost indistinguishable from engineers in, in, in some ways because they're comfortable living in Claude Code. Like, this entire workflow that I just showed, where they're able to build even just a simple internal agent to help them do their daily tasks, where they can... You know, you're not doing-- If you're, you know... We, we kind of say this internally, and I think it's true. It's like, if you're doing things the same way you were doing things last year, then, you know, you haven't, um, you haven't caught up yet. And I think that I deeply do think that, you know, if you're kind of looking at your old board of, like, "Here's my priorities," and you're kind of manually scanning them and manually kind of understanding every single, um, you know, ki- kind of doing what you used to do, it's just different. Because now with the advent of kinda Claude Code, I can actually have it... You're not limited by how many individual meetings and Gong calls that you can personally kinda hear. You can have Claude Code go through and have-- it has access to all of these customer calls that you might have n- you know, never been able to consume all by yourself, but can it help surface up the one or two that are, like, super critical, you need to put your eyes on that because that's gonna help you unlock your next 10, 15 customers. And so I think in, in these AI native companies, what we're seeing is that the PMs who are able to leverage Claude Code to do everything from understand user data and user feedback better, surfacing that back into what does a really good product experience look like, uh, get really close from idea to solution. So it's not like, "Hey, I'm handing it over to an engineer." It's like they're able to effectively almost put together a plan for what that build needs to look like. Those are, those are the PMs that I think are really, um, gonna be 10X or whatever multiplier PMs in any, in any team.
8. AGAakash Gupta
  So we're talking about working with those AI native companies. You are yourself one of the AI native companies, and you refer to this, that you yourself are hiring more AI PMs than ever. So what does the new profile look like? If I wanna land an AI PM role at an AI native company that has raised $131 million-
9. ADAparna Dhinakaran
  Yeah
10. AGAakash Gupta
  ... what are the skills I should be developing? What is the depth of technical knowledge and topics I need to cover?
11. ADAparna Dhinakaran
  One, I... and I have always believed this, is like just the curiosity is the number-- That's like, for me, the number one most important signal. Like, this person is, uh, trying all the new tools. They're kind of exploring the boundaries of what they can and can't do, um, because that's something that y- you know, there's, there's kind of the old way of doing things is that there used to be trainings, and you'd go to these trainings, and someone would walk you through how to use a tool. Um, but what ifThe tool is Claude Code, and it's had, you know, shipped, you know, 90 features in like 30 days. [chuckles] Like, there is no old way of doing things where you can have like a daily training for a product that's moving that fast. And so the, it's kind of the onus of keeping up has become on the individual now to actually keep up with the tools, keep up with what's changing, and if something-- Not everything is gonna be useful to what you do, but if something can give you an ability to, "Hey, that used to take me an hour, and now it can take me 10 minutes," like that is an advantage, and being able to identify those and use them to your advantage is, uh, is deeply, deeply-- I think it's built off of curiosity at this stage. Um, I think too, the other big one is it's still really important to, you know, care and understand, like the, the user and customer empathy is something that I don't think AE, like the best PMs and the best product tastemakers, you know, understand. You could ask them, "How is that customer using the product? What's their biggest pain points? What do they..." You know, and they would be able to rattle them off to you. And I think what's now changed is that you can actually get even deeper. You can, you know, customer asks for something, it could have taken a week to go build that, two weeks to build that in the past. That could be delivered that day, um, if you're able to ship at that velocity. Um, and so being able to get even closer and deliver to customers even faster is no longer just like a-- It's no longer a pipe dream. It's actually how the best products at AI natives are, are, are shipping right now.
12. AGAakash Gupta
  So 99% of people aren't in an AI native company.
13. ADAparna Dhinakaran
  No. No.
14. AGAakash Gupta
  So they don't believe us, so I need to just confirm this is true.
15. ADAparna Dhinakaran
  Yeah.
16. AGAakash Gupta
  What you're saying is that sometimes an issue will come in, your PMs will identify it's important enough. Either they will prototype or an engineer will prototype and make ready for production a feature, and you guys will ship it in the same day.
17. ADAparna Dhinakaran
  Yes.
18. AGAakash Gupta
  That is actually what's happening, guys. So she said it herself. So what is the role then of the PM? Like is the PM-- Do PMs need to become engineers at this point?
19. ADAparna Dhinakaran
  I, I think that, um, at the AI native teams, I am seeing that, that the gap between, um, a PM and an engineer is, is indistinguishable. Um, because when code is become so much easier to actually produce, then
1:04:01 – 1:09:05
What AI PMs need to do differently
1. ADAparna Dhinakaran
  actually the, you know, this goes back to where we started today's podcast with, which is the alpha is, the alpha today is product taste. So the people that understand product taste, understand what customers want, understand how to deliver a really amazing experience are just gonna have an insane, um, insane velocity. Um, so PMs who can kind of go from, "Here's the pain point. Here's what I would, I think is a really amazing experience," and they're a triple threat where they're like, "I could probably go build that today," and figure out what that, you know, talk to Claude Code and figure out what to go build. Like that is, y-you know, it's, it's a triple threat in this, in this environment right now.
2. AGAakash Gupta
  What are you seeing at the enterprise level? 'Cause they're not even close to there. So-
3. ADAparna Dhinakaran
  Yeah
4. AGAakash Gupta
  ... if you're at a big enterprise, if you're at a Pepsi or something like that-
5. ADAparna Dhinakaran
  Yeah
6. AGAakash Gupta
  ... you're still trying to take on the best practices. What realistically, what can they take on, and how do they take them on?
7. ADAparna Dhinakaran
  Yeah. I, I mean, I, I think what I'm seeing at enterprises is like they're still innovating at, um, you know, there's inno-- You know, I, I don't wanna say that there's no innovation happening there at all. Like right now, there is y- all these teams are all using the coding agents and I think feeling the unlock of those, those tools in their own day-to-day workflows. Um, and so I, I think what I'm seeing coming out of the teams right now, even there, is like one, amazing products that use AI to make the experience of that product useful. Um, two, I think there's usually a massive, uh, e-especially larger companies, you have silos of data and people who, you know, might have access to some information, other teams don't have access. And there's actually a really great piece that, um, Jaya Gupta, somebody you should follow on Twitter, uh, kinda shared a couple weeks ago now that's gone super viral around context graphs. Um, and what a context graph is, is essentially can you give your agent access to... The agents are only as good as how much context that they actually have. Um, and then of course, the harness that's built on top of that, it has ha-access to that context. And so instead of all that information and data being in completely different silos and, you know, people operating in these silos, can you give-- One unlock for agents is that can you give it access to the context from different environments? And what that does is it actually makes people kind of, um, kinda bridge the gaps across, across different teams in ways that probably weren't possible before. And so figuring out how agents consume the context within an organization is going to be probably one of the biggest problems. Um, I, I mean, it's probably one of the biggest unlocks, challenges and unlocks that we're gonna see this year.
8. AGAakash Gupta
  So if you're a product leader at one of the enterprise companies, you're seeing what you just demoed for us. You're saying, "Okay, how can I bring my company-
9. ADAparna Dhinakaran
  Yeah
10. AGAakash Gupta
  ... towards that?" What's sort of the step-by-step roadmap I should be implementing over the next, say, 12 to 24 months?
11. ADAparna Dhinakaran
  Well, first, I think, uh, as an individual IC, I think build. Like, building and what I just shared right now of, like, you'll read a lot of just stuff on AI Twitter of everyone kind of, um, y- you know, everyone kind of sharing every latest new model and every latest new tool out there. I think what I would just highly recommend for any IPM is start by building. Start by building very simple, like this example that we just did today, it doesn't even need to be an external-facing agent that you need to publish. Uh, can it just be an internal kind of tool that you use to actually help you unlock, make one big unlock today? Um, that's huge. That's huge because think about, you know, if this tool that we just vibe coded in an hour, now I'm gonna go use it to figure out, "Okay, well, what are my top pains? Like, what are this?" Like, you can imagine the next step after that is, well, can I get an agent to actually go and put up a draft PR for one of these? Uh, can I get an agent to actually then review that PR and do the code review on that? And then, y- you know, like the process to go from identifying a pain point to solve and then releasing that could have taken months in the past, can now... That entire thing can be shortened to a span of, like we were saying, a day. And if you just started with, like, if that
1:09:05 – 1:22:10
What enterprise PMs can realistically take on now
1. ADAparna Dhinakaran
  could be your day, um, and what does everything need to look like in order to deliver on that? Um, I think it changes the game for individual ICs. So first I'd say start by, start by building. Um, it's the most, y- you know, biggest unlock. Two, um, I-- As you're building, it's kind of important to figure out what are the systems that you need in place in order for you to... It's easy to kind of build something and then say, "Oh, it doesn't work." Like, I'm just gonna... You know how many times that's happened to you where you're like, "It's not working. Like, I'll just kind of scratch that idea and kind of let it sit." Um, I think the, the most curious and the, the most kind of, you know, uh, curious of the PMs are typically this is where having a data layer like Arize, uh, and the observability platforms are really helpful is that, you know, you might not know, like, why your agent gave you a bad response or why the outcome wasn't that great, what it was doing. And so getting observability to understand, kinda like we were talking about with just a simple example of like the tennis players, like how do they look at their plays and figure out what went wrong and how do they get to that 1% better every single day? If you could n plus one your output every single day, I, I think that the story is no longer about observability, oh, looking at your data. The story's about self-improvement, and improvement of yourself as a PM, but also improvement for the products that you are building.
2. AGAakash Gupta
  So we used Arize's open source sort of Phoenix platform, and then we used Arize, the paid platform, to do this. Those are two options. How does somebody make a decision? What does the overall ecosystem look like, and why would they choose Arize?
3. ADAparna Dhinakaran
  Yeah, great question. Um, so Arize Phoenix, which was kind of the open source one that we pulled all the GitHub issues from today, um, is an amazing option if you cannot send your data to an external platform. And for most enterprises, most teams building any agents that have any PII data, it's just a reality, is that they want to self-host, um, some initial observability so that they can get a feel and get started and get an unlock. And so Arize Phoenix is the-- I think even Himel's tweeted this before. His most favorite open source, um, tool for observability is Arize Phoenix. It's got super permissive, uh, license. Uh, you know, it's got almost everything that you just saw in the demo today out of the box for you. Um, and all the skills that I shared kind of using Claude Code, there's all those skills exist for Phoenix too. So you can just go open up, build an agent and say, "Hey, help me instrument it. Help me figure out insights for my traces. Help me go write evals." Phoenix will actually go and do all of that for you today. Um, typically, where teams start to feel, you know, the, the paid kind of platform, uh, the enterprise platform kind of makes sense, is when obviously data volume starts to scale. We have teams that send us, you know, just the volume of... I, I think it's a good thing is that these agents are starting to find product market fit in this environment right now. It's that LLMs, uh, the models are, you know, getting better. Products are start- starting to find product market fit, and so we're starting to see, you know, almost like terabytes [chuckles] of data. Um, and so it is, um, the volume and the scale is a big reason why, um, you know, for teams that, that need that as their agents start to get mature, uh, it makes a ton of sense for you to kind of have a more scaled out platform for observability. This is where Arize AX is kind of uniquely fit to solve that problem. Um, we, we do this really well because we've actually had to invest in our own data store that we've been building for a while now, ADB. Um, and it's a data store that's designed for AI workloads from day one.
4. AGAakash Gupta
  So let's say I'm-Figured out I need to pay for it. I have the huge amount of data. How do I decide who to work with?
5. ADAparna Dhinakaran
  The reason to pick Arize is really we're the open and independent, most independent platform out there. We are independent of framework. We don't actually care what framework you use. Um, we have teams using, you know, everything from LangChain, the Claude Agents SDK, to teams that are building without a framework. Um, and so we, we deeply don't-- we're agnostic of whatever framework you, you use. Um, the second thing is we, we deeply believe in the independence of your data. All of our trace data that we collect lives in open formats. Uh, you can actually, using our ADB data fabric, um, that data can be directly sent back to your data warehouse. And the reason that's really powerful is because you don't want our-- you don't want your agent trace data, which is so valuable, to be locked inside of a proprietary platform. We make it accessible so that you can actually use the agent trace data as, uh, part of your context graph. Um, we're also independent of instrumentation. Uh, if you don't know, we're actually the inventors of Open Inference. Our competitors, every single one of them that you mentioned, all use our instrumentation, and they've actually linked to it in their docs. And so we actually own, um, kind of w- we built probably the richest telemetry, um, and it kind of shows in the fact that our instrumentation's widely adopted in the ecosystem. Um, and then the, the last one I think is just, I think we've been consistently one of the most innovative in the market. We were actually the first to shipping LLM as a judge. If you go back to 2023, you'll look at Phoenix in the repo, and you'll see kind of LLM as a judge. We were the first to, uh, release Open Inference instrumentation. Alex that you saw in kind of the product, we were the first to actually have an agent built into our product. The skills that you're actually looking at that we kind of showed how you use, all of those skills, um, we were actually the first to have and release them. Um, Hemel actually did, uh, a talk with Mikio on this about it, our open source lead. Um, and then as m- as mentioning, w- we have kind of the first and only way right now in market to actually take all of these agent traces and have them as standard formats as part of your context graph. And I think it just shows, you know, we are, w- we're probably, uh, the fastest innovator in the space right now.
6. AGAakash Gupta
  What are, what are the things, if somebody has just two hours this weekend, that they should concretely go do and take away besides just they've watched this episode, but now they're gonna actually make impact in their career?
7. ADAparna Dhinakaran
  If you, uh, if you have any two hours, uh, this weekend, I would say literally what we just did right now, which is build, build an agent for yourself. Whatever would take away a couple hours of your week every week, like some- just something repetitive that you do every single week. And by the way, this isn't just for P- like, if you're someone who's in product marketing and you're writing release notes every week, like, what is just a workflow that you do every single week that takes a couple hours of your week? Try to build an agent to go do that. And I think what you'll learn out of that is, one, how insanely easy it is with Claude Code. Um, and then you'll also, on the other hand, realize how, you know, how much work it takes to actually make it really good. And so to make it get better, um, past that initial kind of vibe code, the evals and the observability are so important. And so, y- you know, I said this in the beginning, but any product person that has used observability and is looking at their traces and looking at your evals, you're probably already in the top 1% of PMs in the world right now.
8. AGAakash Gupta
  What are the biggest mistakes PMs are making when they do evals?
9. ADAparna Dhinakaran
  I think the biggest ones is, um, not, first, not starting with actual trace data. Um, I think if you're just starting with, uh, kind of what you think are problems, that's really hard. Um, like even the skills, for example, that we used today, um, that, that Claude was using to build the evals, what's powerful about it is that it's actually, we're trying to instill best practices. It's actually looking at all of the trace data to help and suggest what the right evals could be. Um, so I, I think PMs need to look at... The evals don't just come out of magic. They come out of your-- They come out of traces.
10. AGAakash Gupta
  All right, everybody, I'm gonna put up Arize's pricing page for you. This is how much Arize costs. Now, here's the cool thing. If you wanna get AX Pro for 12 months for free for your team, because you're convinced you wanna create self-improving agents, you can do that with Aakash's bundle, or you can just use the free options that she's talked about right now, Phoenix and AX Free, to get started. It's that simple. I highly recommend every AI PM master the AI eval skill. Arize is one of the easiest ways to do it. Aparna, thank you so much for lending your expertise.
11. ADAparna Dhinakaran
  Awesome. Thank you so much, Aakash, and, uh, it was awesome to be here.
12. AGAakash Gupta
  I hope you enjoyed that episode. Couple things you can do to support the show. One, comment. Two, review. Those ratings and reviews really help other people understand the value and the production that we are putting into this, right? This wasn't an easy episode to produce. We put in a ton of pre-work. We edited it for you. We brought in the best guests. If you don't mind sharing a rating and review, sharing the episode with others, making sure you are subscribed, that really helps the show do bigger and better productions. I'll see you in the next episode. Here's one of those that YouTube thinks would be a great fit for you.

Episode duration: 1:19:31

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode DL-pUGcfrf4

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Intro

What PMs are getting wrong when building agents

Screen share begins — building the PM agent live

What a product taste agent actually does

When to start running evals

Building the agent in Claude Code from scratch

Preview of a pre-built version with tracing active

Instrumenting the agent for observability (one command)

Traces streaming into Arize in real time

Asking Claude to suggest evals

Running the priority accuracy eval

Vibe evals vs. axial coding — when to use each

Looping the improvement automatically

What AI PMs need to do differently

What enterprise PMs can realistically take on now

Get more out of YouTube videos.