AI Agents for PMs in 69 Minutes — Masterclass with IBM VP

Name: AI Agents for PMs in 69 Minutes — Masterclass with IBM VP
Uploaded: 2025-09-05T00:00:00Z
Duration: 1 h 9 min 19 s
Description: AI agents are positioned as the next leap beyond chatbots because they combine reasoning with planning, tool-taking actions, and reflection to automate end-to-end work.

Aakash Gupta and Armand Ruiz on aI agents, RAG, open source, and the PM shift explained.

Aakash GuptahostArmand Ruizguest

Sep 5, 20251h 9mWatch on YouTube ↗

Definition of AI agents vs chatbotsFour-step agent loop: think, plan, act, reflectAgent development: coding frameworks vs low/no-code toolsRAG vs fine-tuning and enterprise context injectionVision RAG for PDFs, charts, tables, multimodal extractionRAG accuracy, data engineering, and evals throughout workflowsMulti-agent orchestration and human-in-the-loop accountabilityHow AI reshapes PM ratios, workflows, and prototyping cultureOpen source AI in enterprise: deploy-anywhere controlIBM strategy: hybrid deployment, Granite models, governanceAI regulation and enterprise compliance inventoryAI talent wars as capital allocationCareer acceleration: prototypes, customer obsession, networkingLinkedIn growth systems and diminishing AI-content advantage

In this episode of Aakash Gupta, featuring Aakash Gupta and Armand Ruiz, AI Agents for PMs in 69 Minutes — Masterclass with IBM VP explores aI agents, RAG, open source, and the PM shift explained AI agents are positioned as the next leap beyond chatbots because they combine reasoning with planning, tool-taking actions, and reflection to automate end-to-end work.

WHAT IT’S REALLY ABOUT

AI agents, RAG, open source, and the PM shift explained

AI agents are positioned as the next leap beyond chatbots because they combine reasoning with planning, tool-taking actions, and reflection to automate end-to-end work.
Building agents increasingly splits into two tracks: code-first frameworks (e.g., LangGraph, CrewAI, LlamaIndex, AutoGen) for control and low/no-code builders (e.g., Lindy, n8n, LangFlow, Flowise) for accessibility.
RAG remains a dominant enterprise technique for injecting fresh, company-specific context into LLM workflows, but reliability requires serious data engineering and evaluation at multiple steps—not just checking the final answer.
Managing “10–20 agents per employee” introduces a new orchestration skill: humans become accountable reviewers of agent outputs, with governance, cost controls, and safe experimentation as key enterprise constraints.
AI changes product management by compressing the PM lifecycle (research → prioritization → PRD → prototype → monitoring) and enabling broader PM coverage, while still requiring customer-first problem investigation to avoid feature-factory behavior.

IDEAS WORTH REMEMBERING

13 ideas

Agents matter because they close the loop from “answering” to “doing.”

Ruiz frames agents as the “wall of automation”: not just generating text, but decomposing tasks, executing actions in real systems (email/CRM/Workday), and improving through reflection over time.

The simplest useful mental model for agents is Think → Plan → Act → Reflect.

Thinking leverages LLM reasoning; planning breaks work into subtasks; action is enabled by tool access/protocols (he cites MCP); reflection uses feedback/history to iteratively improve future runs.

Pick your agent tooling based on required control, not hype.

Low/no-code builders accelerate experimentation for non-technical users, while code frameworks (LangGraph/CrewAI/etc.) remain necessary for complex, production-grade agentic systems needing deeper flexibility.

RAG is primarily for fresh context, not “making the model smarter.”

He distinguishes RAG (connecting to databases/knowledge bases for up-to-date info) from fine-tuning (better for behavior/style/specialization, not continuously changing enterprise knowledge).

Most RAG failures are evaluation and data-engineering failures, not “LLM failures.”

Enterprises can’t tolerate “70% accuracy,” so vanilla templates break; teams need systematic eval practices and better pipelines (embeddings, chunking, retrieval, filtering, ranking) to reach business-acceptable reliability.

Evaluate agentic workflows at multiple steps, not only at the end.

Ruiz argues evals should be embedded across the pipeline—akin to testing in software engineering—because multi-step systems can silently degrade long before the final output looks wrong.

Orchestration becomes a core human skill as people manage 10–20 agents.

He predicts humans will increasingly supervise a portfolio of specialized agents (e.g., marketing copy, creative generation, A/B iteration), with judgment, approvals, and accountability staying with the human.

AI can expand PM scope, but only if PMs remain customer-first.

He sees PM ratios shifting (potentially 1 PM per 20–30 developers) via agents for competitive research, feedback synthesis, PRD drafting, prototyping, and monitoring—yet warns against skipping deep customer problem investigation.

Prototype-first communication will win over perfect PRDs.

Ruiz recounts leading a major platform initiative by showing a (non-production) prototype when others showed slides; with modern “vibe coding,” PMs can produce high-fidelity prototypes in hours, reducing translation loss.

In enterprise, open source “wins” because deployment control and governance matter.

He argues enterprises prioritize deploy-anywhere flexibility, data confidentiality, and integration with internal tools—advantages open models and open infrastructure (Kubernetes, vLLM, PyTorch) can offer over closed APIs.

IBM’s bet is hybrid flexibility plus small, transparent models plus governance.

He highlights IBM’s strategy: run AI near data across on-prem/hyperscaler/private cloud; offer Granite small models optimized for cost-per-token and customization; and build governance/compliance tooling to meet evolving regulation.

AI talent pay is rationalized as “capital allocation leverage.”

He claims top researchers are scarce and can make architecture decisions that determine how effectively billions in GPU/AI-cluster spend is utilized, which explains extreme compensation packages in the current market.

Using AI to generate content can help early growth, but becomes a liability later.

Ruiz says he used agents to research viral topics early on, but now uses AI less because it homogenizes voice; differentiation comes from original thinking, targeted audience focus, and disciplined daily writing systems.

WORDS WORTH SAVING

7 quotes

Agents… deliver the wall of automation that is gonna unlock everyone… to generate way more output.

— Armand Ruiz

Four simple steps. The first one is thinking… planning… action… reflection.

— Armand Ruiz

70% accuracy is not acceptable.

— Armand Ruiz

Evals in agentic workflows should be almost… at every single step if you're really serious about developing something… a critical system.

— Armand Ruiz

If you didn’t write the most beautiful detailed PRD, still a lot of information is lost in translation… nothing speaks better than just a working prototype.

— Armand Ruiz

In the enterprise context… open source, I would say always wins.

— Armand Ruiz

Our customers… don’t wanna spend like $20,000 in compute to get a benefit of like 100 bucks.

— Armand Ruiz

QUESTIONS ANSWERED IN THIS EPISODE

5 questions

In your four-step loop, what are concrete implementation patterns for “reflection” (memory, feedback, reward signals) that actually improve agent quality over time without drifting?

AI agents are positioned as the next leap beyond chatbots because they combine reasoning with planning, tool-taking actions, and reflection to automate end-to-end work.

You mentioned MCP as opening up the “act” phase—what does MCP practically change for tool access, permissions, and standardization compared to prior tool-calling approaches?

Building agents increasingly splits into two tracks: code-first frameworks (e.g., LangGraph, CrewAI, LlamaIndex, AutoGen) for control and low/no-code builders (e.g., Lindy, n8n, LangFlow, Flowise) for accessibility.

When enterprises complain about RAG accuracy, which 3–5 pipeline levers usually move the needle most (chunking strategy, embeddings choice, hybrid search, rerankers, query rewriting, etc.)?

RAG remains a dominant enterprise technique for injecting fresh, company-specific context into LLM workflows, but reliability requires serious data engineering and evaluation at multiple steps—not just checking the final answer.

What does “acceptable business accuracy” look like in practice for different use cases (customer support vs internal search vs machine-to-machine automation), and how do you set those thresholds?

Managing “10–20 agents per employee” introduces a new orchestration skill: humans become accountable reviewers of agent outputs, with governance, cost controls, and safe experimentation as key enterprise constraints.

You argued evals should happen at multiple steps—what are the specific step-level evals you’d add for (1) retrieval quality, (2) grounding/attribution, and (3) tool-action correctness?

AI changes product management by compressing the PM lifecycle (research → prioritization → PRD → prototype → monitoring) and enabling broader PM coverage, while still requiring customer-first problem investigation to avoid feature-factory behavior.

EVERY SPOKEN WORD

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

At a glance

AI agents, RAG, open source, and the PM shift explained

Agents matter because they close the loop from “answering” to “doing.”

The simplest useful mental model for agents is Think → Plan → Act → Reflect.

Pick your agent tooling based on required control, not hype.

RAG is primarily for fresh context, not “making the model smarter.”

Most RAG failures are evaluation and data-engineering failures, not “LLM failures.”

Evaluate agentic workflows at multiple steps, not only at the end.

Orchestration becomes a core human skill as people manage 10–20 agents.

AI can expand PM scope, but only if PMs remain customer-first.

Prototype-first communication will win over perfect PRDs.

In enterprise, open source “wins” because deployment control and governance matter.

IBM’s bet is hybrid flexibility plus small, transparent models plus governance.

AI talent pay is rationalized as “capital allocation leverage.”

Using AI to generate content can help early growth, but becomes a liability later.

In your four-step loop, what are concrete implementation patterns for “reflection” (memory, feedback, reward signals) that actually improve agent quality over time without drifting?

You mentioned MCP as opening up the “act” phase—what does MCP practically change for tool access, permissions, and standardization compared to prior tool-calling approaches?

When enterprises complain about RAG accuracy, which 3–5 pipeline levers usually move the needle most (chunking strategy, embeddings choice, hybrid search, rerankers, query rewriting, etc.)?

What does “acceptable business accuracy” look like in practice for different use cases (customer support vs internal search vs machine-to-machine automation), and how do you set those thresholds?

You argued evals should happen at multiple steps—what are the specific step-level evals you’d add for (1) retrieval quality, (2) grounding/attribution, and (3) tool-action correctness?

Get more out of YouTube videos.