Aakash GuptaAI for Product Managers: 10X Growth with Smart Experimentation
CHAPTERS
- 0:00 – 1:54
Why AI is the biggest shift in experimentation (and why most teams still under-test)
Aakash introduces Frederic De Todaro and frames the core problem: experimentation is powerful, but most organizations don’t A/B test most releases because the process is too slow and developer-dependent. Frederic argues AI is the biggest change he’s seen in his career and can make experimentation far more accessible.
- •AI is the biggest driver of change in experimentation over the last decade
- •Most product teams don’t test the majority of shipped work
- •The build phase is the long-standing bottleneck due to developer constraints
- •AI creates an opportunity to increase experimentation velocity and coverage
- 1:54 – 14:12
The 4-step experimentation loop and where AI helps most
Frederic outlines a simple experimentation framework: idea/assumption → build variations → configure targeting & KPIs → analyze results. The discussion maps how AI can accelerate each stage, especially build and analysis, while keeping humans accountable for context and decision quality.
- •Four steps: ideate/hypothesize, build, configure (targeting + KPIs), analyze
- •Experimentation is an iterative learning loop
- •AI can speed up all steps, but build has historically been hardest
- •Good hypotheses and measurable metrics remain prerequisites for success
- 14:12 – 16:00
AI-assisted ideation: better ideas through better context (and ‘UX memory’)
They discuss how generative AI improves the ideation phase when it has strong context—product history, constraints, prior experiments, and even screenshots. Frederic highlights the value of connecting AI to an experimentation knowledge base so it can surface previously run tests and results, creating organizational “UX memory.”
- •GenAI can generate experiment ideas if given rich business/product context
- •Using projects/prompts with historical context improves output quality
- •AI can warn: “this was tested before” and share past results
- •‘UX memory’ helps avoid duplicated work and strengthens ideation
- 16:00 – 21:08
Human-in-the-loop roles: PM vs data scientist vs AI analyst
Frederic clarifies division of labor: PMs provide business context and constraints; data scientists ensure methodological rigor, bias checks, and model choices; AI automates pattern finding and summarization. The group emphasizes that AI boosts speed, but humans are still needed for accountability and correctness.
- •PM role: business context, constraints, hypothesis, definition of success
- •Data scientist role: validate plausibility, detect bias, select methods/models (RAG vs fine-tune)
- •AI role: summarize results, detect patterns, suggest next steps for inconclusive tests
- •AI makes experimentation faster and more accessible, but humans ensure trust
- 21:08 – 26:56
Two waves of AI in experimentation: ML (2016) → GenAI (2022+)
Frederic gives a historical model: the machine learning wave improved targeting, traffic allocation, and analysis; the generative AI wave enabled content generation, assistants, and rapid creation of variations via prompting. The key leap is moving from tool-assisted testing to prompt-based experimentation that can ship in minutes.
- •ML wave (around 2016): AI targeting, multi-armed/contextual bandits, result analysis aids
- •GenAI wave (from 2022): content generation and RAG-based assistants
- •Late 2023+: models and ‘vibe coding’ accelerate idea-to-experiment dramatically
- •Prompt-based experimentation aims to remove sprint-level build delays
- 26:56 – 30:05
AI targeting: predicting intent to personalize who sees what
They explain AI targeting as real-time intent scoring (likelihood to convert, churn risk, etc.) based on behavior signals. This score replaces manual segmentation and enables smarter targeting—like showing discounts only to users who need them rather than giving coupons to everyone.
- •AI targeting generates an intent/conversion score from on-site behavior
- •Replaces manual segment-building during experiment configuration
- •Common use case: targeted discounts/coupons to prevent unnecessary margin loss
- •Enables personalized experiences based on predicted propensity
- 30:05 – 31:55
Multi-armed bandit vs contextual bandit: performance vs personalization
Frederic distinguishes standard A/B testing from bandit approaches. Multi-armed bandits quickly shift traffic toward the best-performing variant (trading off statistical accuracy for speed), while contextual bandits learn which variant works best per user/context to power hyper-personalization—both typically requiring substantial traffic.
- •A/B testing: equal split to reach high-confidence conclusion
- •Multi-armed bandit: reallocates traffic to best performer faster; less accuracy early on
- •Contextual bandit: chooses best variant per user/context; enables personalization
- •Best fit requires high traffic; contextual bandits need even more data to learn preferences
- 31:55 – 35:13
Opportunity detection: learning from the 80% of ‘failed’ experiments
Because most experiments don’t produce an overall lift, teams spend time drilling into segments to find where a variant might work. AI opportunity detection automates those breakdowns (device, segment, etc.) to propose actionable follow-ups like targeting the winning variant only to certain users.
- •Less than ~20% of experiments show an overall positive lift
- •Analysts often manually slice data to find segment-level wins
- •AI can automatically detect where/why results differ across segments
- •Turns inconclusive tests into next-step hypotheses (e.g., mobile-only targeting)
- 35:13 – 41:35
GenAI for experimentation: content generation and RAG assistants inside tools
Frederic shares practical uses of GenAI: generating alternative copy for banners/popups and providing an in-product assistant to answer experimentation questions and generate SDK/feature-flag code. They briefly cover statistical approaches (frequentist vs Bayesian) and CUPED as examples of what an assistant can recommend.
- •Content generation: create copy variations quickly for messaging, banners, popups
- •RAG assistant: answers “which stats engine should I use?” and can guide configuration
- •Assistant can generate code snippets for feature flags/SDK integrations
- •Stats concepts referenced: frequentist vs Bayesian; CUPED for faster experiments with historical data
- 41:35 – 43:36
From vibe coding to ‘vibe experimenting’: prompt-based experimentation concept
They contrast rapid prototyping (“can you build it?”) with experimentation (“should you build it?”). Prompt-based experimentation sits between the two: generate real, testable variations directly on production surfaces so teams can learn from actual behavior at scale, not just prototype feedback.
- •Vibe coding accelerates prototyping, but doesn’t validate real-world impact
- •Users often say they like prototypes but behavior differs in production
- •Prompt-based experimentation creates variations directly on the real site/app
- •Goal is faster learning and better product decisions, not just faster shipping
- 43:36 – 51:49
Live demo: idea → running experiment in ~2 minutes (prompt-to-variation)
Frederic demonstrates creating an experiment on an e-commerce catalog by prompting a change (default sorting to price low-to-high). The system extracts page context, identifies elements, generates JS/CSS, runs checks (including accessibility and mobile considerations), and produces a shippable variation rapidly.
- •Workflow: create experiment → open target page → prompt desired change
- •AI localizes the relevant UI element and generates implementation code
- •Outputs include target element, behavior changes, JS/CSS, and checks
- •Result: a working, previewable, shippable variation created in minutes
- 51:49 – 54:17
Governance at scale: keeping design/engineering/data checkpoints—just faster
They discuss how larger orgs should use prompt-based experimentation without bypassing collaboration. The proposed approach: keep the same stage gates (design review, engineering code review, data/metrics review) but make them lighter through simulation links, code visibility, and in-context previews of what will ship.
- •Prompt-based speed doesn’t eliminate the need for cross-functional review
- •Simulation/preview links help stakeholders review the real experience
- •Design reviews ensure UX/brand consistency; engineers review generated code
- •Data teams validate goals and instrumentation before launch
- 54:17 – 1:07:18
Beyond text prompts: generating variations from mockups and sketches
Frederic shows how experiments can be generated from uploaded mockups or quick sketches, including creating net-new UI elements like a newsletter popup. The AI asks clarifying questions (e.g., image needs), generates assets, and produces the required front-end code to implement the experience.
- •Mockup-to-variation: upload a design and prompt “build this version”
- •Sketch-to-variation: rough drawings can define layout/components
- •AI can generate images/assets and ask clarifying questions when needed
- •Supports more advanced UI changes (popups, layout changes, onboarding flows)
- 1:07:18 – 1:10:10
Measuring AI features: adoption, outcomes, experience—and the right North Star
The conversation shifts to how to evaluate AI features themselves. Frederic recommends tracking adoption, task outcomes, and user experience feedback (thumbs up/down), plus value-focused operational metrics like time-to-live and reduced developer involvement; he shares Kameleoon’s North Star: daily experiments running.
- •Core AI feature KPIs: adoption/usage, outcomes (task success), experience (feedback)
- •Operational metrics for PBX: prompts per experiment, time from prompt to live, % dev intervention
- •Kameleoon North Star: number of experiments running daily (correlated with churn/growth)
- •Choose actionable metrics that align teams and correlate with business outcomes
- 1:10:10 – 1:13:52
Measuring RAG systems: accuracy, relevance, context quality (+ LLM-as-judge)
Frederic provides a practical measurement model for RAG assistants: accuracy (faithfulness), relevance (does it answer the question), and context quality (are retrieved documents current and useful). He describes using an LLM-as-judge approach to score and validate these qualities and warns against stopping at usage alone.
- •Three RAG metrics: accuracy/faithfulness, response relevance, context quality
- •Failures include confidently wrong answers, off-topic answers, or outdated sources
- •LLM-as-judge can generate similar questions and score relevance/quality
- •Big mistake: measuring only usage instead of outcomes and reliability
- 1:13:52 – 1:15:11
Experimentation culture: Booking.com example + common PM misconceptions
Frederic points to Booking.com as a model experimentation organization where nearly everything ships through tests, but advises adopting it step-by-step. He addresses common objections—experimentation slows delivery, lack of traffic, and “discovery is enough”—arguing that experimentation accelerates learning, traffic isn’t the only limiter, and discovery must be paired with real behavioral validation.
- •Booking.com highlighted as best-in-class experimentation culture
- •Adopt progressively: feature flags → rollouts/targeting → A/B testing at scale
- •Misconceptions: ‘it slows delivery,’ ‘we lack traffic,’ ‘discovery is enough’
- •Experimentation reveals what users do (not just what they say) and correlates with growth