Aakash GuptaHow AI PMs Ship Features Users Love (Descript CEO Explains)
Aakash Gupta and Laura Burkhauser on descript CEO on building AI editing tools and PM leadership.
In this episode of Aakash Gupta, featuring Laura Burkhauser and Aakash Gupta, How AI PMs Ship Features Users Love (Descript CEO Explains) explores descript CEO on building AI editing tools and PM leadership Descript’s early AI feature strategy focused on packaging reliable, job-based “buttons” (e.g., remove retakes, edit for clarity) rooted in well-understood user workflows rather than novelty prompts.
At a glance
WHAT IT’S REALLY ABOUT
Descript CEO on building AI editing tools and PM leadership
- Descript’s early AI feature strategy focused on packaging reliable, job-based “buttons” (e.g., remove retakes, edit for clarity) rooted in well-understood user workflows rather than novelty prompts.
- The team shipped AI tools using pragmatic, human-driven evaluation against real production data, iterating via public beta, adoption/retention metrics, and whether users exported the AI-modified output.
- As user needs became too parameter-heavy for fixed tools (e.g., Create Clips requesting endless knobs), Descript shifted toward Underlord, an objective-driven, open-ended co-editor agent.
- Underlord’s rollout emphasized tool coverage, representative regression tests, real-customer private alpha feedback, and improved activation—especially helping novices “get over the hump” of video editing.
- Burkhauser frames the PM’s unique value in AI as defining success/failure criteria for evals, while her career advice stresses deep product/customer command, shipping excellence, and humility in founder-led environments.
IDEAS WORTH REMEMBERING
5 ideasGreat AI features start with a concrete workflow pain, not the model.
Descript mapped creator workflows (scripted vs. improvised) and attached AI to specific pains like retakes, eye contact, and clarity—then hid prompts behind dependable, job-based buttons.
Ship “reliable buttons” first; use agents when customization explodes.
Fixed tools work well when inputs are bounded, but Create Clips requests kept adding parameters; Underlord emerged as the right abstraction once users needed highly customized, conversational control.
Human evals against real data are a valid starting point—if you’re disciplined.
Before formal eval stacks were common, the team tested on production-like content and shipped when results were genuinely usable; later they layered regression tests, A/B tweaks, and more automation.
PMs uniquely own the definition of “quality” for AI outputs.
Burkhauser argues only the PM can codify what “good,” “acceptable,” and “harmful” look like because it requires deep job/context understanding (e.g., judging jump-cut density, not just grammar).
Representative eval data matters more than sophisticated scoring.
Studio Sound quality regressed when evaluators used unrealistically terrible audio; the best model for ‘disaster audio’ differed from the best for the common ‘laptop mic’ use case, so the dataset must match the target workflow.
WORDS WORTH SAVING
5 quotesThe best products out there, they don't just do a job for you. They transform how you feel about yourself.
— Laura Burkhauser
Build them in these prepackaged, parameterized, job-based buttons that can give you a reliable result over and over again.
— Laura Burkhauser
You and only you are qualified to write the eval criteria for what… a good job looks like.
— Laura Burkhauser
What it didn't take into account is… how many jump cuts per 10 seconds are you putting into my video?
— Laura Burkhauser
If you're allowing for emergence, you're also allowing for a lot of, like, whack stuff to happen in your product.
— Laura Burkhauser
QUESTIONS ANSWERED IN THIS EPISODE
5 questionsFor each early AI “button” (retakes, filler words, clarity), what were the exact launch criteria that flipped it from ‘don’t ship’ to ‘ship’?
Descript’s early AI feature strategy focused on packaging reliable, job-based “buttons” (e.g., remove retakes, edit for clarity) rooted in well-understood user workflows rather than novelty prompts.
How did you decide which editing tasks should be chunked for context-window limits, and which tasks were blocked until models improved?
The team shipped AI tools using pragmatic, human-driven evaluation against real production data, iterating via public beta, adoption/retention metrics, and whether users exported the AI-modified output.
When you say you measured retention for an AI tool, what was the retention definition (e.g., weekly reuse, per-project reuse, cohort reuse)?
As user needs became too parameter-heavy for fixed tools (e.g., Create Clips requesting endless knobs), Descript shifted toward Underlord, an objective-driven, open-ended co-editor agent.
What were the most common reasons users *didn’t* export after applying an AI edit, and how did that feed back into quality improvements?
Underlord’s rollout emphasized tool coverage, representative regression tests, real-customer private alpha feedback, and improved activation—especially helping novices “get over the hump” of video editing.
What did the ‘jump cuts per 10 seconds’ quality bar translate to operationally—did you add constraints, post-processing, or different prompting?
Burkhauser frames the PM’s unique value in AI as defining success/failure criteria for evals, while her career advice stresses deep product/customer command, shipping excellence, and humility in founder-led environments.
Chapter Breakdown
Why Descript feels transformative (and why that matters for AI PMs)
Laura opens with a product philosophy: the best products don’t just complete a task—they change how users feel about themselves. She and Aakash frame the episode around how shipping beloved AI features can compound into bigger scope and ultimately leadership opportunities.
Descript’s doc-based editor: the foundation that made AI features obvious
Laura demos the core Descript experience—transcript on the left, video on the right, optional timeline below. This workflow already had strong product-market fit for script-based editing, setting the stage for AI to remove tedious steps.
The Great AI Boom and picking the first LLM-powered editing “buttons”
Laura explains how Descript was AI-native early, but the LLM wave created new opportunities—and pressure—to integrate more AI. The team focused on LLM strengths (language) and turned prompts into reliable, job-based actions rather than generic chat.
From idea to build: timelines, context limits, and choosing feasible use cases
Aakash probes how the team decided what to build first and how they handled early technical limits like small context windows. Laura shares how chunking enabled certain tasks (retakes) while others (full rewrites) were constrained by needing broader context.
Customer segmentation drives AI feature mapping: scripted vs unscripted creators
Laura outlines a practical model of creator workflows and the pain points that differ by type. This segmentation guided which AI features mattered most, like Eye Contact for scripted delivery and Edit for Clarity for unscripted “rambling then polishing.”
Shipping approach: public beta + human-driven evals before “evals” were trendy
Laura explains how they launched the AI toolbar as a public beta and used heavy internal usage plus real production data. Quality gating was simple but strict: if a human editor would use the result, ship; if not, don’t.
Measuring success for AI tools: adoption, retention, and “export with it”
Success metrics centered on whether users repeatedly used the tools and shipped final content with the AI edits applied. Remove filler words served as a baseline benchmark, with thumbs up/down feedback as an additional quality signal.
The PM’s unique role in AI features: defining what “good” means via eval criteria
Laura argues PMs remain essential because they’re best positioned to define evaluation criteria grounded in real customer outcomes. She shares a concrete example: Edit for Clarity initially missed an editor-critical metric—too many jump cuts per 10 seconds.
Failure lesson from Studio Sound: optimize for the real use case, not edge-case data
Laura describes how Studio Sound quality degraded when evaluation criteria drifted toward extreme “terrible audio” scenarios. The best model for awful audio isn’t the same as the best model for typical laptop mic audio—so the eval set must mirror the target user reality.
Why Underlord (agent) instead of more buttons: escaping the “30 parameters” trap
As feature requests piled onto “Create Clips,” the team hit a limit of parameterized UI. Underlord emerged as an objective-driven co-editor that supports customized workflows and emergent use cases without endless knobs and dials.
Building and rolling out an open-world breadth agent: scope, tools, and report cards
Underlord was intentionally built as a breadth agent spanning the whole editor, which is harder than a narrow agent. Laura explains the core requirements: giving sufficient context, tool coverage across Descript, and an eval/report-card system to know where it fails and improve over time.
Quantifying and iterating Underlord: regression tests → alpha with real users → activation lift
Laura lays out a staged approach: start with capability regression tests, then run a private alpha to collect real prompting behavior, then convert that data into better regression sets and bug bashes. They ultimately measured impact by improved new-user activation versus the prior onboarding experience.
Career path to CEO: earning the founder’s trust through command, humility, and shipping
The conversation shifts to Laura’s career—from consulting to startups to Twitter—then how she joined Descript by cold outreach driven by genuine product love. She explains the IC-to-CEO arc in founder-led companies: earn trust by mastering product/customers/business and delivering repeatedly, not by forcing “strategy” prematurely.
EVERY SPOKEN WORD
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome