At a glance
WHAT IT’S REALLY ABOUT
Building a convincing AI video avatar using Flow quickly, imperfectly
- Claire uses Google Flow’s avatar feature and Gemini Omni to create an AI video double of herself and produce a complete podcast hype video in roughly 15 minutes.
- Flow helps not only generate video clips but also brainstorm a seven-scene storyboard, positioning itself as an end-to-end creative suite rather than a single video model.
- The process includes real-world friction—like mistakenly generating images instead of videos—yet still results in usable scenes and a stitched final cut using a browser-based timeline editor.
- The avatar output is “terrifyingly good” in moments but inconsistent across shots, with uncanny facial expressions, shifting hair/background details, and stereotypical “futuristic AI” visual tropes.
- Claire concludes the tool is already valuable for fast solo content creation, and with tighter prompting and more reference inputs it could become convincing to most viewers.
IDEAS WORTH REMEMBERING
5 ideasFlow’s real differentiator is combining ideation, generation, and editing in one place.
Claire leans on Flow to brainstorm a storyboard, generate multiple takes per scene, and assemble clips on a timeline in the browser—reducing tool-hopping and specialized skills.
Avatar capture can import unintended “truth” from your environment.
Background posters/books from the scan appear in generations, suggesting the model anchors heavily on whatever context is visible during capture and may leak personal/environmental cues into outputs.
Small UI mistakes can derail output type, but recovery is fast.
Claire accidentally generates images instead of videos due to a mode toggle; the workflow still makes it easy to re-run prompts and continue without major rework.
Character consistency is the current bottleneck for believable avatar video.
Across scenes, her hair length changes, the room color and props shift, and her face matches only “about 50% of the time,” indicating reference + prompting still isn’t enough for stable identity across cuts.
Facial emotion and performance remain the fastest route to uncanny valley.
Neutral or serious shots look more convincing, while laughing/smiling clips appear strange and “medicated,” implying expression synthesis is less reliable than static likeness.
WORDS WORTH SAVING
5 quotesToday, I am doing a very strange episode where I'm gonna create a video avatar of myself, and in about 15 minutes, get to a full minute-long video starring none other than your favorite podcast host, Claire Vo.
— Claire Vo
I have no idea what we're gonna get into, and hopefully it won't be terrifying.
— Claire Vo
We were told AI would replace us. That is quite spooky.
— Claire Vo
Sorry. Sorry. For you all that are listening and not watching, I just got, um, jump scared by the AI version of myself wearing glasses, um, turning around in a spinning chair.
— Claire Vo
This took zero time and effort, and it is ... I wouldn't say it's, like, 80% there, but is it 50% there? 100% yes.
— Claire Vo
High quality AI-generated summary created from speaker-labeled transcript.
