How I AIHow Emmy Award–winning filmmakers use AI to automate the tedious parts of documentaries
At a glance
WHAT IT’S REALLY ABOUT
AI automates documentary logging, search, and field archival research workflows
- Tim McAleer (Florentine Films/Ken Burns) describes documentary post-production as a media-management problem: hundreds of hours of footage and tens of thousands of photos that historically required tedious manual logging.
- He demos how early one-off scripts evolved into a production REST API that extracts file specs and embedded metadata, scrapes the web for source truth, generates accurate descriptions, and processes video via frame sampling plus Whisper transcription.
- He then layers in vector embeddings (CLIP for images + text embeddings for descriptions) to enable semantic discovery and reverse-image search within a project’s archive.
- Finally, he shows two hyper-specific “vibe-coded” tools—an iOS app for field capture of photo fronts/backs with embedded EXIF metadata, and a macOS OCR cropping app for historical documents—highlighting AI’s biggest near-term value as workflow tooling, not content generation.
IDEAS WORTH REMEMBERING
5 ideasDocumentary workflows are dominated by asset management toil.
Nonfiction productions can involve tens of thousands of stills and hundreds of hours of footage; organizing, describing, and fact-checking assets becomes a core bottleneck that AI can relieve.
Accuracy improves when AI is constrained by trusted metadata.
Instead of relying on generic vision descriptions (which can hallucinate), McAleer appends embedded metadata (e.g., Library of Congress fields) and eventually web-scraped source info so the model anchors outputs to verifiable facts.
The real leverage is turning ad-hoc scripts into shared infrastructure.
He starts with a single Python script, then scales into a REST API that teammates can call from any database tool—standardizing a multi-step pipeline (specs → copy/rename → parse metadata → scrape URL → generate description).
Video logging becomes tractable with sampling + a two-model approach.
To control cost, he samples frames at ~5-second intervals, captions those with a cheaper model, transcribes audio with Whisper, then sends the combined “video events” to a reasoning model to infer what’s happening.
Embeddings unlock discovery that keyword search can’t match.
By fusing image embeddings (CLIP from thumbnails) with text embeddings (from descriptions), the archive supports semantic search and “Find Similar” reverse-image lookup—useful for editors seeking a consistent visual ‘vibe.’
WORDS WORTH SAVING
5 quotesPost-production is like a technical mess of media management.
— Tim McAleer
My goal was to automate this. For years, this has been manual data entry.
— Tim McAleer
We want everything going into our database to be true and verifiable information.
— Tim McAleer
You’ve now freed them up to just look more, right?
— Tim McAleer
No one was gonna make me this app.
— Tim McAleer
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome