At a glance
WHAT IT’S REALLY ABOUT
David AI builds high-quality conversational speech datasets for next-gen voice models
- David AI focuses narrowly on speech—especially multilingual, multi-accent conversational audio—because high-performing voice models depend on specialized data that is scarce online.
- The founders argue audio is uniquely difficult since there is no “common crawl” equivalent and most internet audio is mono, while modern end-to-end speech architectures need clean, separated channels.
- The company originated from customer discovery with YC startups, where a humanoid robotics team’s urgent need for voice data revealed a broader market opportunity.
- David AI evolved from a weekend-built phone-calling prototype into a global platform collecting scripted and unscripted conversations, enabling rapid growth from $1K pilots to six- and seven-figure contracts.
- Their differentiated approach is a research-driven data product model—develop internally validated datasets first, then scale and offer them broadly—rather than bespoke professional-services labeling work.
IDEAS WORTH REMEMBERING
5 ideasAudio data has a structural supply problem compared to text.
They claim there’s no audio “common crawl,” and what exists online is often unusable for modern training needs, making first-party collection a core advantage.
Separated-at-source audio is a key technical moat for conversational datasets.
Off-the-shelf source separation wasn’t good enough because end-to-end speech models tolerate very little channel bleed, so they collect multi-speaker audio correctly at capture time.
A narrow vertical focus can beat broad data-platform assumptions.
Despite expectations that incumbents dominate “data for AI,” they argue going deep on one modality (voice) lets them solve the hardest edge cases and create repeatable products.
Customer discovery can reveal unexpectedly large markets.
A robotics customer needing voice data became the “aha” that voice is foundational across robots, wearables, games, and avatars—not just call centers.
Research-led data productization is an alternative to bespoke labeling services.
Instead of fulfilling one-off customer specs where the buyer owns the dataset, they run internal R&D to identify valuable dataset shapes, validate them, then scale and sell broadly.
WORDS WORTH SAVING
5 quotesThere’s no real, like, common crawl for audio.
— Ben Wiley
These models have very, very low tolerance for any sort of bleed between channels.
— Ben Wiley
The only way to get high quality data was to collect it separated at the source.
— Ben Wiley
We believe that the best way to build this kind of company is to pick a vertical and go really, really deep.
— Tomer Cohen
Voice AI apps are only as good as the models underneath them, and the models are only as good as the data underneath them.
— Tomer Cohen
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome