Nikhil KamathThe $11B Bet That Voice Will Replace Everything | Mati Staniszewski x Nikhil Kamath | WTF Online
At a glance
WHAT IT’S REALLY ABOUT
Why voice is the next interface—and how to build on it
- Voice will likely become a dominant interface once speech feels human-level—interruptible, emotional, low-latency, and intelligent—so devices can “fade into the background.”
- ElevenLabs positions itself as both a research company (foundational audio models) and a product company (creative tools and enterprise voice agents), reducing dependence on LLM platform providers.
- High-quality localization/dubbing is shifting from “robotic translations” toward preserving emotion and intent, with the next step being automated lip re-animation to match translated speech.
- Near-term profitable opportunities are in domain-specific voice agents for traditional industries (healthcare, automotive, e-commerce, financial services), where integration and workflow deployment are the real moat.
- The conversation broadens into geopolitics, trust, and social media: Kamath argues global platforms may fragment, while Staniszewski emphasizes network effects and trust, and both explore how voice/AI could enable a healthier, authenticity-first social product.
IDEAS WORTH REMEMBERING
5 ideasVoice wins when it matches human conversational dynamics.
Staniszewski argues adoption hinges on speech that’s fast, interruptible, emotionally accurate, and paired with strong “assistant intelligence,” otherwise people won’t tolerate voice agents everywhere.
Great voice UX requires “knowledge plumbing,” not just a good TTS model.
He highlights integrations (CRM, enterprise systems), memory, and deployment channels (WhatsApp, phone numbers, devices) as essential to make voice agents truly useful in real workflows.
Form factor is still the unsolved bottleneck for ubiquitous voice.
They discuss headphones (especially behind-the-ear), pendants, wristbands, and glasses; the technology may exist, but adoption depends on comfortable, always-available hardware with battery, sensors, and social acceptability.
Localization value comes from preserving emotion—and soon, fixing lips.
ElevenLabs’ dubbing approach aims to retain intonation and contextual emotion across languages, but mismatched sentence structure breaks visual timing; lip animation/re-animation is framed as the next major layer.
Entrepreneurs can build big businesses by pairing voice with “boring” domain expertise.
Staniszewski recommends targeting lagging industries (healthcare, automotive) and starting with deployable wedges like customer support, then expanding into richer in-product experiences once trust and distribution are established.
WORDS WORTH SAVING
5 quotesTo make this possible, there's at least three things that need to happen: ... voice quality and interaction ... knowledge access ... and the form factor.
— Mati Staniszewski
Dubbing is... a small market... the interactive use case is the biggest market... that will shift everything.
— Nikhil Kamath / Mati Staniszewski
An hour podcast will be few dollars... but... we did employ a wider set of people that would actually check every translation.
— Mati Staniszewski
I believe social media is broken today... an app which is not governed by an algorithm... triggering hate... I don't think is conducive in the long term.
— Nikhil Kamath
If you could have a platform that inspires curiosity and learning... and doesn't lead to just two sides screaming at each other—[that would be] incredible.
— Mati Staniszewski
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome