The $11B Bet That Voice Will Replace Everything | Mati Staniszewski x Nikhil Kamath | WTF Online

Name: The $11B Bet That Voice Will Replace Everything | Mati Staniszewski x Nikhil Kamath | WTF Online
Uploaded: 2026-03-11T00:00:00Z
Duration: 59 min 39 s
Description: Voice will likely become a dominant interface once speech feels human-level—interruptible, emotional, low-latency, and intelligent—so devices can “fade into the background.”

Nikhil Kamath and Mati Staniszewski on why voice is the next interface—and how to build on it.

Nikhil KamathhostMati Staniszewskiguest

Mar 11, 202659mWatch on YouTube ↗

Voice as the next tech interfaceThree prerequisites: quality, knowledge, form factorHeadphones vs glasses vs pendants; ambient computingElevenLabs: creative tools vs enterprise agentsEmotion-preserving dubbing and localization workflowsOpenAI/platform risk and defensibility beyond modelsGeopolitics, data residency, trust, and new social media design

In this episode of Nikhil Kamath, featuring Nikhil Kamath and Mati Staniszewski, The $11B Bet That Voice Will Replace Everything | Mati Staniszewski x Nikhil Kamath | WTF Online explores why voice is the next interface—and how to build on it Voice will likely become a dominant interface once speech feels human-level—interruptible, emotional, low-latency, and intelligent—so devices can “fade into the background.”

WHAT IT’S REALLY ABOUT

Why voice is the next interface—and how to build on it

Voice will likely become a dominant interface once speech feels human-level—interruptible, emotional, low-latency, and intelligent—so devices can “fade into the background.”
ElevenLabs positions itself as both a research company (foundational audio models) and a product company (creative tools and enterprise voice agents), reducing dependence on LLM platform providers.
High-quality localization/dubbing is shifting from “robotic translations” toward preserving emotion and intent, with the next step being automated lip re-animation to match translated speech.
Near-term profitable opportunities are in domain-specific voice agents for traditional industries (healthcare, automotive, e-commerce, financial services), where integration and workflow deployment are the real moat.
The conversation broadens into geopolitics, trust, and social media: Kamath argues global platforms may fragment, while Staniszewski emphasizes network effects and trust, and both explore how voice/AI could enable a healthier, authenticity-first social product.

IDEAS WORTH REMEMBERING

7 ideas

Voice wins when it matches human conversational dynamics.

Staniszewski argues adoption hinges on speech that’s fast, interruptible, emotionally accurate, and paired with strong “assistant intelligence,” otherwise people won’t tolerate voice agents everywhere.

Great voice UX requires “knowledge plumbing,” not just a good TTS model.

He highlights integrations (CRM, enterprise systems), memory, and deployment channels (WhatsApp, phone numbers, devices) as essential to make voice agents truly useful in real workflows.

Form factor is still the unsolved bottleneck for ubiquitous voice.

They discuss headphones (especially behind-the-ear), pendants, wristbands, and glasses; the technology may exist, but adoption depends on comfortable, always-available hardware with battery, sensors, and social acceptability.

Localization value comes from preserving emotion—and soon, fixing lips.

ElevenLabs’ dubbing approach aims to retain intonation and contextual emotion across languages, but mismatched sentence structure breaks visual timing; lip animation/re-animation is framed as the next major layer.

Entrepreneurs can build big businesses by pairing voice with “boring” domain expertise.

Staniszewski recommends targeting lagging industries (healthcare, automotive) and starting with deployable wedges like customer support, then expanding into richer in-product experiences once trust and distribution are established.

Defensibility in AI apps often lives in deployment, trust, and evaluation loops.

In response to fears that OpenAI or big labs will copy applications, he emphasizes moats like telephony integration, monitoring/evals, domain data, and ongoing operational performance—not just the underlying model.

A healthier social platform may require verification plus different emotional incentives.

Kamath wants a non-rage-driven, less algorithmically manipulative social product; Staniszewski suggests verification/real-human trust and voice-first interaction patterns (summaries, spoken comments, auto-translation) while acknowledging incentives are the hardest design problem.

WORDS WORTH SAVING

5 quotes

To make this possible, there's at least three things that need to happen: ... voice quality and interaction ... knowledge access ... and the form factor.

— Mati Staniszewski

Dubbing is... a small market... the interactive use case is the biggest market... that will shift everything.

— Nikhil Kamath / Mati Staniszewski

An hour podcast will be few dollars... but... we did employ a wider set of people that would actually check every translation.

— Mati Staniszewski

I believe social media is broken today... an app which is not governed by an algorithm... triggering hate... I don't think is conducive in the long term.

— Nikhil Kamath

If you could have a platform that inspires curiosity and learning... and doesn't lead to just two sides screaming at each other—[that would be] incredible.

— Mati Staniszewski

QUESTIONS ANSWERED IN THIS EPISODE

5 questions

On the “three prerequisites,” what are the current biggest bottlenecks: latency, emotional prosody, or interruption handling—and what milestones mark “human-level” voice?

Voice will likely become a dominant interface once speech feels human-level—interruptible, emotional, low-latency, and intelligent—so devices can “fade into the background.”

If behind-the-ear headphones are your preferred form factor, what specific sensors/on-device capabilities are required for the always-on agent use case without privacy backlash?

ElevenLabs positions itself as both a research company (foundational audio models) and a product company (creative tools and enterprise voice agents), reducing dependence on LLM platform providers.

In dubbing, how do you technically represent and transfer “emotion” across languages—prosody tokens, reference audio conditioning, or conversation-level context windows?

High-quality localization/dubbing is shifting from “robotic translations” toward preserving emotion and intent, with the next step being automated lip re-animation to match translated speech.

Lip re-animation sounds adjacent to video generation—will ElevenLabs build it, partner, or acquire, and what makes that workflow hard to commoditize?

Near-term profitable opportunities are in domain-specific voice agents for traditional industries (healthcare, automotive, e-commerce, financial services), where integration and workflow deployment are the real moat.

For entrepreneurs building vertical voice agents, what’s the minimum viable deployment wedge you’ve seen work fastest in healthcare or financial services?

The conversation broadens into geopolitics, trust, and social media: Kamath argues global platforms may fragment, while Staniszewski emphasizes network effects and trust, and both explore how voice/AI could enable a healthier, authenticity-first social product.

EVERY SPOKEN WORD

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

At a glance

Why voice is the next interface—and how to build on it

Voice wins when it matches human conversational dynamics.

Great voice UX requires “knowledge plumbing,” not just a good TTS model.

Form factor is still the unsolved bottleneck for ubiquitous voice.

Localization value comes from preserving emotion—and soon, fixing lips.

Entrepreneurs can build big businesses by pairing voice with “boring” domain expertise.

Defensibility in AI apps often lives in deployment, trust, and evaluation loops.

A healthier social platform may require verification plus different emotional incentives.

On the “three prerequisites,” what are the current biggest bottlenecks: latency, emotional prosody, or interruption handling—and what milestones mark “human-level” voice?

If behind-the-ear headphones are your preferred form factor, what specific sensors/on-device capabilities are required for the always-on agent use case without privacy backlash?

In dubbing, how do you technically represent and transfer “emotion” across languages—prosody tokens, reference audio conditioning, or conversation-level context windows?

Lip re-animation sounds adjacent to video generation—will ElevenLabs build it, partner, or acquire, and what makes that workflow hard to commoditize?

For entrepreneurs building vertical voice agents, what’s the minimum viable deployment wedge you’ve seen work fastest in healthcare or financial services?

Get more out of YouTube videos.