
The $11B Bet That Voice Will Replace Everything | Mati Staniszewski x Nikhil Kamath | WTF Online
Nikhil Kamath (host), Mati Staniszewski (guest)
In this episode of Nikhil Kamath, featuring Nikhil Kamath and Mati Staniszewski, The $11B Bet That Voice Will Replace Everything | Mati Staniszewski x Nikhil Kamath | WTF Online explores why voice is the next interface—and how to build on it Voice will likely become a dominant interface once speech feels human-level—interruptible, emotional, low-latency, and intelligent—so devices can “fade into the background.”
Why voice is the next interface—and how to build on it
Voice will likely become a dominant interface once speech feels human-level—interruptible, emotional, low-latency, and intelligent—so devices can “fade into the background.”
ElevenLabs positions itself as both a research company (foundational audio models) and a product company (creative tools and enterprise voice agents), reducing dependence on LLM platform providers.
High-quality localization/dubbing is shifting from “robotic translations” toward preserving emotion and intent, with the next step being automated lip re-animation to match translated speech.
Near-term profitable opportunities are in domain-specific voice agents for traditional industries (healthcare, automotive, e-commerce, financial services), where integration and workflow deployment are the real moat.
The conversation broadens into geopolitics, trust, and social media: Kamath argues global platforms may fragment, while Staniszewski emphasizes network effects and trust, and both explore how voice/AI could enable a healthier, authenticity-first social product.
Key Takeaways
Voice wins when it matches human conversational dynamics.
Staniszewski argues adoption hinges on speech that’s fast, interruptible, emotionally accurate, and paired with strong “assistant intelligence,” otherwise people won’t tolerate voice agents everywhere.
Get the full analysis with uListen AI
Great voice UX requires “knowledge plumbing,” not just a good TTS model.
He highlights integrations (CRM, enterprise systems), memory, and deployment channels (WhatsApp, phone numbers, devices) as essential to make voice agents truly useful in real workflows.
Get the full analysis with uListen AI
Form factor is still the unsolved bottleneck for ubiquitous voice.
They discuss headphones (especially behind-the-ear), pendants, wristbands, and glasses; the technology may exist, but adoption depends on comfortable, always-available hardware with battery, sensors, and social acceptability.
Get the full analysis with uListen AI
Localization value comes from preserving emotion—and soon, fixing lips.
ElevenLabs’ dubbing approach aims to retain intonation and contextual emotion across languages, but mismatched sentence structure breaks visual timing; lip animation/re-animation is framed as the next major layer.
Get the full analysis with uListen AI
Entrepreneurs can build big businesses by pairing voice with “boring” domain expertise.
Staniszewski recommends targeting lagging industries (healthcare, automotive) and starting with deployable wedges like customer support, then expanding into richer in-product experiences once trust and distribution are established.
Get the full analysis with uListen AI
Defensibility in AI apps often lives in deployment, trust, and evaluation loops.
In response to fears that OpenAI or big labs will copy applications, he emphasizes moats like telephony integration, monitoring/evals, domain data, and ongoing operational performance—not just the underlying model.
Get the full analysis with uListen AI
A healthier social platform may require verification plus different emotional incentives.
Kamath wants a non-rage-driven, less algorithmically manipulative social product; Staniszewski suggests verification/real-human trust and voice-first interaction patterns (summaries, spoken comments, auto-translation) while acknowledging incentives are the hardest design problem.
Get the full analysis with uListen AI
Notable Quotes
“To make this possible, there's at least three things that need to happen: ... voice quality and interaction ... knowledge access ... and the form factor.”
— Mati Staniszewski
“Dubbing is... a small market... the interactive use case is the biggest market... that will shift everything.”
— Nikhil Kamath / Mati Staniszewski
“An hour podcast will be few dollars... but... we did employ a wider set of people that would actually check every translation.”
— Mati Staniszewski
“I believe social media is broken today... an app which is not governed by an algorithm... triggering hate... I don't think is conducive in the long term.”
— Nikhil Kamath
“If you could have a platform that inspires curiosity and learning... and doesn't lead to just two sides screaming at each other—[that would be] incredible.”
— Mati Staniszewski
Questions Answered in This Episode
On the “three prerequisites,” what are the current biggest bottlenecks: latency, emotional prosody, or interruption handling—and what milestones mark “human-level” voice?
Voice will likely become a dominant interface once speech feels human-level—interruptible, emotional, low-latency, and intelligent—so devices can “fade into the background.”
Get the full analysis with uListen AI
If behind-the-ear headphones are your preferred form factor, what specific sensors/on-device capabilities are required for the always-on agent use case without privacy backlash?
ElevenLabs positions itself as both a research company (foundational audio models) and a product company (creative tools and enterprise voice agents), reducing dependence on LLM platform providers.
Get the full analysis with uListen AI
In dubbing, how do you technically represent and transfer “emotion” across languages—prosody tokens, reference audio conditioning, or conversation-level context windows?
High-quality localization/dubbing is shifting from “robotic translations” toward preserving emotion and intent, with the next step being automated lip re-animation to match translated speech.
Get the full analysis with uListen AI
Lip re-animation sounds adjacent to video generation—will ElevenLabs build it, partner, or acquire, and what makes that workflow hard to commoditize?
Near-term profitable opportunities are in domain-specific voice agents for traditional industries (healthcare, automotive, e-commerce, financial services), where integration and workflow deployment are the real moat.
Get the full analysis with uListen AI
For entrepreneurs building vertical voice agents, what’s the minimum viable deployment wedge you’ve seen work fastest in healthcare or financial services?
The conversation broadens into geopolitics, trust, and social media: Kamath argues global platforms may fragment, while Staniszewski emphasizes network effects and trust, and both explore how voice/AI could enable a healthier, authenticity-first social product.
Get the full analysis with uListen AI
Transcript Preview
[upbeat music]
How many times have you-- this is your f- uh, how many times-
Fifth. Fifth time.
Fifth time.
Yeah. But never for podcast. Always for my fintech company.
Of course, yeah.
Are you in India?
We are.
Yeah?
We have, uh, I think it's between ten and fifteen. I think it's fourteen people now.
Nice.
If, uh, if we count the new people that are joining or joined.
And, and which city?
Mostly Bengaluru and few, uh, close to Mumbai.
I'm from Bengaluru.
Oh, are you, do you-- are you based there or you are part-
Partly based there.
Where is the other part?
Uh, Mumbai and Goa.
Okay. [laughs]
[laughs] What do you think the opportunity is? What can one build in voice that can be a big profitable business tomorrow? You're based out of which city now, Mati?
From London.
Right, from London. Do you know Carl, the Nothing guy?
Yeah. Yeah, yeah, yeah. Of course.
Yeah, yeah, yeah. We're investors of his.
Oh, no way.
Yeah.
I'm a tiny investor too.
Really?
Yeah. What-
What do you think?
Well, I l- you know, it's so hard to innovate in hardware.
Yeah.
And Carl is one of the very few that is actually doing both innovating-
Mm-hmm
... and got to a good scale-
Yeah
... um, which is so hard.
Yeah.
Um, and I think he, he's, like, thinking around design and how there will be a combination of AI native devices and maybe they all take slightly different forms.
Yeah.
I think there's a, a great chance for, for Nothing. What do you think?
Uh, so we engaged recently actually. We got on their cap table. I like Carl. I like Carl a lot. I feel like it's a new guy who's trying something new. Uh, the sales look great. Uh, I think they went from five hundred mil last year to about nine hundred mil this year. Uh, but it's such a tough space, right? Hardware.
So tough. But it's, you know, I, I'm also excited.
Mm.
You know, maybe that's a little bit biased.
Mm.
But he, he can lead with voice in many ways.
Yeah.
You know, the, the Nothing headphones I think are, are great.
Yeah, yeah, yeah.
And they are actually one of the first ones-
Yeah
... to start integrating, like, AI-assisted part of the-
Yeah
... of the experience, so you could speak with the headphones, you could personalize music-
That's not out yet, right?
They're testing a few things in alpha.
Mm. Yeah.
But if they, if they, um, you know, if they move quickly, I think this could be, like, the first truly AI native device that's kind of with you, around you.
Yeah.
Um, and I mean, that opens so many opportunities. I'm frequently speaking with Carl of, uh-
Yeah
... um, how, like, in the ideal future, you could potentially speak any language-
Mm-hmm
... and it automatically gets real-time translated to any other language around you, so we can communicate and, and travel, travel around.
Install uListen to search the full transcript and get AI-powered insights
Get Full TranscriptGet more from every podcast
AI summaries, searchable transcripts, and fact-checking. Free forever.
Add to Chrome