Nikhil KamathThe $11B Bet That Voice Will Replace Everything | Mati Staniszewski x Nikhil Kamath | WTF Online
CHAPTERS
ElevenLabs in India, and why hardware (Nothing) might be the voice wedge
Nikhil and Mati open with team presence in India and segue into hardware as the missing link for mainstream voice. They discuss Nothing’s positioning and how AI-native audio devices could make real-time translation and always-on assistance feel natural.
- •ElevenLabs’ growing team footprint in India (Bengaluru/Mumbai)
- •Why hardware is hard—and why Nothing is an interesting bet
- •Earbuds/headphones as a potential “first AI-native” everyday device
- •Real-time translation as a killer use case, gated by form factor
The “meeting companion” vision: voice agents, pendants, and opt-in transcription
Mati describes internal experiments using voice agents to capture event/meeting feedback and action items. The conversation explores what an ideal wearable note-taker could look like—assuming explicit consent and strong privacy norms.
- •Voice agent for capturing feedback and shortening post-meeting workflows
- •Opt-in transcription and the need for clear “safe environment” norms
- •Pendant/wearable constraints: size, battery, capture quality
- •Why form factor determines adoption as much as model quality
Voice as the next interface: what must improve for a Jarvis-like experience
Mati outlines why voice is a natural interaction layer and what it will take to move computing “into the background.” He breaks the transition into three requirements: human-level interaction quality, knowledge/memory access, and the right device form factor.
- •Voice will become a major interface across support, education, devices, and robots
- •Three prerequisites: natural interaction, knowledge+memory, and deployable hardware
- •Need for low-latency, interruptible, emotionally expressive voice
- •Vision: phone back in pocket; ambient voice assistance
Which devices win: headphones, glasses, pendants—and even silent speech
They debate the most likely winners among AI hardware formats. Mati leans toward behind-the-ear audio devices, and mentions emerging “silent speech” approaches that infer speech from mouth movements to interact privately.
- •Likely multi-form-factor future (glasses/headphones/pendants/wristbands)
- •Headphones as the most natural near-term interface; phone remains a companion
- •On-device constraints and why additional sensors may require more wearables
- •Silent-speech interaction via mouth-movement detection
Speculating on the Jony Ive + Sam Altman device, and why OpenAI voice matters
Nikhil and Mati speculate about what OpenAI and Jony Ive might build, focusing on behavior change vs. adoption friction. They also note OpenAI’s increased emphasis on voice, raising the competitive bar for specialists like ElevenLabs.
- •Adoption likely starts with a familiar handheld form, then expands to companions
- •Strategic importance of pre-installed AI-first hardware for distribution
- •OpenAI’s push into voice as a competitive pressure test
- •Why hardware/software bundling could reshape the AI landscape
Competing with big labs: ElevenLabs’ moat (research + product) and platform split
Nikhil raises the fear of platform providers moving up the stack; Mati responds with ElevenLabs’ strategy: owning core audio research and shipping products that create durable customer value. He explains their two major lines: creative tools and enterprise/agent platforms.
- •Risk: large labs replicate successful apps built on their platforms
- •ElevenLabs’ approach: foundational audio research plus product execution
- •International voice quality as a differentiator (incl. Indian & European languages)
- •Two platforms: Creative (voiceovers/localization) and Agents (CX/training/education)
Creator workflows and dubbing at scale: fixing audio, then localizing across languages
Mati explains practical creator use cases: patching missing lines with voice reconstruction and dubbing full podcasts into multiple languages. They discuss the Lex Fridman–PM Modi dubbing workflow and how cost differs between fully automated dubbing and human-reviewed, high-stakes translations.
- •Post-production voice fixes: add/replace lines seamlessly in the same voice
- •End-to-end dubbing workflow with human-in-the-loop review when needed
- •Lex Fridman–PM Modi episode as a stress test of quality and accuracy
- •Cost spectrum: automated dubbing is cheap; precision review adds labor costs
Preserving emotion in dubbed content—and the next step: lip reanimation
Nikhil challenges the core problem: dubbing often loses emotion and sounds robotic. Mati explains how ElevenLabs attempts to preserve intonation and context, and why sentence structure differences create lip-sync mismatches—leading to the need for lip animation/reanimation.
- •Emotion/intonation preservation via contextual generation, not just translation
- •Structural differences between languages shift timing (e.g., German word order)
- •Current limitations: facial/lip movements may not match translated phrasing
- •Near-term future: automatic lip reanimation integrated into localization workflows
Voice agents that make money today: domain-first businesses in lagging industries
Asked what entrepreneurs can build now, Mati recommends pairing voice tech with deep domain expertise—especially in “non-innovating” industries like automotive and healthcare. The key is deployment, integrations, compliance, and replicability beyond a single client.
- •Best entry: combine voice agents with strong domain knowledge
- •Targets: automotive, healthcare, financial services, e-commerce
- •Start with customer support, then expand into richer in-product experiences
- •Build once, then abstract into a repeatable product for multiple customers
Real-world deployments: Meesho scale and the ‘AI concierge’ future of commerce
Mati shares ElevenLabs’ work helping Meesho automate high-volume multilingual customer calls. They then zoom out to a broader vision: voice-driven shopping where an assistant helps discovery, comparison, and ordering—turning e-commerce into a guided conversation.
- •Meesho: large-scale voice support (tens of thousands of calls) in Hindi/English
- •Use cases: shipping/refund/order queries handled via voice agent
- •Next step: on-site AI concierge for product discovery and guided purchase
- •Voice shifts e-commerce from browsing to conversational intent fulfillment
Education, loneliness, and how voice changes what humans should learn
Nikhil and Mati explore voice agents as companions (including loneliness/intimacy) and as personalized tutors. They debate how always-available knowledge changes education—shifting value from memorization toward learning-to-learn and social development.
- •Loneliness as a major future consumer use case for voice companionship
- •Personal tutors modeled after great teachers (e.g., Feynman-style instruction)
- •If knowledge is always in-ear, schooling shifts away from memorization
- •Future education split: AI learning efficiency + human-only social time
AI valuations and global innovation: bubble talk, Europe/India opportunity, and risks
Nikhil questions whether AI valuations outpace revenue fundamentals. Mati argues this cycle is different due to clearer value and revenue, while acknowledging frothy pockets (e.g., GPU-layer providers) and emphasizing that global teams outside the Valley can build category leaders.
- •Why some AI valuations may still be rational: high upside and real demand
- •Skepticism around certain infrastructure middle-layers (GPU resellers)
- •Global talent and company creation: Europe/India as major sources of innovation
- •Research breakthroughs can justify premium multiples in early stages
Geopolitics, data residency, and trust: will platforms fragment by country?
Nikhil predicts a more multipolar tech world where countries demand local data, local platforms, and reduced dependence on global incumbents like WhatsApp. Mati partially disagrees, arguing foundational models and some infrastructures will remain concentrated, while fine-tuning and product layers localize—making trust and network effects central.
- •Thesis: rising geopolitical tension may break trust in global tech platforms
- •Data residency likely becomes standard; local versions of products proliferate
- •Counterpoint: foundation-model training remains concentrated due to resources
- •Trust + network effects as the main defenses for global platforms
Designing a new social network: authenticity, verification, voice-first interaction, and incentives
Nikhil shares plans for an India-first social platform that reduces algorithmic outrage and foreign influence over youth culture. Mati proposes voice-enabled “companion” consumption (summaries, deeper questions), multilingual access, strong human verification, and carefully designed incentives that don’t reward knee-jerk negativity.
- •Problem framing: current social media is brand/influencer-heavy and outrage-optimized
- •Proposed approach: tighter entry/verification to reduce bots and trolls
- •Voice layer: assistant-like summaries, voice comments, and interactive engagement
- •Hardest challenge: incentives that reward authenticity/curiosity (or neutrality) over outrage