This Startup Built the Infrastructure Powering Voice AI

In this episode of Founder Firesides, YC Managing Partner Jared Friedman talks to Dylan Fox, the Founder of Assembly AI (S17), which has raised $160M to date. AssemblyAI is the voice AI infrastructure platform powering 10,000 companies, including Granola, Zoom and Delta Airlines. https://www.assemblyai.com/ Apply to Y Combinator: https://www.ycombinator.com/apply Work at a startup: https://www.ycombinator.com/jobs Chapters: 02:08 - What AssemblyAI actually does 05:23 - Dylan learns to code and discovers ML 07:11 - The Amazon Echo moment 09:32 - Why Dylan built voice AI infrastructure 13:02 - Building AI before anyone cared 16:50 - The 2021 inflection point 24:13 - Real-time voice agents are here 28:26 - Inside AssemblyAI’s new voice models 45:33 - Lessons from hypergrowth 52:00 - The future of voice AI

Jared FriedmanhostDylan Foxguest

Mar 5, 202653mWatch on YouTube ↗

CHAPTERS

AssemblyAI today: the voice AI infrastructure layer at massive scale
Dylan explains AssemblyAI’s core role: providing the APIs and primitives companies use to build voice AI products. He shares scale metrics (developers, customers, voice hours) and illustrates how AssemblyAI powers everything from note-takers to enterprise workflows.
- •Infrastructure platform for voice AI features (transcription, analysis, real-time agents)
- •Scale: ~1M developers, ~10K customers, hundreds of millions of voice hours annually
- •Used across products (note-taking) and internal enterprise operations (trust & safety, contact centers)
- •Emphasis on being a developer platform rather than an end-user app company
Customer examples: note-takers, contact centers, and enterprise deployments
Concrete customer stories make the platform tangible: popular note-taking tools and large contact-center stacks depend on AssemblyAI under the hood. The discussion highlights the “embedded” nature of infrastructure—end users often don’t realize voice is being processed by AssemblyAI.
- •Note-taking apps like Granola and Fireflies built on AssemblyAI
- •Hiring/workflow products (e.g., MetaView, Ashby’s note-taker) as voice-driven use cases
- •Contact center deployments (e.g., vendors serving brands like airlines)
- •Enterprise usage including large platforms like Zoom for multiple capabilities
From self-taught coder to early machine learning practitioner
Dylan recounts learning to code via books while building small college startups, then moving into machine learning professionally. He places his journey in the pre-LLM era, when classical ML gave way to early neural-network momentum around 2014–2015.
- •Self-taught programming through building small SaaS experiments in college
- •Early ML interest pre-deep-learning hype (SVM era)
- •Joined ML work at Cisco in San Francisco and began focusing on neural networks
- •Context: early TensorFlow meetups and the dawn of the modern deep learning wave
The Amazon Echo moment: voice finally felt reliable
Buying an Amazon Echo in 2015 was a turning point: voice recognition worked across the room and in noisy environments, creating new user habits. That experience triggered Dylan’s curiosity about building with voice—only to discover the developer tooling ecosystem was missing.
- •Echo’s far-field reliability contrasted sharply with Siri-era frustration
- •Reliability created behavior change (timers, weather, music)
- •Sparked a desire to build voice-powered products personally
- •Realization: developers lacked accessible, modern voice building blocks
Why build voice AI infrastructure: “Twilio/Stripe for voice” vs. CD-ROM incumbents
Dylan contrasts modern developer platforms (Twilio/Stripe) with the painful incumbent voice stack (Nuance’s expensive, outdated SDK workflow). AssemblyAI’s founding idea emerges: apply deep learning advances to voice and deliver it through a world-class developer experience.
- •Nuance as incumbent: expensive upfront costs and outdated distribution model
- •Developer-first inspiration from Twilio/Stripe-style APIs
- •AssemblyAI positioning: infrastructure primitives, not a consumer application
- •Mission: make advanced voice capabilities effortless for any developer/team
Building before the market existed: early YC years and slow iteration cycles
In YC 2017, AssemblyAI faced a brutal reality: models were hard to improve quickly, and the market for voice applications was tiny. Dylan explains the broader missing ecosystem (LLMs, real-time tech, mobile networks) that had to mature before voice apps could flourish.
- •Solo-founder stress: competing with fast-iterating web startups while ML cycles took weeks
- •Early users required model improvements, not UI tweaks—slower feedback loops
- •Voice AI required an ecosystem: LLMs, vector DBs, WebRTC, 5G, supporting tooling
- •Conviction driven by obsession with the problem, not near-term TAM or business plans
The 2021 inflection point: COVID data, transformers, and NLP stacking
AssemblyAI’s breakout begins around 2021 as more voice data moved online during remote work and model quality/cost improved. Transformers, more data, and adjacent NLP capabilities made it easier to build useful voice applications beyond raw transcription.
- •COVID accelerated internet voice data generation (meetings, remote work, podcasts)
- •Model improvements: transformers, more training data, better cost/performance
- •Stacking capabilities: transcription enabled summarization/sentiment workflows (helped by BERT-era NLP)
- •First “real” customer and Series A momentum leading into rapid acceleration
From batch audio to real-time: the threshold moment for voice agents
Early deployments were mostly non-real-time (post-call analytics, meeting transcription). Dylan explains why real-time is harder and why it only recently crossed a practical threshold—unlocking voice agents and other interactive experiences now “good enough” to deploy broadly.
- •Initial use cases: pre-recorded audio analytics and processing
- •Real-time challenges: accuracy + latency + robustness simultaneously
- •Recent 18-month breakthrough: real-time crossing a usability threshold
- •Ongoing gaps (e.g., real-time speaker ID) leave room for major improvement
Where voice AI is heading: agents, robotics/hardware, and ambient intelligence
Dylan outlines the fastest-growing application categories: voice agents that customers can’t reliably distinguish from humans, voice interfaces for robots and consumer devices, and ambient capture in healthcare and sales. The focus shifts from novelty to ROI-driven deployment.
- •Real-time voice agents: high enough success rates and strong ROI for support/reception workflows
- •Robotics and consumer hardware: voice as a natural UI for devices beyond touchscreens
- •Ambient healthcare scribes: noisy, far-field clinical audio now transcribable at high accuracy
- •Ambient sales coaching: real-time guidance improving rep outcomes and compensation
Inside the new generation of voice models: Universal-3 Pro and ‘promptable STT’
AssemblyAI introduces a model positioned between traditional speech-to-text and multimodal LLMs: reliable transcription plus instruction-following for controllability. The emphasis is on staying ‘on the rails’ for speech tasks while adding configurable behavior developers can shape via prompts.
- •Universal-3 Pro: “more intelligent” voice model that can follow instructions
- •Designed to be more reliable than multimodal LLMs for real-world speech workflows
- •Goal: capture context (noise, stress, multiple speakers, multilingual settings) in real time
- •Supports self-hosting for lower latency and production control
Live demo: verbatim capture, alphanumerics, whisper robustness, and translation prompts
Dylan demonstrates low-latency verbatim transcription, strong performance on emails and alphanumeric strings, and better-than-typical robustness under difficult audio (including whispering). He also shows prompt-driven behavior changes such as translating speech into Spanish during transcription.
- •Verbatim capture of disfluencies (stutters, filler sounds) with low latency
- •High accuracy on emails, IDs, and long alphanumeric sequences (call center critical)
- •Robustness experiments: quieter/whispered speech still works reasonably well
- •Prompt-based translation: transcribe + translate to another language on the fly
Controllability for real apps: cross-talk handling and ‘what to ignore’ vs ‘what to capture’
A key theme is that different products need different behaviors—sometimes you want background speakers, sometimes you don’t. Dylan shows prompting the model to mark cross-talk segments without transcribing them, highlighting granular control that developers can tailor per use case.
- •Different application requirements: primary speaker-only vs background capture
- •Prompting can mark cross-talk segments without full transcription
- •Option to transcribe background speech when needed (configurable)
- •Positioning: STT-like reliability with instruction-following, not a general assistant
Hypergrowth lessons: hiring, capital pressure, and staying lean for speed
Dylan reflects on scaling from a tiny team to rapid growth after major fundraising. He shares hiring and organizational lessons: avoid hiring just to ‘explore,’ define role non-negotiables, prioritize mission/market passion, and minimize process overhead to preserve speed.
- •Capital can increase pressure and encourage premature org build-out
- •Mistake pattern: hiring ahead of conviction (exploration vs investment)
- •Hiring discipline: role-specific non-negotiables and strong culture/mission fit
- •Operating model: small (~80 people), transparent metrics, minimal bureaucracy to move fast
Living in the voice-first future: company-wide knowledge from transcripts and feedback
AssemblyAI uses its own technology to capture and organize internal knowledge, making customer truth accessible to everyone. Dylan argues AI-augmented organizations will outcompete those that rely on manual layers of interpretation between customers and builders.
- •AI note-takers in meetings create a searchable internal knowledge base
- •Unified view of customer truth: support, sales calls, social feedback, transcripts
- •Reduces layers between engineers and customer reality; accelerates iteration
- •Belief: AI-augmented companies will systematically outperform non-augmented ones

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

AssemblyAI today: the voice AI infrastructure layer at massive scale

Customer examples: note-takers, contact centers, and enterprise deployments

From self-taught coder to early machine learning practitioner

The Amazon Echo moment: voice finally felt reliable

Why build voice AI infrastructure: “Twilio/Stripe for voice” vs. CD-ROM incumbents

Building before the market existed: early YC years and slow iteration cycles

The 2021 inflection point: COVID data, transformers, and NLP stacking

From batch audio to real-time: the threshold moment for voice agents

Where voice AI is heading: agents, robotics/hardware, and ambient intelligence

Inside the new generation of voice models: Universal-3 Pro and ‘promptable STT’

Live demo: verbatim capture, alphanumerics, whisper robustness, and translation prompts

Controllability for real apps: cross-talk handling and ‘what to ignore’ vs ‘what to capture’

Hypergrowth lessons: hiring, capital pressure, and staying lean for speed

Living in the voice-first future: company-wide knowledge from transcripts and feedback

Get more out of YouTube videos.