YC Root AccessThis Startup Built the Infrastructure Powering Voice AI
At a glance
WHAT IT’S REALLY ABOUT
AssemblyAI built voice infrastructure enabling real-time agents and audio understanding
- AssemblyAI provides voice AI infrastructure—APIs and models—used by products like note takers, contact centers, and emerging real-time voice agents at massive scale.
- Fox’s motivation came from experiencing the Amazon Echo’s reliability and noticing that existing speech vendors (e.g., Nuance) had a poor, inaccessible developer experience compared to Stripe/Twilio.
- The company started extremely early (YC 2017) and endured years of slow progress until a 2021–2022 inflection driven by better models, more voice data from COVID-era remote work, and a maturing ecosystem (LLMs, WebRTC, etc.).
- Voice is shifting from batch transcription to real-time interaction, with recent breakthroughs crossing thresholds of accuracy, latency, and cost that unlock widespread deployment.
- AssemblyAI’s roadmap centers on “smarter” promptable speech models (e.g., Universal-3 Pro) that remain reliable for transcription while adding instruction-following and richer audio understanding.
IDEAS WORTH REMEMBERING
5 ideasVoice AI becomes valuable when it’s developer-accessible infrastructure, not just a model.
AssemblyAI’s wedge is a Stripe/Twilio-like experience for voice primitives so teams can ship note takers, call analytics, and agents without building speech stacks in-house.
Product breakthroughs hinge on crossing practical thresholds—not just incremental accuracy gains.
Fox describes adoption surging once real-time models became “good enough” on accuracy, latency, and cost; before that, even compelling demos didn’t create sustained usage.
Ecosystem readiness can matter more than a single company’s technology.
Early on, voice apps lacked enabling layers (LLMs, vector databases, WebRTC, bandwidth/5G), so most use cases stayed non-real-time and narrowly transcription-focused.
COVID accelerated voice AI by increasing recorded voice data and normalizing audio workflows.
Remote work, meetings, and podcasts expanded the amount of internet voice data and made transcription + downstream analysis (summaries, sentiment) more economically compelling.
Promptable speech models are a middle path between STT and multimodal LLMs.
Universal 3 Pro is positioned as reliable for transcription-like tasks while allowing instruction following (e.g., translate to Spanish, mark crosstalk) without the “off-the-rails” behavior of general multimodal LLMs.
WORDS WORTH SAVING
5 quotesWe really are focused on helping other companies have the voice AI infrastructure and primitives they need to just innovate around voice.
— Dylan Fox
They would mail you a CD-ROM with a developer SDK on it, and I didn’t even have a CD-ROM drive on my laptop.
— Dylan Fox
If you said AI… it was like a scam. Deep learning was the thing to say.
— Dylan Fox
Real-time voice agents… work really well now. It’s kind of wild.
— Dylan Fox
My goal is having as little operational overhead as possible… teetering… on the line of complete chaos.
— Dylan Fox
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome