No Priors Ep. 70 | With Cartesia Co-Founders Karan Goel & Albert Gu

This week on No Priors, Sarah Guo and Elad Gil sit down with Karan Goel and Albert Gu from Cartesia. Karan and Albert first met as Stanford AI Lab PhDs, where their lab invented Space Models or SSMs, a fundamental new primitive for training large-scale foundation models. In 2023, they Founded Cartesia to build real-time intelligence for every device. One year later, Cartesia released Sonic which generates high quality and lifelike speech with a model latency of 135ms—the fastest for a model of this class. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @krandiash | @_albertgu Show Notes: 0:00 Introduction 0:28 Use Cases for Cartesia and Sonic 1:32 Karan Goel & Albert Gu’s professional backgrounds 5:06 State Space Models (SSMs) versus Transformer Based Architectures 11:51 Domain Applications for Hybrid Approaches 13:10 Text to Speech and Voice 17:29 Data, Size of Models and Efficiency 20:34 Recent Launch of Text to Speech Product 25:01 Multi-modality & Building Blocks 25:54 What’s Next at Cartesia? 28:28 Latency in Text to Speech 29:30 Choosing Research Problems Based on Aesthetic 31:23 Product Demo 32:48 Cartesia Team & Hiring

Sarah GuohostAlbert GuguestKaran GoelguestElad Gilhost

Jun 26, 202434mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Cartesia Bets on State Space Models for Real-Time Voice AI

Cartesia co-founders Karan Goel and Albert Gu discuss their work on state space models (SSMs) like S4 and Mamba as efficient, elegant alternatives and complements to Transformers. They explain why SSMs are particularly well-suited for perceptual and multimodal data such as audio, and how this underpins their flagship product Sonic, a low-latency text-to-speech engine. The conversation covers technical trade-offs between SSMs and Transformers, hybrid architectures, and the potential to run powerful multimodal models on consumer devices instead of only in data centers. They also outline Cartesia’s roadmap toward multimodal conversational agents, on-device inference, and building a broader “rebellion” against Transformer-only thinking.

IDEAS WORTH REMEMBERING

5 ideas

State space models offer linear-time sequence processing, making them ideal for long, streaming data.

Unlike Transformers’ quadratic scaling with sequence length, SSMs update a compressed state in constant time per token, which is crucial for audio, video, and other sensor data that require fast, continuous processing.

Transformers and SSMs are complementary, and hybrid architectures often outperform either alone.

SSMs act as ‘fuzzy compressors’ for bulk processing, while attention layers serve as exact retrieval or cache; many groups have found that mostly-SSM models with a small fraction of attention (around 10:1) work best.

Text-to-speech is far from “solved” when judged by human-level engagement and nuance.

Cartesia sees gaps in emotion, role-specific speaking styles, and long-form listenability; a practical test is whether you’d enjoy talking to the voice for more than 30 seconds.

High-quality speech systems increasingly require real language understanding, not just signal generation.

To pronounce words correctly and respond naturally, TTS and ASR need deeper semantic and contextual modeling, pushing systems toward integrated multimodal language models rather than isolated components.

On-device and edge AI will be a major next wave after large cloud models.

Cartesia is focused on making powerful models efficient enough to run in real time on laptops and eventually smaller hardware, reducing latency and cloud costs and enabling new applications that assume local intelligence.

WORDS WORTH SAVING

5 quotes

People think you can throw a Transformer at anything and it just works. Actually, it doesn’t really.

— Albert Gu

I think of these state space models as fuzzy compressors, keeping a state in memory that’s always updating as you see new information.

— Albert Gu

The way I think about it is: would I want to talk to this thing for more than 30 seconds? If the answer is no, then it’s not solved.

— Karan Goel

The future will be more intelligence everywhere, and how do you enable that piece is what we’re excited about.

— Karan Goel

In the end, all the systems go away and it’s just one model.

— Albert Gu

Origins and research journey behind S4, Mamba, and state space modelsDifferences and trade-offs between SSMs and Transformer architecturesApplications of SSMs to audio, text, DNA, and multimodal dataCartesia’s Sonic product: low-latency, high-quality text-to-speechOn-device and edge inference as a second wave of AI innovationFuture plans for multimodal, real-time conversational modelsCompany building: team, hiring, and the ‘rebellion’ against Transformers

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.