No Priors Ep. 70 | With Cartesia Co-Founders Karan Goel & Albert Gu

No Priors Ep. 70 | With Cartesia Co-Founders Karan Goel & Albert Gu

No PriorsJun 27, 202434m

Sarah Guo (host), Albert Gu (guest), Karan Goel (guest), Elad Gil (host), Sarah Guo (host), Karan Goel (guest), Elad Gil (host)

Origins and research journey behind S4, Mamba, and state space modelsDifferences and trade-offs between SSMs and Transformer architecturesApplications of SSMs to audio, text, DNA, and multimodal dataCartesia’s Sonic product: low-latency, high-quality text-to-speechOn-device and edge inference as a second wave of AI innovationFuture plans for multimodal, real-time conversational modelsCompany building: team, hiring, and the ‘rebellion’ against Transformers

In this episode of No Priors, featuring Sarah Guo and Albert Gu, No Priors Ep. 70 | With Cartesia Co-Founders Karan Goel & Albert Gu explores cartesia Bets on State Space Models for Real-Time Voice AI Cartesia co-founders Karan Goel and Albert Gu discuss their work on state space models (SSMs) like S4 and Mamba as efficient, elegant alternatives and complements to Transformers. They explain why SSMs are particularly well-suited for perceptual and multimodal data such as audio, and how this underpins their flagship product Sonic, a low-latency text-to-speech engine. The conversation covers technical trade-offs between SSMs and Transformers, hybrid architectures, and the potential to run powerful multimodal models on consumer devices instead of only in data centers. They also outline Cartesia’s roadmap toward multimodal conversational agents, on-device inference, and building a broader “rebellion” against Transformer-only thinking.

Cartesia Bets on State Space Models for Real-Time Voice AI

Cartesia co-founders Karan Goel and Albert Gu discuss their work on state space models (SSMs) like S4 and Mamba as efficient, elegant alternatives and complements to Transformers. They explain why SSMs are particularly well-suited for perceptual and multimodal data such as audio, and how this underpins their flagship product Sonic, a low-latency text-to-speech engine. The conversation covers technical trade-offs between SSMs and Transformers, hybrid architectures, and the potential to run powerful multimodal models on consumer devices instead of only in data centers. They also outline Cartesia’s roadmap toward multimodal conversational agents, on-device inference, and building a broader “rebellion” against Transformer-only thinking.

Key Takeaways

State space models offer linear-time sequence processing, making them ideal for long, streaming data.

Unlike Transformers’ quadratic scaling with sequence length, SSMs update a compressed state in constant time per token, which is crucial for audio, video, and other sensor data that require fast, continuous processing.

Get the full analysis with uListen AI

Transformers and SSMs are complementary, and hybrid architectures often outperform either alone.

SSMs act as ‘fuzzy compressors’ for bulk processing, while attention layers serve as exact retrieval or cache; many groups have found that mostly-SSM models with a small fraction of attention (around 10:1) work best.

Get the full analysis with uListen AI

Text-to-speech is far from “solved” when judged by human-level engagement and nuance.

Cartesia sees gaps in emotion, role-specific speaking styles, and long-form listenability; a practical test is whether you’d enjoy talking to the voice for more than 30 seconds.

Get the full analysis with uListen AI

High-quality speech systems increasingly require real language understanding, not just signal generation.

To pronounce words correctly and respond naturally, TTS and ASR need deeper semantic and contextual modeling, pushing systems toward integrated multimodal language models rather than isolated components.

Get the full analysis with uListen AI

On-device and edge AI will be a major next wave after large cloud models.

Cartesia is focused on making powerful models efficient enough to run in real time on laptops and eventually smaller hardware, reducing latency and cloud costs and enabling new applications that assume local intelligence.

Get the full analysis with uListen AI

Audio and sensor-heavy applications are prime targets for SSM-based systems.

Because audio and video are information-sparse and highly compressible, SSMs’ streaming, compressive nature fits well for gaming NPCs, voice agents, security cameras, and other real-time interactive experiences.

Get the full analysis with uListen AI

Aesthetic and conceptual elegance strongly shapes Cartesia’s research direction.

Gu emphasizes choosing architectures that feel fundamentally ‘right’ and unified, aiming to replace complex system orchestration (multiple chained models) with simpler, more powerful single-model solutions where possible.

Get the full analysis with uListen AI

Notable Quotes

People think you can throw a Transformer at anything and it just works. Actually, it doesn’t really.

Albert Gu

I think of these state space models as fuzzy compressors, keeping a state in memory that’s always updating as you see new information.

Albert Gu

The way I think about it is: would I want to talk to this thing for more than 30 seconds? If the answer is no, then it’s not solved.

Karan Goel

The future will be more intelligence everywhere, and how do you enable that piece is what we’re excited about.

Karan Goel

In the end, all the systems go away and it’s just one model.

Albert Gu

Questions Answered in This Episode

How might hybrid SSM–Transformer models change the way we design future large language and multimodal models?

Cartesia co-founders Karan Goel and Albert Gu discuss their work on state space models (SSMs) like S4 and Mamba as efficient, elegant alternatives and complements to Transformers. ...

Get the full analysis with uListen AI

What entirely new applications become feasible once high-quality speech and reasoning models can run in real time on consumer devices?

Get the full analysis with uListen AI

How should we evaluate text-to-speech systems beyond basic intelligibility—what dimensions best capture emotional nuance and long-term engagement?

Get the full analysis with uListen AI

What are the biggest technical bottlenecks today in training a large multimodal SSM-based model that rivals leading Transformer models?

Get the full analysis with uListen AI

How might moving more intelligence to the edge affect data privacy, business models, and the economics of AI infrastructure?

Get the full analysis with uListen AI

Transcript Preview

Sarah Guo

Welcome back to No Priors. We're excited to talk to Karan Goel and Albert Gu, the co-founders of Kartesia, and authors behind such revolutionary models as S4 and Mamba. They're leading a rebellion against the dominant architecture of Transformers, so we're excited to talk to them about that and their company today. Welcome, Karan, Albert.

Albert Gu

Thank you.

Karan Goel

Nice to be here.

Elad Gil

And can you tell us a little bit more about Kartesia, the product, what people can do with it today, some of the use cases?

Karan Goel

Yeah, definitely. We launched Sonic. Sonic is a, uh, really fast text-to-speech engine, so, um, some of the places I think that we've, we've seen people be really excited about, uh, you know, using Sonic is where, like, they wanna do, uh, interactive low-latency, uh, voice generation. So I think the two places we've really kind of, um, had a lot of excitement is, one, in gaming, where, you know, folks are really just, um, interested in, uh, powering, um, you know, characters and roles and NPCs. The dream is to have a game where you have millions of players and they're, like, able to just interact, uh, with these, uh, with these models and- and- and get back responses on the fly, and I think that's sort of, uh, where we've seen a lot of excitement and- and uptake. And then the other end is voice agents, um, and- and being able to power them, and again, low latency there matters. Um, and you know, even with, uh, what we've done with Sonic, we're already kind of shaving off, like, 150 milliseconds off of, uh, you know, what they typically use. And so, you know, the roadmap is let's- let's, uh, get to the next 600 milliseconds and- and try to shave those off in the- over the course of the year. That's been the place where it's been pretty exciting.

Sarah Guo

Love to talk a little bit just about backgrounds and how you ended up starting Kartesia, and maybe you can start with the research journey, and like what kind of problems you were both working on.

Albert Gu

Karan and I both came from the same PhD group, uh, at Stanford. I did a pretty long PhD and I worked on a bunch of problems, uh, but I ended up sort of working on a bunch of problems around, uh, sequence modeling. They came out of kind of these, uh, problems that I started working on actually at DeepMind during an internship, and then I started working on sequence modeling, uh, around the same time, actually, that Transformers got popular. Um, I actually, instead of working on them, I got really interested in these alternate kind of recurrent models, which I thought were really elegant for other reasons, and it kind of felt like fundamental in a sense. And so I- I was just really interested in them and I worked on them for a few years. A couple years ago, me and Karan worked together on this model called S4, which, uh, kind of got popular for showing that some form of recurrent model, called a state-space model, um, was really effective in some applications. And I've continuing to be pushing on that direction. Recently, uh, pr- proposed a model called Mamba, um, which was, uh, kind of brought these to language modeling, um, and showed really good results there, and so people have been really interested. Um, we've been using them for, um, applications and, uh, other sorts of domains and so on. So yeah, it's really exciting. Um, personally, I also, I just started as a professor at CMU this year. My research lab there is kind of working on the academic side of these questions, while, uh, at Kartesia we're kind of putting them into production.

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome