At a glance
WHAT IT’S REALLY ABOUT
Cartesia’s bet: beyond transformers via compression-driven multimodal architectures for voice
- Cartesia was founded by Stanford PhD researchers to commercialize “architecture research,” not just scale existing transformer recipes.
- Goel argues transformers are strong at retrieval over raw context but hit ceilings for long-context, efficient abstraction, motivating compression-oriented alternatives like state-space models (SSMs).
- SSMs trade some fidelity for compressed representations that enable abstraction, and the emerging direction is hybrid architectures that combine retrieval strengths with compression strengths.
- Cartesia chose audio+text because it is inherently multimodal (signal-to-symbol) and provides a grounded, solvable path to general multimodal learning recipes.
- Running a research-driven startup requires a single, ruthlessly executed vision, with product constraints acting as a “truth serum” that enforces empirical honesty over novelty-for-novelty’s-sake.
IDEAS WORTH REMEMBERING
5 ideasArchitecture research is about changing the core learning recipe, not just scaling it.
Goel frames the last decade as converging on strong transformer recipes, but says key remaining challenges (efficiency, long context, multimodal interaction) require new architectural primitives.
Transformers skew toward retrieval over raw context, which limits abstraction under long horizons.
He characterizes attention as making past information available “in raw form,” which is powerful for recalling specifics but weak at building compressed, reusable world models.
Compression is positioned as a fundamental primitive for intelligence.
The discussion ties human-like intelligence to consolidating many forms of experience (text, audio, physical meaning) into abstractions that remain useful across time and tasks.
SSMs emphasize compressed internal state, trading fidelity for abstraction.
SSMs are described as living on the opposite extreme from transformers: they “lose fidelity” by compressing history, but gain generalized representations that can support longer-range reasoning.
The near-term frontier is hybrids that blend retrieval and compression.
Goel points to a trend of architectures combining strengths of both paradigms, suggesting neither pure transformer nor pure SSM is the final answer for long-timescale multimodal agents.
WORDS WORTH SAVING
5 quotesI think machine learning and AI is just, um, putting data into architectures to make, um, really cool models, uh, on lots of compute, right?
— Karan Goel
So I think that transformers are fundamentally limited by their, um, inability to model and compress, I think, compress representations in this way, and they're, um, sort of like context window machines, right? Like, uh, they're very retrieval-oriented machines, right?
— Karan Goel
SSMs have a fuzzier representation of the world, so they try to compress all this information, which means you lose fidelity, uh, but you at the same time gain something, which is by compression you build abstraction.
— Karan Goel
So a, a simple way to say it is we wanna get rid of tokens... and have the model learn this representation internally.
— Karan Goel
I think in a company, it's almost the opposite, where there's only really room for one vision.
— Karan Goel
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome