CHAPTERS
Meet Cartesia: architecture-first AI, known today for voice
Karan Goel introduces Cartesia as a two-year-old company founded by former Stanford PhD researchers focused on “architecture research.” While many people recognize Cartesia for developer-focused voice AI models, he frames the company’s core identity as building new model approaches and commercializing them through products.
- •Cartesia was founded by Stanford PhD researchers
- •Company roots are in “architecture research,” not just applications
- •Public perception: a voice AI company building models for developers
- •Goal: commercialize new research directions through real products
What “architecture research” means (and why it matters now)
Karan contrasts recent AI progress—dominated by scaling a proven architecture—with the earlier era where new architectures drove step-changes (e.g., transformers). He explains their grad-school motivation: identify what breaks when transformers are scaled “to their logical conclusion,” especially for efficient, human-like intelligence.
- •Architecture research = designing the model structures that learn from data/compute
- •Industry trend: scaling/engineering on one dominant architecture
- •Their question (2019–2020): what limits appear at full scale?
- •Human intelligence inspires goals: efficiency, long context, multimodality, interaction
Why transformers may hit a ceiling for human-like intelligence
Karan argues the transformer paradigm has architectural limitations for long-timescale, multimodal, interactive intelligence. He frames transformers as great for today’s recipes, but not necessarily the endpoint for models that need to consolidate knowledge over long horizons the way humans do.
- •Transformers are powerful but may be the wrong endpoint for human-like systems
- •Target capabilities: very long context, multimodal grounding, interactive use
- •Efficiency matters (“intelligence per watt” comparison to humans)
- •Motivation to pursue fundamentally different architectures
State Space Models (SSMs) as an alternative direction
He introduces state-space models as a recurrent-family approach and credits his co-founder Albert as a pioneer of the modern SSM wave. The chapter positions SSMs as a serious architectural exploration motivated by different tradeoffs than attention-based models.
- •SSMs framed as recurrent models used for deep learning
- •Albert (co-founder) helped pioneer modern SSMs
- •SSMs represent a distinct architectural family vs transformers
- •Belief: AI still has big unsolved problems beyond reusing the same recipes
Intelligence as compression: a core mental model
Karan proposes compression as a primitive underlying intelligence: to reason over huge information, systems must build abstractions that consolidate meaning across modalities. He uses examples like “cup” across text, physical world, and spoken audio to illustrate the need for unified representations.
- •Compression enables abstraction and consolidation
- •Models should unify concepts across text/audio/vision/action
- •Humans operate over long timescales by building compressed internal models
- •The goal is not only recall, but interactive, durable understanding
Retrieval vs abstraction: transformers and SSMs as opposite extremes
He characterizes transformers as retrieval-oriented: they keep history in raw form and query it via keys/values/queries, enabling high-fidelity access. SSMs compress history into a “fuzzier” state—losing some fidelity but gaining abstraction—highlighting a central tradeoff between recall and compression.
- •Transformers: raw-context access, strong retrieval behavior
- •Attention mechanism encourages selective retrieval from stored context
- •SSMs: compress history into internal state, trading fidelity for abstraction
- •Core tension: perfect recall vs abstract, compact world models
Hybrid architectures and the search for the “ultimate” multimodal model
Karan notes the emergence of hybrid models that combine strengths of transformers and SSM-like components. He argues the real question is the ultimate architecture for long-timescale, multimodal learning—beyond stitching pieces together—especially for models that can learn and act over long horizons.
- •Hybrid models are combining retrieval and compression strengths
- •Many variants exist; implementation/inference details matter
- •Focus: architectures that support long-timescale learning and use
- •End goal: best architecture for multimodal, interactive intelligence
Why Cartesia focused on voice: a grounded wedge into multimodality
He reframes “multimodal” beyond flashy video: audio-to-text is inherently multimodal because it maps continuous signals to discrete symbols. Cartesia chose audio+text as a focused slice of the broader signal-to-symbol problem, allowing concrete productization without “biting off the entire pie.”
- •Multimodality = signal + discrete symbols (not only image/video)
- •Speech recognition/transcription is already a multimodal mapping
- •Audio+text offers a tractable, high-impact research/product wedge
- •Grounded problem selection enables shipping while advancing research
Audio as a transferable recipe for other modalities (video, robotics, more)
Karan argues many domains share a common bottleneck: how to represent continuous signals as tokens/representations for learning and prediction. He claims that solving representation learning well for audio generalizes to video, images, and robotics signals (trajectories, joint angles), because they share the same core structure: learning over signals.
- •Shared multimodal problem: representing signals for model training
- •Audio tokenization parallels video/image tokenization and robotics state encoding
- •Key question: what is the best learned representation of a signal?
- •A strong “audio recipe” can generalize to broader multimodal/embodied settings
Rethinking tokens: toward end-to-end learned representations
He describes Cartesia’s intersection of architecture + tokenization: moving away from hand-engineered pipelines and toward models that learn hierarchical representations internally. The aim is to reduce dependence on fixed tokenization schemes and let the model discover the right abstractions end-to-end.
- •Tokenization is central to scaling beyond text
- •Critique of hand-engineered signal processing as the interface to models
- •Goal: models learn hierarchies/abstractions internally, end-to-end
- •“Get rid of tokens” as a direction: reduce rigid discretization assumptions
Building for the “average human”: voice agents as long-horizon, interactive work
Karan frames their product ambition as building systems that can do what ordinary people do—high-context, interpersonal, action-oriented tasks over long periods. He uses the example of a call-center agent that must onboard quickly, interact naturally, and improve over years, emphasizing long-term adaptation and reliability.
- •Target intelligence: practical, interactive competence (not Olympiad-style IQ)
- •Voice agents/call-center tasks stress long context and real-world interaction
- •Desired behavior: onboard day one, operate and improve for years
- •Humans excel at systems + people + context; AI architectures should match that
Research vs product reality: one vision, executed with discipline
Karan contrasts academia’s many parallel visions with startups’ constraint: there’s only room for one. He explains the need to preserve exploration inside a focused company direction, and argues that product requirements force clarity about what matters and what doesn’t.
- •Academia: many visions and curiosity-driven exploration
- •Startup: single coherent vision; limited tolerance for random exploration
- •Need to create space for exploration without losing focus
- •Execution focus: “prosecute that point of view to the end of the earth”
Product as a truth serum—and “startup gravity” applies to research companies too
He argues customers impose intellectual honesty: you don’t ship an architecture “just because,” and must prove it improves outcomes. He closes by warning that research startups are still subject to core startup dynamics—distribution, speed, iteration, and discipline—so founders should adopt operational wisdom (like YC’s) even in research-heavy settings.
- •Customers don’t care about novelty; they care about performance
- •Product constraints force rigorous evaluation of research claims
- •Avoid delusion about model quality/impact while keeping ambitious vision
- •Research startups still follow “startup gravity”; YC-style discipline applies
