EVERY SPOKEN WORD
20 min read · 3,883 words- 0:00 – 0:11
Intro
- AGAnkit Gupta
[upbeat music]
- 0:11 – 0:26
Introducing Cartesia
- AGAnkit Gupta
I'm Ankit from YC. I'm here at NeurIPS with our YC and Arc Prize after party. I'm here with Karan Goel, the CEO of Cartesia, and we're really excited to have you here.
- KGKaran Goel
Great to be here. Thanks so much for having me.
- AGAnkit Gupta
Why don't you start by telling us a little bit about what Cartesia is?
- 0:26 – 1:20
From Architecture Research to Startup
- KGKaran Goel
Yeah. So, um, Cartesia is a two-year-old company. We are, um, formerly, um, researchers from Stanford, uh, where we did our PhDs. Um, and, uh, we worked on architecture research there. Uh, decided that it was a very important research direction, and we wanted to commercialize and build products around it, so we decided to start Cartesia. And a lot of people know us as a voice AI company, uh, a-and that we build, uh, models for developers that are trying to build voice AI applications. Um, I can describe it in a few different ways, uh, but, um, I think that's, like, the common way that we're known.
- AGAnkit Gupta
Rewinding a little bit, when you said that you're architecture researchers, can you tell us a little bit about what that means? You know, in a sense that a lot of machine learning for the last decade has been architecture research. Um, in, in more recent years you see people working on scaling and more, like, engineering problems on a single architecture. What do you mean by when you say architecture
- 1:20 – 2:18
What “Architecture Research” Really Means
- AGAnkit Gupta
research?
- KGKaran Goel
Yeah. I think machine learning and AI is just, um, putting data into architectures to make, um, really cool models, uh, on lots of compute, right? Um, that's a very, uh, short way to put a very complicated endeavor.
- AGAnkit Gupta
[laughs]
- KGKaran Goel
But, uh, but I think, um, uh, it's clear in the last, like, um, ten years, like, obviously with transformers and self-attention and so on, like, we kind of figured out, like, really good, uh, recipes for building really powerful models. Um, and that became then the entire industry around LLMs. I think that when we were grad students we were, um... This was back in twenty nineteen, twenty twenty. Uh, we were pretty interested in, um, what are some of the biggest challenges that will remain when these models are scaled to their logical conclusion. And part of what we were pretty inspired by is human intelligence, right? Like, humans are very, uh, efficient, you know, intelligence per watt, and, um, are able to do, you know, multimodal, uh, things, right? Like interact with people-
- 2:18 – 3:33
Why Transformers Hit a Ceiling
- AGAnkit Gupta
Right
- KGKaran Goel
... um, take lots of actions, think, um, and be, uh, very productive. Even the average human, quote unquote average, is, is very, very productive and, and is a feat of intelligence basically, right? And so I think we were interested in what kind of architectures would make that type of intelligence possible, uh, that's very long context, that's very multimodal, where you can interact. And so, uh, uh, we felt like the, the transformer paradigm, because of, you know, limitations in, in the actual architecture itself, uh, would not be the right one to build something that's closer to human intelligence. And so we started working on it back in grad school and, um, well, my co-founder Albert is a pioneer in the field. He, uh, invented this area of state-space models, um, which are these, um, recurrent models to do, um, uh, to build, you know, deep learning, uh, models with. And, uh, and that's been, um, you know, an interesting sort of, uh, set of, uh, research directions that we kind of undertook, and I think it was really a belief in the research and the fact that, like, I think in AI there's often a feeling of, like, is there more to do? Uh, but I think we were always very optimistic about, like, well, AI's a very big field and there's lots of unsolved problems. Like, we should go out and do new things, not necessarily the same thing in a different setting, right? Which is often what I think has happened in the last few years.
- 3:33 – 4:21
State Space Models Explained
- AGAnkit Gupta
And so when you think about state-space models or architectures like that, you know, I, I think about the history of how models like the transformer came about. You know, uh, if I rewind the clock to 10 years ago, people were working primarily on RNNs to do language modeling and using LSTMs with attention, and that's eventually what led to the transformer. Um, how should someone think about models like state-space models with that history in mind? Like, is there an analogy to what an, an RNN does versus what a transformer does that this sort of answers, uh, a limitation of each of these two methods?
- KGKaran Goel
Yeah. I think, I think there's like, uh, architecture research is fascinating. Firstly, I think it's, like, super interesting for folks that are, um, you know, working on machine learning. I think more people should work on this stuff. There's different ways to think about intelligence and, like, one, uh, thing that we think a lot about is the, this idea of compression-
- 4:21 – 5:47
Intelligence as Compression
- AGAnkit Gupta
Mm. Very interesting
- KGKaran Goel
... as being sort of a very fundamental, uh, primitive for intelligence, right? So if you have, uh, if you imagine trying to build a model that is going to reason over huge amounts of information, you know, obviously you need to abstract and reason over that information in some, uh, more compressed form, whether it's because you need to consolidate, you know, your understanding of the world, like what does a cup mean in text? What does a cup mean in the world physically? What does a cup mean when you, uh, you know, say the word out loud? Um, you know, all these different representation, audio, video, text, et cetera, need to be consolidated and put together in some way that's reasonable and that, um, also is, uh, something that can be used interactively, right? So basically what I mean by that is humans are able to take all of these representations and then use them to act in the world over 100 years, right? So I think that transformers are fundamentally limited by their, um, inability to model and compress, I think, compress representations in this way, and they're, um, sort of like context window machines, right? Like, uh, they're very retrieval-oriented machines, right? Like, I kind of think of the, the difference between, like, a raw text file and a zip, zipped version of it, right? Like, it's sort of like you want to have, um, these, like, more abstract representation. The compression is a good... It's pressure to build those.
- AGAnkit Gupta
And when you say that a transformer is a retrieval machine, are you saying that, you know, the specific primitive in a transformer with, you know, keys and values and queries is acting to enforce this prior as a, as something that would be effective at retrieval versus what other alternative architectures
- 5:47 – 6:41
Retrieval vs. Abstraction
- AGAnkit Gupta
would do?
- KGKaran Goel
Yeah. Exactly, right. Transformers sort of sit at one extreme. That extreme is I have all, like, my historical data or my, you know, prompt or context, however you want to, uh, frame it, is all available to me in raw form, and I can reason over it and I can answer, you know, very specific questions about it, um, as needed. And an example of that is I can recall factsExactly. SSMs have a fuzzier representation of the world, so they try to compress all this information, which means you lose fidelity, uh, but you at the same time gain something, which is by compression you build abstraction. So I think that's the tension. They live on different extremes. And in fact, one of the interesting things that's emerged is these hybrid models that are basically bringing together a lot of the strengths of both of these architectures, and you're seeing a lot of, uh, modern models like, you know, even like QAN, et cetera, like, these- these sort of, like, open source models that are being built on different forms of hybrids, right? Like,
- 6:41 – 7:13
Hybrid Architectures and the Future
- KGKaran Goel
um, these are just classes of architectures. There's many variants and so on. There's subtleties in terms of how they're implemented, inferenced, et cetera. But I think, uh, that's to me the conceptual difference in terms of the extremes that they occupy, and the question really is, like, what is the ultimate architecture, right? The- the not- not the one that's, like, taking bits and pieces and putting it together, but ultimately what is going to be best for multimodal data where you can really learn and- and then use these models, um, over, like, very long timescales. I think that's what we're interested in at least.
- 7:13 – 8:25
Why Cartesia Chose Voice AI
- AGAnkit Gupta
A- and it's interesting you frame it in terms of multimodality because to your point earlier, many people consider you to be a voice AI company. And so an outsider's view might be, "Okay, this is a single modality." It's audio presumably or text maybe with some kind of text-to-speech thing going on.
- KGKaran Goel
Yeah.
- AGAnkit Gupta
Uh, how do you think about what multimodality means in terms of the specific company you're making, and why is that maybe an, uh, at least an incomplete picture of how to think about it?
- KGKaran Goel
Yeah. I think, um, a lot of people, uh, when the word multimodal is said, they conjure up an image or a video in their heads-
- AGAnkit Gupta
[laughs]
- KGKaran Goel
... I think. Um, I think, uh, video models are very sexy, and role models are all the rage all the time. Uh, but I think what we're very interested in is what are the right approaches to actually build and scale multimodal models. And multimodal just means, like, to me a signal and, uh, some sort of, like, discrete symbol or, like, text, right? That's kind of how we think about it. And so even a transcription model, which is, uh, something that's w- perhaps considered one of the most boring tasks, is actually multimodal because you're actually trying to solve a very interesting problem of taking a signal and mapping it to a discrete set of symbols.
- AGAnkit Gupta
Right. As opposed to it being a single-
- KGKaran Goel
Exactly
- AGAnkit Gupta
... you know, in a- in a standard LLM pre-training set-
- KGKaran Goel
Right
- AGAnkit Gupta
... where it's from text to text.
- KGKaran Goel
Exactly.
- AGAnkit Gupta
Here you're going from audio signal
- 8:25 – 9:20
What Multimodality Actually Means
- AGAnkit Gupta
directly to text.
- KGKaran Goel
Right. And so when you have two modalities, you get many different types of tasks a- and, you know, different types of representations you need to build, right? So for example, uh, one is the prediction problem of, like, can I generate a modality condition on the other one or maybe a combination of the two? There are problems in learning alignments between them, like, uh, what part of audio corresponds to what text.
- AGAnkit Gupta
Mm.
- KGKaran Goel
That could be transcription. It could be trying to understand the audio in some way, like a person is yelling in this fragment of the audio, or this piece of music is, uh, you know, reminds me of Mozart, right?
- AGAnkit Gupta
Right.
- KGKaran Goel
Like, these are all different ways of thinking about correspondences between them. So it's a very rich space, and the question is, like, how do you do better unsupervised learning on multimodal data? Now, for us as a company, the part of the reason we really chose to work on this is, um, A, I think we wanted to find a- a grounded set of problems where you don't have to bite off the entire pie. Like, this is a very big space, and there are many difficult problems here. Uh, my focus-
- AGAnkit Gupta
As in, like, this is why you picked
- 9:20 – 10:09
Audio as a Recipe for Other Modalities
- AGAnkit Gupta
audio, right?
- KGKaran Goel
Yes.
- AGAnkit Gupta
Okay.
- KGKaran Goel
This is why we picked audio and- and, uh, audio text specifically because it's a signal-meets-text problem, and there are many other signal-meets-text problem like video and text and so on.
- AGAnkit Gupta
Mm.
- KGKaran Goel
But if you c- I- we believe that if you can solve one in the right way, you can solve all of them. So that's one of the core beliefs we have around, uh, how we do our research, that if you have the right recipe for building great audio text models, you will have the right recipe for building great, um, models for robotics, for video, et cetera.
- AGAnkit Gupta
C- could you elaborate on that a little bit more?
- KGKaran Goel
Yeah.
- AGAnkit Gupta
I mean, when you say it as a recipe, it makes sense, but when I think about, let's say, a robotics model, it's not intuitive to me what the recipe is you're referring to that comes out of audio that'd be directly relevant there. So help me understand. Like, what would that-
- KGKaran Goel
Yeah.
- AGAnkit Gupta
I mean, maybe some of this is future, you know, thinking-
- KGKaran Goel
Yeah
- AGAnkit Gupta
... but just at least hypothetically, like, what exactly do you mean by recipe here?
- KGKaran Goel
Yeah. A lot of the multimodal, um, domains have
- 10:09 – 11:37
Tokens, Representations, and Learning Signals
- KGKaran Goel
common problems, which is, um... And I think here's- here's a crisp, common problem, which is, uh, you have a signal. How do you represent that signal in tokens in order to train models over it? This is a very, very standard problem. And so, um, in audio, for example, you would want to take the audio signal and the audio wave file and turn it into a bunch of audio tokens-
- AGAnkit Gupta
Right
- KGKaran Goel
... and then train models on those audio tokens. In video, you do the same thing, images same thing. In robotics as well, you construct, um, or in, you know, these joint tr- you know, joint angles or, you know, kinematic trajectories, you're constructing some sort of representation, discrete representation, and then you're trying to train models on it, especially if you're trying to predict these things. So, um, it's very much the same set of problems. And so the core question is how do you actually build the best representation of a signal, right? That's the core question. And it- it goes to the heart of both architectures, but also this idea of tokens and tokenization. This is, this is the intersection that we work on. So we are trying to solve it from the lens of you need new architectures, and you also need to rethink tokenization completely. Like, you need to basically think about, um, not audio as, uh, you know, somebody's hand engineering, uh, you know, 16 kilohertz signals into 50 hertz, but more as, like, how do you learn over these raw signals in terms of hierarchies or abstraction inside the model directly, end-to-end learning of that. So a, a simple way to say it is we wanna get rid of tokens-
- AGAnkit Gupta
Right
- KGKaran Goel
... and have the model learn this representation internally. And-
- AGAnkit Gupta
And that part feels transferable.
- KGKaran Goel
And that part is entirely
- 11:37 – 12:29
Learning Representations End-to-End
- KGKaran Goel
transferable to any other signal in the world, especially if it's done the right way. It would be like saying transformers work on different types of tokens, right? Like, there's n- these are just, like, things that will work out of the box then. So that's why we wanted to have a focused point of view on one modality. The other piece was we felt like, uh, the impact of all of this research on architectures ultimately as a startup needs to be on some real problems and products, right? So I think we were very interested in, as I said, quote-unquote, "building the average human" or building these-
- AGAnkit Gupta
Okay
- KGKaran Goel
... for these problems that are basically high interaction, um, uh, large context, and lots of action taking. Um, so that's why we were like, "Okay, let's take the call center agent-"And think about that, because that is actually a fairly complicated thing to build a AI call center agent with a model, right? Because you have to onboard that person on day one, and then you want
- 12:29 – 13:54
Building for the “Average Human”
- KGKaran Goel
them to do this job for the next ten years, let's say, and improve and, and be able to interact with different, you know, customers and, and, and, you know, uh, help them with their, their queries and so on. So I think that's kind of one way we think about the company is we're first trying to solve it for this specific type of person, and then if we can do it the right way, we can then take that and use it to do a lot of things that, quote-unquote, "average humans do", which is not average at all, you know, uh, to be very clear, right? Like, it is actually extraordinary what people are able to do. But I think that's kind of where we think intelligence can head. And this is very different, I think, than the notion of, like, high IQ intelligence, which is very focused on math and physics. I don't have a, you know, a gold medal in any Olympiad.
- AGAnkit Gupta
You only have a PhD from Stanford.
- KGKaran Goel
I only have a PhD, and that's, you know, a, a much lower feat, I would say, than getting a gold medal in one of these things. Albert, actually, my co-founder, has an IOI gold medal.
- AGAnkit Gupta
Okay [laughs] .
- KGKaran Goel
So I always think for him. But the rest of us, you know, I think we, we try to do our best and do some, you know, fairly productive work, I think. But it, it's not because we can solve math problems every day, right? It's because we're good at dealing with people, systems, uh, taking lots of context and using it to solve real problems. And I think that's what AI is not, and I, I think that it's because of the architectures and the way that we train these models and the way they're built, and multimodality being one of the key things that is not working yet, and not for any other reason. That's, that's our bet, basically.
- AGAnkit Gupta
Interesting. And maybe as like a,
- 13:54 – 15:18
Research vs. Product Reality
- AGAnkit Gupta
a final piece-
- KGKaran Goel
Yeah
- AGAnkit Gupta
... to talk about here, as you've been a researcher, uh, in entering the space, you're a researcher at a university, now you run a research-driven company. You, you started to allude to it around, you know, why you have a product focus as a way of, you know, making this become real. But I'm really curious, as you, you know, think about, you know, folks who might wanna start research-focused companies, or just in contrasting your time in research versus running a research-based company, what are some lessons you've learned about the kind of differences between these and how, uh, your kind of product and research sides can coexist?
- KGKaran Goel
Yeah, I mean, I think it's been, um, it's been very different from doing a PhD for sure, and doing research. I think, um, part of it is here, like, uh, in research often when you're-- even in a PhD lab, like I was in Chris Re's lab at Stanford. It's an amazing place. You know, during the time that I was there, we did flash attention, SSMs were happening. There's just im- amazing work happening. But I think part of it is that in grad school and in, in academic labs, you have, uh, many different people with many different visions for what they think they should accomplish with research, and that's why it works, right? Because actually, like, there's a lot of curiosity about doing new things, and those things are all different, right? Um, I think in a company, it's almost the opposite, where there's only really room for one vision.
- AGAnkit Gupta
Right.
- KGKaran Goel
There's not really room-
- AGAnkit Gupta
Explore versus exploit.
- KGKaran Goel
Yeah. And I think within that, you have to find the room for exploration,
- 15:18 – 16:28
One Vision, Ruthlessly Executed
- KGKaran Goel
right? And you have to sort of build a culture where people, uh, feel like they're not being forced to do things that are, uh, you know, the same old, same old, right? Like, it's not just like you do this work, and that's all you do. But at the same time, there's not room for random exploration either. So I think that's the tension that is different about a research team that's in, uh, a startup versus a research lab in academia. And I think it's right, actually, that that is the difference because I think that, uh, at least we think we should be extremely focused on only one point of view, and we should, uh, prosecute that, uh, to the end of the earth, right? And we happen to have very high conviction in that point of view, uh, and happen to have worked on it for six years. But, uh, and one would work for, on it for another twenty years, if we can. But I think, uh, to me, product is something that drives discipline and truth to the work you do. Because, um, let's take an example. Like, I think we would never ship a product or a model, uh, with an SSM in it just because.
- AGAnkit Gupta
Right.
- KGKaran Goel
Because we have customers, and the customers that we have expect the best version of the product.
- AGAnkit Gupta
Right. They don't care about the architecture.
- KGKaran Goel
They don't care about the architecture.
- 16:28 – 17:25
Product as a Truth Serum for Research
- KGKaran Goel
And that actually drives a lot of honesty into research because it means that, like, you are going to run experiments to prove that your product or your approach is better for the end user, not just because you wanna publish something or you wanna put something out there that's new or interesting. I think that's a level of intellectual honesty that product brings that I don't think is often seen in research, for better or for worse.
- AGAnkit Gupta
Yeah, fair.
- KGKaran Goel
I don't think it's because of, I don't think it's because researchers don't have the-- uh, aren't honest about their work. I think it's just that everybody wants to do things that are new, right, at the end of the day. And I think you wanna be able to have the right incentive to say, "Actually, you don't need something new," when it's not necessary. That's very important, actually. So, so we don't want to be, we want to be delusional about, uh, how, uh, we can change the world, but we don't want to be delusional about how well our, our models actually do or, uh, you know, how much impact our architecture will actually have. So I think that, that is not a place where I think delusion is good. I
- 17:25 – 18:13
Startup Gravity Applies to Research Too
- KGKaran Goel
think precision is good there. So I think that's kind of how we think about it. Um, and that's why I think product is very important. For many other reasons, I have other, uh, reasons. I think actually a lot of people undervalue the lessons that, you know, even YC teaches to founders. I think a lot of people think that research companies aren't governed by the laws of startup gravity. I happen to not believe in that. I think that all companies should be governed by the laws of startup gravity. So I think that the wisdom, the YC wisdom should actually be used by all research founders. And, and I try to, try to do that as much as possible, uh, and break the rules where necessary. Yeah.
- AGAnkit Gupta
[laughs] I appreciate you saying that. Well, thanks so much for joining us. This was a lot of fun.
- KGKaran Goel
Yeah. Thanks so much. Thanks for having me. [outro music]
Episode duration: 18:14
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode 1pjiS-t_O0w
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome