No Priors Ep. 70 | With Cartesia Co-Founders Karan Goel & Albert Gu

This week on No Priors, Sarah Guo and Elad Gil sit down with Karan Goel and Albert Gu from Cartesia. Karan and Albert first met as Stanford AI Lab PhDs, where their lab invented Space Models or SSMs, a fundamental new primitive for training large-scale foundation models. In 2023, they Founded Cartesia to build real-time intelligence for every device. One year later, Cartesia released Sonic which generates high quality and lifelike speech with a model latency of 135ms—the fastest for a model of this class. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @krandiash | @_albertgu Show Notes: 0:00 Introduction 0:28 Use Cases for Cartesia and Sonic 1:32 Karan Goel & Albert Gu’s professional backgrounds 5:06 State Space Models (SSMs) versus Transformer Based Architectures 11:51 Domain Applications for Hybrid Approaches 13:10 Text to Speech and Voice 17:29 Data, Size of Models and Efficiency 20:34 Recent Launch of Text to Speech Product 25:01 Multi-modality & Building Blocks 25:54 What’s Next at Cartesia? 28:28 Latency in Text to Speech 29:30 Choosing Research Problems Based on Aesthetic 31:23 Product Demo 32:48 Cartesia Team & Hiring

Sarah GuohostAlbert GuguestKaran GoelguestElad Gilhost

Jun 27, 202434mWatch on YouTube ↗

EVERY SPOKEN WORD

80 min read · 15,623 words

0:00 – 0:28
Introduction
1. SGSarah Guo
  Welcome back to No Priors. We're excited to talk to Karan Goel and Albert Gu, the co-founders of Kartesia, and authors behind such revolutionary models as S4 and Mamba. They're leading a rebellion against the dominant architecture of Transformers, so we're excited to talk to them about that and their company today. Welcome, Karan, Albert.
2. AGAlbert Gu
  Thank you.
3. KGKaran Goel
  Nice to be here.
4. EGElad Gil
  And can you tell us a little
0:28 – 1:32
Use Cases for Cartesia and Sonic
1. EGElad Gil
  bit more about Kartesia, the product, what people can do with it today, some of the use cases?
2. KGKaran Goel
  Yeah, definitely. We launched Sonic. Sonic is a, uh, really fast text-to-speech engine, so, um, some of the places I think that we've, we've seen people be really excited about, uh, you know, using Sonic is where, like, they wanna do, uh, interactive low-latency, uh, voice generation. So I think the two places we've really kind of, um, had a lot of excitement is, one, in gaming, where, you know, folks are really just, um, interested in, uh, powering, um, you know, characters and roles and NPCs. The dream is to have a game where you have millions of players and they're, like, able to just interact, uh, with these, uh, with these models and- and- and get back responses on the fly, and I think that's sort of, uh, where we've seen a lot of excitement and- and uptake. And then the other end is voice agents, um, and- and being able to power them, and again, low latency there matters. Um, and you know, even with, uh, what we've done with Sonic, we're already kind of shaving off, like, 150 milliseconds off of, uh, you know, what they typically use. And so, you know, the roadmap is let's- let's, uh, get to the next 600 milliseconds and- and try to shave those off in the- over the course of the year. That's been the place where it's been pretty exciting.
1:32 – 5:06
Karan Goel & Albert Gu’s professional backgrounds
1. KGKaran Goel
2. SGSarah Guo
  Love to talk a little bit just about backgrounds and how you ended up starting Kartesia, and maybe you can start with the research journey, and like what kind of problems you were both working on.
3. AGAlbert Gu
  Karan and I both came from the same PhD group, uh, at Stanford. I did a pretty long PhD and I worked on a bunch of problems, uh, but I ended up sort of working on a bunch of problems around, uh, sequence modeling. They came out of kind of these, uh, problems that I started working on actually at DeepMind during an internship, and then I started working on sequence modeling, uh, around the same time, actually, that Transformers got popular. Um, I actually, instead of working on them, I got really interested in these alternate kind of recurrent models, which I thought were really elegant for other reasons, and it kind of felt like fundamental in a sense. And so I- I was just really interested in them and I worked on them for a few years. A couple years ago, me and Karan worked together on this model called S4, which, uh, kind of got popular for showing that some form of recurrent model, called a state-space model, um, was really effective in some applications. And I've continuing to be pushing on that direction. Recently, uh, pr- proposed a model called Mamba, um, which was, uh, kind of brought these to language modeling, um, and showed really good results there, and so people have been really interested. Um, we've been using them for, um, applications and, uh, other sorts of domains and so on. So yeah, it's really exciting. Um, personally, I also, I just started as a professor at CMU this year. My research lab there is kind of working on the academic side of these questions, while, uh, at Kartesia we're kind of putting them into production.
4. KGKaran Goel
  Yeah, I guess my story was that I grew up in India, so I was- came from an engineering family, you know, all my ancestors were engineers, um, so I actually was trying to be a doctor, uh, in high school, but, uh, my aptitude for biology was very low, so (laughs) I abandoned it and instead became an engineer. Uh, so you know, I kind of took a fairly, um, typical path, went and did an IT, came to grad school and then, uh, ended up at Stanford. Actually, started out working on reinforcement learning back in 2017, '18, and then once I got into Stanford, I, uh, started working with Chris, who was somewhat skeptical about reinforcement learning as a field. And so, uh...
5. EGElad Gil
  This is Chris Reye, yeah.
6. KGKaran Goel
  Yes, Chris Reye, who was our PhD advisor. So I had a very interesting sort of transition period where I started the PhD 'cause I had no idea what I was working on. And so it was just exploring. Uh, and then ended up, actually we did our first project together too.
7. AGAlbert Gu
  Oh, yeah. Those were good times.
8. KGKaran Goel
  Uh, and actually we knew each other before that, and I think then started working together, uh, on that first project. And, uh, we would hang out socially and then start working together. The only memory I have of that project was, uh, we (laughs) I- I kept, uh, I kept filling up this disc on GCloud and expanding it by one terabyte every time, and then (laughs) it would keep filling up and I would insist on only adding a terabyte to it, uh, which would, uh, which he was very mad about for a while, so...
9. AGAlbert Gu
  Well, by the end of the project, it was, like, running a bunch of experiments and the logs would get filled up faster than...
10. KGKaran Goel
  (laughs) Yeah, than they would...
11. AGAlbert Gu
  Uh, basically, like, I would be there, like, tracking the experiments and Karan would be there deleting logs, like, in real time, so that our runs didn't crash. It was a really interesting way to get, start working together.
12. KGKaran Goel
  Yeah, so we- we started working together then, and then, you know, I eventually, like, uh, started working with Albert on the S4 push when he was pushing from NeurIPS. And I think he was working on it alone and then needed, uh, help. I got recruited in to help out 'cause I was just, uh, not doing anything for that NeurIPS deadline. So, uh, ended up spending about three weeks on that, two or three weeks, something like that. And then we really pushed hard. And that's kind of how I got interested in it, um, because, you know, he had been working on this stuff for a while, and you know, nobody really knew what he was doing, to be honest, in the lab. It was just like, over in the corner, scribbling away, talking to himself, and we don't really know what's going on. (laughs)
13. EGElad Gil
  Could
5:06 – 11:51
State Space Models (SSMs) versus Transformer Based Architectures
1. EGElad Gil
  you actually tell us more about SSMs, and you know, how's it different from Transformer-based architectures, and what are some of the main areas that people are applying them right now? 'Cause I think it's really interesting as- as sort of another approach.
2. AGAlbert Gu
  They really kind of got started from, um, work on, uh, RNNs, or recurrent neural networks, that I was working on before as a- as an intern at- in 2019. It kind of felt like the right thing to do for sequential modeling, because the basic premise of this is that if you want to model a sequence of data, um, you want to kind of process the sequence one at a time. If you think about the way that you will kind of, um, process information, you're taking it in, uh, sequentially and kind of, um, encoding it into, like, your representation of the information that you know, right? And then you get new information and you update your belief or your state or whatever with the information that- new information that you have. You can basically say almost any model actually is doing this. And then there were some connections to other, like, dynamical systems and other things that I found really interesting mathematically, and I just thought this kind of felt like, um...... a fundamental way to do this, it just felt right in some ways. You can kind of think of these models as doing something, uh, a l- there's like some loose inspiration from the brain even where you kind of think of the model as encoding all the information it's seen, um, into a compressed state. It could be kind of fuzzy compression, but that's actually, uh, powerful in some ways, 'cause it's a way of kind of stripping out unnecessary information and just trying to focus on the things that matter and code those and process those and then, um, and then work with that. We can get more than the technical details, but kind of like, uh, at a high level, it's just this thing. Uh, it's just in- representing this idea of this fuzzy compression and fast updating, so you're just keeping this th- um, this state in memory that's just always updating as you see new information.
3. SGSarah Guo
  Is it a better, um, architecture for certain types of data or did you have, uh, you know, applications in mind-
4. AGAlbert Gu
  Yes.
5. SGSarah Guo
  ... besides the sort of general, um, architectural concept?
6. AGAlbert Gu
  Yeah, so it really can be applied to pretty much everything. So just like kind of, um, Transformers these days are applied to everything, um, so can these sort of models. Over the course of a l- uh, research over a few years, we kind of realized that there are different advantages for different types of data and, uh, lots of different variants of these models, um, are better at different types of data or others. So the first type of model we worked on were really good at modeling kind of perceptual signals. So you can think of text data as kind of a, a representation that's already been really compressed and, like, tokenized, right?
7. EGElad Gil
  Pre-cooked.
8. AGAlbert Gu
  Yeah, f- sure.
9. SGSarah Guo
  Mm-hmm.
10. AGAlbert Gu
  And i- it's kind of like very dense.
11. SGSarah Guo
  Mm-hmm.
12. AGAlbert Gu
  Um, like every e- every token in text already has a meaning. It's, it's kind of just dense information. Now if you look at, like, a video or an audio signal, uh, it's highly compressible. It's, for example, if you sample at a really high rate, it's basically like, it's very continuous, and so that g- means it's compressible. Um, and, and turns out that, um, yeah, different types of models just have different, um, inductive biases or, like, um, strengths at modeling these things. Um, the first types of models we were looking at were really good actually at modeling kind of these, um, raw waveforms, um, raw pixels, things like that, but not as good at modeling text, and Transformers are way better there. Newer versions of these models like, um, MAMA, which was the most recent one that, that's been out for a few months, that's a lot better at modeling the same types of data as Transformers. Even there, there's subtler kind of trade-offs. Um, but yeah, so, uh, one thing we kind of learned is that in general there's no free lunch there. So people think that, like, you can throw a Transformer at, like, anything and it just works. Um, actually, it, it doesn't really. Like, if you try to throw it at, like, the raw pixel level or the raw sample level in, in audio waveforms, um, I think it doesn't work nearly as well. And so you have to be a little more deliberate about this. They really evolve hand in hand with the whole ecosystem of the whole training pipeline, so it's like the places that people use Transformers, um, the data has already kind of been processed in a way that helps th- uh, the model. Uh, for example, people, um, have been talking a lot about tokenization and how it's, uh, both extremely important but also, like, very counterintuitive, unnatural, and has its own issues. That's an example of something that's kind of developed hand in hand with the Transformer architecture. And then when you kind of, uh, break away from these assumptions, then, uh, some of your modeling assumptions no longer hold and, and then some of these other models actually, um, work better.
13. SGSarah Guo
  Do you think of the advantage as, like, natural fit that translates to quality for certain data types? Um, at least if we think about, like, let's say perceptual data or, um, I don't know, richer raw precooked, uh, not precooked data or, like, uh, you know, how do you think about efficiency or the other dimensions of, like, comparing the architectures?
14. AGAlbert Gu
  Yeah, so I guess so far we talked kind of about, um, the inductive bias or the fit for the data. Now, uh, the other reason why we really cared about these is because of efficiency. So yeah, maybe we should've led with that even. So, uh, people have yelled for a long time about this, like, uh, quadratic scaling of Transformers. One of the big advantages of these alternatives is the linear scaling. So, it, it just means that, um, basically the time it takes to process any new token is basically constant time for a current model. But, uh, for a Transformer, it scales with the history that you've seen. This is obviously a, a huge advantage when you're really scaling to, like, lots of data, but it is actually something that's sort of a, a little bit of, like, a no free lunch thing. The fact that the Transformer is processing, is taking longer to process things, um, also means that there are things that it's better at modeling. Um, so this is kind of what I was talking about. There are some, like, subtleties, um, when you're talking about the trade-offs there. One way that we've sort of been thinking about it more is kind of thinking of... Uh, so as I mentioned in the beginning, we think of these state space models as kind of being, um, fuzzy compressors, and I think maybe kind of like the bulk of the processing should be done there. But at the same time, it benefits from having some sort of, like, exact retrieval or some cache, and that's exactly what a Transformer is. So one way to think about the Transformer is that you're processing all this data, um, and it's just memorizing every single token it's seen basically. I mean, some kind of, you know, representation of it, but it's literally remembering every single thing you've seen, um, and it... and you're allowed to look back over all of it. So that's why it's a lot slower, but that could be useful. But probably that shouldn't be what the bulk of your model is doing. Um, kind of the, the same way that like, um, I mean, again, using like very rough and probably not accurate analogies, but like the way a human brain is probably, uh, most of the intelligence is in this, uh, you know, it's this statefulness, this real time processing unit. But it is helpful to augment it with some sort of scratch pad or like lookup ability retrieval, right? Um, and so, uh, these, these ideas are actually quite synergistic, and what people have recently been doing is finding that, um, combining them into hybrid models, uh, tends to work really, really well. Seems like it's better than, uh, either of them individually. Um, so, uh, and interestingly kind of also maybe in line with that intuition, um, people have found that the optimal ratio tends to be mostly SSM layers with a little bit of attention, so maybe a ratio of like 10:1. I know of at least probably like five groups that have independently verified that this is kind of the optimal ratio of things. Um, so yeah, I think it makes in- makes intuitive sense.
15. EGElad Gil
  Are there
11:51 – 13:10
Domain Applications for Hybrid Approaches
1. EGElad Gil
  specific domains that you're seeing initial applications of these hybrid approaches?
2. AGAlbert Gu
  People are mostly using this on text because that's what everyone cares about. Um, I think they've been investigated a bit on some other things. So, um, I actually just heard from some collaborators today that they applied, um, a MAMA-based model, um, on DNA modeling. They're basically bringing this idea of foundation models to DNA, which is kind of this new idea. Um...
3. EGElad Gil
  Sure. I'm just wondering, like, uh, because, uh...DNA is just, uh- do you mean translation of DNA into proteins?
4. AGAlbert Gu
  Um-
5. EGElad Gil
  Is it a protein folding model, or is it something-
6. AGAlbert Gu
  No. So what you do is y- you can, like, pre-train a model on long DNA sequences, and then, uh, fine-tune it or use it downstream on things such as even just, like-
7. EGElad Gil
  Sure. But DNA itself just encodes proteins and RNA that fold into certain molecular shapes, so that's why I was wondering what the problem set is.
8. AGAlbert Gu
  Yeah. There's a bunch of them, and I'm honestly not familiar with every sin- like, a- a lot of the exact details.
9. EGElad Gil
  I was curious, yeah. Yeah.
10. AGAlbert Gu
  Um-
11. EGElad Gil
  I used to be a biologist. That's why I was wondering (laughs) .
12. AGAlbert Gu
  Okay. Yeah, yeah. That's right.
13. EGElad Gil
  Yeah, that's my background.
14. AGAlbert Gu
  Uh, well, just like Karan, um, my- my brain can't handle biology.
15. KGKaran Goel
  (laughs)
16. EGElad Gil
  Okay. Yeah, yeah, yeah. I think, uh, I get it. I- I was just curious, like, where- what the specific application area was. So-
17. AGAlbert Gu
  Oh. There's- this is not protein folding, um, per se-
18. EGElad Gil
  Okay.
19. AGAlbert Gu
  ... but probably more like classification tasks, like... One thing that people are interested in is, like, detecting whether, like, um, point mutations in DNA, like, what downstream effects that can have and stuff.
20. KGKaran Goel
  Mm-hmm. Mm-hmm.
21. AGAlbert Gu
  Um, but I'm not sure the exact-
22. EGElad Gil
  Okay. No, that's really cool.
23. AGAlbert Gu
  ... classification setting. Yeah.
24. EGElad Gil
  And
13:10 – 17:29
Text to Speech and Voice
1. EGElad Gil
  then one of the areas that you folks really started focusing on, uh, from a company perspective is text-to-speech and voice. How did the research lead into that domain?
2. AGAlbert Gu
  We were kind of interested in showing the- like, the versatility and the ac- the actual use case of these models. So previously, it was done mostly in academic context, and, uh, at CMU, my students are still kind of carrying that fundamental research forward. Uh, but we were pretty sure this would just work in a lot of, like, places that, um, are interesting. And so, uh, like, we- we think it will really work on all sorts of data, but kind of, um, audio seemed like a- a pretty natural fit at first for some of the benefits. Like, what we talked about is, like, uh, l- much faster inference and so on. And so, um, doing, like, streaming settings and so on are a natural fit. Uh, we just thought this would kind of be, like, cool first application. Maybe Karan can kind of say more about...
3. KGKaran Goel
  Yeah. I- I mean, there's so many applications that are interesting for these models 'cause they're so generically useful. Um, and I think, uh, part of the challenge is, you know, sort of picking the ones that are most interesting and, uh, impactful long term. Obviously, DNA is an interesting one, but, you know, we don't personally ha- it doesn't personally motivate us as much, or we would have worked on DNA. Um, but I think, like, uh, to me, like, the things that are interesting about multimodal data are really the- the places where SSMs have the most advantage, right? Which is that, um, you have data that's very, very information sparse. Uh, compression is actually an advantage 'cause you can stream data through the system really fast and then process it very quickly, so you update the sort of memory. Um, and- and so being able to handle very large context is kind of something that you want by design. The other thing that I- I think really interesting about audio, um, is that I think commercially, there's a lot of very interesting applications where audio is starting to be, uh, important. I think, uh, both on the voice agent side and, like, being able to kind of interact with your system in a more, you know, natural way like you would with a human is something that, you know, a lot of people want to be able to do, is because there's a lot of places where you don't want to type into your computer; you actually want to talk to a human. Even on things like gaming and stuff, I think it's, like, really interesting to think about how in the future, like, you will be essentially replacing, you know, graphics and rendering with- with essentially models that are outputting streams of data in real time. So I think the real-time aspect of it is sort of really core to, uh, signals and sensor data, and off of which audio and video are both, uh, very important. And I think audio in particular for us felt like a very natural place to start because of- well, I did do some work on audio in my PhD, which was also something that, uh, was helpful, and I think there's just so many applications in audio that are really emerging right now that, um, require these types of capabilities to exist, so I think that's particularly exciting. The other piece I think that's really interesting about SSMs and that we're very excited about and, you know, are trying to do is, uh, the fact that, like, the model are- is so efficient that you can hope to put it on smaller hardware and actually push inference closer to on-device and to the edge. And I think that so far the theme in a lot of the models that people use has been data center, very big model, lots of compute, lots of GPUs being burned. I think that in the long run, like, what we would hope is that you're pushing this closer and closer to the edge. You're actually using much less compute to, uh, do inference, and you're actually able to basically reproduce a capability that maybe, you know, today costs a million dollars in the data center and $10 on a commodity GPU or accelerator at the edge. So I think that will be a very powerful shift, because, um, essentially what it means is that instead of running batch-oriented work- uh, workloads in the cloud, you're basically pushing the- the processing and the intelligence to closer to where the data is being acquired and where the sensors are, and that's kinda what you want. Because, you know, if you think about, like, security cameras or any kind of, like, really, um, sensor that's deployed, um, you really do want to be able to kind of sift through the information very quickly, discard what's not useful, 'cause most of it isn't, and then really kind of remember all the stuff that is and use that to do prediction and generation and understanding problems. So I think that's the theme, I think, in general for what we're trying to build is- is sort of the- um, the infrastructure to be able to train these models, make them run fast, and then bring them closer and closer to kind of be, you know, very edge-oriented rather than cloud-oriented.
4. EGElad Gil
  That's super
17:29 – 20:34
Data, Size of Models and Efficiency
1. EGElad Gil
  interesting. I guess, um, as part of the recent Apple announcement, they mentioned that a lot of the models that they're running on-device are three billion parameters-
2. KGKaran Goel
  Yep.
3. EGElad Gil
  ... or so in size, and so you really have to focus on smaller models
4. KGKaran Goel
  Yeah.
5. EGElad Gil
  ... or even maybe-
6. KGKaran Goel
  So it's- it- it is partly, like, in my head, like, it's always, like, two waves, right? Like, there's, like, the first wave of companies that came was sort of, uh, really about, like, how do we figure out if we can do something interesting, right? Um, you know, nobody knew that scaling to, uh, this amount of data and compute would be interesting, so somebody took a bet there and- and did that, and that was great 'cause now we have these- all these great models. I think the second wave is always about efficiency, and that's been the case in computing as well, where, like, you know, now we have phones that can do so much, you know, powerful work. Uh, and similarly, I think for models, what you would want is the smartest model ever, but run it real cheap so you can run it repeatedly at scale. You can, like, run it a hundred times where you might run it once today. So a lot of that needs to- needs to happen, so I think that's what's interesting. So the 3D models are interesting, I think, but they aren't- like, they're still small and not very capable. I think the question is, like, how do you make the capabilities just very, very good, but then have that, uh, low footprint on-device, all of that? So I think that's where, like, the technology, uh, that, you know, we've been kind of, uh, playing with for the last few years and then building with now has really huge potential to actually be kind of the default to run these workloads. Because I think that part of the challenge with transformers is the fact that, you know, if you try to take a, um, LLM 7B and try to run it on your Mac, you'll- and you open your profiler, you will notice that the, you know, tokens per second goes down and the memory goes up.So I think that's, you know, obviously not great and power and all these things aren't something that people have like... Even, I think in the data center people are talking about it now, in the last year or so. But now, you know, you'll start to see more of that conversation shift and so I think that's kind of where we want to be which is like, you know, the future will be more intelligence everywhere and how do you kind of enable that piece I think is kind of what we're excited about.
7. SGSarah Guo
  Yeah, I think we get really different applications if people start making that assumption, right? As you see, as you said, like we see, we see it in the data center at first where even as an investor betting on applications or, you know, full stack companies that say like, "It costs a great deal to do a thousand calls per query right now, but we're just gonna assume we can make it cheaper-"
8. KGKaran Goel
  Yep.
9. SGSarah Guo
  "... and we can focus on quality first." I think when you assume that you can run the model or if you make it possible to run the model on hardware everybody has already, then you just get very different applications, um, continuously and-
10. KGKaran Goel
  Yeah, I think the-
11. SGSarah Guo
  ... without quality, without that cost and ongoing compute being a problem.
12. KGKaran Goel
  Yeah, just the set of things you would want to be able to do will change because, you know, in the same dollars you will be able to just do way more, um, intelligent computation and so, I think that's cool. I think the way that, you know, you run games on your computer and the games are like very power, like very, very rich and interesting, like how do you kind of bring models to that place where, um, you know, if you, if you think about on device, like I should be able to have a music model on device and that, that should be my personal musician that I can like talk to and get it to play whatever I want and I don't need to, you know, go to the cloud to do it. So I think these are all things that should be possible and just require like, uh, this type of infrastructure and, and work.
13. EGElad Gil
  Yeah. And Karta
20:34 – 25:01
Recent Launch of Text to Speech Product
1. EGElad Gil
  recently launched its initial sort of text to speech product and it's really impressive in terms of performance and how fast you've gotten to ship something really, this stop performing. Can you tell us a little bit more about that launch and that product?
2. KGKaran Goel
  Yeah, I think it was sort of a natural, um, you know, transition for us to kind of now start thinking about how to put the technology to work because, you know, there was a lot of pre-work that happened and Albert continues to do the pre-work for the next set of things. But, uh, um, I think it's sort of like how do you kind of build an efficient system that will allow you to do, for example, in this case, voice and audio generation. So I think the way we're thinking about it is we're building these fairly general models inside the company that allow us to kind of do fairly generic tasks very efficiently, so in this case it's audio generation and then being able to condition on things like text transcripts. The philosophy is like, oh, audio generation is a problem, needs to be very efficient, needs to be very real time and so we need to kind of work on the groundwork there to build the sort of the model stack and then we need to have great training stacks so that we can actually train a model that's high quality that people want to use and that, um, actually has a really great experience. So when we were putting together the sonic demo it was sort of like we wanted to show that the tech that we were using really can kind of give you something that's really interesting and text to speech is very interesting to me because, uh, you know people have been building text to speech systems for the last, you know, probably 30, 40 years. There's constantly improvements happening and yet we're not at ceiling, right? Like there's still so much more you can do in the, in this area.
3. SGSarah Guo
  Can you actually talk about that? Because I think a lot of people would say like that feels a lot more solved in the last year, just text to audio generation.
4. KGKaran Goel
  I feel like, um-
5. SGSarah Guo
  Like what's left between here and the ceiling in terms of thinking about the-
6. KGKaran Goel
  I think-
7. SGSarah Guo
  ... application experience?
8. KGKaran Goel
  Yeah, I think like the way I think about it is like would I want to talk to this thing for more than 30 seconds? And if the answer is no, then it's not solved and if the answer is yes, then it is solved.
9. SGSarah Guo
  Okay.
10. KGKaran Goel
  And I think most text to speech systems-
11. SGSarah Guo
  Karn's audio touring test, yeah.
12. KGKaran Goel
  (laughs) Are not that interesting yet.
13. SGSarah Guo
  Yeah.
14. KGKaran Goel
  You don't feel as engaged as you do when you're talking to a human. I know there's other, obviously other reasons you talk to humans which is, you know... Sorry, I don't want to come across as crazy here but yeah, there's a society that we live in so, um, so we want to talk to people for that reason obviously but, but I do think the engagement that you have with these systems is not that high. When you're trying to build these things you really kind of get so into the weeds on like oh, it can't say this thing this way and it's like so boring when it says it that way and how do I control this part of it to say it like this? You know, the intonation-
15. EGElad Gil
  Are there specific dimensions that you look at from an eval perspective that are, that you think are most important in terms of how you think about...
16. KGKaran Goel
  Yeah, evolves for, you know, generation are generally challenging because they're qualitative and based on sort of, um, you know, the general perception of someone who looks at something and says this is more interesting than this. And so there is some dimension to that but I think for speech like, um, you know, emotion is something that matters a lot because you want to be able to kind of control, you know, the way in which things are said. And I think the other piece that's really interesting is the, um, how speeches used to embody kind of the roles people play in society so like different people speak in different ways because they have, you know, different jobs or work in different, uh, you know, areas or live in different parts of the world and that's sort of the nuance that I don't think any models really capture well which is like, you know, if you're a nurse you need to talk in a different way than if you're a lawyer or if you're a judge or if you're a venture capitalist, you know, very different forms of speech.
17. SGSarah Guo
  The highest form of voice, yeah.
18. KGKaran Goel
  (laughs) So those are all very challenging I would say. So it's not solved-
19. SGSarah Guo
  Okay.
20. KGKaran Goel
  ... is my claim.
21. AGAlbert Gu
  There's, there's also an interesting point which is kind of like even just for your basic evaluations of like can your, can your ASR system like recognize these words or can your generation, can your TTS system say this word? Um, even that is actually not quite a local problem and, and for a lot of hard things you actually need to really have the language understanding in order to, uh, process and figure out what is the right way of pronouncing this and so on. And so actually to really get like perfect, um, even just TTS or like speech to speech, um, you actually really need to have like a model that has more understanding like at least of the language but kind of like it's not really an isolated component anymore and so you have to start getting into these multimodal models just to even do one modality well. And so that's kind of like somewhere where that we were kind of eyeing from the beginning as well, um, and, and we're kind of using this as an entry point into building out the stack toward, um, all of that and hope- hopefully that's all going to, you know, help-It's gonna help the audio as well, but also start getting other, um, modalities into
25:01 – 25:54
Multi-modality & Building Blocks
1. AGAlbert Gu
  that.
2. EGElad Gil
  That's really cool. I mean, I guess you've done, uh, so much pioneering key work on the SSM side. How has multi-modality or, or, uh, speech really impacted how you thought about the broader problem? Or has it and it's more just the generic solutions are the ones that make sense?
3. AGAlbert Gu
  Um, I don't think multi-modality by itself has been kind of a driving motivation for this work, because I, I kind of think of these, um, database models I've been working on as like basic generic building blocks that can be used anywhere. Um, so they certainly can be used in multi-modal systems to good effect, I think. Different modalities have presented different challenges, which has influenced the design of these. Um, but I always look for kind of like the most general purpose fundamental kind of like building block that can be used everywhere. And so that's like... Like multi-modality is more of like a set, sort of a different set of challenges in terms of like, um, how are you applying the building blocks to that. But like you still use kind of the same, the same techniques and, and they mostly work.
4. SGSarah Guo
  Given
25:54 – 28:28
What’s Next at Cartesia?
1. SGSarah Guo
  that like versatility of model architecture, generality as a building block, like what's, what do you do next for, for Cartesia? You focus on like the headroom, uh, for Sonic and audio? You work on other modalities?
2. KGKaran Goel
  You know, I, I'll take that one. (laughs) Um, you know, we're obviously really excited about the Sonic work because I think, um, it kind of shows the first, um, example of something that we're excited about, which is it's a real time model. You can run it really, really fast at low latencies and it's capturing this idea that you want to generate a signal of some kind. So, we're gonna continue to obviously improve that piece. Also, you know, just generally, um, things that folks want out of speech systems that, um, need to get built that are orthogonal to, uh, the technology piece which is, you know, uh, being able to support lots of languages and just generally providing more controls. Orthogonal access, that's really important for generative models, which is, you know, how do you kind of add more controllability in general to the system so you can kind of get the desired output that you want. So, that's obviously one focus for us is, is how to kind of, uh, put that piece in. A few stu- things that we're doing that I think in the short term are really interesting, one is, uh, bringing Sonic more on device. You know, you can run the model real time in the cloud. Wouldn't it be cool if you could run it on your MacBook and it ran real time and it, you know, was, uh, just, just as good? I actually have a, a demo I can show there, um, that, that I think is super cool. Over time, what we wanna do is what Albert said, which is that, um, you know, audio benefits from text reasoning, and, and, you know, the ability to kind of converse with these models and actually have them understand what you're saying beyond just, you know, superficial understanding is very important. So, what we wanna do is enable that piece next, which is, you know, you, you should be able to have a conversation with this thing and actually be able to have it respond to you intelligently, um, and reason over data and context w- in order to do that. And so Sonic I think of as sort of the output piece of that in some sense, which is like what does the response for that model look like? And then there's the input piece which is ingesting audio natively into these models and doing that kind of thing.
3. EGElad Gil
  Is the intention then to train a large scale, multi-modal, uh, language model on your side as well?
4. KGKaran Goel
  Yes, but, you know, we have our own sort of, uh, set of, uh, techniques that we're developing in order to be able to do that, um, effectively. I think that I will maybe, uh, leave for a, uh, for another (laughs) podcast. Um, but I think, yeah, I think that is the intention at the end of the day, is build a great multi-modal model but then make it really, really easy to run on device, uh, and make it really cheap to run. Um, and, and really focus on kind of, um, the audio piece and making that as good as possible, because I think that's sort of where, um, uh, you know, the fidelity and the quality that you get from SSMs is just very different than what you been able to see.
5. EGElad Gil
  Yeah. That's pretty
28:28 – 29:30
Latency in Text to Speech
1. EGElad Gil
  amazing because it seems like a lot of the limitations right now in terms of different application areas or use cases for text to speech is basically the, the extra latency or round trip associated with pinging a language model in the middle.
2. KGKaran Goel
  Yeah. Yeah.
3. EGElad Gil
  As we go from speech to text, to text and then out, and so if you do have multi-modality, then obviously that shrinks the time on the inference side dramatically and that has a huge impact in terms of, uh...
4. KGKaran Goel
  Yeah, I think the latency is gonna be a big thing, theme there because, um, it's obviously like quite painful to orchestrate multiple models to do this piece. And then I think also just the orchestration itself adds so much overhead. Um, it m- it turns what is, in my mind, uh, something that the model should do into an engineering problem that requires so much, you know, orchestration and just, um, engineering work to, to-
5. EGElad Gil
  It feels almost inelegant.
6. KGKaran Goel
  Yeah. I mean, that, that was a...
7. EGElad Gil
  Uh, from a computer science perspective. Yeah.
8. KGKaran Goel
  Yeah. Maybe that's, you know, some of the-
9. SGSarah Guo
  (laughs)
10. KGKaran Goel
  ... thematically the, uh, the general bias here which is, uh, you know, the inelegant things we're trying to, trying to, uh, chip away at.
11. SGSarah Guo
  In the end, all the systems go away and it's just one model.
12. KGKaran Goel
  (laughs) And then me also goes away-
13. SGSarah Guo
  It appears. Yeah. (laughs)
14. KGKaran Goel
  ... apparently. (laughs)
29:30 – 31:23
Choosing Research Problems Based on Aesthetic
1. KGKaran Goel
2. AGAlbert Gu
  People ask me like how do I choose my research problems? And my... I can't explain. My answer is just aesthetic. It's just like there's something that I find elegant or, and, or aesthetically pleasing about things, and to me that's almost the most important thing. And, uh, that, that's kind of driven a lot of these things, too. So like, like I said, like for how did SSMs come about in the first place is just like I just felt like there was something like really like nice about it, like elegant about it, and, um, I just wanted to keep working on it. Uh, and I'm continuing to try to like do that, like find like the simple, um, nice solutions to hard problems. Uh, but it's not always possible, so at Cartesia, we- we of course need to solve the actual like, uh, engineering challenges and there's always gonna be hairy things. Um, but as much as I can, I'm always trying to strive to kind of like make everything simple, unified as possible.
3. EGElad Gil
  That's great. Yeah. Uh, I remember... I can't remember. Is it, uh, Erdos or somebody, uh, used to talk about, uh, certain theorems coming out of like God's book-
4. AGAlbert Gu
  Oh, yeah.
5. EGElad Gil
  ... or something like that were so elegant and...
6. AGAlbert Gu
  Yeah, I very much adhere to that, uh, that idea. So the... It's called, uh, proofs from the book-
7. EGElad Gil
  Ah, that's right.
8. AGAlbert Gu
  ... is what he would say.
9. EGElad Gil
  Yeah, yeah.
10. AGAlbert Gu
  And that's actually kind of thing that kind of, um, guides a lot of the, the way that I like pick and choose in problems. And what you're referring to is of course in, um, like in pure math, sometimes you see like proofs or ideas that just feel like this is obviously just the right way of doing things. It's so elegant, it's so correct. Things are not... In machine learning world, things are often not nearly that clean.
11. EGElad Gil
  Yeah. (laughs)
12. AGAlbert Gu
  Um, but you still can have still the same kind of concept, just, you know, maybe a different level of obstruction. But sometimes, uh, certain approaches or-
13. KGKaran Goel
  ... something just seems like the right way of doing things. Um, unfortunately this thing is also kind of like, it- it can be subjective. Yeah, sometimes I tell people, "This is, this, this is just the right way of doing it and I can't explain why." But, you know, maybe we should kind of have like, um, one of our pillars should be about the book so I can start-
14. (laughs)
15. ... saying this.
31:23 – 32:48
Product Demo
1. KGKaran Goel
  The, the-
2. EGElad Gil
  Let's see the demo.
3. KGKaran Goel
  Yeah, I'd love to show you. (instrumental music) Cool. Yeah, I have a, a, you know, our model running on our, uh, standard issue Mac here. Basically this is, um, you know, our text-to-speech model, Sonic, on our playground is running in the cloud, and so, um, you know, part of what I talked about earlier was how do you kind of bring this closer to on device and edge? And I think the first place to start is your laptop, uh, and then hopefully bring it, uh, uh, shrink it down and bring it closer and closer to a smaller footprint. So let me try running this.
4. It's great to be on the No Priors podcast today.
5. You know, and we have the same feature set that's in the cloud but running on this, and I think that-
6. EGElad Gil
  Prove it's real time and not cooked. Say, um, "You don't have to believe in God, but you have to believe in the book." I think that's the Erdos quote.
7. KGKaran Goel
  Was, was that the quote?
8. EGElad Gil
  Yeah.
9. KGKaran Goel
  So let me grab a, a interesting voice for this one.
10. EGElad Gil
  Erdos is ... Where is Erdos from?
11. KGKaran Goel
  Uh, Hungary.
12. Hungary.
13. EGElad Gil
  Hungary.
14. KGKaran Goel
  Yeah. I mean, that's a default guess for any mathematician -
15. EGElad Gil
  Oh yeah, sure.
16. KGKaran Goel
  ... from that era.
17. EGElad Gil
  You're, he's just assuming.
18. KGKaran Goel
  (laughs)
19. All right, I'm gonna press enter.
20. You don't have to believe in God. You have to believe in the book.
21. EGElad Gil
  That's pretty good. (laughs)
22. KGKaran Goel
  Lengthy. It's pretty, pretty good. So, yeah.
23. Yeah, it, it works really fast and I think that's, that's part of what I think gets me really excited, which is like, you know, uh, it streams out audio instantly. So yeah.
24. EGElad Gil
  I would talk to Erdos. On my laptop. (laughs)
25. KGKaran Goel
  (laughs)
26. (laughs) Yeah, let me turn it. It'll be amazing. That, that'd be a great way to get inspired every morning.
27. Yeah, I know.
28. Yeah.
29. EGElad Gil
  Yeah.
30. KGKaran Goel
  Turn it on and, uh, yeah.
32:48 – 34:08
Cartesia Team & Hiring
1. KGKaran Goel
  great.
2. Your team is now how many people?
3. We are 15 people now.
4. EGElad Gil
  And eight interns.
5. KGKaran Goel
  (laughs) Sarah always gives me shit for this, but, uh-
6. EGElad Gil
  It's a big intern class, yeah.
7. KGKaran Goel
  Yeah, that's amazing.
8. Uh, yeah, it's ... Yeah, we have a lot of interns. I really like interns 'cause they're-
9. They're great.
10. Uh, you know, they're excited, they wanna do cool things, so and, yeah. And are there specific roles that you're currently hiring for at AI?
11. Um, yeah, we are, um, hiring for, uh, you know, model roles specifically. Um, we're hiring across the engineering stack, but really wanna kind of build out our modeling team, uh, deeper, so always looking for, you know, great folks to come, uh, come to Team SSM and, uh, and help us, uh, build the future.
12. EGElad Gil
  The rebellion.
13. KGKaran Goel
  Yeah, the rebellion. Yeah, we used to actually call it, uh- Yeah, yeah. It's, uh ... What did we call it? Overthrowing the empire? (laughs) Yeah, yeah, yeah. (laughs) That was the, that was the theme during our PhDs. And yeah, I would love to continue to, uh, you know, have folks inbound us and, and, and chat with us if they're, uh, excited about this, uh, technology and, and the use cases. A lot of exciting, uh, work to do, both research and bringing it to people. Yep.
14. (instrumental music)
15. EGElad Gil
  Find us on Twitter @nopriorspod. Subscribe to our YouTube channel if you wanna see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way, you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priors.com.

Episode duration: 34:08

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode neQbqOhp8w0

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome