No Priors Ep. 67 | With Voyage AI Co-Founder and CEO

After Tengyu Ma spent years at Stanford researching AI optimization, embedding models, and transformers, he took a break from academia to start Voyage AI which allows enterprise customers to have the most accurate retrieval possible through the most useful foundational data. Tengyu joins Sarah on this week’s episode of No priors to discuss why RAG systems are winning as the dominant architecture in enterprise and the evolution of foundational data that has allowed RAG to flourish. And while fine-tuning is still in the conversation, Tengyu argues that RAG will continue to evolve as the cheapest, quickest, and most accurate system for data retrieval. They also discuss methods for growing context windows and managing latency budgets, how Tengyu’s research has informed his work at Voyage, and the role academia should play as AI grows as an industry. Show Notes: 0:00 Introduction 1:59 Key points of Tengyu’s research 4:28 Academia compared to industry 6:46 Voyage AI overview 9:44 Enterprise RAG use cases 15:23 LLM long-term memory and token limitations 18:03 Agent chaining and data management 22:01 Improving enterprise RAG 25:44 Latency budgets 27:48 Advice for building RAG systems 31:06 Learnings as an AI founder 32:55 The role of academia in AI

Sarah GuohostTengyu Maguest

Jun 6, 202436mWatch on YouTube ↗

EVERY SPOKEN WORD

70 min read · 13,802 words

0:00 – 1:59
Introduction
1. SGSarah Guo
  Welcome to No Priors. Today, we're talking to Tenggu Ma, assistant professor of computer science at Stanford, and the co-founder and CEO of Voyage. Voyage trains state-of-the-art components for next generation retrieval systems, including embeddings models and re-rankers. We're really excited to talk about his research and the RAG debate today. Uh, welcome, Tenggu.
2. TMTengyu Ma
  Yeah. Thanks so much. Thanks for having me here. Looking forward to the debate.
3. SGSarah Guo
  Yeah. W- why don't we start with, um, just a- a little bit of an overview of y- like, your research agenda to date, because I- I think uniquely, it covers, um, a broad range of fields, uh, within and around deep learning from, like, theory to RL to embeddings and optimizers. So can you talk a little bit about, um, sort of how you picked the directions you have?
4. TMTengyu Ma
  Yeah. So, um, I think most of the papers I wrote, uh, have some theoretical thinking in it. I guess maybe that's the commonality. And, um, um, be- besides that, I think I worked on, uh, quite a few topics, as you mentioned, ranging from the theoretical understanding, mathematical proofs of deep learning systems, um, to, um, all the way to practical large language models, reinforcement learning, um, deep reinforcement learning. And these days, recently, I think what we are working on, um, more centralized to, um, efficiency of training the large language models and reasoning, improving the reasoning tasks for large language models. So I- my vision is that in the future, uh, the efficiency is very important because, um, we are running out of data and compute, so we have to either use the data much better and use the compute much better. Um, and also reasoning tasks seems to be a pretty, uh, important direction. And, um, um, uh, and also in some sense, kind of like a- a risky direction in the sense that we don't know exactly how, um, how fast we can, uh, solve those challenging reasoning questions yet.
5. SGSarah Guo
  Mm-hmm. Um, can you, uh, mention a few of the, like, key papers or work
1:59 – 4:28
Key points of Tengyu’s research
1. SGSarah Guo
  that, uh, you know, you or students in your lab have done, just so our listeners can look them up?
2. TMTengyu Ma
  In the very, uh, early days, I think I worked on some of this matrix completion, optimization for matrix completion. That's like 10 years ago. Um, and then I move on to, uh, embedding models, like sentence embeddings, vector embeddings. One of the papers we wrote is a ve- actually simple paper where we average the word embeddings to get sentence embeddings, and then we did some, uh, of these transformations using PCA to make the performance much better. Um, that was even before transformer came out. Um, and then I think I move on to, uh, transformers, large language models, um, and contrastive learning, which is the new way of training the embedding models. Um, uh, especially the- the, um- um, the direction started with some of the papers on, uh, using contrastive learning for images, and we work on improving those and understanding why contrastive learning can work. Uh, and recently, we work on, uh, optimizers for large language models. For example, one of the paper, uh, we wrote last year was Sofia, uh, which we found, where we found that we have a new tran- optimizer which can improve the training efficiency by 2X for pre-training.
3. SGSarah Guo
  This is great. Adam is very old at this point.
4. TMTengyu Ma
  Yeah. He's 10 years old now. I think that's the interesting part about it. So optimizers, you know, I think people have tried in the last 10 years for so many times. There were so many papers published, um, which has, um, improvements over Adam in various cases. But so far, Adam is still u- the default algorithms for training large language models, and that's why we thought that it's the time to, uh, really, um, um... And we spent a lot of time on this. Like, I- I think I started probably around 2018, 2019, and I w- asked a- a few students to work on this (laughs) and finally, we had one paper out after a few years, um, um, after a few failed projects and failed ideas. Um, and- and recently, I think one of the Facebook, um, uh, friends, uh, actually used this in their large scale multi-modal training, and they found that in- on that scale, I don't know exactly how many parameters there are, but I thi- I assume it's kind of more than 100 billion pa- parameters. Um, they found that on that scale, they are, um, uh... there is a 1.6X im- improvement in the efficiency of the training. So that's like, uh, $10 million versus 16- $16 million.
5. SGSarah Guo
  That's super exciting. Yeah. Um, I- I think, uh, you know, Sofia has an opportunity to be really, um, really impactful.
4:28 – 6:46
Academia compared to industry
1. SGSarah Guo
  Uh, you started a company last year, taking leave from Stanford. Um, given your work has been, like, theoretical, but with practical applications, like, what drove you to do that?
2. TMTengyu Ma
  I think I came to Stanford partly because there's a, um, very strong industry connection here at Stanford compared to some of the other universities, um, and- and also probably entrepreneurship is just part of my, um, (laughs) my career plan. Uh, anyways. And, uh, in terms of the timing, I felt that this is the right timing in the sense that, um, the- the technologies are more and more mature so that it seems that the commercialization is the- is the right timing right now. So for example, I think, um, uh, one- one story I have is that, you know, I- I look up some of my, um, slide stack, uh, for my, uh, lectures at Stanford CS 229 seven years ago, uh, when I started to teach at Stanford. Um, at that point, machine learning, uh, we have a lecture with Chris Ray and machine learning, uh, on applied machine learning, so how do you apply machine learning in industry. And there are seven steps there. So, um, the first step is you define your problem. The second step is you collect your data, um, and you choose the loss function, you train it and, uh, you iterate, so on and so forth. So it's pretty complicated at that point. Um, and now the foundation model, um, uh, arise to-... power and, uh, and in the new foundation model era, the only thing you have to do is that you have to, um, you know, someone will tune a f- foundation model for you, and then, uh, you tune a prompt and you add, uh, uh, retrieve augmented generation on top of it. And that's pretty much, um, that's it. So, uh, applying machine learning, AI to, uh, an industry environment is, uh, much, much easier than seven years ago. Um, and that's why I felt that this is probably the right time to commercialize many of the technologies because the technologies are more mature.
3. SGSarah Guo
  Yeah. This is actually, I mean, a core premise even for the investing fund that I started at Conviction, in that, you know, somebody's doing the bulk of the work for you in a more general way. And so the application of, um, of AI in industry is just much, much cheaper, right, because you only do, you know, the last few steps. And... Or a different set, but last few steps, in essence. So maybe you can talk about, like, w- the, the, you
6:46 – 9:44
Voyage AI overview
1. SGSarah Guo
  know, just given wide range of research, the problem you focus on with Voyage that you saw with customers.
2. TMTengyu Ma
  Yeah. So with, with Voyage, I think we are mostly building on these two components, uh, with rank and embeddings, for improving the quality of the retrieval or the search system. So the reason why we focus on this is because we talked to so many customers and we found that, uh, right now, um, uh, for implementing Rag, um, the bottleneck seems to be that, you know, it's- it's not very hard to implement it, right? You can just connect the components and have your Rag system ready very quickly. But the bottleneck seems to be the quality of the response. And the quality of the response, uh, is heavily affected or is kind of almost, almost bottlenecked by the quality of the retrieval part. If the large language model, uh, see very relevant documents, then they can synthesize very good answers. Uh, even like LLaMA-70B can do that very well.
3. SGSarah Guo
  Can you just give, uh, like, a sort of general intuition for, like, what a Rag system is and some of the applications of it?
4. TMTengyu Ma
  Yeah. So I guess just a, a little bit of background. Um, so a, a retrieve augmented generation, the idea is that there's a retrieval step, there's a generation step. So the main point here is that if you just use a large language model as a black box, you know, as is, then the large language model wouldn't know anything about, uh, the proper information inside a company. And then it doesn't know enough context about the use cases. And, um, the, the, the, the retrieve augmented generation stack is about you first retrieve some knowledge from, for example, inside a company, and then, um, uh, use this knowledge and give the knowledge to the large language model so that the large language model can generate or synthesize a good answer, uh, without any hallucination. Uh, this has been found to be very, very useful to reduce the hallucination rate. Um, and so there's two steps. The first step is to retrieve some relevant, uh, uh, information given a query, and then this relevant information are given to the large language model. The retrieval step, uh, is important because, um, once the large language model sees the relevant information, it can, um, um, uh, reduce the hallucination rates dramatically because it use the, the relevant information as an anchor to refine the answers, in some sense. And what we are doing here is that we want to improve the quality of the retrieval or the relevancy or accuracy of the retrieved documents, uh, and information. Um, and the way to do this is that there are two steps. The first step is that you vectorize all of your, uh, documents or all of your knowledge base. So you turn the documents to vectors, you turn the videos into vectors.
5. SGSarah Guo
  You turn your code into vectors.
6. TMTengyu Ma
  Code into vectors, everything into vectors. And so the vectors are the representations of, uh, each piece of the knowledge or documents, and all the other indices. So... And then you put these vectors into a vector database and, and, and then you search on the relevant information using the vectors as indices.
7. SGSarah Guo
  Where are you seeing, uh, Rag applications today? Like, what are, what are customers building or, you know, what are the most common systems?
9:44 – 15:23
Enterprise RAG use cases
1. SGSarah Guo
2. TMTengyu Ma
  Yeah. So we have, uh, a lot of users and, uh, and they are all over the, uh, places. I think, uh, you know, uh, we have even a customer who is a co- chemistry company who is building this Rag system to understand their, uh, chemistry, uh, documents or product descriptions. Um, and, um, I think just the... It's almost everywhere, like finance, legal, code retrieval, code generation, so on and so forth. I think it can be applied to almost any cases. And also for... Even for individual users, uh, where you have, you know, a lot of, like, individual, um, um, uh, personal information, um, and you want to have a Rag system on a phone so that, uh, you can access your past information, uh, much, in a much more easy way. And, um, and you want to retrieve... For example, we all have seen that, you know, when you search your, um, um, documents on your laptop, it's actually pretty hard. You have to use the exact file name. It will be much easier if, uh, this search can be semantic based.
3. SGSarah Guo
  Mm-hmm. Rag is a relatively, uh, new architecture. I think your, your average, um, enterprise technology leader had not heard the term before the last year or so. And it became popularized in research over the last few years. But there is already a debate, I, I think, you, you know, in terms of opinions from people at, uh, different large labs and in academia, about whether or not, um, you need a Rag architecture to work on proprietary data. And just to sort of describe some of the alternative views, um-Uh, I, I think there's kind of two, two alternative points of view given. Um, one is a sort of agent-chaining architecture where you are inputting your data and knowledge, you know, chemistry code, law, finance, whatever documents, um, into a series of LLMs that just operate with instruction on it, for example, to summarize or categorize it. Um, or you simply feed everything into LLMs with, um, infinite context or actively managed context versus explicitly, um, vectorizing anything. And so I would love to get your reaction to, um, you know, that as a, as an alternative to RAG.
4. TMTengyu Ma
  Um, actually there was also a debate last year about RAG versus fine-tuning. Uh, and I think that debate was kind of like, uh, getting to, uh, a consensus now (laughs) . Um, it sounds like RAG is, um, much easier than fine-tuning. And, and fine-tuning in many cases doesn't work because, uh, you need a lot of data to see the results and there are still hallucinations even after fine-tuning. And now as you said, the debate becomes RAG versus, um, uh, agent-chaining or, uh, long context. So, um, maybe let's talk about long context first. Um, so I think, um, there are probably two answers to this in, uh, from different angles because long context right now is not practical yet, right? So we have to kind of anticipate what long context transformer can do and then do the debate (laughs) uh, at a future time in some sense or anticipate the debate at a future time. In the near term, I think the long context transformer where you just put in all the preparatory data, one building tokens into the context of the transformer. So, um, will be very, very expensive, right? If you use the price right now, it's gonna be just, uh, impossible (laughs) uh, to do it. Uh, it's probably like five, ten magnitudes of difference depending on how many documents you have in the context. Uh, of course, you know, you can bring the cost down, um, by, for example, one approach is to cache the, uh, the activations of all of the internal operations of the documents you put in the context. So that will bring the cost down, uh, by a lot, but I think still if you do the calculation, um, it's theoretically still much more expensive than, uh, RAG. So I think that's the, the more practical answer. So in terms of cost, it's gonna be much more expensive than RAG because you have to save all of this, um, activations or inter- inter- intermediate computations in the GPU memory most likely, uh, or maybe in CPU memory, uh, of all the, um, all the, you know, one building tokens context.
5. SGSarah Guo
  Mm-hmm.
6. TMTengyu Ma
  You know, you may argue that, okay, over time everything will become s- you know, uh, cheaper and cheaper. But, um, RAG will be cheaper as well, right? Because m- many of the, the technologies under RAG are neural network-based and the GPUs will become cheaper, the neural networks be, will become smaller. So my prediction is that RAG will be much cheaper than long context, uh, going forward. Um, and another way to think about this is that maybe just on the first princi- from the first principle, right? So m- my analogy of long context is that, so in some sense the context is the short-term memory in some sense, right? And the RAG is more like long-term memory in some sense.
7. SGSarah Guo
  Mm-hmm.
8. TMTengyu Ma
  So the question is, you know, for example, you know, when you answer any question why you have to go through the entire library every time (laughs) , right? Like, uh, uh, put all of the entire library in your, in your short-term memory for answer single question, right? It sounds like the right approach should be that for every single question, um, you retrieve some, uh, a subset of the information and use those to answer the question. That seems to be the most efficient way to do that. It should be some kind of hierarchies in some sense, uh, in terms of how we solve the problem so that we can get the best efficiency. You- you know, even when we do the, um, uh, the computer architecture like the, um, the hardware s- stuff, right? So you have a different level of caching, right? So you have disc, you have, uh, uh, CPU cache and so forth. So, so in that sense I feel like, um, the, the more hierarchical two-level kind of like system like RAG is more cost-efficient.
9. SGSarah Guo
  Yeah.
15:23 – 18:03
LLM long-term memory and token limitations
1. SGSarah Guo
  I mean, the analogy certainly makes sense. I think there is another thread of discussion of like what does long-term memory for LLMs look like that where, you know, it is something managed by the LLM itself. But I do not think that is a well-answered question and, like, RAG may just be a part of that answer.
2. TMTengyu Ma
  So the, um, the, the embedding model that we rank are in some sense the large language model that are managing the long-term memory. Of course there might be variants and other ways to man- manage the long-term, um, memory, um, um, but I think it will be somewhat similar. It's gonna be like more, you know, the technology always evolves, right? Um, uh, gradually, right? So maybe t- two years later Voyage or maybe other companies will have a new version of the long-term memory, uh, which is based on, you know, embedding models but also kind of like extending the embedding model in some way. That's, that's entirely possible.
3. SGSarah Guo
  Yeah. I do think it's useful to sort of contextualize for people who, um, are not working with, um, sort of data sources for LLMs at scale every day, like what sort of token limitations are, right? Uh, you know, we go from a few thousand tokens to something like Gemini 1.5 Pro could contact window of a million tokens, right? And so if that's short of that in word count, that's maybe five books or-... like, 25, 30,000 lines of code. And obviously, like, limited amount of video and audio. And so I think the ability to make reasoning decisions on more than that amount of data is obviously going to be needed. And hence, the questions are, are... to, to me, are really, like, you know, does, does efficiency matter, both from a cost perspective and a speed, like a latency perspective, right? And how much can you push the context window? And, like, you know, does hallucination management matter? And so I- I- I think there are lots of arguments for, like, RAG being very persistent here.
4. TMTengyu Ma
  Yeah, yeah, exactly. And, uh, just to add a little bit on that. So, uh, one million tokens, five books, right? So, but many companies has 100 million tokens. That's a 100X difference, right? So a 100X, you know, in terms of cost is a, is a big difference. That could be just a, um, you know, a $100K versus, like, $10 million, right? $10 million is unacceptable, but the $100K sounds okay. Yeah, I think that's probably what's gonna happen. Like, so, so from... Uh, at least for many of the companies, right? So right now, if they have 100 million tokens, I don't think they can use long-context transformers at all because it's way too expensive.
5. SGSarah Guo
  Right. And I'm, I... Like, the simplest thing for me is actually for a system to look at the entire codebase or some representation of the entire codebase versus the portion of it that could fit into context today.
6. TMTengyu Ma
  Yeah.
7. SGSarah Guo
  What about the other
18:03 – 22:01
Agent chaining and data management
1. SGSarah Guo
  piece, like the idea of agent chaining and using LLMs to manage the data, um, in, in that form?
2. TMTengyu Ma
  So agent chaining, this is a, you know, growing area, and many people are doing research on it. I think there's... It's a little bit less well-defined in some sense. Um, the, the first level a bit I would say is that I think it's kind of orthogonal to embedding models and re-rankers to some degree, because even when you have agent chaining, right, you still probably use embedding models as part of the chain, right? Um, you probably do iterative retrieval, uh, as part of the chain. And of course, you use large-language models as part of the chain as well. In some sense, it's orthogonal direction. So I probably would rephrase the agent chaining as more like a iterative, um, uh, multi-steps, uh, retrieval augmented, uh, uh, large-language model-augmented (laughs) uh, system. So, uh, and some part of this retrieval probably is done by a large-language model, sometime part of the system is done by a small large-language model, um, and some part of the system is done by embedding model, so on and so forth. So in that sense, I feel like it's somewhat kind of orthogonal.
3. SGSarah Guo
  Yeah, and I- I- I feel like, um, some of the motivation for agent chaining to begin with is the same efficiency motivation as RAG.
4. TMTengyu Ma
  Yeah. Exactly. Right? But if you use a very, very large language model to manage, to manage the system, the knowledge system, I think you, again, lose the efficiency, right? So, so it has to be a somewhat kind of, like, smaller model to manage the, um, um, the knowledge. Uh, and then at that point, embedding model might be the right thing to do in that agent-chaining framework. Maybe another angle to look at this is that whether we should do iterative retrieval versus, um, uh, just retrieve at once. Um, I think iterative retrieval is definitely useful, especially because now there are still a lot of headroom, uh, in the, uh, embedding model's performance. So that's why sometimes you have to retrieve multiple times because the models are not clever enough. Um, however, in the long run, my suspicion is that, um, iterative retrieval will be useful, but it will be a bit less useful as the... if the embedding models becomes more and more clever, right? So, uh, once the embedding model's more clever, then maybe one round or two rounds is gonna be enough.
5. SGSarah Guo
  Mm-hmm. Um, if we go ahead and just assume that RAG is at least a dominant architecture for enterprise use cases where you care about proprietary data that is large with reliability, how do you go about improving, like, a RAG system, right? You can improve the LLM itself, but what are the, what are the other components that you guys are working on? Or what are the, maybe challenges from the user, um, the builder's perspective to improve retrieval quality?
6. TMTengyu Ma
  Yeah. So I guess there are a few ways, right? One way is that you improve the prompting of the large-language models. Um, so, um, for example, you could tell the large-language models to abstain if there's no relevant information (laughs) , uh, in the, in the retrieved documents. But because the large-language models are so good these days, I think you don't need a lot of prompting anymore. Uh, it just responds to those instructions so well. Um, and then the next thing is to improve the retrieval part, which is the bottleneck, in my opinion, because most of our users found out that if they re- improve the retrieval quality directly, that affects the response quality. Um, and improving the retrieval part, I think there are two ways. One way is you improve the embedding model. One way is that you improve some of the other things on top of that. For example, how you chunk the data, whether you do iterative, uh, retrieval, uh, whether you put in some of the meta information in the data, so on and so forth. So basically, I would say, um, there are two, two ways of improving. One way is that you improve the neural networks (laughs) , uh, either the ne- the embedding models or the re-rankers, or you improve the ways to use the networks with software engineering, right? By their trunking iterations or other kind of, like, heuristics or kind of, like, tricks on top of that. So... And what we are specialized in is that we want to improve the neural networks because that's, uh, requires a lot of heavy lifting.
22:01 – 25:44
Improving enterprise RAG
1. TMTengyu Ma
  Um, that's... Uh, it's a very data, data-driven approach. We train our neural networks on trillions of tokens, uh, at least, uh, and we fine-tune them for special use cases. Um, and this is something that probably what a company should do instead of, like, every, the users. The end users should optimize themselves. And my long-term vision here is that some of these software engineering layers on top of the neural networks, uh, will be, uh, um, less and less needed when the neural networks are more and more clever. So for example, right now, we already see that trunkings becomes less needed because the context window becomes longer and longer, and the long context embedding model that... No.... relatively long context embedding model. Long context here means, like, uh, 10K for example, maybe 16K so that you can put a 50 pages PDF into it. Uh, because this long context embedding model becomes much better, there's less of a need to chunk the documents into pieces of, like, five, 12 tokens. And I think this will happen, you know, uh, in other dimensions as well, right? So maybe in the future you don't have to turn your, uh, images into description of images and then give it to the text embedding model. Uh, that's what people are doing right now, like, everything is turned into text and they use a text embedding model. But when the embedding models are more, uh, clever and multi-modal, then you don't have to do that anymore.
2. SGSarah Guo
  Mm-hmm. Um, can you talk a little bit about just, um, like, the intuition for how fine-tuning or, uh, domain specific embeddings improves performance?
3. TMTengyu Ma
  Yeah. Fine-tuning and domain specific embedding models are what we are, um, very good at at Voyage. So just to have some context here, so, uh, what we do is that we start with a, um, um, general purpose based embedding model, which is also what we trained from scratch. Uh, and from there, we, um, uh, first fine-tune or continue pre-train, whatever you call it, uh, on, um, some domain specific data. So for example, we, uh, fine-tune on two trillions of code snippets, tokens, and that... and then we get the code embedding model. And we do the, uh, uh, fine-tuning on one trillion legal tokens, and that's, uh, how we got the legal embedding model. And these domain specific embedding models didn't use any of the proprietary, uh, data so that everyone can use them, but they really excel in one particular domain, and the performance in other domains are not, uh, changed much. And the reason why we do this is because the number of parameters in the embedding model is, uh, limited. So, um, because, um, you only have like a... uh, you have a latency budget, uh, something like maybe one second, sometimes like 200 milliseconds, you know, some people even want 50 milliseconds, um, and then, um, basically it's, uh, it's impossible to use more than 10 billion parameters for embedding models. And we have limit parameters, and the customization is very important because y-... customization means that you use the limited number of parameters on the right tasks, the right domain, so that you excel in that domain. Uh, you... there's no way that you can use these 10 mi-... 10 billion parameters to excel in everything, so that's why you have to specialize in one domain. And we have seen, like, 5 to 20% of improvements by this domain specific fine-tuning, um, uh, depending on the, the particular domains. For code, we have seen, um, 15 to 20% of improvement, partly because we have a lot of data there, and, um, uh, and the headroom there is also bigger because code retrieval requires a lot of deep understanding of the algorithmic part of the code. Um, and for legal domain, um, the, um, the, the baseline is a little bit better, so the, the headroom is sl- slightly smaller, so that's why we see 5 to 15% improvement depending on the data sets. For some of the very complex legal data sets, we have seen, uh, bigger improvements.
4. SGSarah Guo
  Just to make sure that, um, our listeners can picture exactly, like, where
25:44 – 27:48
Latency budgets
1. SGSarah Guo
  the latency cost is coming from here, in, in a search system, like, your data, you know, has been vectorized by an embeddings model, but then every query also needs to be translated into an embedding and then compared to the embeddings of your knowledge in order to feed that, that LM for the generation that you want. Right? Um, and, and so there's a... so there's inference time latency here as well. I, I just think that's not obvious if somebody hasn't built a RAG system.
2. TMTengyu Ma
  Yeah. Exactly. Exactly. So basically at, at the inference time, you have to first turn the query into vectors and then do the search with vector database. And actually, related to this, the dimension of the vectors you produce also affects the latency for the, uh, vector, uh, uh, based search. If the dimension of the embedding is like 100, uh, is only 100, then it's gonna be much, much faster than when the dimension of the embeddings is, uh, 1,000. So... and actually, this is something we are very good at as well, so we produce embeddings that is like a 3Xing or 4X smaller dimension than some of the competitors.
3. SGSarah Guo
  Yeah. That makes... uh, I mean, i- intuitively you are creating embeddings models that use a limited number of parameters and, um, dimensions, uh, just given the sort of latency budget, uh, that you, uh... that any application has to create the best possible representation of proprietary data or domain specific data.
4. TMTengyu Ma
  Yeah. Exactly. And going back to the, um, the domain specificity and, uh, and fine-tuning, so the second level of, um, customization is that we can customize to a particular company. Right? So we fine-tune on the proprietary data of a particular company and we can see, uh, 10 to 20% improvement on top of the domain specific in fine-tuning, um, as well. So of course there's a total budget in terms of how much additive improvements you have there. Right? So if you start with like 50% accuracy then you only have 50% headroom, but if you start with 90%, you only have 10% headroom. Um, so, so the, the improvement, the absolute improvement varies a little bit across the domains.
5. SGSarah Guo
  Um,
27:48 – 31:06
Advice for building RAG systems
1. SGSarah Guo
  maybe just advice to people who are building RAG systems, at what point do they begin to invest in, you know, some of these retrieval components?
2. TMTengyu Ma
  Yeah. I think they can do it even from day one as long as they have a prototype, you know, available. Right? So basically my default suggestion for our users is that, uh, when they have the RAG, you know, first of all, of course you want to connect the components and, uh, at least, uh, uh, see some response and then, um, probably do some kind of like basic, um, profiling in terms of the latency and the quality. Right? So you can s-... uh, check the retrieval quality, meaning that how often you retrieve relevant documents. There are some, uh, default ways to evaluate the retrieval quality, and then you, you also do the end-to-end evaluation, uh, for the responses, um, and then you can see which part is the bottleneck. And in many cases people found that the retrieval quality is not good so that the final response is not good. And then you can swipe some of the components. You can say, "I'm gonna try, um, word embedding." You... I can try the words re-rankers which we haven't discussed too much about. Um, so, um, and you can, um, try various different embeddings, uh, and possibly various different large language models as well.
3. SGSarah Guo
  Maybe just, um, zooming out, like, y- you know, you started by saying in order to have the debate about RAG versus alternative architectures for working on proprietary data, you need to predict forward. Right? Um, any predictions for how these systems change as LMs improve dramatically? Right? If we look at the next generations of OpenAI and, um, uh, or GPT and Claude and the Mistral models and LLaMA and such.
4. TMTengyu Ma
  Yeah. So my prediction is that the s- the, the system will be simpler and simpler. Maybe this is, uh, my biased view. (laughs) Um, so, or at least, this is something we are working towards. Um, so the idea of what would be that, um, it's a very, very simple system. So you just do, you just have three components, like large language model, um, vector database and embedding models, and maybe four components, another reranker, um, uh, which fi- refine the 3T5 results. Um, and, uh, you connect, uh, all of this and each of the, the neural networks does everything else. Uh, you don't have to worry to anything about chunking, multimodality, changing the data formats, um, because neural networks can do mo- most of them, right? So seven years ago, if you talked to any of the langu- so-called language models seven years ago, you have to turn the, the format into a very, very clean format. Um, and, uh, now you talk to GPT-4, you can have typos, you can have all kind of like, uh, weird formats. You can even dump JSON files through it. Um, right? So the same thing would happen for embedding models as well. So my vision is that in the future, AI will just be that, uh, a very simple software engineering layer on top of, of a few, um, uh, very strong neural network components.
5. SGSarah Guo
  Mm-hmm. Yes. I think the bias toward, um, it is actually all going to be AI versus complex, you know, discretized software systems is, is clear, but, um, I believe directionally right. Uh, maybe zooming out to just get a little bit of your perspective as a founder. Like, you know, what's one or two top learnings you have about starting, uh, the company as a, um, as an academic before even, you know, despite your work with Google and other companies
31:06 – 32:55
Learnings as an AI founder
1. SGSarah Guo
  before?
2. TMTengyu Ma
  Yeah. I think, um, it's very, very different. Uh, founding a company is very different from doing research, (laughs) at, uh, big tech and, uh, and also even from... Actually, it's a little bit closer to, uh, being academia, because to run a university lab, I'm the CEO, CTO, CFO and, and, and HR for, for the university lab, right? So you, you, you, you touch on a little bit of everything, um, but at a slightly different scale, right? So I think one of the biggest thing I learned, actually is from one of our angel investor, is that I should read some of the books. (laughs) Um, even though s- I think for probably experienced entrepreneur, many of the books are very basic. But for me, they are very, very useful, uh, when I read some of the, even the basic books, um, including Elad's book, by the way. (laughs) Uh, so, uh, but his book is a little bit, um, uh, advanced in a sense that he's ta- his book is talking about how to scale from 10 people to 1,000 people. And I only read a few chapters of that because we are about 10 people right now. (laughs) So, um, yeah, and also talking to a lot of angel investors, ta- talking to, um, uh, Sarah and, uh, my other lead investors. So, um, so I think all of these, um, helped me a lot in, uh, reducing the unforced mistakes (laughs) uh, in this process. Um, to me, I think it's really about how to reduce the number of errors you make so that you can maximize the efficiency. At least this is, uh, what happens to me. And, and also, how to correct the mistakes as fast as possible, right? If you cor- correct mistakes every, um, one week after you made them, uh, uh, versus like one month after you made them, then, uh, that's a 4X efficiency improvement.
3. SGSarah Guo
  Mm-hmm. Um, very theoretically consistent in your, uh, you know, vein of research. Um, last question. You know, you have been personally productive, productive research lab, but you've started
32:55 – 36:19
The role of academia in AI
1. SGSarah Guo
  a company. What do you think the role of academia in AI is in, in this age of, like, um, scaling? Because most of your former students alm- like, they essentially all work at OpenAI or Anthropic with, you know, a few professors and Citadel folks in the mix, and the ones working with you, right?
2. TMTengyu Ma
  (laughs) Yes, yes. In academia, this is a little bit controversial topic, I think different people have different views. My view is that, um, I think academia probably should work on some different questions from what industry is good at, right? So if we are just only working on how to scale up the system, then obviously, uh, the incentive is not right. The- we don't have enough capital there. And, you know, even OpenAI, I guess Sam Altman argues that, uh, you need a lot of capital, uh, to, to start to do this in some sense. So, uh, you know, like at the very beginning, I think the, the point is that, you know, you first have... It cannot be non-profit, because, uh, if it's non-profit, then you don't have enough capital and you cannot scale up enough. Um, I think I kind of agree with that. Um, and that's why in academia, it's very hard to scale up and have enough resources, uh, to do the large scale research. However, I think in academia, there are many, many other things that we can do on a smaller scale, and we probably should focus on more long term innovations. So what I told my students at that lab is that we should think about what will be the, uh, the breakthrough in three to five years, as opposed to how do you help OpenAI (laughs) to, uh, to improve their, uh, large language models in the next, in GPT-5. So, so that's why we work on optimizers, which is like 10 years, uh, old op- the item is a 10 years old optimizer and we say, "Okay, that sounds like a long term project." Maybe in five years, we can improve the optimization efficiency by 5 to 10X, that's gonna be, uh, a game changer, uh, for the whole landscape, right? So if we improve the efficiency by 10X, I guess that's like a $100 million off $10 million, uh, for training GPT-5. Then, uh, I think that would change the landscape a lot, um, uh, in the industry. So, uh, so efficiency is one of the things I spend a lot of time on. Um, uh, another thing is that there's reasoning tasks. I think the reason why I identified that as one of my, um, lab's direction is because it's challenging and it requires a lot of very innovative research. It's very unclear whether you can really, uh, the scaling law is really, uh, enough to get, get you to prove Riemann, uh, uh, hypothesis (laughs) or any of the math conjectures. So, um, you know, and also, you have to be superhuman performance in some sense, right? So if you turn on just the, the, the common core data on the web, can you be a good mathematician? It's kind of very hard to believe that. So, so we need more innovations there. So that's pretty much what we are doing at the university lab. We try to work on the three to five years, uh, agenda, uh, um, and, and on a smaller scale.
3. SGSarah Guo
  Mm-hmm. (instrumental music plays) Uh, I think that's an inspiring note to end on. And, and, um, like a, a very open-minded one about what is still to be figured out. Thanks so much for doing this, Tenghu.
4. TMTengyu Ma
  Thanks so much.
5. SGSarah Guo
  (instrumental music plays) Find us on Twitter @nopriorspod. Subscribe to our YouTube channel if you wanna see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way, you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priors.com.

Episode duration: 36:20

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode MYsx9POL_x8

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome