
No Priors Ep. 67 | With Voyage AI Co-Founder and CEO
Sarah Guo (host), Tengyu Ma (guest)
In this episode of No Priors, featuring Sarah Guo and Tengyu Ma, No Priors Ep. 67 | With Voyage AI Co-Founder and CEO explores voyage CEO Explains Why RAG Beats Long Context For Enterprise AI Stanford professor and Voyage AI co-founder Tengyu Ma discusses his research journey from matrix completion and early sentence embeddings to modern contrastive learning, LLM optimizers, and domain-specific retrieval systems.
Voyage CEO Explains Why RAG Beats Long Context For Enterprise AI
Stanford professor and Voyage AI co-founder Tengyu Ma discusses his research journey from matrix completion and early sentence embeddings to modern contrastive learning, LLM optimizers, and domain-specific retrieval systems.
He explains Retrieval-Augmented Generation (RAG), why retrieval quality is now the main bottleneck for enterprise AI, and argues that RAG will remain more cost-efficient and practical than ultra-long-context models or pure agent-chaining approaches.
Ma outlines Voyage’s focus on high-quality embeddings and re-rankers, domain- and company-specific fine-tuning, and reducing latency through compact, efficient models and optimizers like Sofia.
He closes with reflections on founding a startup as an academic and what academia’s role should be in longer-horizon AI research on efficiency and reasoning, rather than competing directly on scale with industry labs.
Key Takeaways
RAG will likely remain more cost-efficient than long-context LLMs for enterprise data.
Ma argues that stuffing entire corporate knowledge bases (often 100M+ tokens) into context is orders of magnitude more expensive than selective retrieval, and even with cheaper compute, RAG’s neural components will also get cheaper, preserving the cost advantage.
Get the full analysis with uListen AI
Retrieval quality, not wiring up RAG, is now the main bottleneck.
Connecting an LLM, vector database, and basic RAG pipeline is easy; the hard part is ensuring the retrieved documents are highly relevant so the LLM can answer accurately and reduce hallucinations.
Get the full analysis with uListen AI
High-quality, domain-specific embeddings materially improve retrieval performance.
Voyage pre-trains general embeddings, then fine-tunes on massive domain corpora (e. ...
Get the full analysis with uListen AI
Efficiency constraints force embedding models to be specialized and compact.
Because production systems have tight latency budgets (often 50–200 ms), embedding models can’t be arbitrarily large; Ma emphasizes using limited parameters and lower-dimensional embeddings to maximize domain performance and keep vector search fast.
Get the full analysis with uListen AI
Agent-chaining is complementary to RAG, not a replacement.
Ma frames agent systems as multi-step pipelines where both LLMs and embeddings participate; even sophisticated agents will still rely on embedding-based retrieval for efficiency rather than being managed solely by large LLMs.
Get the full analysis with uListen AI
RAG stacks will simplify as neural components become more capable.
Ma predicts future systems will mostly comprise a strong LLM, an embedding model, a reranker, and a vector DB, with far less need for manual chunking, format normalization, or multimodal hacks as models natively handle long context and various data types.
Get the full analysis with uListen AI
Academia’s comparative advantage is long-horizon work on efficiency and reasoning.
Given capital constraints, Ma believes universities should pursue 3–5 year breakthroughs (e. ...
Get the full analysis with uListen AI
Notable Quotes
““The bottleneck seems to be the quality of the response, and the quality of the response is almost bottlenecked by the quality of the retrieval part.””
— Tengyu Ma
““Why would you go through the entire library every time to answer a single question?””
— Tengyu Ma
““My prediction is that RAG will be much cheaper than long context going forward.””
— Tengyu Ma
““You only have a limited number of parameters… there’s no way that you can use these to excel in everything, so that’s why you have to specialize in one domain.””
— Tengyu Ma
““My vision is that in the future, AI will just be a very simple software engineering layer on top of a few very strong neural network components.””
— Tengyu Ma
Questions Answered in This Episode
Under what conditions, if any, could long-context LLMs become economically competitive with RAG for large proprietary corpora?
Stanford professor and Voyage AI co-founder Tengyu Ma discusses his research journey from matrix completion and early sentence embeddings to modern contrastive learning, LLM optimizers, and domain-specific retrieval systems.
Get the full analysis with uListen AI
How should an enterprise rigorously measure whether its current bottleneck is retrieval quality, LLM quality, or prompt design?
He explains Retrieval-Augmented Generation (RAG), why retrieval quality is now the main bottleneck for enterprise AI, and argues that RAG will remain more cost-efficient and practical than ultra-long-context models or pure agent-chaining approaches.
Get the full analysis with uListen AI
What are the trade-offs between domain-specific embeddings vs. company-specific fine-tuning when data sensitivity and privacy are concerns?
Ma outlines Voyage’s focus on high-quality embeddings and re-rankers, domain- and company-specific fine-tuning, and reducing latency through compact, efficient models and optimizers like Sofia.
Get the full analysis with uListen AI
How might advances in multi-modal embeddings change how we store and retrieve code, images, audio, and video in RAG systems?
He closes with reflections on founding a startup as an academic and what academia’s role should be in longer-horizon AI research on efficiency and reasoning, rather than competing directly on scale with industry labs.
Get the full analysis with uListen AI
What kinds of reasoning benchmarks would meaningfully demonstrate that improvements in optimizers or embeddings are enabling qualitatively new capabilities rather than just incremental gains?
Get the full analysis with uListen AI
Transcript Preview
Welcome to No Priors. Today, we're talking to Tenggu Ma, assistant professor of computer science at Stanford, and the co-founder and CEO of Voyage. Voyage trains state-of-the-art components for next generation retrieval systems, including embeddings models and re-rankers. We're really excited to talk about his research and the RAG debate today. Uh, welcome, Tenggu.
Yeah. Thanks so much. Thanks for having me here. Looking forward to the debate.
Yeah. W- why don't we start with, um, just a- a little bit of an overview of y- like, your research agenda to date, because I- I think uniquely, it covers, um, a broad range of fields, uh, within and around deep learning from, like, theory to RL to embeddings and optimizers. So can you talk a little bit about, um, sort of how you picked the directions you have?
Yeah. So, um, I think most of the papers I wrote, uh, have some theoretical thinking in it. I guess maybe that's the commonality. And, um, um, be- besides that, I think I worked on, uh, quite a few topics, as you mentioned, ranging from the theoretical understanding, mathematical proofs of deep learning systems, um, to, um, all the way to practical large language models, reinforcement learning, um, deep reinforcement learning. And these days, recently, I think what we are working on, um, more centralized to, um, efficiency of training the large language models and reasoning, improving the reasoning tasks for large language models. So I- my vision is that in the future, uh, the efficiency is very important because, um, we are running out of data and compute, so we have to either use the data much better and use the compute much better. Um, and also reasoning tasks seems to be a pretty, uh, important direction. And, um, um, uh, and also in some sense, kind of like a- a risky direction in the sense that we don't know exactly how, um, how fast we can, uh, solve those challenging reasoning questions yet.
Mm-hmm. Um, can you, uh, mention a few of the, like, key papers or work that, uh, you know, you or students in your lab have done, just so our listeners can look them up?
In the very, uh, early days, I think I worked on some of this matrix completion, optimization for matrix completion. That's like 10 years ago. Um, and then I move on to, uh, embedding models, like sentence embeddings, vector embeddings. One of the papers we wrote is a ve- actually simple paper where we average the word embeddings to get sentence embeddings, and then we did some, uh, of these transformations using PCA to make the performance much better. Um, that was even before transformer came out. Um, and then I think I move on to, uh, transformers, large language models, um, and contrastive learning, which is the new way of training the embedding models. Um, uh, especially the- the, um- um, the direction started with some of the papers on, uh, using contrastive learning for images, and we work on improving those and understanding why contrastive learning can work. Uh, and recently, we work on, uh, optimizers for large language models. For example, one of the paper, uh, we wrote last year was Sofia, uh, which we found, where we found that we have a new tran- optimizer which can improve the training efficiency by 2X for pre-training.
Install uListen to search the full transcript and get AI-powered insights
Get Full TranscriptGet more from every podcast
AI summaries, searchable transcripts, and fact-checking. Free forever.
Add to Chrome