No Priors

No Priors Ep. 67 | With Voyage AI Co-Founder and CEO

Sarah Guo and Tengyu Ma on voyage CEO Explains Why RAG Beats Long Context For Enterprise AI.

Sarah GuohostTengyu Maguest
Jun 6, 202436m
Tengyu Ma’s research trajectory: theory, embeddings, contrastive learning, LLM optimizersFounding Voyage AI and the commercialization timing for foundation modelsDefinition, architecture, and real-world applications of Retrieval-Augmented Generation (RAG)RAG vs. long-context LLMs vs. agent-chaining as architectures for proprietary dataImproving retrieval quality: embeddings, re-rankers, chunking, and software heuristicsDomain-specific and company-specific embedding fine-tuning and latency constraintsThe role of academia in AI: efficiency, reasoning, and long-term breakthroughs

In this episode of No Priors, featuring Sarah Guo and Tengyu Ma, No Priors Ep. 67 | With Voyage AI Co-Founder and CEO explores voyage CEO Explains Why RAG Beats Long Context For Enterprise AI Stanford professor and Voyage AI co-founder Tengyu Ma discusses his research journey from matrix completion and early sentence embeddings to modern contrastive learning, LLM optimizers, and domain-specific retrieval systems.

At a glance

WHAT IT’S REALLY ABOUT

Voyage CEO Explains Why RAG Beats Long Context For Enterprise AI

  1. Stanford professor and Voyage AI co-founder Tengyu Ma discusses his research journey from matrix completion and early sentence embeddings to modern contrastive learning, LLM optimizers, and domain-specific retrieval systems.
  2. He explains Retrieval-Augmented Generation (RAG), why retrieval quality is now the main bottleneck for enterprise AI, and argues that RAG will remain more cost-efficient and practical than ultra-long-context models or pure agent-chaining approaches.
  3. Ma outlines Voyage’s focus on high-quality embeddings and re-rankers, domain- and company-specific fine-tuning, and reducing latency through compact, efficient models and optimizers like Sofia.
  4. He closes with reflections on founding a startup as an academic and what academia’s role should be in longer-horizon AI research on efficiency and reasoning, rather than competing directly on scale with industry labs.

IDEAS WORTH REMEMBERING

7 ideas

RAG will likely remain more cost-efficient than long-context LLMs for enterprise data.

Ma argues that stuffing entire corporate knowledge bases (often 100M+ tokens) into context is orders of magnitude more expensive than selective retrieval, and even with cheaper compute, RAG’s neural components will also get cheaper, preserving the cost advantage.

Retrieval quality, not wiring up RAG, is now the main bottleneck.

Connecting an LLM, vector database, and basic RAG pipeline is easy; the hard part is ensuring the retrieved documents are highly relevant so the LLM can answer accurately and reduce hallucinations.

High-quality, domain-specific embeddings materially improve retrieval performance.

Voyage pre-trains general embeddings, then fine-tunes on massive domain corpora (e.g., trillions of code or legal tokens), delivering 5–20% retrieval gains, with additional uplift from company-specific fine-tuning on proprietary data.

Efficiency constraints force embedding models to be specialized and compact.

Because production systems have tight latency budgets (often 50–200 ms), embedding models can’t be arbitrarily large; Ma emphasizes using limited parameters and lower-dimensional embeddings to maximize domain performance and keep vector search fast.

Agent-chaining is complementary to RAG, not a replacement.

Ma frames agent systems as multi-step pipelines where both LLMs and embeddings participate; even sophisticated agents will still rely on embedding-based retrieval for efficiency rather than being managed solely by large LLMs.

RAG stacks will simplify as neural components become more capable.

Ma predicts future systems will mostly comprise a strong LLM, an embedding model, a reranker, and a vector DB, with far less need for manual chunking, format normalization, or multimodal hacks as models natively handle long context and various data types.

Academia’s comparative advantage is long-horizon work on efficiency and reasoning.

Given capital constraints, Ma believes universities should pursue 3–5 year breakthroughs (e.g., 5–10x more efficient optimizers, genuine reasoning advances) rather than trying to match industry on sheer scale of model training.

WORDS WORTH SAVING

5 quotes

“The bottleneck seems to be the quality of the response, and the quality of the response is almost bottlenecked by the quality of the retrieval part.”

Tengyu Ma

“Why would you go through the entire library every time to answer a single question?”

Tengyu Ma

“My prediction is that RAG will be much cheaper than long context going forward.”

Tengyu Ma

“You only have a limited number of parameters… there’s no way that you can use these to excel in everything, so that’s why you have to specialize in one domain.”

Tengyu Ma

“My vision is that in the future, AI will just be a very simple software engineering layer on top of a few very strong neural network components.”

Tengyu Ma

QUESTIONS ANSWERED IN THIS EPISODE

5 questions

Under what conditions, if any, could long-context LLMs become economically competitive with RAG for large proprietary corpora?

Stanford professor and Voyage AI co-founder Tengyu Ma discusses his research journey from matrix completion and early sentence embeddings to modern contrastive learning, LLM optimizers, and domain-specific retrieval systems.

How should an enterprise rigorously measure whether its current bottleneck is retrieval quality, LLM quality, or prompt design?

He explains Retrieval-Augmented Generation (RAG), why retrieval quality is now the main bottleneck for enterprise AI, and argues that RAG will remain more cost-efficient and practical than ultra-long-context models or pure agent-chaining approaches.

What are the trade-offs between domain-specific embeddings vs. company-specific fine-tuning when data sensitivity and privacy are concerns?

Ma outlines Voyage’s focus on high-quality embeddings and re-rankers, domain- and company-specific fine-tuning, and reducing latency through compact, efficient models and optimizers like Sofia.

How might advances in multi-modal embeddings change how we store and retrieve code, images, audio, and video in RAG systems?

He closes with reflections on founding a startup as an academic and what academia’s role should be in longer-horizon AI research on efficiency and reasoning, rather than competing directly on scale with industry labs.

What kinds of reasoning benchmarks would meaningfully demonstrate that improvements in optimizers or embeddings are enabling qualitatively new capabilities rather than just incremental gains?

EVERY SPOKEN WORD

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome