No Priors Ep. 67 | With Voyage AI Co-Founder and CEO

After Tengyu Ma spent years at Stanford researching AI optimization, embedding models, and transformers, he took a break from academia to start Voyage AI which allows enterprise customers to have the most accurate retrieval possible through the most useful foundational data. Tengyu joins Sarah on this week’s episode of No priors to discuss why RAG systems are winning as the dominant architecture in enterprise and the evolution of foundational data that has allowed RAG to flourish. And while fine-tuning is still in the conversation, Tengyu argues that RAG will continue to evolve as the cheapest, quickest, and most accurate system for data retrieval. They also discuss methods for growing context windows and managing latency budgets, how Tengyu’s research has informed his work at Voyage, and the role academia should play as AI grows as an industry. Show Notes: 0:00 Introduction 1:59 Key points of Tengyu’s research 4:28 Academia compared to industry 6:46 Voyage AI overview 9:44 Enterprise RAG use cases 15:23 LLM long-term memory and token limitations 18:03 Agent chaining and data management 22:01 Improving enterprise RAG 25:44 Latency budgets 27:48 Advice for building RAG systems 31:06 Learnings as an AI founder 32:55 The role of academia in AI

Sarah GuohostTengyu Maguest

Jun 5, 202436mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Voyage CEO Explains Why RAG Beats Long Context For Enterprise AI

Stanford professor and Voyage AI co-founder Tengyu Ma discusses his research journey from matrix completion and early sentence embeddings to modern contrastive learning, LLM optimizers, and domain-specific retrieval systems.
He explains Retrieval-Augmented Generation (RAG), why retrieval quality is now the main bottleneck for enterprise AI, and argues that RAG will remain more cost-efficient and practical than ultra-long-context models or pure agent-chaining approaches.
Ma outlines Voyage’s focus on high-quality embeddings and re-rankers, domain- and company-specific fine-tuning, and reducing latency through compact, efficient models and optimizers like Sofia.
He closes with reflections on founding a startup as an academic and what academia’s role should be in longer-horizon AI research on efficiency and reasoning, rather than competing directly on scale with industry labs.

IDEAS WORTH REMEMBERING

5 ideas

RAG will likely remain more cost-efficient than long-context LLMs for enterprise data.

Ma argues that stuffing entire corporate knowledge bases (often 100M+ tokens) into context is orders of magnitude more expensive than selective retrieval, and even with cheaper compute, RAG’s neural components will also get cheaper, preserving the cost advantage.

Retrieval quality, not wiring up RAG, is now the main bottleneck.

Connecting an LLM, vector database, and basic RAG pipeline is easy; the hard part is ensuring the retrieved documents are highly relevant so the LLM can answer accurately and reduce hallucinations.

High-quality, domain-specific embeddings materially improve retrieval performance.

Voyage pre-trains general embeddings, then fine-tunes on massive domain corpora (e.g., trillions of code or legal tokens), delivering 5–20% retrieval gains, with additional uplift from company-specific fine-tuning on proprietary data.

Efficiency constraints force embedding models to be specialized and compact.

Because production systems have tight latency budgets (often 50–200 ms), embedding models can’t be arbitrarily large; Ma emphasizes using limited parameters and lower-dimensional embeddings to maximize domain performance and keep vector search fast.

Agent-chaining is complementary to RAG, not a replacement.

Ma frames agent systems as multi-step pipelines where both LLMs and embeddings participate; even sophisticated agents will still rely on embedding-based retrieval for efficiency rather than being managed solely by large LLMs.

WORDS WORTH SAVING

5 quotes

“The bottleneck seems to be the quality of the response, and the quality of the response is almost bottlenecked by the quality of the retrieval part.”

— Tengyu Ma

“Why would you go through the entire library every time to answer a single question?”

— Tengyu Ma

“My prediction is that RAG will be much cheaper than long context going forward.”

— Tengyu Ma

“You only have a limited number of parameters… there’s no way that you can use these to excel in everything, so that’s why you have to specialize in one domain.”

— Tengyu Ma

“My vision is that in the future, AI will just be a very simple software engineering layer on top of a few very strong neural network components.”

— Tengyu Ma

Tengyu Ma’s research trajectory: theory, embeddings, contrastive learning, LLM optimizersFounding Voyage AI and the commercialization timing for foundation modelsDefinition, architecture, and real-world applications of Retrieval-Augmented Generation (RAG)RAG vs. long-context LLMs vs. agent-chaining as architectures for proprietary dataImproving retrieval quality: embeddings, re-rankers, chunking, and software heuristicsDomain-specific and company-specific embedding fine-tuning and latency constraintsThe role of academia in AI: efficiency, reasoning, and long-term breakthroughs

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.