No Priors Ep. 67 | With Voyage AI Co-Founder and CEO

After Tengyu Ma spent years at Stanford researching AI optimization, embedding models, and transformers, he took a break from academia to start Voyage AI which allows enterprise customers to have the most accurate retrieval possible through the most useful foundational data. Tengyu joins Sarah on this week’s episode of No priors to discuss why RAG systems are winning as the dominant architecture in enterprise and the evolution of foundational data that has allowed RAG to flourish. And while fine-tuning is still in the conversation, Tengyu argues that RAG will continue to evolve as the cheapest, quickest, and most accurate system for data retrieval. They also discuss methods for growing context windows and managing latency budgets, how Tengyu’s research has informed his work at Voyage, and the role academia should play as AI grows as an industry. Show Notes: 0:00 Introduction 1:59 Key points of Tengyu’s research 4:28 Academia compared to industry 6:46 Voyage AI overview 9:44 Enterprise RAG use cases 15:23 LLM long-term memory and token limitations 18:03 Agent chaining and data management 22:01 Improving enterprise RAG 25:44 Latency budgets 27:48 Advice for building RAG systems 31:06 Learnings as an AI founder 32:55 The role of academia in AI

Sarah GuohostTengyu Maguest

Jun 6, 202436mWatch on YouTube ↗

CHAPTERS

0:05 – 1:52
Tengyu Ma’s research arc: theory, RL, LLM efficiency, and reasoning
Sarah opens by introducing Tengyu Ma and asks how he chose such a broad set of research directions. Tengyu frames his work around theoretical thinking, then explains his current focus on training efficiency and reasoning as data/compute become limiting.
- •Common thread: theoretical thinking applied across topics
- •Shift from theory/RL to practical LLM training concerns
- •Efficiency as a response to data and compute constraints
- •Reasoning research as important but risky/uncertain frontier
1:52 – 4:21
Notable papers: early embeddings, contrastive learning, and the Sofia optimizer
Tengyu lists key projects from his lab, from matrix completion to early sentence embeddings and modern contrastive learning. He highlights the Sofia optimizer work and why it matters given Adam’s long reign.
- •Matrix completion and optimization roots
- •Pre-transformer sentence embeddings via averaging + PCA
- •Contrastive learning: improvements and understanding why it works
- •Sofia optimizer: ~2x pretraining efficiency claims; real-world large-scale results discussed
- •Why beating Adam is hard despite many published attempts
4:21 – 6:51
Why leave Stanford (temporarily) to found a company now
Sarah asks what motivated Tengyu to start Voyage given his academic background. He argues the foundation-model era dramatically simplifies industrial ML application, making commercialization timely.
- •Stanford’s industry/entrepreneurship proximity as a factor
- •Foundation models reduced the “7 steps” of applied ML to prompting + RAG
- •Maturity of tooling and models makes productization more feasible
- •Entrepreneurship framed as part of his longer-term plan
6:51 – 7:40
Voyage AI’s focus: embeddings and rerankers as the RAG bottleneck fix
Tengyu explains Voyage’s product emphasis: improving retrieval quality via embeddings and reranking. Customer conversations suggested implementation is easy, but answer quality is constrained by retrieval relevance.
- •Voyage builds core retrieval components: embedding models and rerankers
- •RAG is straightforward to wire up; quality is the real pain point
- •Retrieval quality heavily determines generated answer quality
- •Even smaller LLMs can perform well if given highly relevant documents
7:40 – 9:36
RAG explained: retrieve first, then generate with grounding
Tengyu gives an intuitive overview of retrieval-augmented generation and why it reduces hallucinations. He walks through turning data into vectors, indexing in a vector DB, and retrieving relevant context for the LLM.
- •Two-step pipeline: retrieval then generation
- •RAG injects proprietary/company context the base LLM lacks
- •Grounding reduces hallucinations by anchoring generation to evidence
- •Vectorization of documents/media/code enables semantic search at scale
- •Vector DB similarity search as the core indexing mechanism
9:36 – 10:41
Where RAG is used: enterprise domains and personal knowledge search
Sarah asks what customers are building with RAG today. Tengyu describes broad adoption across industries plus consumer-style “personal memory” search as a compelling use case.
- •Use cases across chemistry, finance, legal, and developer/code workflows
- •RAG applied to internal documents, product descriptions, and knowledge bases
- •Personal semantic search on devices as an emerging pattern
- •Motivation: semantic retrieval is easier than filename/keyword-based search
10:41 – 11:57
The debate shifts: RAG vs fine-tuning, long context, and agent chaining
Sarah frames ongoing arguments about whether RAG is necessary, outlining alternatives like long-context models and agent-chaining workflows. Tengyu notes the earlier RAG-vs-fine-tuning debate is converging toward RAG’s practicality, then pivots to evaluating long context and agents.
- •Prior debate: RAG vs fine-tuning; fine-tuning often data-hungry and still hallucinates
- •New debate: RAG vs long context and managed context approaches
- •Agent chaining described as an alternative architecture for proprietary data workflows
- •Sets up cost/latency and systems-design considerations
11:57 – 16:12
Why long context isn’t enough: cost, caching, and the memory hierarchy analogy
Tengyu argues that stuffing massive proprietary corpora into context is economically impractical, even with caching of intermediate activations. He frames RAG as long-term memory and context as short-term memory, advocating a hierarchical retrieval approach similar to computer caching.
- •Near-term long-context inference is prohibitively expensive at scale
- •Caching activations can help but remains costlier than retrieval-first systems
- •RAG as long-term memory; context window as short-term memory
- •Analogy: you wouldn’t read the whole library to answer one question
- •Hierarchical systems (disk/cache/CPU) suggest retrieval-based efficiency wins
16:12 – 18:02
Token limits in practice: ‘1M tokens’ vs ‘100M tokens’ company reality
Sarah and Tengyu quantify context-window limits and what they mean in books/lines of code. Tengyu emphasizes enterprise knowledge sizes dwarf even million-token contexts, making pure long-context approaches cost-prohibitive.
- •Context windows growing (e.g., up to ~1M tokens) but still bounded
- •Enterprises may have ~100M+ tokens of relevant proprietary material
- •100x scale differences translate into unacceptable cost multipliers
- •RAG remains practical for reliability and budget constraints
18:02 – 20:19
Agent chaining: mostly orthogonal, but still needs retrieval components
Sarah asks about using LLM agents to manage data instead of vectorizing it. Tengyu argues agent chaining often still relies on embeddings and retrieval, and the key question becomes iterative retrieval vs single-shot retrieval as embedding quality improves.
- •Agent chaining is less well-defined and often complements RAG rather than replaces it
- •Efficient agent systems likely require smaller models and embedding-based retrieval
- •Iterative retrieval helps today due to embedding headroom
- •As embeddings improve, fewer retrieval rounds may be needed
20:19 – 23:20
Improving enterprise RAG: prompts, retrieval quality, and system heuristics
Sarah asks how builders improve RAG beyond upgrading the base LLM. Tengyu breaks improvements into prompting, better retrieval models, and software-layer choices like chunking and metadata—predicting many heuristics fade as models get stronger and more multimodal.
- •Prompting can enforce behaviors like abstention when evidence is missing
- •Retrieval is often the true bottleneck for end-to-end answer quality
- •Two levers: improve neural components (embeddings/rerankers) vs improve usage (chunking/metadata/iterations)
- •Longer-context embedding models reduce the need for aggressive chunking
- •Future: multimodal embeddings remove text-only conversion hacks
23:20 – 25:39
Domain and company fine-tuning for embeddings: why it helps and how much
Tengyu explains Voyage’s approach: train a general embedding model, then fine-tune on large domain corpora (e.g., code, legal) and optionally on a company’s proprietary data. He attributes gains to limited parameter/latency budgets requiring specialization.
- •Pipeline: general embedding model → domain adaptation → optional company-specific adaptation
- •Examples: ~2T tokens for code; ~1T tokens for legal
- •Reported gains: ~5–20% domain improvements; additional ~10–20% with proprietary fine-tuning
- •Specialization is necessary because embedding models can’t be huge (latency constraints)
- •Headroom depends on starting baseline accuracy
25:39 – 27:47
Latency budgets in retrieval: query embeddings, vector dimensions, and reranking
Sarah clarifies where latency arises: each query must be embedded and searched against the vector DB before generation. Tengyu adds that embedding dimensionality directly affects search speed and claims Voyage reduces vector dimension versus competitors to improve latency.
- •Inference path: embed query → vector DB similarity search → provide docs to LLM
- •Latency depends on embedding model inference and vector search cost
- •Embedding dimension (e.g., 100 vs 1000) affects retrieval speed materially
- •Voyage focuses on smaller-dimension embeddings to reduce latency
27:47 – 36:20
Builder advice and the long view: simpler stacks, founder lessons, and academia’s role
Tengyu advises builders to evaluate retrieval quality early, profile bottlenecks, and swap components to iterate. He then predicts future systems become simpler (LLM + vector DB + embeddings + reranker), shares founder learnings about avoiding unforced errors, and argues academia should pursue longer-horizon breakthroughs like optimizers and reasoning.
- •Start optimizing retrieval once a prototype exists; measure retrieval and end-to-end quality
- •Iterate by swapping embeddings/rerankers/LLMs to isolate bottlenecks
- •Prediction: fewer heuristics; stronger neural components simplify RAG pipelines
- •Founder learnings: entrepreneurship differs from research; learn from books/advisors; correct mistakes fast
- •Academia should focus on 3–5 year breakthroughs (efficiency, reasoning) rather than pure scaling arms races