No Priors Ep. 15 | With Kelvin Guu, Staff Research Scientist, Google Brain

How do you personalize AI models? A popular school of thought in AI is to just dump all the data you need into pre-training or fine tuning. But that's costly and less controllable than using AI models as a reasoning engine against an external data source, and thus the intersection of retrieval with LLMs has become an increasingly interesting topic. Kelvin Guu, Staff Research Scientist at Google, wants to make machine learning cheaper, easier, and more accessible. Kelvin joins Sarah and Elad this week to talk about the newer methods his team is working on in machine learning, training, and language understanding. He has completed some of the earliest work on retrieval-augmented language models (REALM) and training LLMs to follow instructions (FLAN). 00:00 - Introduction 01:44 - Kelvin’s background in math, statistics and natural language processing at Stanford 03:24 - The questions driving the REALM Paper 07:08 - Frameworks around retrieval augmentation & expert models 10:16 - Why is modularity important 11:36 - FLAN Paper and instruction following 13:28 - Updating model weights in real time and other continuous learning methods 15:08 - Simfluence Paper & explainability with large language models 18:11 - ROME paper, “Model Surgery” exciting research areas 19:51 - Personal opinions and thoughts on AI agents & research 24:59 - How the human brain compares to AGI regarding memory and emotions 28:08 - How models become more contextually available 30:45 - Accessibility of models 33:47 - Advice to future researchers

Sarah GuohostKelvin GuuguestElad Gilhost

May 4, 202337mWatch on YouTube ↗

CHAPTERS

0:00 – 1:52
Kelvin Guu’s path: from math & statistics to NLP (and why Google)
Kelvin describes how an early interest in helping people learn and find information pushed him from math into statistics and then NLP at Stanford. He also explains why joining Google in 2018 was a natural fit, especially as pretrained language modeling (BERT-era) opened new research frontiers.
- •Motivation: tools for learning faster and finding information
- •Academic trajectory: math → statistics PhD → Stanford NLP group
- •Why Google: focus on information access at scale
- •Early exposure to pretraining ideas that became BERT
- •Realization: pretrained models contain surprising implicit world knowledge
1:52 – 3:21
What drove REALM: capacity, modularity, and freshness without retraining
Sarah tees up REALM as a landmark retrieval-augmented modeling paper, and Kelvin lays out the core motivations. He frames retrieval augmentation as a way to increase memorization capacity, enable modular swapping of knowledge sources, and incorporate rapidly-changing information without costly retraining.
- •Three motivations: memorization capacity, modularity, timeliness
- •Swapping data sources like a database for business needs
- •Updating knowledge for fast-changing domains (sports/news/events)
- •Retrieval as a way to use human-interpretable external text
- •Ongoing challenges remain even in modern tool-using/citation systems
3:21 – 5:36
REALM architecture: dense retrieval + cross-attention, trained from the LM objective
Kelvin walks through REALM at inference time: embed the input, retrieve nearest documents in a dense vector space, then attend over those documents to make predictions. He then explains the key training idea: learn what to retrieve using the language-modeling objective itself (masked LM in the BERT setting).
- •Encode input into a dense vector; retrieve nearest neighbor documents
- •Documents are also embedded into the same vector space
- •Use cross-attention over retrieved documents during prediction
- •Train retrieval usefulness via masked language modeling (BERT-style)
- •Retrieve docs that improve fill-in-the-blank performance; suppress unhelpful docs
5:36 – 6:42
When retrieval helps vs. when scaling a dense model is enough
Zooming out, Kelvin offers a practical framework: many “common web knowledge” tasks increasingly don’t need retrieval as models scale. Retrieval’s strongest value shifts toward modularity, privacy/personal data separation, and enterprise customization.
- •Dense scaling increasingly covers common/public knowledge
- •Retrieval becomes most compelling for modularity and personalization
- •Enterprise use case: incorporate proprietary or customer-specific info
- •Memorization depends on frequency in the pretraining corpus
- •Tradeoff framing: model size vs. external knowledge access
6:42 – 8:43
Retrieval-augmented models vs. Mixture-of-Experts: granularity and best-fit use cases
Elad asks about parallels to MoE models, and Kelvin compares them directly. He explains a “Branch-Train-and-Merge” style MoE approach and contrasts it with retrieval’s finer granularity, highlighting where each shines (factoids vs. deeper domain adaptation like specialized codebases).
- •MoE approach: split corpus, train experts, route at inference
- •Retrieval goes to individual documents—finer-grained than experts
- •Both partition knowledge, but at different granularities
- •Retrieval excels at precise fact lookup (e.g., wifi password)
- •Experts can better capture specialized language/skills (e.g., proprietary code)
8:43 – 10:00
Why modularity matters: adaptation pathways from 2017 to instruction-following
Kelvin argues modularity is practically necessary because most organizations won’t train frontier models from scratch. He traces the evolution from task-specific architectures, to pretrain+fine-tune, to instruction-following—showing how interfaces for adaptation keep changing and why modular design helps.
- •Resource reality: most teams must build on top of existing models
- •Modularity enables low-effort adaptation and component swapping
- •Historical trend: specialized architectures → fine-tuning shared backbones
- •Current trend: instruction-following reduces need for bespoke datasets
- •Not all needs are solved by prompting; some require deeper adaptation
10:00 – 11:54
FLAN and instruction-following: multitask instruction tuning and surprising generalization
Kelvin explains instruction following via the FLAN paper: train on many tasks with natural-language instructions, then generalize to unseen tasks. He also notes the limits of instruction-following—especially around hallucinations and behaviors that don’t reliably respond to “just prompt it.”
- •FLAN: annotate many tasks with instructions; train multitask
- •Generalization: adapt to a new (101st) task without seeing it before
- •Surprise: far fewer tasks needed than expected (hundreds, not thousands)
- •Instruction-following enabled later systems like InstructGPT/ChatGPT
- •Limits: hallucinations and other behaviors need fine-tuning/RL/personalization
11:54 – 13:33
Continuous learning in production: why real-time weight updates are rare, and a prompt-tuning alternative
Elad asks about updating weights over time; Kelvin discusses continuous learning and why it’s less common in production due to validation and maintenance risks. He then describes prompt tuning as a practical compromise: freeze the model, but update a small learned prompt embedding over time.
- •Continuous learning has operational costs: harder validation and rollback
- •Many products can tolerate weekly/monthly model release cycles
- •Ongoing learning still matters, but implementations must be safe/manageable
- •Prompt tuning: freeze core model; optimize soft prompts via gradients
- •Modular “plug in/out” prompts provide lightweight, updateable behavior changes
13:33 – 16:20
Simfluence and training data attribution: tracing behaviors back to examples
Kelvin introduces Simfluence within the broader goal of training data attribution: identifying which training examples caused a model behavior. He explains the ideal-but-impractical approach (remove an example and retrain) and Simfluence’s approximation via a lightweight simulator of training dynamics.
- •Training data attribution asks: which examples taught a specific behavior?
- •Hard for LLMs due to long training and complex generalization
- •Gold standard: remove an example, retrain, and compare behavior (too costly)
- •Simfluence: lightweight model that simulates training effects from prior runs
- •Use cases: find valuable data to collect; distinguish generalization vs. memorization
16:20 – 18:06
Model surgery and ROME: editing factual knowledge inside weights (and why it’s exciting)
Kelvin highlights “model surgery” as a research direction: directly editing parameters to change what a model believes. Using ROME as an example, he explains how a targeted edit can propagate through related facts, and why this offers a different kind of modularity than retrieval or MoE.
- •Model surgery: edit weights to change knowledge of specific facts
- •ROME example: change ‘Eiffel Tower is in Paris’ to ‘Rome’
- •Edits can propagate to related queries (downstream consistency)
- •Intuition: weight matrices behave like lookup tables for queries
- •Promise: modular, controllable updates while preserving broad coverage
18:06 – 23:19
Agents, memory, and human-like constraints: why today’s workflows feel brittle
Shifting to AGI-adjacent topics, Kelvin critiques current autonomous-agent patterns: breaking tasks into prompt-sized steps plus an external memory store. He argues this can lock in wrong subgoals and lacks the human-like progression from explicit reasoning to distilled instincts (plus pruning/consolidation).
- •Current agent paradigm: decompose tasks + bridge with external memory
- •Risk: incorrect intermediate goals become “canonical” and persist
- •Humans rely on chunking and learned instincts, not only explicit reasoning
- •Today’s systems overuse explicit chain-of-thought/introspection loops
- •Missing analogue: memory consolidation/pruning of thoughts over time
23:19 – 26:26
Human brain analogies: emotions (fear), introspection, and avoiding runaway behavior
Elad connects the discussion to neuroscience; Kelvin speculates on importing ideas like emotional ‘fear’ signals from RL to avoid irreversible states. They also discuss human introspection mechanisms that prevent absurd loops, contrasting them with brittle LLM behaviors and patchy prompting-based fixes.
- •Speculation: add ‘fear’/uncertainty signals to avoid irreversible states
- •RL literature explores mechanisms akin to fear (state-avoidance)
- •Potential downside: importing human-like issues alongside benefits
- •Example failure mode: consistency bias and runaway counting loops
- •Need for less-explicit, more associative/intuitive self-correction
26:26 – 27:20
Making models contextually available: from chat prompts to ambient assistance
Kelvin describes a shift from users crafting heavy prompts in a chat UI to systems that understand context automatically. He imagines models embedded in workflows (e.g., browsers) that infer intent from recent activity and help without requiring users to restate everything.
- •Current barrier: summarizing context into a good prompt is cognitively costly
- •Goal: models accessible in more contexts, not just a chat window
- •Ambient help: model observes recent workflow and infers user intent
- •Example: browser-based assistance after struggling with a task
- •Open challenge: robust context capture and intent inference
27:20 – 37:17
Knowledge representation, accessibility, and advice for researchers (plus careers in an LLM world)
Kelvin reflects on knowledge representations: knowledge bases excel as canonical ground truth, while text offers coverage but makes edits hard—leading to a centralization vs. coverage tradeoff. He then discusses accessibility of training/adaptation interfaces and closes with advice: set more ambitious goals, learn from product use cases, and keep digging into how models work—even as LLMs shift which human skills matter.
- •Knowledge bases vs. text: canonical ground truth vs. broad coverage
- •Editing challenge: facts appear in many places; centralization tradeoffs
- •Dense models dominate; opportunity is making them more controllable/manageable
- •Accessibility: data quality and curation are major bottlenecks for training
- •Research advice: raise ambition, follow product needs, and deepen mechanistic understanding; future value shifts toward problem formulation and validation

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Kelvin Guu’s path: from math & statistics to NLP (and why Google)

What drove REALM: capacity, modularity, and freshness without retraining

REALM architecture: dense retrieval + cross-attention, trained from the LM objective

When retrieval helps vs. when scaling a dense model is enough

Retrieval-augmented models vs. Mixture-of-Experts: granularity and best-fit use cases

Why modularity matters: adaptation pathways from 2017 to instruction-following

FLAN and instruction-following: multitask instruction tuning and surprising generalization

Continuous learning in production: why real-time weight updates are rare, and a prompt-tuning alternative

Simfluence and training data attribution: tracing behaviors back to examples

Model surgery and ROME: editing factual knowledge inside weights (and why it’s exciting)

Agents, memory, and human-like constraints: why today’s workflows feel brittle

Human brain analogies: emotions (fear), introspection, and avoiding runaway behavior

Making models contextually available: from chat prompts to ambient assistance

Knowledge representation, accessibility, and advice for researchers (plus careers in an LLM world)

Get more out of YouTube videos.