Context Engineering for Engineers

Jeff Huber, founder of Chroma, shares why building with large language models isn’t just about prompts or RAG—it’s about context. He explains how deciding what goes into the context window shapes reliability, why performance drops with long inputs, and how careful filtering and compaction can make AI systems faster and more useful. Chapters: 00:00 - Introduction to Context Engineering 00:26 - Understanding AI Systems as Programs 01:29 - The Concept of Context Engineering 02:02 - Building Reliable Software with AI 02:31 - Challenges with Long Contexts 03:07 - Chroma's Technical Report Insights 03:57 - Needle in a Haystack Problem 05:08 - The Importance of Context in AI Tasks 06:05 - Gather and Glean Model 06:44 - Data Gathering Techniques 07:31 - Gleaning and Optimizing Data 08:26 - Content Engineering for Agents 09:35 - Challenges with Agent Performance 10:13 - The Role of Compaction 10:57 - Conclusion and Final Thoughts

Aug 25, 202511mWatch on YouTube ↗

CHAPTERS

0:00 – 0:26
Why “context engineering” beats the buzzwords
Jeff (founder of Chroma) frames the talk for an engineering audience and argues that “context engineering” is a clearer, more durable concept than “prompt engineering” or the many flavors of RAG. He positions the field as practical software engineering rather than mysticism.
0:26 – 1:29
AI systems are just programs: inputs, tools, and a context window
The model is presented as a program: instructions + relevant information/tools + user input go in, output comes out. This framing sets up context as the primary engineering surface area for controlling behavior and reliability.
1:29 – 2:02
Defining context engineering: deciding what goes in the window
Jeff gives a simple definition: context engineering is the act of deciding what information ends up in the context window for a given turn. This includes prompts, retrieved documents, tool outputs, and any other task-relevant state.
2:02 – 2:31
Reliability first: make it work, then fast, then cheap
The shared goal is reliable software that is also fast and inexpensive. Jeff suggests most teams are still in the “make it work” phase, where correct context selection is the main lever to achieve dependable behavior.
2:31 – 3:07
Why “just use long context” doesn’t solve it (yet)
Despite announcements of million- or multi-million-token context windows, Jeff argues long context is not a practical solution today. Engineers should focus on what works now rather than betting on future context scaling.
3:07 – 3:57
Chroma technical report: performance drops as token length grows
Jeff cites Chroma’s report showing model performance can fall sharply as input length increases, even on tasks that seem easy for humans. The takeaway is that more tokens can reduce accuracy, not increase it.
3:57 – 5:08
Needle-in-a-haystack is misleading: easy attention, near-zero reasoning
He critiques needle-in-a-haystack as a benchmark: it succeeds because the model only needs to find one salient snippet and do almost no reasoning. Real applications often require attending to more of the context and performing harder reasoning.
5:08 – 6:05
A task map: attention required vs. reasoning required
Jeff describes plotting tasks by (1) how much of the context the model must attend to and (2) how difficult the reasoning is. He claims many valuable uses—agents and summarization included—live in the “harder” regions, exposing limits of long, uncurated context.
6:05 – 6:44
Focused context beats full context: curate aggressively
Using LongMav Eval-style comparisons, Jeff argues that giving models a smaller, curated “focused” context can dramatically improve performance over dumping the full context. The practical guidance: curate context as a first-class optimization.
6:44 – 7:31
The core loop: Find relevant info, remove irrelevant, optimize what remains
Jeff reduces context engineering to three goals: retrieve relevant information, discard distractions, and optimize/format what remains for model consumption. Each model call is essentially a selection problem over the universe of possible information.
7:31 – 8:26
Gather & Glean: maximize recall, then maximize precision
He introduces “Gather and Glean” as a two-stage pipeline: first over-collect (high recall), then filter and refine (high precision). This mirrors classic ML/IR thinking and fits modern LLM retrieval workflows.
8:26 – 9:35
Gathering inputs: structured, unstructured, tools, and history
Jeff enumerates common data pools used during gather: SQL/structured stores, vector DBs, APIs/tools, local files, web search, and chat history. The key is that many sources may be relevant, but only some should survive into the final prompt.
9:35 – 10:13
Gleaning methods: ranking, reranking, and “brute-force” LLM curation
He surveys techniques for filtering the gathered pool: top‑K similarity, reciprocal rank fusion, learning-to-rank, rerank models, and increasingly LLM-based judging. He notes a trend toward using many cheap models in parallel—“cheating at search”—to improve curation instead of over-optimizing retrieval heuristics.
10:13 – 10:57
Context engineering for agents: loops, sub-agents, and exploding histories
Agents repeat the gather/glean process many times inside loops, often with sub-agents and orchestration. This makes prompt/tool logs and conversation history a dominant part of the context window, creating scale and readability challenges.
10:57 – 11:16
What helps agents learn: failures > successes, plus the compaction problem
Jeff reports a notable finding: providing past failure cases can improve agent performance (helping escape local minima), while past success cases can cause lazy pattern matching. He closes by emphasizing compaction (distilling history for the next turn) as a critical but currently unsolved lever—naive summaries often don’t help, while better prompting for compaction can.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Why “context engineering” beats the buzzwords

AI systems are just programs: inputs, tools, and a context window

Defining context engineering: deciding what goes in the window

Reliability first: make it work, then fast, then cheap

Why “just use long context” doesn’t solve it (yet)

Chroma technical report: performance drops as token length grows

Needle-in-a-haystack is misleading: easy attention, near-zero reasoning

A task map: attention required vs. reasoning required

Focused context beats full context: curate aggressively

The core loop: Find relevant info, remove irrelevant, optimize what remains

Gather & Glean: maximize recall, then maximize precision

Gathering inputs: structured, unstructured, tools, and history

Gleaning methods: ranking, reranking, and “brute-force” LLM curation

Context engineering for agents: loops, sub-agents, and exploding histories

What helps agents learn: failures > successes, plus the compaction problem

Get more out of YouTube videos.