Context Engineering for Engineers

Jeff Huber, founder of Chroma, shares why building with large language models isn’t just about prompts or RAG—it’s about context. He explains how deciding what goes into the context window shapes reliability, why performance drops with long inputs, and how careful filtering and compaction can make AI systems faster and more useful. Chapters: 00:00 - Introduction to Context Engineering 00:26 - Understanding AI Systems as Programs 01:29 - The Concept of Context Engineering 02:02 - Building Reliable Software with AI 02:31 - Challenges with Long Contexts 03:07 - Chroma's Technical Report Insights 03:57 - Needle in a Haystack Problem 05:08 - The Importance of Context in AI Tasks 06:05 - Gather and Glean Model 06:44 - Data Gathering Techniques 07:31 - Gleaning and Optimizing Data 08:26 - Content Engineering for Agents 09:35 - Challenges with Agent Performance 10:13 - The Role of Compaction 10:57 - Conclusion and Final Thoughts

Aug 24, 202511mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

How to curate AI context for reliable engineering and agents

LLMs should be treated like software programs whose outputs depend heavily on the instruction set, tools, and information placed in the context window.
Very long context windows are not a reliable solution today because performance can degrade sharply as token length increases on even simple tasks.
“Needle-in-a-haystack” tests overstate long-context capability because they require minimal reasoning and attention to only a tiny portion of the input.
Curating and focusing context (versus dumping full context) can produce large performance gains, motivating an explicit process for finding, removing, and optimizing information.
For agents, repeated gather/glean cycles and massive histories make compaction critical, and naive summarization often provides little benefit without smarter prompts and strategies.

IDEAS WORTH REMEMBERING

5 ideas

Context engineering is simply deciding what goes in the context window.

It includes prompts, retrieved knowledge, tool outputs, and history—anything the model will condition on for the current turn.

Longer context is not equivalent to better performance.

Chroma’s results suggest model performance can drop markedly as token length grows, so “just add more tokens” can reduce reliability on real tasks.

Needle-in-a-haystack success is a weak proxy for real workloads.

These tests require attending to a tiny “needle” with near-zero reasoning, unlike summarization, multi-document synthesis, or agentic tasks that need broad attention and deeper reasoning.

Focused context can outperform full context by a wide margin.

When the model gets only the most relevant information (even via oracle/human curation in evaluations), accuracy improves significantly—implying that context selection is a primary lever.

Use a two-stage “Gather then Glean” process for context quality.

First maximize recall (collect broadly, tolerate noise), then maximize precision (prune distractions) to deliver a small, high-signal context payload.

WORDS WORTH SAVING

5 quotes

And though people may want to sell this to you as, uh, a techno machine god, um, we believe it is ultimately just software.

— Jeff

What is Context Engineering? It is quite simply deciding what's in the context window. It's that simple.

— Jeff

So broadly speaking, the goal of context engineering is to, number one, find the relevant information, number two, remove the irrelevant information, and then number three, optimize the relevant information.

— Jeff

But our assertion is that most interesting things people are doing with language models today require either more context or more reasoning or both.

— Jeff

And what we find is that like today's approaches don't really work.

— Jeff

Context engineering vs prompt engineering vs RAG framingLong-context performance degradation findings (Chroma report)Limitations of needle-in-a-haystack evaluationsFocused (curated) context vs full-context performanceGather and Glean model (recall then precision)Retrieval pipelines: query planning, probes, reranking, LLM-as-rerankerAgent loops, history growth, and compaction/summarization challenges

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.