Skip to content
YC Root AccessYC Root Access

Context Engineering for Engineers

Jeff Huber, founder of Chroma, shares why building with large language models isn’t just about prompts or RAG—it’s about context. He explains how deciding what goes into the context window shapes reliability, why performance drops with long inputs, and how careful filtering and compaction can make AI systems faster and more useful. Chapters: 00:00 - Introduction to Context Engineering 00:26 - Understanding AI Systems as Programs 01:29 - The Concept of Context Engineering 02:02 - Building Reliable Software with AI 02:31 - Challenges with Long Contexts 03:07 - Chroma's Technical Report Insights 03:57 - Needle in a Haystack Problem 05:08 - The Importance of Context in AI Tasks 06:05 - Gather and Glean Model 06:44 - Data Gathering Techniques 07:31 - Gleaning and Optimizing Data 08:26 - Content Engineering for Agents 09:35 - Challenges with Agent Performance 10:13 - The Role of Compaction 10:57 - Conclusion and Final Thoughts

Aug 25, 202511mWatch on YouTube ↗

CHAPTERS

  1. 0:00 – 0:26

    Why “context engineering” beats the buzzwords

    Jeff (founder of Chroma) frames the talk for an engineering audience and argues that “context engineering” is a clearer, more durable concept than “prompt engineering” or the many flavors of RAG. He positions the field as practical software engineering rather than mysticism.

  2. 0:26 – 1:29

    AI systems are just programs: inputs, tools, and a context window

    The model is presented as a program: instructions + relevant information/tools + user input go in, output comes out. This framing sets up context as the primary engineering surface area for controlling behavior and reliability.

  3. 1:29 – 2:02

    Defining context engineering: deciding what goes in the window

    Jeff gives a simple definition: context engineering is the act of deciding what information ends up in the context window for a given turn. This includes prompts, retrieved documents, tool outputs, and any other task-relevant state.

  4. 2:02 – 2:31

    Reliability first: make it work, then fast, then cheap

    The shared goal is reliable software that is also fast and inexpensive. Jeff suggests most teams are still in the “make it work” phase, where correct context selection is the main lever to achieve dependable behavior.

  5. 2:31 – 3:07

    Why “just use long context” doesn’t solve it (yet)

    Despite announcements of million- or multi-million-token context windows, Jeff argues long context is not a practical solution today. Engineers should focus on what works now rather than betting on future context scaling.

  6. 3:07 – 3:57

    Chroma technical report: performance drops as token length grows

    Jeff cites Chroma’s report showing model performance can fall sharply as input length increases, even on tasks that seem easy for humans. The takeaway is that more tokens can reduce accuracy, not increase it.

  7. 3:57 – 5:08

    Needle-in-a-haystack is misleading: easy attention, near-zero reasoning

    He critiques needle-in-a-haystack as a benchmark: it succeeds because the model only needs to find one salient snippet and do almost no reasoning. Real applications often require attending to more of the context and performing harder reasoning.

  8. 5:08 – 6:05

    A task map: attention required vs. reasoning required

    Jeff describes plotting tasks by (1) how much of the context the model must attend to and (2) how difficult the reasoning is. He claims many valuable uses—agents and summarization included—live in the “harder” regions, exposing limits of long, uncurated context.

  9. 6:05 – 6:44

    Focused context beats full context: curate aggressively

    Using LongMav Eval-style comparisons, Jeff argues that giving models a smaller, curated “focused” context can dramatically improve performance over dumping the full context. The practical guidance: curate context as a first-class optimization.

  10. 6:44 – 7:31

    The core loop: Find relevant info, remove irrelevant, optimize what remains

    Jeff reduces context engineering to three goals: retrieve relevant information, discard distractions, and optimize/format what remains for model consumption. Each model call is essentially a selection problem over the universe of possible information.

  11. 7:31 – 8:26

    Gather & Glean: maximize recall, then maximize precision

    He introduces “Gather and Glean” as a two-stage pipeline: first over-collect (high recall), then filter and refine (high precision). This mirrors classic ML/IR thinking and fits modern LLM retrieval workflows.

  12. 8:26 – 9:35

    Gathering inputs: structured, unstructured, tools, and history

    Jeff enumerates common data pools used during gather: SQL/structured stores, vector DBs, APIs/tools, local files, web search, and chat history. The key is that many sources may be relevant, but only some should survive into the final prompt.

  13. 9:35 – 10:13

    Gleaning methods: ranking, reranking, and “brute-force” LLM curation

    He surveys techniques for filtering the gathered pool: top‑K similarity, reciprocal rank fusion, learning-to-rank, rerank models, and increasingly LLM-based judging. He notes a trend toward using many cheap models in parallel—“cheating at search”—to improve curation instead of over-optimizing retrieval heuristics.

  14. 10:13 – 10:57

    Context engineering for agents: loops, sub-agents, and exploding histories

    Agents repeat the gather/glean process many times inside loops, often with sub-agents and orchestration. This makes prompt/tool logs and conversation history a dominant part of the context window, creating scale and readability challenges.

  15. 10:57 – 11:16

    What helps agents learn: failures > successes, plus the compaction problem

    Jeff reports a notable finding: providing past failure cases can improve agent performance (helping escape local minima), while past success cases can cause lazy pattern matching. He closes by emphasizing compaction (distilling history for the next turn) as a critical but currently unsolved lever—naive summaries often don’t help, while better prompting for compaction can.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome