Context Engineering for Engineers

Jeff Huber, founder of Chroma, shares why building with large language models isn’t just about prompts or RAG—it’s about context. He explains how deciding what goes into the context window shapes reliability, why performance drops with long inputs, and how careful filtering and compaction can make AI systems faster and more useful. Chapters: 00:00 - Introduction to Context Engineering 00:26 - Understanding AI Systems as Programs 01:29 - The Concept of Context Engineering 02:02 - Building Reliable Software with AI 02:31 - Challenges with Long Contexts 03:07 - Chroma's Technical Report Insights 03:57 - Needle in a Haystack Problem 05:08 - The Importance of Context in AI Tasks 06:05 - Gather and Glean Model 06:44 - Data Gathering Techniques 07:31 - Gleaning and Optimizing Data 08:26 - Content Engineering for Agents 09:35 - Challenges with Agent Performance 10:13 - The Role of Compaction 10:57 - Conclusion and Final Thoughts

Aug 25, 202511mWatch on YouTube ↗

EVERY SPOKEN WORD

10 min read · 2,247 words

0:00 – 0:26
Introduction to Context Engineering
1. SPSpeaker
  So tonight, um, I wanna share some thoughts about Context Engineering for Engineers, this audience. Um, by way of context, my name is Jeff, and I'm the founder of Chroma. Chroma, for those of you who don't know, uh, builds a search and retrieval database. Uh, numerous illustrious speakers have shouted out Chroma, so thank you very much. But let's get into some meat
0:26 – 1:29
Understanding AI Systems as Programs
1. SPSpeaker
  here. Um, you know, one way that we think about what really is happening inside of an AI system is that it is ultimately just a program. You have your instruction set, the relevant information and tools, you have your user input, that's the part that changes, and then you put it into this magic box, and an output comes out the other end. And yeah, this is just very much a program. And though people may want to sell this to you as, uh, a techno machine god, um, we believe it is ultimately just software. So I will assert, uh, Jake and I can get into fisticuffs later on this, but I will assert that Context Engineering is a much better term than Prompt Engineering or RAG. I think there's a lot of buzzwords that fly around in the AI space, and every week there's some new AI thought boy with their head explode emoji going crazy over some new technique. You know, they tell you the eighteen different kinds of RAG that you need to know about. Just stop. Mute them. Your life will be better. Um, and I think as evidenced by a lot of the things you've heard tonight, that's very true.
1:29 – 2:02
The Concept of Context Engineering
1. SPSpeaker
  Um, what is Context Engineering? It is quite simply deciding what's in the context window. It's that simple. That includes the prompt, that may include retrieval, depending on the use case. Um, but Context Engineering is the right term, I believe, and I think it's great because Context Engineering, uh, I think, uh, implies the existence of context engineers, people whose jobs it is to, like, make this really good. Maybe even Context Engineering implies the existence of a context engine. I'm not gonna talk about that tonight, but you can go home and, and think about what that might mean.
2:02 – 2:31
Building Reliable Software with AI
1. SPSpeaker
  So really, what is a lot of our shared goal? Our shared goal is to build reliable software. Um, this new software has some new abilities and primitives that prior software didn't. That can be pretty useful. We believe AI can be useful if you give it the right context. And, you know, these systems ideally are reliable, fast, and cheap. Uh, I do believe that in general, we should all take a make it work, make it fast, make it cheap type approach. And probably today, most people are still on stage one, uh, how do we make it reliable?
2:31 – 3:07
Challenges with Long Contexts
1. SPSpeaker
  Okay, so why don't we just use long context? Anthropic just announced, uh, a couple of days ago, their million token context limit model. Famously, uh, there was a certain language model lab that released a model that was ten million tokens. This is amazing. And then you've seen these like, you know, startups raise half a billion dollars for a hundred million, billion, infinite tokens. Nice. Well, unfortunately, that doesn't work yet. Um, and who knows? Maybe it'll never work. Maybe it'll work next year. We don't know yet. But as, uh, as the room is full of engineers and builders, we want to know what works today.
3:07 – 3:57
Chroma's Technical Report Insights
1. SPSpeaker
  Um, so Chroma put out this technical report, um, about a month ago. Uh, Kelly, who did a lot of the work on this, is in the audience. Shout out, Kelly. [audience cheering] Yep. Um, and, uh, yeah, this-- I think this video now has done about a hundred and twenty thousand views on YouTube. So it is doing pretty well. And what we sh- demonstrated in this technical report is this. Across simple tasks that you think a human should do pretty well, this is a task to repeat back certain set of words. Model performance, uh, as an input of token length, uh, goes down precipitously. I... It is clipping off the bottom here, but I'll tell you, I believe that the blue dot at the far bottom right-hand corner is ten thousand tokens. So actually, model performance, uh, I've heard a couple other numbers referenced tonight, forty percent, hundred and seventy K. Across some tasks, seems like it's even much, much sooner than that.
3:57 – 5:08
Needle in a Haystack Problem
1. SPSpeaker
  Um, now, of course, the way that usually the labs substantiate these context windows being useful is they'll tell you about needle in a haystack. Needle in a haystack is solved across all the different token dimensions. Great. Um, but I think what we want to point out to people is, needle in a haystack is a very easy task. Um, on the screen, I have an example of both a needle and a haystack. Um, and you'll notice that there's, number one, the model only has to pay attention to a needle by definition. It doesn't have to pay attention to lots of the context window, only the needle. And then number two, the reasoning power is basically zero. Uh, I will read this out loud. The question is, what was the best writing advice I got from my college classmate? The needle is, the best writing advice I got from my college classmate was to write every week. So, like, you know, imagine the reasoning power required to make that match. It's basically zero. And what we ended up doing was plotting a number of different tasks across these dimensions of on the left-hand axis, uh, amount, so the amount of the context window the model has to pay attention to, and then the bottom axis, the difficulty or the reasoning power required to do this task well. Um, and you'll notice needle in a haystack is in the bottom left, requires you to pay attention to a needle, zero reasoning power.
5:08 – 6:05
The Importance of Context in AI Tasks
1. SPSpeaker
  But our assertion is that most interesting things people are doing with language models today require either more context or more reasoning or both. And actually, many, uh, agent tasks and even summarization are much more difficult. And so then it sort of begs the question, well, how much of the model can you actually use effectively? So we also, uh, ran some tests on LongMav Eval and demonstrated that very simply, if you were to give the model full context versus focused context, focused context in this case is Oracle, so it's sort of human curated, um, this is the numbers for performance. So again, massive gains in performance by curating context. You should curate your context. So broadly speaking, the goal of context engineering is to, number one, find the relevant information, number two, remove the irrelevant information, and then number three, optimize the relevant information. And you could argue that for any given turn of the model, there's this problem of out of all the information in the universe, what information should be in the context window
6:05 – 6:44
Gather and Glean Model
1. SPSpeaker
  this time?And I, I have this model here that I call Gather and Glean. Yes, it is an alliteration. Yes, I did think about that for probably thirty minutes to get there. And the way that I think makes sense to think about this problem is for those of you who have a machine learning background, this will connect. If not, I'll explain again. Stage one is you wanna maximize recall. You want to get all possible relevant tokens or information, even at the risk of getting information that's not relevant. And then stage two is maximizing precision. You want to then remove and call out and cut out all of the irrelevant information, so you're just left with that pristine set of highly relevant, non-distracting information.
6:44 – 7:31
Data Gathering Techniques
1. SPSpeaker
  Um, what we're seeing a lot of developers do now on kind of the retrieval side is this very, uh, this interesting pipeline where the query comes in from the user, you have an LLM create functionally a query plan of like, "Okay, based on this query, I'm gonna use these tools. I'm gonna search in these ways." Maybe it creates ten different search probes or thirty different search probes across structured data, SQL queries, APIs and tools, unstructured data like data in Chroma, and it gets a big pool of data. Um, and then there's the question of like, well, how do you glean it down? And I'm gonna get to that in a second. So again, gathering is, is not news to you all, but, you know, it could be structured data, unstructured data, local file system tools, uh, other kinds of tools like MCP tools, web search, your chat conversation history. You know, all of these pools of data may be relevant to the task the model, uh, has at hand. And
7:31 – 8:26
Gleaning and Optimizing Data
1. SPSpeaker
  then glean, um, so top K on vector similarity. I think you've seen that mentioned before. That's usually people's first pass. The next sort of approach here people use commonly is reciprocal rank fusion or RRF. Uh, LTR, learning to rank is sort of an OG information retrieval technique that's implemented into Elasticsearch. And then, of course, you have dedicated re-ranking models, also common, and then increasingly just LLMs. Believe it or not, LLM. Um, this is a great meme. Uh, and what I, what I think is quite interesting is actually that more and more developers that I talk to are-- they're, they're calling it cheating at search or brute forcing search. Instead of trying to get super fancy about it, they're just using more lang-- just use more intelligence, like spend more money on tokens. Um, you don't have to use state-of-the-art models all the time. You can use small, fast, cheap models and use a lot of them, and use a lot of them in parallel to kind of help you with this curation and gleaning stage.
8:26 – 9:35
Content Engineering for Agents
1. SPSpeaker
  All right, so now I want to spend a little bit more time talking about context engineering for agents. And of course, what is agents? Well, there's a, there's a, there's a loop happening here. And so, you know, in a deep research agent, for example, you're not just doing this gather and glean task once. You're doing this gather and glean task many times conceptually. Um, you're doing it inside of sub-agents, and the sub-agents are getting judged by the orchestrator, and they're going back and forth and, you know, carving up the web and, and finding lots of relevant information. Um, and so of course, this makes this stuff more complicated and more interesting. And, you know, notably, um, you now have the addition of agent conversation and history as a major factor in the context window. When you're going back and forth, you're generating lots of information. Uh, this was also alluded to a moment ago, but prompt histories can be really, really big. So for example, this is a GIF, and I know it's a little bit blurry, but this is an example from Sweet Bench of like the code and logs generated from like one, you know, couple turns of Sweet Bench. Like, as has been stated before, you know, what human could possibly parse this and make sense of it? It is insanely large. And so we really found this quite an interesting learning actually when we were looking at the ability of agents to learn from long context.
9:35 – 10:13
Challenges with Agent Performance
1. SPSpeaker
  And we found one thing that was quite notable, which was that if you give the agent access to past failure cases, it helps improve agent performance. The agent seems to be able to break out of these like local minimas where it commonly gets trapped and like move forward. But it wasn't really a slam dunk to give the agent access to prior success cases. In fact, in many cases, it seemed like the agent would slip into a local minima and kinda just like pattern match and, and get lazy. Like, "Oh, you already gave me the answer. Thank you. I'll just say that back." Um, and so again, there's a lot of... I think these are not solved problems. I do not have the answer to this problem for you today. I wish I did. Um, I think this is why like it's important to create, uh, a community around this idea of context engineering so we can all solve these problems together.
10:13 – 10:57
The Role of Compaction
1. SPSpeaker
  Um, and so, you know, as [chuckles] as has been stated before, compaction is a really important point of leverage. Um, understanding, you know, that GIF that I showed you a moment ago that's going on forever and ever and ever, um, how do you distill down for the next turn of the model? Compaction is so important. And what we find is that like today's approaches don't really work. Um, the difference, again, I apologize for that it's clipped, the difference between no summary and the compaction coming out of like OpenCode, for example, is negligible. So you could basically throw away that compaction entirely, and it's only minorly worse than, uh, using like the sort of built-in compaction tool from OpenCode. Um, but if you do a smarter c- a smarter compaction with a better prompt, uh, it can be much better.
10:57 – 11:15
Conclusion and Final Thoughts
1. SPSpeaker
  All right, well, thank you very much for listening. I'm Jeff, uh, and it's your round. [audience applauding] [upbeat music]

Episode duration: 11:16

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode 3jN77Aw7Utk

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome