Context Engineering: Lessons Learned from Scaling CoCounsel

Jake Heller has spent years building AI tools for lawyers. With early access to GPT-4, he and his team realized the model could finally perform legal work at a professional level—scoring in the 90th percentile on the bar exam where GPT-3.5 had only reached the 10th. That breakthrough led to Co-Counsel, an AI legal assistant for research and contracts, and eventually to Casetext’s acquisition by Thomson Reuters. In this video, Jake breaks down what it takes to turn powerful models into reliable products, and the lessons he’s learned from building AI for one of the world’s most demanding professions. Chapters: 00:28 - Early Work with GPT-4 00:53 - Pivot to Co-Counsel 01:38 - Success with GPT-4 02:34 - Acquisition by Thomson Reuters 02:57 - Introduction to Context Engineering 03:24 - Developing Co-Counsel: Three Big Steps 03:44 - Defining the Customer Experience 04:57 - Legal Research Example 06:13 - Linear vs. Agentic Tasks 08:02 - Writing Effective Prompts 12:44 - Importance of Context 13:33 - Challenges in Prompt Engineering 15:49 - Tricks and Tips for Prompt Engineering 18:18 - Reinforcement Fine-Tuning and Model Selection

Jake Hellerguest

Aug 25, 202520mWatch on YouTube ↗

EVERY SPOKEN WORD

20 min read · 4,286 words

0:00 – 0:28
Intro
1. JHJake Heller
  All right. Um, hey, everybody. My name's Jake. We're gonna be talking about, uh, lessons we learned in what we, we used to call prompt engineering, may still in the future call prompt engineering, while building a product called CoCounsel. So, uh, you might be wondering, like, who the heck is this guy and why am I listening to him? Very brief background. Um, I founded a company called Casetext twelve years ago, uh, summer '13 at YC, so it's been a minute. Uh, I am that old, in fact.
0:28 – 0:53
Early Work with GPT-4
1. JHJake Heller
  Uh, fast-forward, like, nine years, and we were working on AI and law for the entirety of our existence, and as a result, we got early access to GPT-4. We were just very close with OpenAI. Before even ChatGPT came out, we were using GPT-4. And what we noticed immediately, and I'll get these two bullet points out
0:53 – 1:38
Pivot to Co-Counsel
1. JHJake Heller
  at the same time, is we built this-- we started pivoting the entire company essentially around, uh, CoCounsel. We had a different business. It was based on AI, it was business in, in law, but we pivoted the entire company around chat-- around, uh, GPT-4. Because what we saw early on was that GPT-4 could finally, unlike GPT-3 or 3.5 or any other model we've seen or developed ourselves, was able to do complex legal tasks at a rate that was not perfect, but was also around the same rate that humans can do for a lot of these tasks. And of course, you can scale it, unlike humans, to hundreds of different things, uh, tasks at the same time, and even if you couldn't scale it in certain circumstances, it's still faster than people. So you had something all of a sudden that was faster and better at
1:38 – 2:34
Success with GPT-4
1. JHJake Heller
  legal tasks, that's the area that we served, and that blew us away. We're the guys that did the study that showed that GPT-3.5 scored in the 10th percentile on the bar examination, but when you did GPT-4 on the same exam, it got to the 90th percentile. And this was an exam that was not in the training set, unlike basically all the, uh, you know, [chuckles] the evals today. Um, and so we were just absolutely blown away with this model and built a whole product and company around it. Like, fact-- You know, it was pretty massively successful in legal, the idea of having the first AI assistant for lawyers, um, is something that a lot of our customers were looking for this functionality that we were offering through CoCounsel. It's something that they'd a-been asking for us for, like, the first decade of our business, and we're like, "That's ridiculous. We can't build that. We need some genius AI to do that." And then all of a sudden, you know, it was kinda handed to us, so we, we were able to, to be successful, uh, in part because we're just a- doing what our customers asked for. And based on that success, uh, we were s- you know,
2:34 – 2:57
Acquisition by Thomson Reuters
1. JHJake Heller
  acquired by a company called Thomson Reuters, uh, in twenty twenty-three, so literally two days ago it was my two-year anniversary, which when you all do the acquisition thing, you'll find out it's a very important milestone. [laughs] Uh, and, uh, and, uh, you know, we've continued to hone some of the, uh, prompt engineering or context engineering techniques since then, which is why we're talking to you about it now. Um,
2:57 – 3:24
Introduction to Context Engineering
1. JHJake Heller
  before I get into the fun and meat stuff, just, uh, in order of semantics, I'm actually not sure about context engineering as the meme. It may or may not be the right thing because as I'll talk about, I think that most prompts are instruction plus context, at least most prompts of consequence. Instruction plus context. So to call it context engineering is really talking about, in some sense, one part of it. But it's just semantics. Let's get into the actual, um, you know, how to do it, whatever you wanna call it.
3:24 – 3:44
Developing Co-Counsel: Three Big Steps
1. JHJake Heller
  From a high level, there are really three big steps that we took when we were developing CoCounsel that I think still apply today. The first is, um... And this is developing the whole application, and then we'll, like, work backwards from there to where each step of the application fits in, and then w- how each prompt or context, whatever you wanna call it these days, uh, fits
3:44 – 4:57
Defining the Customer Experience
1. JHJake Heller
  into that. So the first big step, big, big picture, is, like, what is the experience we're trying to deliver to customers? In our case, it was, uh, a suite of different what we called skills that CoCounsel could do. So imagine CoCounsel almost like ChatGPT with tools, basically, and each tool was another skill, and those skills could be things like do complex legal research, or read over a hundred or a thousand documents and tell me what's in them, or review a contract and tell me what needs to change in order to be, like, what my company needs. And these skills mapped on for us pretty well to what actual lawyer skills would do. These are the kind of skills that if you're a lawyer, you might list on a resume of skills, like, "Here are some things I can do. I can do research, I can do contract review," et cetera. That's the way we thought about it. It's not the only way to think about it. But even if it-- in our case, it was a chat application with tools. Your case might be a big button you press and then it randomly generates a poem. Or it might be, uh, upload a document, analyze a document, something comes out, but there might be different uni- UIs, but, but be really mindful about what is, like, the ideal customer experience. And then from there, for us anyways, we asked, "How would the world's best lawyer do that task?"
4:57 – 6:13
Legal Research Example
1. JHJake Heller
  So for legal research, to use it as a concrete example, the world's best lawyer with, like, an infinite amount of time would take the question given to them, and maybe first they just clarify. So prompt one, clarify. What is... You know, I need to understand is either trying to search just federal case law or state case law? Are you trying to, uh, understand this angle or that angle, right? So there might be a clarification step, and in fact, we saw this creep out later in deep research, one of the iterations from ChatGPT, if you saw that. Right? So maybe that's one step. But after the clarification step, then you have a step where you formulate a number of search queries. That's what I would do if I was a, a real legal researcher as a lawyer, as, as I did before Casetext as a lawyer. So, you know. So I might tr- you know, try this kind of search query, that kind of search query, these kinds of words, that kind of words, maybe twenty or thirty or fifty different search queries. And then I'd execute those search queries and review each of the results one by one. And the best lawyer in the world would read every single page of everything, single result that came back and analyze, is this actually relevant to what I'm doing? If so, how does it help answer the question? Maybe I'll make some notes about how this case relates, uh, to the, to the final answer, compile that into a big notepad, and then finally I take all those notes and compile it into a f- a final answer I give to my, my client. Those are the steps for that, that task.
6:13 – 8:02
Linear vs. Agentic Tasks
1. JHJake Heller
  Sometimes the tasks will be pretty linear and predictable.And in that case, don't be agentic behind the scenes, just, just program it in. Like, if you imagine Python codes, like def step one, def step two, def step three, and then you just like literally just, you know, do task def one, def two, def or whatever, step one, step two, step three, right? Um, so sometimes it's like really linear and sometimes it's agentic. It's like actually if you do a number of searches, but if you realize you're not finding what you're looking for, then go back to step one and try some more searches, try some other... You know, right? So you have to figure out for yourself how would the best person do this and are they acting more agentically, so to speak, or more just steps. So how would the best person in the world do this? That's your architecture for that skill or task or whatever, in my opinion. That's at least version one or version zero. There may be things that humans aren't even able to do, that the world's best X could not even comprehensively, you know, com- you know, there's no way that they could possibly do it, then maybe you can be a little creative here. But I'd start with how the world's best human attack, attack this problem. And then within that, there are multiple micro steps. And I told you some of them. Clarify the s- the search query, um, run number, this number of searches, review the results that came back from the searches, review, um, based on those results, you know, start making notes. These are all micro steps, and there may even be micro steps within that. Each micro step is either code or prompt, and increasingly in difficult tasks that we're all doing these days that were un- impossible to do, like unthinkable to do before GPT-4 basically, but that it's now possible to do, those are mostly gonna be prompts. So that's, that's, that's how you work down to like the prompt level. And just-- it's important as a developer to understand where in the flow, what this-- what is the, the job of each prompt, right? Like, what is the job of each prompt doing when it's trying to accomplish this task for the user?
8:02 – 12:44
Writing Effective Prompts
1. JHJake Heller
  From there, you have to make the prompt actually work, and this is where most people try and fail and then move on 'cause they're like, "This is an impossible problem to solve. I'm not gonna build an AI application." So the, the first thing I recommend, like the way-- the method that we, we worked through, and it's pretty simplistic, but you'd be surprised about how few people do it this way, so I would recommend, even if it may sound really stupid and simple, that to try to do it this way, is step one is you just write a prompt that's your like best guess of, of, of achieving this activity. So if the activity is write one good legal research query, and then you define in the prompt, you know, all the instructions. This is where like con- uh, uh, having some subject matter expertise really helps. So if you're like a really good researcher, you can give it these instructions about how to write a great legal research query. And then you also write-- start with just like ten evals. What is a good legal research query? And the idea is, given the prompt, which is the context, and in this case the context is the user query, and your instructions, what is the objectively correct answer? Or how would you measure an objectively correct answer? And start writing down these evals. Here's where tooling gets into play. I-- When I develop at home, I use PromptFu. It's free, it's open source, it, uh, is, is simple and pro- and command line. There is a host of tools though that are paid and cloud-based and are awesome. I've used Vellum before. I don't know if they're here. Those guys are cool. There's so many different tools. Just pick one and make sure it can do this task of you write a prompt, you give it con- you write instructions, you give it context. Uh, context in other circumstances, by the way, would be the full text of a document. You're asking questions about that document. The prompt might be instructions around doing the next chess move, and the context is the chess game up to this point, right? So instructions plus context, uh, whatever it may be for you, make sure that g- given the, the following instructions and context, it actually does the task well. And, uh, what my guess is, is that the first prompt you write, it's gonna pass six out of your ten tests. And that's the moment most people go, "Well, AI's fucking stupid. I guess we'll wait 'til GPT-6, I guess," right? Like, "It's over for me. This, this application is doomed." And here's where the hard part is. You just keep working until you pass all ten tests, even if it takes you a week or two weeks. What I tell everybody who works for me is that the definition of a good prompt engineer is somebody who can write pretty well and also, like, concisely, directly, um, s- understandably write great instructions, and also somebody who's willing to not sleep for two weeks straight until they get it right. That's what it takes. Because what you'll find is that you thought your instructions were clear, but then it gets it wrong every time on this dumb question. Okay, let's read the instructions again. Where is it unclear? Where am I not being clear to this AI? Where am I using the wrong model or doing the wrong setting? I te- temperature to two, which is somehow possible, or ten is somehow possible with Gemini models. I don't really understand that. It's, like, crazy. It's like one is, like, kinda crazy, and ten's, like, let's go nuts here. But maybe we should try temperature zero or 0.5 or whatever, right? Maybe I, you know, set thinking too much or too little or whatever. You just keep on experimenting, mostly with the instruction 'cause these models are very smart and they will follow your instructions better and better and better, but with everything else as well, model, settings, et cetera, until you get ten out of ten. And then you keep on iterating and iterating and iterating, and then you go to fifty, and then a hundred, and then a thousand per prompt, okay? This is what it takes. Because by the time you get to a thousand, and another kind of test here is not the thousand that you thought of by yourself just in a dark room, but what are the kinds of things your users are likely to throw at you? Can you anticipate pre-launch what are the kinds of crazy shit your, your users are gonna pump into it? We were shocked by how dumb the legal research queries people put into CoCounsel. We were just totally blown away. But that's the reality of our users and many users, uh, who-- even people who are highly educated, like lawyers. So be prepared to try to anticipate as best as you can what lawyers will, uh... Lawyers, and I said lawyers, but actually users will put into your, uh, into your application, what kind of context will look like, and maybe have a small beta program where you tell your customers it is going to be bad at first. We're looking for that, and we're looking for you to tell us the times you tried it and give us that, that information so that we can put it as part of our tests, right? That's how you get to a thousand tests. It's usually driven by your customers.All right. I kind of dumped on context engineering as the meme because I think it's context plus instructions equals prompt, but again, we'll probably hear some disagreements throughout the presentations. Doesn't matter really, it's just semantics.
12:44 – 13:33
Importance of Context
1. JHJake Heller
  But I will say that context really, really, really, really matters, and a lot of people overlook this. There are times where we try to go back to this legal research example. We're like, "I can't believe the AI's getting the answer wrong. It's so stupid." But then we actually just sat down and read, and sometimes it's like ten pages or a hundred pages, just read it, and we realized, oh, given this information, I would say the same thing the AI does. It doesn't have the right information. This is especially the case if, and I recommend this, you tell the AI, "Don't answer based on any of your previously existing knowledge, just based on what I'm giving you right here. If this thing says the sky is purple, the sky is damn purple," right? That's a good instruction to give, by the way, if you're worried that the AI's gonna hallucinate based on the information you put in. You just be really, really, really clear. It must be based on the context given.
13:33 – 15:49
Challenges in Prompt Engineering
1. JHJake Heller
  Okay. Well, but what happens if your retrieval sucks and the information that comes in is, is garbage? How could you possibly expect a human, let alone an AI, to answer the question accurately if the retrieval sucks? So now you have to-- You think you're working on prompt engineering or context engineering, what you're really working on is retrieval. Go Chroma. Uh-
2. SPSpeaker
  [cheering]
3. JHJake Heller
  Or, you know, [laughs] or maybe you, uh, have another problem. For us, believe it or not, like, OCR was a huge issue. Legal documents are total messes. Like, absolute messes. And what we'd see time and time and time again is that when we read the OCR, it was gibberish. It was total mess. No human in their right mind could get the right answer. Sometimes the AI would get it right, and we were shocked because the words are out of order, and it still figured it out, okay? So now you have an OCR problem, not a prompt engineering problem. So that, that... Like, really opening up under the hood and seeing, seeing it like the AI sees it. Read it verbatim, including all the information put in, the spacing. Yeah, it sometimes matters. If it's hard for you to read, it'll be hard for the AI to read is the general rule. AI can do some things you can't do, but if you can make it readable and easy, then you're in a really good place. So that's what I was just saying. Look at the, look at the actual, like, actual information in there. And again, I wanna double underscore this because it's gonna be the difference between you making a prompt that works and a prompt that doesn't work. If you're not willing to stay up, or you or your employees or whoever, are not willing to stay up for two hour- two weeks straight, not sleeping, just working on the prompt, you're not gonna make it, all right? You just need to do that in order to get to a place where the prompt's actually working at a, at scale accurately in the way you want it to. So that's, that's our secret to success. It's actually pretty simple. Evals per prompt. Given this information and instructions, it should answer this. If it's not answering this, it is red, not green. I have to keep on working on the instructions or the information that's coming in until this is green. Let's move on to the next test, and the next test, and the next test. Um, by the time you're passing, like, nine hundred and ninety-nine out of a thousand, and that one is kind of debatable, now you're ready to go, uh, release to a wider audience, in my opinion. Right now, you, now you have some degree of reliability. It's not a guarantee, and you should never tell your customers it's gonna be fucking perfect 'cause it's not gonna be fucking perfect, but it'll be really good, especially if you anticipated all the different edge cases your customers will throw at it.
15:49 – 18:18
Tricks and Tips for Prompt Engineering
1. JHJake Heller
  A few other quick tricks. First of all, the way that the AI works, as you probably know, is it reads over the entire prompt and also by the way, all the tokens it's generated so far, for every token it has to generate. So for every token you generate, it takes a while. And this is less true now that the models are getting faster, and they all quantize like crazy, and so on. But when we started, GPT-4 was slow. And if you wanna do something like read, like, a million documents over the course of a few hours or even a few minutes, okay, well, good luck if it's generating a lot of content per, uh, per document. But there's a trick here. Make it just have one token as its response. That means it does that whole process of reading the entire context and outputting a token once, as opposed to twice or three times, right? And there are some really creative things you can do here with stop words. One thing we love to do is, and this is more true of completion models than chat models, and unfortunately, the world's moved to chat models. But we would, like, give the whole prompt, tell it the output format is like, first you just tell us the number on a scale of one to ten of whatever we're trying to look for. Just the number. And those are all just one tokens each. Ten's also a single token, right? Or a single word like true or false, or y- if there's a gradient there of, you know, true, false, maybe, whatever, right? And then give your explanation, dear AI. Here's the craziest shit. The AI will give a slightly better answer 'cause it thinks it's about to give a whole explanation as to defend its number. But you use stop words or max tokens at one, so the AI just outputs one, one token, and then you just cut it right off. It's like, "Stop talking," right? You've got your fast response, uh, and you also got the AI to think a little bit harder before it's given its answer. That's a good trick. Um, similarly, for eval purposes, even if you're gonna have it output a total explanation, you want a big, nice JSON blob that has all the stuff in it, make it give its, uh, numeric or objective answer first. That makes evals so much easier. Evals are way easier when you can say, like, matches word X rather than having LLM as a judge for every single one of the evals. Although LLM as judge is sometimes the only thing you got. Um, break things into small parts. It is tempting to try to get the AI to, to in one prompt or within one context to do, like, dozen steps, and sometimes it works and if you can, do it. But it's just like people, uh, when if you give them a simple task, they'll more likely do it right more often. Break it down to simple steps. Finally, something that we're seeing-- I think finally.
18:18 – 20:22
Reinforcement Fine-Tuning and Model Selection
1. JHJake Heller
  The thing that we're seeing is that reinforcement fine-tuning is a lot better than older fine-tunes. A lot of the people in this room are probably turned off to fine-tuning because you tried fine-tuning, if you're anything like us, and the gains were minimal at best. And you go, "Okay. Well, why am I getting thousands of different examples just for it to kind of fuck up, um, in the same rate it's fucked up before?" And in fact, maybe even get dumber, which we saw sometimes. Reinforcement fine-tuning on the other hand requires max fifty to a hundred different examples of prompt, what good answer looks like, and how to, uh, objectively judge whether prompt-- whether the answer is good, which is the hardest part of creating these reinforcement fine-tuning, like fine-tunes for, for APIs like OpenAI or if you're doing it yourself. But it goes a really long way, like a really long way, and we're starting to a place where we have a different model for every single prompt, and there are hundreds if not thousands of prompts at this point throughout the CoCounsel ecosystem. For each little micro step, each micro step gets its own reinforcement fine-tune model, and that's another way that might get your evals to go from nine hundred out of a thousand to nine hundred and ninety-nine out of a thousand. Finally, try different models for different prompts. There's no rule that says it has to be GPT all the way through if you have seven different steps. Just try different models. Um, all these, these platforms now, including Promptphu, which is free and on your, on your desktop environment, lets you, say, try it under different conditions and between different models. And the, the other thing too is you might save a buck. If you can go with GPT-5 Mini as opposed to GPT-5, you might save a lot of bucks. Uh, Gemini 2.5 Flash, beautiful model, cheap as fuck. If it works, uh, to your level of satisfaction, use that. Don't use, uh, 2.5 Pro. So try to experiment with different models, and it might take you pretty far, especially if you have a, a potentially lower than a hundred percent accuracy rate is still acceptable for your customers depending on the use case. All right. Well, that's all I got. Hopefully it's useful, and we'll do questions at the end, I think. [applauding] [upbeat music]

Episode duration: 20:23

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode sLFv3RSj_d8

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome