Context Engineering: Lessons Learned from Scaling CoCounsel

Jake Heller has spent years building AI tools for lawyers. With early access to GPT-4, he and his team realized the model could finally perform legal work at a professional level—scoring in the 90th percentile on the bar exam where GPT-3.5 had only reached the 10th. That breakthrough led to Co-Counsel, an AI legal assistant for research and contracts, and eventually to Casetext’s acquisition by Thomson Reuters. In this video, Jake breaks down what it takes to turn powerful models into reliable products, and the lessons he’s learned from building AI for one of the world’s most demanding professions. Chapters: 00:28 - Early Work with GPT-4 00:53 - Pivot to Co-Counsel 01:38 - Success with GPT-4 02:34 - Acquisition by Thomson Reuters 02:57 - Introduction to Context Engineering 03:24 - Developing Co-Counsel: Three Big Steps 03:44 - Defining the Customer Experience 04:57 - Legal Research Example 06:13 - Linear vs. Agentic Tasks 08:02 - Writing Effective Prompts 12:44 - Importance of Context 13:33 - Challenges in Prompt Engineering 15:49 - Tricks and Tips for Prompt Engineering 18:18 - Reinforcement Fine-Tuning and Model Selection

Jake Hellerguest

Aug 24, 202520mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

How CoCounsel scaled legal AI using evals, context, and tuning

GPT-4’s leap in legal reasoning quality enabled Casetext to pivot to CoCounsel and deliver lawyer-like performance at scalable speed.
CoCounsel was engineered as a suite of tool-like “skills,” each designed by mapping the ideal user experience to how the world’s best lawyer would execute the task step-by-step.
Reliable prompt performance came from systematic evaluation: write a best-guess prompt, build tests, iterate until passing, and expand to hundreds or thousands of user-driven eval cases.
Many apparent “prompt failures” were actually context failures—poor retrieval or bad OCR—so debugging required inspecting the exact input the model sees.
To push reliability and cost-efficiency, the team used techniques like single-token outputs with stop sequences, task decomposition, trying multiple models per step, and reinforcement fine-tuning with small but well-judged datasets.

IDEAS WORTH REMEMBERING

5 ideas

Design the product experience before designing prompts.

Start with the customer-facing workflow (e.g., a suite of lawyer “skills”), then derive the model interactions from that UX instead of treating prompting as an isolated activity.

Use “world’s best human” workflows as your first architecture.

For tasks like legal research, mirror expert steps (clarify, generate many searches, review results, take notes, synthesize) and implement each micro-step as code or a dedicated prompt.

Don’t be agentic if the task is naturally linear.

If the correct process is predictable, hard-code the sequence (step1/step2/step3) rather than adding autonomous looping; reserve agentic behavior for genuinely branching, feedback-driven work.

Evals are the real unlock for prompt reliability.

Write a first-pass prompt, create a small test set, then iterate until all pass—expanding to 50/100/1000 cases driven by real user behavior and edge cases.

Most “bad prompting” is actually bad context.

If retrieval returns irrelevant snippets or OCR produces gibberish, the model may be “correct” given the input; debug by reading the exact context verbatim as the model receives it.

WORDS WORTH SAVING

5 quotes

GPT-4 could finally, unlike GPT-3 or 3.5 or any other model we've seen or developed ourselves, was able to do complex legal tasks at a rate that was not perfect, but was also around the same rate that humans can do for a lot of these tasks.

— Jake Heller

I think that most prompts are instruction plus context, at least most prompts of consequence. Instruction plus context.

— Jake Heller

The definition of a good prompt engineer is somebody who can write pretty well and also, like, concisely, directly, um, s- understandably write great instructions, and also somebody who's willing to not sleep for two weeks straight until they get it right.

— Jake Heller

If this thing says the sky is purple, the sky is damn purple.

— Jake Heller

If you're not willing to stay up, or you or your employees or whoever, are not willing to stay up for two hour- two weeks straight, not sleeping, just working on the prompt, you're not gonna make it, all right?

— Jake Heller

Pivot enabled by GPT-4 legal capability jumpSkill-based product architecture (tools for lawyers)Workflow design from “best human” processLinear vs. agentic decompositionEvals-first prompt iteration (10 → 1000 tests)Context quality: retrieval and OCR as bottlenecksSpeed/cost tricks, model selection, and reinforcement fine-tuning

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.