At a glance
WHAT IT’S REALLY ABOUT
How CoCounsel scaled legal AI using evals, context, and tuning
- GPT-4’s leap in legal reasoning quality enabled Casetext to pivot to CoCounsel and deliver lawyer-like performance at scalable speed.
- CoCounsel was engineered as a suite of tool-like “skills,” each designed by mapping the ideal user experience to how the world’s best lawyer would execute the task step-by-step.
- Reliable prompt performance came from systematic evaluation: write a best-guess prompt, build tests, iterate until passing, and expand to hundreds or thousands of user-driven eval cases.
- Many apparent “prompt failures” were actually context failures—poor retrieval or bad OCR—so debugging required inspecting the exact input the model sees.
- To push reliability and cost-efficiency, the team used techniques like single-token outputs with stop sequences, task decomposition, trying multiple models per step, and reinforcement fine-tuning with small but well-judged datasets.
IDEAS WORTH REMEMBERING
5 ideasDesign the product experience before designing prompts.
Start with the customer-facing workflow (e.g., a suite of lawyer “skills”), then derive the model interactions from that UX instead of treating prompting as an isolated activity.
Use “world’s best human” workflows as your first architecture.
For tasks like legal research, mirror expert steps (clarify, generate many searches, review results, take notes, synthesize) and implement each micro-step as code or a dedicated prompt.
Don’t be agentic if the task is naturally linear.
If the correct process is predictable, hard-code the sequence (step1/step2/step3) rather than adding autonomous looping; reserve agentic behavior for genuinely branching, feedback-driven work.
Evals are the real unlock for prompt reliability.
Write a first-pass prompt, create a small test set, then iterate until all pass—expanding to 50/100/1000 cases driven by real user behavior and edge cases.
Most “bad prompting” is actually bad context.
If retrieval returns irrelevant snippets or OCR produces gibberish, the model may be “correct” given the input; debug by reading the exact context verbatim as the model receives it.
WORDS WORTH SAVING
5 quotesGPT-4 could finally, unlike GPT-3 or 3.5 or any other model we've seen or developed ourselves, was able to do complex legal tasks at a rate that was not perfect, but was also around the same rate that humans can do for a lot of these tasks.
— Jake Heller
I think that most prompts are instruction plus context, at least most prompts of consequence. Instruction plus context.
— Jake Heller
The definition of a good prompt engineer is somebody who can write pretty well and also, like, concisely, directly, um, s- understandably write great instructions, and also somebody who's willing to not sleep for two weeks straight until they get it right.
— Jake Heller
If this thing says the sky is purple, the sky is damn purple.
— Jake Heller
If you're not willing to stay up, or you or your employees or whoever, are not willing to stay up for two hour- two weeks straight, not sleeping, just working on the prompt, you're not gonna make it, all right?
— Jake Heller
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome