Skip to content
YC Root AccessYC Root Access

Context Engineering: Lessons Learned from Scaling CoCounsel

Jake Heller has spent years building AI tools for lawyers. With early access to GPT-4, he and his team realized the model could finally perform legal work at a professional level—scoring in the 90th percentile on the bar exam where GPT-3.5 had only reached the 10th. That breakthrough led to Co-Counsel, an AI legal assistant for research and contracts, and eventually to Casetext’s acquisition by Thomson Reuters. In this video, Jake breaks down what it takes to turn powerful models into reliable products, and the lessons he’s learned from building AI for one of the world’s most demanding professions. Chapters: 00:28 - Early Work with GPT-4 00:53 - Pivot to Co-Counsel 01:38 - Success with GPT-4 02:34 - Acquisition by Thomson Reuters 02:57 - Introduction to Context Engineering 03:24 - Developing Co-Counsel: Three Big Steps 03:44 - Defining the Customer Experience 04:57 - Legal Research Example 06:13 - Linear vs. Agentic Tasks 08:02 - Writing Effective Prompts 12:44 - Importance of Context 13:33 - Challenges in Prompt Engineering 15:49 - Tricks and Tips for Prompt Engineering 18:18 - Reinforcement Fine-Tuning and Model Selection

Jake Hellerguest
Aug 25, 202520mWatch on YouTube ↗

CHAPTERS

  1. Why Casetext Went All-In on GPT-4

    Jake Heller explains his background founding Casetext and how early access to GPT-4 changed what the company believed was possible in legal AI. He frames GPT-4 as the first model that could perform complex legal work at roughly human-level quality—while being much faster and scalable.

  2. From “Nice Demo” to Product Pivot: Building CoCounsel

    He describes the internal realization that GPT-4 enabled the product customers had been requesting for years: a true AI assistant for lawyers. That realization triggered a major company pivot to build CoCounsel around GPT-4 capabilities.

  3. Validation Moments and Market Impact (Bar Exam + Adoption)

    Jake highlights concrete proof points that increased confidence in GPT-4’s legal aptitude, including their bar exam performance research. He ties these results to why customers trusted the system enough to adopt it for real work.

  4. Acquisition by Thomson Reuters and Continued Iteration

    He briefly covers the Thomson Reuters acquisition and uses it to contextualize why these lessons matter at scale. The team continued refining “prompt engineering” practices post-acquisition as usage expanded.

  5. Context Engineering vs Prompt Engineering: The Framing

    Jake challenges the terminology, arguing most useful prompts are ‘instructions + context.’ He suggests the label matters less than mastering the two components and how they interact.

  6. Design the Customer Experience First (Skills, Tools, and UX)

    He outlines the top-down approach they used: start by defining the user experience and the “skills” the product should offer. CoCounsel is described as a chat interface with tool-like skills mapped to what lawyers do in practice.

  7. Decompose Each Skill Like the World’s Best Professional Would

    For each skill, the team asked: how would the best lawyer in the world do this with unlimited time? He walks through a legal research workflow (clarify → generate queries → search → review → notes → final answer) as the blueprint for the AI pipeline.

  8. Linear Pipelines vs Agentic Loops (When to Program vs Let It Roam)

    Jake distinguishes predictable, linear tasks from tasks requiring iteration and backtracking. If the process is stable, hard-code step-by-step execution; if success depends on exploration, use more agentic looping behavior.

  9. Making Prompts Actually Work: Evals-Driven Iteration

    He presents a practical workflow: write a best-guess prompt, create a small set of tests (evals), then iterate relentlessly until it passes. Most teams quit at partial success; his argument is that reliability comes from grinding through failures with disciplined evaluation.

  10. Why Many “Prompt Problems” Are Actually Context Problems (Retrieval/OCR)

    Jake emphasizes that models often fail because the provided context is incomplete, wrong, or unreadable—not because the instructions are bad. He points to retrieval quality and OCR errors as frequent root causes and recommends inspecting the exact input the model sees.

  11. Operational Discipline: Don’t Ship Until You’ve Earned Reliability

    He argues the real differentiator is willingness to iterate obsessively—often for weeks—until prompts are robust. He suggests shipping only once you’re near-perfect on large eval sets, while still setting realistic expectations with customers.

  12. Practical Prompting Tricks: Speed, Output Control, and Decomposition

    Jake shares tactical techniques to improve latency, reduce cost, and make evaluation easier. These include forcing single-token outputs with stop sequences, asking for a final answer first for easier scoring, and splitting large tasks into smaller prompts.

  13. Reinforcement Fine-Tuning, Model Selection, and Multi-Model Pipelines

    He closes by recommending reinforcement fine-tuning (RFT) as more effective than older fine-tuning approaches, often requiring fewer examples but clearer objective grading. He also advocates mixing models across steps to balance cost and accuracy, potentially using a different model per micro-step.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome