CHAPTERS
Why Casetext Went All-In on GPT-4
Jake Heller explains his background founding Casetext and how early access to GPT-4 changed what the company believed was possible in legal AI. He frames GPT-4 as the first model that could perform complex legal work at roughly human-level quality—while being much faster and scalable.
- •Casetext’s long focus on AI + law led to early GPT-4 access
- •GPT-4 crossed a threshold: complex legal tasks at near human reliability
- •Speed and parallelism made AI assistance economically compelling
- •Contrast with GPT-3/3.5 limitations in legal reasoning
From “Nice Demo” to Product Pivot: Building CoCounsel
He describes the internal realization that GPT-4 enabled the product customers had been requesting for years: a true AI assistant for lawyers. That realization triggered a major company pivot to build CoCounsel around GPT-4 capabilities.
- •Customers had long wanted an ‘AI assistant’ but it was previously infeasible
- •GPT-4 made the assistant concept viable in practice
- •CoCounsel positioned as the first AI assistant for lawyers (in their market)
- •Success came from aligning tightly with real customer demand
Validation Moments and Market Impact (Bar Exam + Adoption)
Jake highlights concrete proof points that increased confidence in GPT-4’s legal aptitude, including their bar exam performance research. He ties these results to why customers trusted the system enough to adopt it for real work.
- •GPT-3.5 vs GPT-4 bar exam jump (10th to ~90th percentile)
- •Emphasis that the exam wasn’t in the training set (per their claim)
- •Early performance gains translated into real customer traction
- •Benchmarks helped justify a full product bet
Acquisition by Thomson Reuters and Continued Iteration
He briefly covers the Thomson Reuters acquisition and uses it to contextualize why these lessons matter at scale. The team continued refining “prompt engineering” practices post-acquisition as usage expanded.
- •Acquired by Thomson Reuters in 2023
- •Scaling requirements intensified reliability and quality needs
- •Prompting practices became an ongoing engineering discipline
- •Lessons learned generalized beyond legal tech
Context Engineering vs Prompt Engineering: The Framing
Jake challenges the terminology, arguing most useful prompts are ‘instructions + context.’ He suggests the label matters less than mastering the two components and how they interact.
- •Prompts of consequence = instruction + context
- •“Context engineering” emphasizes only one half of the equation
- •Terminology is secondary to practical craft
- •Sets up later focus on why context quality dominates outcomes
Design the Customer Experience First (Skills, Tools, and UX)
He outlines the top-down approach they used: start by defining the user experience and the “skills” the product should offer. CoCounsel is described as a chat interface with tool-like skills mapped to what lawyers do in practice.
- •Begin with the ideal customer experience, not the prompt text
- •Model the product as a suite of skills/tools (research, doc review, contract edits)
- •Skills map well to real professional workflows
- •Different apps may use different UIs, but UX clarity drives architecture
Decompose Each Skill Like the World’s Best Professional Would
For each skill, the team asked: how would the best lawyer in the world do this with unlimited time? He walks through a legal research workflow (clarify → generate queries → search → review → notes → final answer) as the blueprint for the AI pipeline.
- •Use ‘world’s best human’ process as the initial task architecture
- •Example workflow: clarify question, generate many queries, run searches, review results, compile notes, produce final output
- •Break work into micro-steps that become code or prompts
- •Start with human workflow, then extend beyond it only when needed
Linear Pipelines vs Agentic Loops (When to Program vs Let It Roam)
Jake distinguishes predictable, linear tasks from tasks requiring iteration and backtracking. If the process is stable, hard-code step-by-step execution; if success depends on exploration, use more agentic looping behavior.
- •Linear tasks: implement as deterministic step functions (step1/step2/step3)
- •Agentic tasks: allow loops (e.g., revise searches when results are weak)
- •Choose agency level based on task characteristics, not hype
- •Architecture decisions precede prompt wording
Making Prompts Actually Work: Evals-Driven Iteration
He presents a practical workflow: write a best-guess prompt, create a small set of tests (evals), then iterate relentlessly until it passes. Most teams quit at partial success; his argument is that reliability comes from grinding through failures with disciplined evaluation.
- •Start with a ‘best guess’ prompt, then define ~10 evals immediately
- •Use tools (e.g., PromptFu, Vellum) to run prompts against eval sets
- •Expect early prompts to fail; iterate until 10/10 passes
- •Scale evals from 10 → 50 → 100 → 1000 (including real user inputs)
Why Many “Prompt Problems” Are Actually Context Problems (Retrieval/OCR)
Jake emphasizes that models often fail because the provided context is incomplete, wrong, or unreadable—not because the instructions are bad. He points to retrieval quality and OCR errors as frequent root causes and recommends inspecting the exact input the model sees.
- •Model outputs can be reasonable given flawed/missing context
- •Use instruction: answer only from provided context to reduce hallucination
- •If retrieval is poor, fix retrieval rather than endlessly rewriting prompts
- •OCR and document cleanliness can dominate accuracy; read inputs verbatim as the model sees them
Operational Discipline: Don’t Ship Until You’ve Earned Reliability
He argues the real differentiator is willingness to iterate obsessively—often for weeks—until prompts are robust. He suggests shipping only once you’re near-perfect on large eval sets, while still setting realistic expectations with customers.
- •Good prompt engineers: clear writers + relentless iteration mindset
- •Iterate on instructions, model choice, and settings (e.g., temperature)
- •Use customer beta feedback to discover edge cases and expand evals
- •Target ~999/1000 passing before broader release; never promise perfection
Practical Prompting Tricks: Speed, Output Control, and Decomposition
Jake shares tactical techniques to improve latency, reduce cost, and make evaluation easier. These include forcing single-token outputs with stop sequences, asking for a final answer first for easier scoring, and splitting large tasks into smaller prompts.
- •Latency hack: constrain output to 1 token (or very few) for fast classification/scoring
- •Use stop words / max tokens to cut off explanations while benefiting from ‘thinking’ setup
- •For evals, request the objective/structured answer first (simplifies scoring)
- •Prefer smaller, simpler steps over one giant prompt when reliability matters
Reinforcement Fine-Tuning, Model Selection, and Multi-Model Pipelines
He closes by recommending reinforcement fine-tuning (RFT) as more effective than older fine-tuning approaches, often requiring fewer examples but clearer objective grading. He also advocates mixing models across steps to balance cost and accuracy, potentially using a different model per micro-step.
- •Older fine-tunes often delivered minimal gains; RFT can be a step-change
- •RFT needs ~50–100 high-quality examples plus objective evaluation criteria
- •At scale, consider different fine-tuned models per micro-step/prompt
- •Try multiple base models per step to optimize cost/performance (don’t default to one model everywhere)
