Y CombinatorWhy Vertical LLM Agents Are The New $1 Billion SaaS Opportunities
CHAPTERS
- 0:00 – 0:29
Early GPT‑4 access: realizing a once-in-a-decade advantage
Jake recounts seeing an early GPT‑4 model under NDA and immediately recognizing it could compress days of legal work into minutes. That “godlike” first demo created urgency across the company to sprint ahead of the market.
- •First exposure to GPT‑4 felt qualitatively different from prior AI
- •Tasks that took a full day could be done in ~90 seconds
- •Company-wide urgency: months of intense work and little sleep
- •Belief that early access created a temporary but massive lead
- 0:29 – 3:51
Casetext’s outcome and why vertical AI agents are the new playbook
Garry frames Casetext as a flagship vertical AI agent story: a long legal-tech journey culminating in a rapid CoCounsel breakout and a $650M exit to Thomson Reuters. The hosts position vertical agents as a dominant new YC/company pattern.
- •Casetext’s pre-LLM progress vs. post-LLM step-change
- •Vertical AI agents proliferating across new startups
- •CoCounsel as a mission-critical deployed agent in legal
- •Goal of the episode: unpack how this was built
- 3:51 – 6:12
Origin story: legal work is information retrieval under terrible tools
Jake explains the pain that pushed him to build for law: modern consumer search was effortless, but finding crucial evidence or precedent could take days. Legal workflows historically involved manual reading of boxes of documents and clunky research systems.
- •Mismatch between consumer-grade search and legal research tools
- •Discovery/document review as painstaking manual work
- •Legal research evolved from libraries to early web tools, still clunky
- •Jake’s CS background made inefficiencies impossible to ignore
- 6:12 – 7:24
First product thesis failed: why UGC doesn’t work for lawyers
Early Casetext tried a Stack Overflow/Wikipedia-style model—lawyers annotating case law to create better content. It failed because lawyers’ incentives and time constraints (billable hours) don’t match typical UGC contributors.
- •Initial plan: better tech + lawyer-generated annotations/content
- •Lawyers won’t donate time: billing model and time scarcity
- •Pivot forced by fundamental mismatch in contributor incentives
- •Shift toward automation/NLP to replicate content advantages
- 7:24 – 9:13
Pre-LLM innovation: NLP-driven incremental workflow improvements (and market resistance)
Casetext rebuilt around NLP/ML to deliver better experiences like citation-network recommendations and “missing case” detection. But incremental improvements were easy for law firms to ignore—especially when partners were already highly paid and risk-averse.
- •Applied recommendation-style models to legal citation graphs
- •Built tools that checked work and suggested missing authorities
- •Incremental gains weren’t compelling enough to force adoption
- •Law firms resisted change (risk + billable-hour incentives)
- 9:13 – 11:21
ChatGPT changes the market psychology: from “don’t change” to “must adapt”
ChatGPT’s release made lawyers viscerally feel that their work would change, even if they didn’t know how. That shift unlocked inbound demand and attention Casetext never got from incremental improvements alone.
- •Market perception flipped once generative AI became visible
- •Previously indifferent buyers suddenly felt existential urgency
- •Inbound interest arrived even before CoCounsel’s public launch
- •Non-incremental change forced firms to pay attention
- 11:21 – 12:44
What real product-market fit looked like for CoCounsel
Jake contrasts earlier traction with the explosive PMF of CoCounsel: servers failing, frantic hiring, and mainstream media attention. The Marc Andreessen PMF description matched their reality once CoCounsel launched.
- •Prior to CoCounsel: real revenue and growth, but not “PMF heat”
- •CoCounsel launch triggered overload: demand outpaced capacity
- •Couldn’t hire support/sales fast enough
- •Visibility expanded from legal press to CNN/MSNBC
- 12:44 – 15:04
CoCounsel’s core concept: an AI ‘new team member’ for legal work
Jake describes CoCounsel as an AI legal assistant that could be delegated associate-level tasks: document review, summarization, and research memos with citations. Early customers used it under NDA, experiencing LLM capability before ChatGPT existed.
- •Vision: conversational assistant that executes real legal tasks
- •Examples: fraud detection across massive docs, summarization, memos
- •Early deployment with select customers under extended NDA
- •Pre-ChatGPT customers encountered LLMs for the first time via CoCounsel
- 15:04 – 18:22
Deep founder mode: convincing a 120-person company (and the board) to pivot fast
After a decade building the business, Jake faced skepticism when proposing a full-company pivot. He led by building the first prototype himself (partly due to NDA constraints) and used live customer reactions to rapidly align the team.
- •Internal skepticism: ‘things are working, why blow it up?’
- •Context: $15–$20M ARR and 70–80% YoY growth made pivot harder
- •Jake built the initial prototype personally; NDA limited early access
- •Customer demos created instant belief (and visible existential reactions)
- 18:22 – 21:10
Why GPT‑4 was the turning point: from plausible text to reliable legal reasoning
Earlier models (2/3/3.5) produced plausible lawyer-like prose but hallucinated too much for legal. GPT‑4 crossed a reliability threshold, demonstrated by a dramatic jump in bar-exam performance and better grounding in provided sources.
- •GPT‑3.5 sounded good but made up facts—unusable for legal
- •Legal requires correctness and careful assumptions
- •Bar exam benchmark: ~10th percentile (3.5) to ~90th percentile (GPT‑4)
- •Early prompting focused on accurate citations and staying in-context
- 21:10 – 24:27
Prompt engineering as workflow design: decomposing ‘skills’ into many steps
Casetext built CoCounsel ‘skills’ by working backward from the desired deliverable (e.g., a research memo) and modeling how a top lawyer would do the job. Complex tasks became chains of prompts—often dozens—each with clear definitions of “good output.”
- •Start from user’s desired output, then reverse-engineer steps
- •Model the best attorney’s process (queries → results → reading → outline → memo)
- •Break big tasks into many smaller prompts/operations
- •‘Skills’ as productized, repeatable agent capabilities
- 24:27 – 28:10
Test-driven prompting and the ‘not a GPT wrapper’ argument
Jake explains that reliability comes from evals: writing gold-standard tests for each step and iterating prompts until they pass at scale. He also argues defensibility comes from all the surrounding system work—data, integrations, OCR, edge cases—not just the LLM call.
- •Test-driven development applied to prompts (hundreds/thousands of tests)
- •Prompts can regress; evals prevent breaking previously-working behavior
- •Beyond “GPT wrapper”: proprietary data, legal DMS integrations, OCR pipelines
- •Handling real-world edge cases is a major source of IP and performance
- 28:10 – 30:48
Getting from 70% to 100% accuracy: trust, risk, and first-impression economics
They discuss the gap between demo-level performance and mission-critical correctness. Jake emphasizes systematic root-causing with evals, improving instructions/context, and designing for conservative users who abandon tools after one bad experience.
- •Law’s tolerance for error is extremely low; hallucinations are unacceptable
- •Evals reveal failure patterns; prompts/instructions are tightened iteratively
- •Passing a robust test suite increases confidence on real distributions
- •First bad experience can permanently lose a busy professional user
- 30:48 – 35:03
o1 and ‘teaching models how to think’: precision checks and domain-guided reasoning
Diana and Jake explore o1’s potential for more deliberate reasoning (system-two style). Jake shares a precision eval where o1 catches subtle quotation/meaning changes in a legal brief—something prior models missed—and discusses experimenting with injecting expert “how to think” guidance.
- •o1 shows improved thoroughness and nuance beyond math tasks
- •Example eval: detecting subtle altered quotes against source cases
- •Hypothesis: training includes step-by-step reasoning traces/monologues
- •New frontier: prompting the model’s reasoning process with expert frameworks
- 35:03 – 37:05
Closing: debunking tropes and the coming wave of vertical billion-dollar agents
They conclude that many industries still underestimate what changed, and that “70% demos” can be engineered into high-reliability products. Jake and Diana encourage founders to use evals, pursue the last-mile accuracy, and build agents that make knowledge work more strategic.
- •Common misconceptions: ‘must fine-tune’ or ‘LLMs can’t be accurate’
- •There’s real alpha in engineering from 70% to near-100%
- •Vertical agents unlock massive value by replacing expensive manual work
- •Jobs evolve toward higher-leverage strategy rather than disappearing