Why Vertical LLM Agents Are The New $1 Billion SaaS Opportunities

As LLM’s become exponentially better it is clear that vertical AI agents are key to the next generation of billion dollar SaaS companies. In this episode of the Lightcone, the hosts sit down with YC alum Jake Heller, the co-founder and CEO of Casetext (which sold to Thomson Reuters for $650 million in cash in 2023) to discuss what it takes to build a successful vertical AI company and overcome resistance from industry veterans and skeptics. Chapters (Powered by https://bit.ly/chapterme-yc) - 00:00 Coming Up 01:40 Building a successful vertical AI company 06:05 The unique challenges of law and AI 09:24 The turning point for lawyers with ChatGPT 11:25 Finding product market fit in legal 15:04 Entering deep founder mode 20:40 Approaching prompt engineering step by step 25:05 Going beyond GPT wrappers 28:10 Aiming for 100% accuracy 30:48 Thoughts on o1’s capabilities 36:42 Outro

Jake HellerguestGarry TanhostJared FriedmanhostDiana Huhost

Oct 4, 202437mWatch on YouTube ↗

CHAPTERS

0:00 – 0:29
Early GPT‑4 access: realizing a once-in-a-decade advantage
Jake recounts seeing an early GPT‑4 model under NDA and immediately recognizing it could compress days of legal work into minutes. That “godlike” first demo created urgency across the company to sprint ahead of the market.
- •First exposure to GPT‑4 felt qualitatively different from prior AI
- •Tasks that took a full day could be done in ~90 seconds
- •Company-wide urgency: months of intense work and little sleep
- •Belief that early access created a temporary but massive lead
0:29 – 3:51
Casetext’s outcome and why vertical AI agents are the new playbook
Garry frames Casetext as a flagship vertical AI agent story: a long legal-tech journey culminating in a rapid CoCounsel breakout and a $650M exit to Thomson Reuters. The hosts position vertical agents as a dominant new YC/company pattern.
- •Casetext’s pre-LLM progress vs. post-LLM step-change
- •Vertical AI agents proliferating across new startups
- •CoCounsel as a mission-critical deployed agent in legal
- •Goal of the episode: unpack how this was built
3:51 – 6:12
Origin story: legal work is information retrieval under terrible tools
Jake explains the pain that pushed him to build for law: modern consumer search was effortless, but finding crucial evidence or precedent could take days. Legal workflows historically involved manual reading of boxes of documents and clunky research systems.
- •Mismatch between consumer-grade search and legal research tools
- •Discovery/document review as painstaking manual work
- •Legal research evolved from libraries to early web tools, still clunky
- •Jake’s CS background made inefficiencies impossible to ignore
6:12 – 7:24
First product thesis failed: why UGC doesn’t work for lawyers
Early Casetext tried a Stack Overflow/Wikipedia-style model—lawyers annotating case law to create better content. It failed because lawyers’ incentives and time constraints (billable hours) don’t match typical UGC contributors.
- •Initial plan: better tech + lawyer-generated annotations/content
- •Lawyers won’t donate time: billing model and time scarcity
- •Pivot forced by fundamental mismatch in contributor incentives
- •Shift toward automation/NLP to replicate content advantages
7:24 – 9:13
Pre-LLM innovation: NLP-driven incremental workflow improvements (and market resistance)
Casetext rebuilt around NLP/ML to deliver better experiences like citation-network recommendations and “missing case” detection. But incremental improvements were easy for law firms to ignore—especially when partners were already highly paid and risk-averse.
- •Applied recommendation-style models to legal citation graphs
- •Built tools that checked work and suggested missing authorities
- •Incremental gains weren’t compelling enough to force adoption
- •Law firms resisted change (risk + billable-hour incentives)
9:13 – 11:21
ChatGPT changes the market psychology: from “don’t change” to “must adapt”
ChatGPT’s release made lawyers viscerally feel that their work would change, even if they didn’t know how. That shift unlocked inbound demand and attention Casetext never got from incremental improvements alone.
- •Market perception flipped once generative AI became visible
- •Previously indifferent buyers suddenly felt existential urgency
- •Inbound interest arrived even before CoCounsel’s public launch
- •Non-incremental change forced firms to pay attention
11:21 – 12:44
What real product-market fit looked like for CoCounsel
Jake contrasts earlier traction with the explosive PMF of CoCounsel: servers failing, frantic hiring, and mainstream media attention. The Marc Andreessen PMF description matched their reality once CoCounsel launched.
- •Prior to CoCounsel: real revenue and growth, but not “PMF heat”
- •CoCounsel launch triggered overload: demand outpaced capacity
- •Couldn’t hire support/sales fast enough
- •Visibility expanded from legal press to CNN/MSNBC
12:44 – 15:04
CoCounsel’s core concept: an AI ‘new team member’ for legal work
Jake describes CoCounsel as an AI legal assistant that could be delegated associate-level tasks: document review, summarization, and research memos with citations. Early customers used it under NDA, experiencing LLM capability before ChatGPT existed.
- •Vision: conversational assistant that executes real legal tasks
- •Examples: fraud detection across massive docs, summarization, memos
- •Early deployment with select customers under extended NDA
- •Pre-ChatGPT customers encountered LLMs for the first time via CoCounsel
15:04 – 18:22
Deep founder mode: convincing a 120-person company (and the board) to pivot fast
After a decade building the business, Jake faced skepticism when proposing a full-company pivot. He led by building the first prototype himself (partly due to NDA constraints) and used live customer reactions to rapidly align the team.
- •Internal skepticism: ‘things are working, why blow it up?’
- •Context: $15–$20M ARR and 70–80% YoY growth made pivot harder
- •Jake built the initial prototype personally; NDA limited early access
- •Customer demos created instant belief (and visible existential reactions)
18:22 – 21:10
Why GPT‑4 was the turning point: from plausible text to reliable legal reasoning
Earlier models (2/3/3.5) produced plausible lawyer-like prose but hallucinated too much for legal. GPT‑4 crossed a reliability threshold, demonstrated by a dramatic jump in bar-exam performance and better grounding in provided sources.
- •GPT‑3.5 sounded good but made up facts—unusable for legal
- •Legal requires correctness and careful assumptions
- •Bar exam benchmark: ~10th percentile (3.5) to ~90th percentile (GPT‑4)
- •Early prompting focused on accurate citations and staying in-context
21:10 – 24:27
Prompt engineering as workflow design: decomposing ‘skills’ into many steps
Casetext built CoCounsel ‘skills’ by working backward from the desired deliverable (e.g., a research memo) and modeling how a top lawyer would do the job. Complex tasks became chains of prompts—often dozens—each with clear definitions of “good output.”
- •Start from user’s desired output, then reverse-engineer steps
- •Model the best attorney’s process (queries → results → reading → outline → memo)
- •Break big tasks into many smaller prompts/operations
- •‘Skills’ as productized, repeatable agent capabilities
24:27 – 28:10
Test-driven prompting and the ‘not a GPT wrapper’ argument
Jake explains that reliability comes from evals: writing gold-standard tests for each step and iterating prompts until they pass at scale. He also argues defensibility comes from all the surrounding system work—data, integrations, OCR, edge cases—not just the LLM call.
- •Test-driven development applied to prompts (hundreds/thousands of tests)
- •Prompts can regress; evals prevent breaking previously-working behavior
- •Beyond “GPT wrapper”: proprietary data, legal DMS integrations, OCR pipelines
- •Handling real-world edge cases is a major source of IP and performance
28:10 – 30:48
Getting from 70% to 100% accuracy: trust, risk, and first-impression economics
They discuss the gap between demo-level performance and mission-critical correctness. Jake emphasizes systematic root-causing with evals, improving instructions/context, and designing for conservative users who abandon tools after one bad experience.
- •Law’s tolerance for error is extremely low; hallucinations are unacceptable
- •Evals reveal failure patterns; prompts/instructions are tightened iteratively
- •Passing a robust test suite increases confidence on real distributions
- •First bad experience can permanently lose a busy professional user
30:48 – 35:03
o1 and ‘teaching models how to think’: precision checks and domain-guided reasoning
Diana and Jake explore o1’s potential for more deliberate reasoning (system-two style). Jake shares a precision eval where o1 catches subtle quotation/meaning changes in a legal brief—something prior models missed—and discusses experimenting with injecting expert “how to think” guidance.
- •o1 shows improved thoroughness and nuance beyond math tasks
- •Example eval: detecting subtle altered quotes against source cases
- •Hypothesis: training includes step-by-step reasoning traces/monologues
- •New frontier: prompting the model’s reasoning process with expert frameworks
35:03 – 37:05
Closing: debunking tropes and the coming wave of vertical billion-dollar agents
They conclude that many industries still underestimate what changed, and that “70% demos” can be engineered into high-reliability products. Jake and Diana encourage founders to use evals, pursue the last-mile accuracy, and build agents that make knowledge work more strategic.
- •Common misconceptions: ‘must fine-tune’ or ‘LLMs can’t be accurate’
- •There’s real alpha in engineering from 70% to near-100%
- •Vertical agents unlock massive value by replacing expensive manual work
- •Jobs evolve toward higher-leverage strategy rather than disappearing

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Early GPT‑4 access: realizing a once-in-a-decade advantage

Casetext’s outcome and why vertical AI agents are the new playbook

Origin story: legal work is information retrieval under terrible tools

First product thesis failed: why UGC doesn’t work for lawyers

Pre-LLM innovation: NLP-driven incremental workflow improvements (and market resistance)

ChatGPT changes the market psychology: from “don’t change” to “must adapt”

What real product-market fit looked like for CoCounsel

CoCounsel’s core concept: an AI ‘new team member’ for legal work

Deep founder mode: convincing a 120-person company (and the board) to pivot fast

Why GPT‑4 was the turning point: from plausible text to reliable legal reasoning

Prompt engineering as workflow design: decomposing ‘skills’ into many steps

Test-driven prompting and the ‘not a GPT wrapper’ argument

Getting from 70% to 100% accuracy: trust, risk, and first-impression economics

o1 and ‘teaching models how to think’: precision checks and domain-guided reasoning

Closing: debunking tropes and the coming wave of vertical billion-dollar agents

Get more out of YouTube videos.