Why Vertical LLM Agents Are The New $1 Billion SaaS Opportunities

As LLM’s become exponentially better it is clear that vertical AI agents are key to the next generation of billion dollar SaaS companies. In this episode of the Lightcone, the hosts sit down with YC alum Jake Heller, the co-founder and CEO of Casetext (which sold to Thomson Reuters for $650 million in cash in 2023) to discuss what it takes to build a successful vertical AI company and overcome resistance from industry veterans and skeptics. Chapters (Powered by https://bit.ly/chapterme-yc) - 00:00 Coming Up 01:40 Building a successful vertical AI company 06:05 The unique challenges of law and AI 09:24 The turning point for lawyers with ChatGPT 11:25 Finding product market fit in legal 15:04 Entering deep founder mode 20:40 Approaching prompt engineering step by step 25:05 Going beyond GPT wrappers 28:10 Aiming for 100% accuracy 30:48 Thoughts on o1’s capabilities 36:42 Outro

Jake HellerguestGarry TanhostJared FriedmanhostDiana Huhost

Oct 4, 202437mWatch on YouTube ↗

EVERY SPOKEN WORD

40 min read · 8,236 words

0:00 – 1:40
Coming Up
1. JHJake Heller
  This is our first ever experience talking to this godlike feeling, you know, AI that was all of a sudden doing these tasks that would take me, when I practiced, like a whole day, and it's being done in a minute and a half. The whole company, all 120 of us did not sleep for those, you know, months before GPT-4. We felt like we had this amazing opportunity to run far ahead of the market.
2. GTGarry Tan
  That's why you're the first man on the moon.
3. JHJake Heller
  Yeah. (laughs)
4. GTGarry Tan
  Welcome back to another episode of The Light Cone. I'm Gary. This is Jared and Diana. Harj is out, but he'll be back on the next one. And today, we have a very special guest, Jake Heller of Casetext. I think of Jake as a little bit like one of the first people on the surface of the moon. He created, uh, Casetext more than, I think, 11, 12 years ago, actually. And in the first 10 years, you went from zero to $100 million valuation, and then in a matter of two months after the release of GPT-4, that valuation went to a liquid exit to Thomson Reuters for $650 million. So you have a lot of lessons about how to create real value from really, like large language models. I think you were of, um, you know, our friends in YC, one of the first people to actually realize this is a sea change and revolution. And not only that, we're gonna bet the company on it, and you were super right. So welcome, Jake.
5. JHJake Heller
  Happy to be here.
6. GTGarry Tan
  One of the cool things I think about Jake's story and reason
1:40 – 6:05
Building a successful vertical AI company
1. GTGarry Tan
  why we wanted to bring him on today is that if you just look at the companies that good founders are starting now, it's a lot of vertical AI agents. I mean, I was trying to count the ones in S24. We have l- literally dozens of the YC companies in the last batch were building vertical-specific AI agents, and I think Jake is the founder who is currently running the most successful vertical AI agent. It's by far the largest acquisition, and it's actually deployed at scale in a lot of mission-critical situations. And the inspiration for this was, uh, we hosted this retreat a few months ago, and Jake gave an incredible talk about how he built it, and we thought that it'd be super useful for people who watched The Light Cone who are interested in this area to hear directly from one of the most successful builders in this area, how he did it. So how did you do it?
2. JHJake Heller
  (laughs) Well, first of all, like, like a lot of these things, um, there's a certain amount of luck. Over the course of our decade-long, uh, journey, we started investing very deeply in AI, uh, and natural language processing, and we, we became close with a number of different research labs, uh, including some of the folks at OpenAI. And when it came time for them to start testing early versions, uh, we didn't realize it was GPT-4 at the time, but what was, what was GPT-4, we got a very early kind of like view of it. And so, you know, months before the public release of GPT-4, you know, we, as a company, were all under NDA, all working on this thing. And I- I'll never forget the first time I saw it, it took maybe 48 hours for us to decide to take every single person at the company and shift what they're working on from w- the projects we were then working on at the time to 100% of the company all working on building this new product we call CoCounsel based on the GPT-4 technology.
3. GTGarry Tan
  How many people was that?
4. JHJake Heller
  We're about 120 people at the time.
5. GTGarry Tan
  So you're talking like 120 people-
6. JHJake Heller
  Yeah.
7. GTGarry Tan
  ... and completely change what they were all working on-
8. JHJake Heller
  Yes, yes, yes.
9. GTGarry Tan
  ... in 48 hours.
10. JHJake Heller
  Yes.
11. GTGarry Tan
  And for the people watching, uh, Casetext originally, I mean, had always been in the legal space.
12. JHJake Heller
  Yeah.
13. GTGarry Tan
  You're a lawyer-
14. JHJake Heller
  Yeah.
15. GTGarry Tan
  ... and you built something for yourself, and you know, sort of the first versions of it were actually sort of, uh, annotated versions of case law, actually.
16. JHJake Heller
  Yeah. That's exactly right. So in the very ear- early origins of the company, the mission of the company, what we're always focused on is, how can we build something that brings the best of technology to the legal space? Um, I, as a lawyer, I actually like the job a lot. The parts of my job that I hated the most was when I had to interact with the technology that lawyers have to use, um, regularly to get the job done. I remember thinking, and this is like 2012 when I was at a law firm, if I want, want to do something really trivial, I had like the, the new iPhone at the time, I can go and Google and find, like, movie times or where's the closest open Thai restaurant with vegetarian options. That was super easy. But if I wanted to find the piece of evidence that was going to exonerate my client and, and make it so he doesn't have to go to jail for the rest of his life, or the, um, key legal case that will help me win a billion-dollar lawsuit, well, that's going to be like five days in a row till 5:00 AM every day. I was like, there's got to be a better way.
17. JFJared Friedman
  What is the process as a lawyer? You would have to read the stacks and stacks of documents or...
18. JHJake Heller
  Pretty much, yeah. Um, right before I started practicing, before everything went virtual or, like, online, uh, you would literally be in a basement with banker's boxes full of documents, reading them one by one by one-
19. JFJared Friedman
  Oh, God.
20. JHJake Heller
  ... to try to find, you know, all the emails in a company like Pfizer or Google to see if there was potential fraud or... Um, and then if you wanted to find case law, slightly before my time, you'd literally go to the library and open up books and just start reading, and, you know, new products were coming out that were some of the first web-based research tools, but they were pretty clunky. It was just hard to find the relevant information.
21. JFJared Friedman
  You couldn't do Control F for any of this stuff basically.
22. JHJake Heller
  Basically not, yeah.
23. JFJared Friedman
  And what was interesting about your background is you also happen to be the rare breed of having also computer science training.
24. JHJake Heller
  Mm-hmm.
25. JFJared Friedman
  So this must have driven you nuts.
26. JHJake Heller
  Yeah, exactly. I mean, i- in the law firm, I will never forget, I was building, like, browser plugins to go on top of the, the tools I was using just to make my, like, life more efficient and effective. And actually, one of the reasons I left the law firm to start a company and apply to YC was, I got in trouble with the general counsel who thought like, "Hey, why are you spending all your time, you know, doing this tech stuff?" And also made at the time very clear that, that my law firm owns all that technology. (laughs) So I decided to do something different.
6:05 – 9:24
The unique challenges of law and AI
1. GTGarry Tan
  So do you want to s- tell us a little bit about the first tenures of Casetext, the sort of like long slog in the pre-LLM era?
2. JHJake Heller
  One of the lessons here, I think, that I took away from that time period is that, uh, when you start a company, you may not get the exact right ... You may have, like, the right kind of general direction. You know there's a problem, you're trying to solve it. But it could take a very long time to figure out what the solution is. For us, for example, you know, we saw that there was this kind of combined issue of, like, bad technology in the legal s- sphere, but also, like, this very n- Like, like, a lot of lawyers use content to do things like research and understand, like, the, what the law is. And so we thought, "Okay, well, we can do the technology better, but how are we going to get this content?" And we spent, like, a couple of years trying to get, as Gary said, lawyers to annotate case law and to provide information.
3. DHDiana Hu
  So it was like a UGC site.
4. JHJake Heller
  Yeah.
5. DHDiana Hu
  Like, a user-generated content site.
6. JHJake Heller
  Yeah, that was a big focus of ours. Like, the kind of one-two punch of better technology but also better content. Um, we, you know ... At the time, our heroes were, like, Stack Overflow and Wikipedia and GitHub and other kind of open source or UGC kind of websites. And it was a total failure. Like, (laughs) we could not get lawyers to contribute their time and information, and, and I think these are just different populations. The typical Wikipedia editor has more time on their hands than they know what to do with, and so they're adding ... Uh, not all, but, but many do, and they're adding content for free, um, and, and altruistically. Lawyers bill by the hour. Their time is incredibly valuable. They're always running out of time. They had no time to kind of contribute to some UGC site, so we had to pivot. And we started investing, uh, investing very deeply. At the time, it was not called AI, it was just, like, natural language processing and machine learning, and saw that, first of all, we didn't need to create all this UGC, like, to, to replicate some of the best benefits of what our competitors had in these big content databases. Some of it, you can basically do, even then, s- at a kind of automated basis. And then also, uh, we were starting to create these user experiences that were, you know, a lot better than what our competitors could offer based on th- then at the time what seems kind of quaint, like AI stuff. Like, you know, the same recommendation algorithm that powers Pandora and Spotify's, like, recommended music, you can use ... You know, what they look at basically is how this song relates to that song, if people listen to this also listen to this and this and this, right? Similarly, we looked at, okay, cases that cite to, you know, other cases, they all reference earlier opinions. You know, they, they kind of build out this network of citations, and we f- found ways that we can check a lawyer's work. They'd upload their work so far and be like, "Well, everybody who talks about this case, talks about this case too, and you miss that."
7. DHDiana Hu
  Mm.
8. JHJake Heller
  Um, so cool experiences like that. But the truth is, until the very end, until CoCounsel, a lot of what we did were, relatively speaking, kind of incremental improvements on the legal workflow, and one of the things that's kind of weird about this is, um, when there's just an incremental improvement, it's actually pretty easy to ignore. A lot of our clients, uh, they would never say this literally, but you get this impression when you walk into their room, their office, and you try to pitch them a product and you say, "This is going to change everything about the way you practice." And they go, "Well, I make $5 million a year. I don't want nothing to change."
9. DHDiana Hu
  (laughs)
10. JHJake Heller
  "This technology plot, yeah, it's not int- I do not want to introduce anything that has the opportunity to make my life at all worse, um, or potentially worse," or potentially more effi- efficient because they bill by the hour. It was
9:24 – 11:25
The turning point for lawyers with ChatGPT
1. JHJake Heller
  really only after, like, much later when ChatG- GPT came out. You know, at the time we were privately and secretly working on GPT-4. ChatGPT came out, and all of a sudden, every lawyer in America, probably in the world, saw, "Oh my God, I don't know exactly how this is going to change my work, but it's going to change it very substantially." Like, they could feel it. And the same, you know, guys and gals who were telling us, "I make $5 million a year. Why would it change anything about my life?" Were like, "I make $5 million a year. If this is going to change something, I need to be a- ahead of this." The technology itself, and we'll get into that in a second, really changed what we can build for lawyers. But also the market perceptions of what was, like, wha- what was necessary really changed as well, and for the first time in our 10 years, you know, even before we launched CoCounsel publicly based on GPT-4, they were calling us like, "You know, we know you work on AI. We need to get on top of this. What can, you know, what can you show us? What can we, what can we work on?" And I think it's because the change was not incremental anymore. It was, like, fundamental, and all of a sudden they had to pay attention. They could not ignore it.
2. DHDiana Hu
  I guess, uh, the mental model I have for you is, there's this concept of the idea maze. You know, the founder goes in the beginning of the maze and they're just, like, feeling around, like, actually, uh, in the arena.
3. JHJake Heller
  Yeah.
4. DHDiana Hu
  Talking to f- you know, customers.
5. JHJake Heller
  Yeah.
6. DHDiana Hu
  Learning, like, where are the walls? Which, which path to go? Should I go left or right? Like ... And then, um, as is actually common for startup founders, in the idea maze you will actually reach a dead end, and then usually you have to pivot.
7. JHJake Heller
  Yeah.
8. DHDiana Hu
  And then I think you have a very interesting story because you were sort of towards the end of maybe, like, one of the, uh, you know, parts that weren't going to get you all the way to product market fit, but then LLMs drop and then it's like the maze got shaken up.
9. JHJake Heller
  Yeah.
10. DHDiana Hu
  And then you were actually much closer to product market fit than absolutely anyone else.
11. JHJake Heller
  Yeah.
12. DHDiana Hu
  And so that's why-
13. JHJake Heller
  Uh, it- it is exactly right.
14. DHDiana Hu
  What a crazy time.
15. JHJake Heller
  Yeah. I think that's exactly right.
16. DHDiana Hu
  That's why you're the first man on the moon.
17. JHJake Heller
  Yeah. (laughs)
18. DHDiana Hu
  (laughs) Mm-hmm.
19. JHJake Heller
  Yeah, I think, I think there's, there's certainly something to that. And, and the thing is,
11:25 – 15:04
Finding product market fit in legal
1. JHJake Heller
  you know, each time we got, progressed through that maze, it felt like maybe now we were at product market fit. You know, we were making real revenue before we, um, launched CoCounsel, and we had real customers and they said really great things about us. I keep on thinking about this article written by Marc Andreessen in, like, the early 2000s. Uh, I think it's called The Only Thing That Matters, and in it he describes the, what it feels like to have product market fit, and he lists things like your servers will go down. You can't hire support people and salespeople fast enough. You're going to eat for a year free at Buck's, the, the kind of famous Woodside, you know, uh, diner where, where a lot of VCs will take you.
2. DHDiana Hu
  (laughs)
3. JHJake Heller
  The, the process to ... And I, I read that early on in my, like, like, you know, career, and I was like, "Okay, well, that's, like, hyperbolic." But when we launched CoCounsel it was literally exactly that. Our servers were going down. We could not hire support people fast enough. We couldn't hire salespeople fast enough. I ate a lot at Buck's. (laughs)
4. DHDiana Hu
  (laughs)
5. JHJake Heller
  You know? (laughs) Uh, before we were ... It was a really big day if we were in the ABA Journal or some other, you know, legal-specific-... uh, publication. We were on CNN and MSNBC and like, you know, all of a sudden everything changed, and that's what real product market fit looks like. I think Mark- Mark was (laughs) even in like 2005 or whenever the article came out, exactly right about what it looked like in 2023.
6. SPSpeaker
  Can you talk about that crazy time? 'Cause this was only two months from when you launched CoCounsel to getting bought for $650 million. So, like, what happened in those two months?
7. JHJake Heller
  Well, to- to be clear, the transaction only closed six months after we launched.
8. SPSpeaker
  But, but the-
9. JHJake Heller
  But it was two months into the conversation started. And so, uh, so we started building CoCounsel and, and for just, just to, uh, to kind of background purposes, the idea we came up with, again, like 48 hours, like a weekend after seeing GPT-4 was, um... And it- it's something that is not but kind of still sound crazy today, but it's feel- felt crazy at the time, which is this AI legal assistant, by which we mean it's like almost like a new member of the firm. You can just talk to it, um, not unlike how you might talk to something like ChatGPT today, uh, and give it tasks like, uh, "I need you to read these a million documents for me and tell me if there's any evidence of fraud happening in this company." And then within a couple of hours, it's like, "I've read all the documents. Here's what the, you know, summary is." Or summarize documents or do legal research and put together a whole memo after researching, you know, hundreds or thousands of cases answering the lawyer's initial research question. And, and so in that sense, it was this like really powerful extension of the workforce of these law firms. That was the concept from the beginning. And we made a very early initial version of it, and we started because we couldn't... You know, under our agreement with OpenAI, we could not be public about this product, but they did let us extend the NDA to a handful of our customers. And so we started having our customers use it. And so, you know, for months before GPT-4 was launched publicly, we had a number of law firms un- like, they had no idea they were using GPT-4, but they were seeing something really special, right? This is actually even before ChatGPT, so they're, this is their first ever experience talking to this godlike feeling, you know, AI that was all of a sudden doing these tasks that would take me when I practiced like a whole day and it's being done in a minute and a half, right? And, and so as you might imagine, like, it was, it was nuts. I mean, first of all, the whole company, all 120 of us did not sleep for those, you know, months before GPT-4 was like publicly launched and there- therefore it could publicly launch the product. We felt like we had this amazing opportunity to run far ahead of the market. Something really beautiful happens when everybody's working super, super hard, which is you iterate so quickly past. And, and actually, I, I still see some companies out there, they're stuck where we were in the first month of seeing GPT-4, right?
10. JFJared Friedman
  Mm-hmm.
11. JHJake Heller
  Um, and I think it's because they're just not, like, as intensely focused and engaged as, as we were, were able to be during those, like, couple, like, ab- about six months or so before the public launch of GPT-4.
15:04 – 20:40
Entering deep founder mode
1. JFJared Friedman
  You kind of... To do this transition, you had to shake the company. You kind of went into deep founder mode.
2. JHJake Heller
  Oh, yeah.
3. JFJared Friedman
  Because there was a lot of, uh, pushback from employees as like, "Oh, this thing was working. Why should we go into, throw ourselves into the deep end of AI?" And you're like-
4. JHJake Heller
  Oh, yeah.
5. JFJared Friedman
  "Uh, tell us about that founder mode moment for you."
6. JHJake Heller
  A- and so first of all, like, this is especially true if you're running a business for 10 years, because they've seen you wander through that maze and, and bump into dead ends. And a lot of those folks have been there for, uh, most or all of that time watching, you know, me as the founder saying, "We're definitely going this direction. It's definitely gonna work." And, and sometimes it doesn't. And you only get so many of those with employees, right? So this was maybe my last one that (laughs) I had with some of these folks and they're like, "Here Jake goes again with this crazy new technology and some idea we're going to invest deeply in." And, and yeah, it took some, some, uh, job to convince people. And if you imagine like what some of the different roles are, if you're in the go-to-market role, if you're, if you're selling or marketing a product, and we were making, you know, we were growing 70, 80% year over year. We're between 15 and $20 million in ARR. Things weren't like terrible, right?
7. SPSpeaker
  That's great.
8. JHJake Heller
  Yeah, we were great. Yeah. We, but like, so they were like, "What? Why are we blo-..." Even the board, you know, some of the members are like, "I get this immediately," and some of them had to be persuaded, right? Um, and, and about the founder mode moment, like one thing that really worked for me is, uh, I led the way through example. I built the first version of it myself. Um, I-
9. SPSpeaker
  Wow, even with 120-person company with like a whole-
10. JHJake Heller
  You know what?
11. SPSpeaker
  ... bunch of engineers and-
12. JHJake Heller
  Yeah.
13. SPSpeaker
  ... lawyers and stuff. Like-
14. JFJared Friedman
  Mm-hmm.
15. SPSpeaker
  ... before that you, like, opened up your, like, IDE and actually built the thing yourself.
16. JHJake Heller
  Oh, yeah. Oh, yeah. And, and part of it was, uh, the NDA only extended at first to me and my co-founder, and that was it.
17. DHDiana Hu
  That was a blessing then, actually.
18. JHJake Heller
  Yeah, exactly.
19. SPSpeaker
  Uh-huh.
20. JHJake Heller
  It was, it turned out to be, like, perfect. And even after the NDA got extended a little bit, we kept it pretty small at first, for the first, like, you know, little bit of time. I made my mind within 48 hours the whole company's going to do this, but we actually only told the company I think a week and a half after we first got access. And during that week and a half, the- we built the very first version, like prototype version of this. And, and again, I, I won't, I'll never forget this. The timing is just so funny. Like, we saw it on like a Friday. We had it all weekend long working with it. And then Monday was an executive offsite where everybody came, all my executives came, and they expected that we're gonna be, we're gonna be talking about how we're going to hit our sales target for the next quarter.
21. JFJared Friedman
  (laughs)
22. JHJake Heller
  How we're...
23. DHDiana Hu
  (laughs)
24. JHJake Heller
  And it's like, "Guys, we're talking about none of that. You know, we are talking about something totally different right now. Let me show you something on my laptop, you know?" Uh, so yeah, I, I built the first version myself, but going through that process, me and, and, and then a handful of other people, I think was really helpful. And we also brought in customers early and that helped convince a lot of people. As soon as like a skeptical sales or marketing or whatever person, or even engineer, was on the other li- end of a Zoom call, uh, where, um, a customer was, was reacting to the product in real time and giving us their honest reactions and like seeing the look on their face. A- and again, you have to imagine, it's almost hard to imagine what the world was like pre-ChatGPT, but th- there, some of these people were seeing that, that exact idea for the first time, and they were, they were just blown away, and that really changed minds quickly. I mean, we saw people go through like existential crisises live, you know, on Zoom calls, like, "Oh, my God."
25. JFJared Friedman
  You could see their expression change?
26. JHJake Heller
  Yeah. Exactly.
27. DHDiana Hu
  (laughs)
28. JHJake Heller
  In all kinds of ways. It's like, "What am I going to do?" A lot of... The very common reaction amongst the senior attorneys we showed it to was like, "Well, they've got to retire soon." Like, you know, "I have to deal with this." (laughs)
29. DHDiana Hu
  And some of this was, um, really driven by GPT-4, uh-... coming out, like, you had access to three. You had access even to two, I think. Is that right?
30. JHJake Heller
  Yeah. We had access. We, we were, we were in a close relationship, again, with a lot of the labs, but including OpenAI, and they kept on showing us stuff kind of early on in its development. And they're like, "Well, can you build something with this for legal?" And every time, we're like, "No, this sucks." (laughs) Like, you know, by, by the time it got to 3 and 3.5, it was like, okay, well, this is plausible sounding English and sounds kind of like a lawyer, so kudos to that. But it was just making stuff up wildly. Like, we just d- d- d- c- it's very hard to connect it to a real use case, especially in legal where it's so important that you actually get the facts right. The, you can't hallucinate. Um, you can't even, you know, make the wrong kinds of assumptions. And we had to do a lot of work with those earlier models to even get them close to usable, and they just were- weren't really... I mean, like, one, one, like, totem or, like, one example along the way is when GPT-3.5 came out, the study was run, um, and it showed that GPT-3.5 got a ten- tenth percentile on the bar passage, right? So, like, it did better than some people actually, but the 10% of them, yeah. And probably the ones who were just filling it out randomly, basically. Um, when we got early access to GPT-4, we're like, "Let's run the study again too." And we worked with OpenAI where, like, we were going to confirm this, this test is not in the training set, and it wasn't. Totally new test to it, and the test we ran did better than 90% of the test takers.
20:40 – 25:05
Approaching prompt engineering step by step
1. DHDiana Hu
  you know, today we have o1, we have, you know, chain of thoughts, reasoning. Um, I think a lot of people look at it as it's not merely the text itself but also the instructions that lead up to, you know, the workflow. But, you know, way at the beginning, nobody knew any of this stuff. How did you start? You had your sort of tests that you had written for previous versions of the model, they outperformed. But then there's this moment where you say, "Okay, well now it's something, but what do we do next and how do we do it?"
2. JHJake Heller
  So the process that we started with then, and it's actually not too dissimilar to what we're doing today. It started with a question of like, okay, well what problem are we trying to solve for the user, right? The user wants to do research, uh, legal research, um, so, and they want like a memo answering their question with citations to the original source. So, like, that's the end result. And then we're like, "Okay, well how do we go from the end result, like working backwards almost? What would it take to get there?" And what ends up happening a lot, uh, with the things that we built for co-counsel, we called them skills, which i- is, uh, it felt very u- unique and, at the time, but I think a lot of companies now call their AI capability skills. So when you're building these skills, um, it turns out it usually takes a lot of work to go from, like say the customer inputs something, say like a set of documents or a question or what have you, to the end result that they're looking for. And the way that we thought about it was how would the best attorney in the world approach this problem? And so in the case of research, for example, the best attorney would, you know, get the request say from a partner and then break that request down into like actual search queries they run against these, these platforms. And sometimes they use special search syntax that looks actually pretty like, like SQL almost, right? So like from the English language query, you have to break it down into these different kind of search queries, maybe a dozen different search queries if you're being really diligent. And then, um, they'd execute the search queries against these databases of law, and they come back with say like 100 results each. And then they, you know, the most diligent, best attorney would sit down and just read every single one of these results that come back, all the, like, case law, statutes, regulations. And you'd start to do things like make notes and, uh, summarize and kind of compile, like, an outline of what your response might be. And-
3. DHDiana Hu
  Like line by line-
4. JHJake Heller
  And it- yeah, exactly.
5. DHDiana Hu
  ... or paragraph by paragraph actually.
6. JHJake Heller
  Exactly. Yeah. It's 100%. And then you start like just taking out those like insights you're getting from what you're reading. And then finally, based on all of that work and all those citations you've gathered, et cetera, then finally you put together your, your, you know, research memo. And so we were like, okay, well each one of those steps along the way, for the vast majority of them, those were impossible to accomplish with previous technology, but now they're, they're prompts.
7. DHDiana Hu
  Think step by step.
8. JHJake Heller
  Yeah. Think step by step. Yeah, exactly. But we actually broke it down each, each, you know, so getting to the final result might be a dozen or two dozen different individual prompts, each of which might, by the way, be thinking step by step th- themselves. But, um, and then for the, for each of those prompts, you know, as part of this like chain of, of actions you take to get to the final result, we had a very clear sense of what good looks like. And we were able, you know, we had a ser- series, like a battery of tests before, but this got way more intense, where we'd write at first maybe a few dozen tests and then a few hundred and then a few thousand for every single one of those prompts. So, you know, if, if the, the, the job to be done in the very beginning of this research process, for example, is taking the English language query and breaking it down into search queries, we had a, a very clear sense of what good search queries look like and wrote like gold standard answers for given this input, this is what the output looks like, right? And so our prompt engineers, um, and I was one of them at the very beginning, we all just kind of in it together, were writing these English language prompts to try to f- to, you know, write the tests first basically, and wrote these English language prompts to try to get it so of 1,200 times they got the right answer 1000, 1199 times or what have you.
9. JFJared Friedman
  So sort of like, um, test-driven development.
10. JHJake Heller
  Oh, yeah.
11. JFJared Friedman
  Really approached from...... doing software engineering to, to, to prompt.
12. JHJake Heller
  That's exactly right. And, and the funny thing is, I never really believed in test-driven development before prompting. (laughs) Like, I was like, "Oh, the code works. It does it. It's fine." Like, you'll see when you ... But, like, with prompting, actually, I think it becomes even more important because of the kind of like nature of these LLMs, that they might go in crazy directions unexpectedly. And so, you know, you might very easily add in a set of instructions to solve one problem you're seeing with these sets of tests, and then it breaks something with these sets of tests. And so you, so that, that exact kind of theory of kind of test-driven development applies, you know, 10X more, I'd say, in the world of prompting.
25:05 – 28:10
Going beyond GPT wrappers
1. JFJared Friedman
  There's a lot of, uh, sort of the naysayers saying that a lot of companies are just building GPT wrappers.
2. JHJake Heller
  Mm-hmm.
3. JFJared Friedman
  And there's not a lot of IP getting built. But it's actually ... There's a lot of finesse to how you explain all of this. Like, could you tell us about all of that and how much more there is to be built?
4. JHJake Heller
  Oh, yeah. I, I mean, I think the thing is, when you're actually trying to solve a problem for a customer, and actually doing the job in, in our case of, like, what a young associate might do, and do it really well, there are many layers of things you have to add in to actually get the job done. And by the time you, like, add that all up, you're not like a GPT wrapper. You're a full application that may include, in our case, proprietary datasets, like the law itself, and, or annotations to the law that we added automatically. It may include, um, connections into customer databases. In our case, in legal, they have these s- very specific, legal-specific document management systems. Um, you know, so connecting into those is, like, very important. Um, it may include, uh, something as subtle as, like, how well you OCR, and, like, what OCR programs you use and how you set those up. When you're doing that task of ... You know, one of the tasks that the co-counsel does, for example, is reviewing large sets of documents. Once you start working with a lot of documents, you see, like, stuff with handwriting all over it, and they're, like, tilted in the scan. And there's this crazy thing that they do in law where they print four pages on one page to save, like, room.
5. JFJared Friedman
  (laughs)
6. JHJake Heller
  And all the OCRs kind of read it directly across, but it actually goes, you know, one, two, three, four. And so, by the time you've dealt with, like, all of the edge cases, frankly, uh, not even before you hit the, the large language model, like, everything else up to the large language model, um, there might be dozens of things you've built into your application to actually make it work and work well. And then you get to the prompting piece, and writing out tests and very specific prompts, and, and the strategy for how you break down, you know, a, a big problem into step-by-step-by-step kind of thinking, um, and, and how you feed in the information, how you format that information the right way. Um, all of that also becomes, like, you know, your IP. And it's very hard to replicate, very hard to build and therefore very hard to replicate.
7. JFJared Friedman
  Which is all the business logic, which is all ...
8. JHJake Heller
  Yeah.
9. JFJared Friedman
  Even all the very successful SaaS companies-
10. JHJake Heller
  Yeah.
11. JFJared Friedman
  ... with a v- very specific domain, you need very, very custom, esoteric, niche integrations like-
12. JHJake Heller
  Yeah.
13. JFJared Friedman
  ... plug into this esoteric law database.
14. JHJake Heller
  Yeah, absolutely. The two things that I think about all the time, it's like, basically all SaaS for a while was just like a SQL wrapper. Right?
15. JFJared Friedman
  Mm-hmm.
16. JHJake Heller
  Like, if you think about, like, very successful companies like Salesforce, they've built that business logic around basically just databases, and connections between, like, tables in a database. And sometimes g- bridging that gap between, um, something that, like, either a very technical person can do but most people can't, and making it accessible, or, um, bridging that gap between something that almost works. Like, you can do a lot of cool demos in ChatGPT without building a line of code, but that almost works, and works, you know, 70% of the time. But going to 100% of the time is a very b- different kind of task. And people will pay $20 a month for the 70%, and maybe $500 or $1,000 a month for something that actually works, depending on the use case. Right? So, there's a lot of value gained going that last mile or 100 miles or whatever it is.
28:10 – 30:48
Aiming for 100% accuracy
1. SPSpeaker
  Yeah. Can you talk about how you went from 70% to 100%? 'Cause I think the other knock on this technology that we hear a lot is like, "Oh, these LMs hallucinate too much. They're not accurate enough for real world use." But as you said earlier, like, the use case that you're working on is a mission critical use case.
2. JHJake Heller
  Oh, yeah.
3. SPSpeaker
  There's, like, a lot at stake if the agent gives bad information to lawyers who are working on important court cases. How did you make it accurate enough for lawyers who are conservative by nature to trust it?
4. JHJake Heller
  This test-driven development framework, first of all, goes a long way. Because you can start seeing, you know, patterns in why it's making a mistake, and then you add instructions against that pattern. And then sometimes it still doesn't, you know, do the right thing, and then you kind of really ask yourself, "Okay. Well, was I being super clear in my instructions?" Uh, you know, "Am I including information it doesn't, you know, it, it doesn't, it shouldn't see, or too much, or, or too little information for it to really get the full context?" And usually, like, the- these things are pretty intelligent. And so, usually, you can kind of root cause why you're failing certain tests, and then build to a place where you're actually passing those tests and just getting it right. You know, and, and, and one of the things we learned is if it, after it passes, frankly, even like 100 tests, the odd that it will do a- an, any random distribution of, like, user inputs the next 100,000 100% accurately is, like, very high.
5. DHDiana Hu
  One of the things that strikes me that is tricky, like, many founders we work with are very tempted to just raw dog it. (laughs)
6. JHJake Heller
  Yeah.
7. DHDiana Hu
  Just like, no evals, no test driven. We're just, like, vibes only, prompt engineering. (laughs)
8. JHJake Heller
  Yeah.
9. DHDiana Hu
  And maybe ... I mean, you switched over to this, uh, very quickly then. Like, was it just obvious from the beginning? You're like, "We just can't do it that other way. We should not raw dog any of these prompts"?
10. JHJake Heller
  Yeah. I think, I think the biggest thing ... Uh, first of all, it depends on the use case. For a lot of things that we were working on, for better or for worse, there was a right answer. And if you get the wrong answer, lawyers are not going to be happy about it. You know, I had been a lawyer myself but also had been selling to lawyers for a decade. And every time we made the smallest mistake in anything that we did, we heard about it immediately. Right? And so, I had that voice in my head maybe (laughs) as I was going through this process. Um, and that, that-
11. SPSpeaker
  That was the learning from the 10 years of slogging through-
12. JHJake Heller
  Yeah.
13. SPSpeaker
  ... pre-LMs. You're like, "No, it has to be 100%."
14. JHJake Heller
  Oh, yeah.
15. SPSpeaker
  Yeah.
16. JHJake Heller
  Oh, yeah.
17. DHDiana Hu
  It's probably true of way more domains than we realize, actually.
18. JHJake Heller
  It, it could be. Um, 'cause, uh, and the other thing that we, we were thinking about a lot is, you can lose faith in these things really quickly.
19. SPSpeaker
  Mm-hmm.
20. JHJake Heller
  Right? You have one bad experience, especially if it's your first bad ex- your first experience is bad. And you're like, "You know, maybe I'll, I'll ..."... check on this AI stuff a year from now, especially if you're, like, a busy lawyer, not a technologist. So we, we knew we, we had to make that first encounter, the first week really, really work for the lawyer, or, or else they're not going to invest in it deeply.
30:48 – 36:42
Thoughts on o1’s capabilities
1. DHDiana Hu
  So, let's talk a bit about OpenAI 01, because it is very different model. I mean, up to this point with GPT-4 and all that previous generation, the analogy in terms of the intelligence is sort of the kind of system one thinking in the Daniel Kahneman-
2. JHJake Heller
  Mm-hmm.
3. DHDiana Hu
  ... type of-
4. JHJake Heller
  Mm-hmm.
5. DHDiana Hu
  ... uh, intelligence, right? That he has this whole economic theory, he won the Nobel Prize around this. System one thinking is just very fast, a- as kind of these decisions that humans make very intuitively and based on patterns, and LMs are fantastic at that. But they're terrible at the executive function, because what I'm hearing with all the stuff that you're describing is kind of you're just giving the LLM, like, executive function is like, "How do you think right? How do I manage you?" It's really that slower thinking. And I think 01 is exciting. We haven't seen things built yet-
6. JHJake Heller
  Mm-hmm.
7. DHDiana Hu
  ... because it just got announced a few days ago, right? I think it's getting to that system two thinking, and I think this is, has been a big area of research, which I saw a lot in, uh, NeurIPS a year ago where a lot of the researchers were excited to unlock this because this is the missing piece to our AGI. Let's talk about what are your thoughts on 01 and how this changes?
8. JHJake Heller
  So, so first of all, I think 01's a very impressive model. Um, like with other things, we gave it the kinds of tests that we knew were failing and the degree of, and it's not just math, degree of thoroughness, precision, intelligence applied to some of these questions. And sometimes it's the, the stuff that you wouldn't, wouldn't expect you need a super smart model to do, like, in one of the tests that we run, we give it a lawyer's real legal brief, but we edited, very slightly, some of, uh, that lawyer's quotations to the case to make it, uh, a wrong quotation, or a wrong kind of summarization of his case. So he has this, like, 40-page legal brief, you alter things with just adding the word, like, "not" can change the meaning of something entirely, right? And then we give the full text of the case as well to the AI and we say, "Well, what did, you know, what did the lawyer, uh, get wrong about this case, if anything?" And literally every LLM before that would be like, "Nothing. It's perfectly right." And it's just not a precise thinker about some of the, the very nuanced things that we altered about the brief to make it slightly wrong. And 01 got this guess, like, immediately. Like you said, like, it thinks actually for a while, like, it sits there for a minute. You're like, "Is this thing, is this thing on," you know? Like-
9. DHDiana Hu
  (laughs)
10. JHJake Heller
  ... but then, then it starts answering, and it's like, "Oh, well, you know, change an and to a neither nor." So th- th- those are the kinds of tests that you kind of expect even, frankly, earlier AI, like LLMs to be able to pass, but just could not. And all of a sudden 01 is even doing these things that take, like, like, precise detail thinking.
11. DHDiana Hu
  Obviously we don't have the internals on O- how O1 really works. We have, you know, this broad idea of chain of thoughts. Seemingly, we know that if OpenAI had a giant corpus of internal monologue of people thinking through doing things step by step, O1 would be even a lot better. It sort of rhymes with, uh, the thing you did to, you know, put your first, uh, step on the moon, right? (laughs)
12. JHJake Heller
  Yeah. Yeah.
13. DHDiana Hu
  Like, you, it rhymes with break it down into t- you know, uh, chunks where you can get to 100% accuracy instead of just throw it all in the context window and m- you know, maybe magically it will work.
14. JHJake Heller
  Yeah.
15. DHDiana Hu
  Do you think that that's what's happening then? Um-
16. JHJake Heller
  I think there's a good shot that, that they've had, you know, maybe change what their contractors are doing and instead of just doing, you know, input in, answer out, they're doing input in, "How would I think about solving this problem?"
17. DHDiana Hu
  Mm.
18. JHJake Heller
  And then answer out. But then it, you know, the, the interesting thing is then it's kind of limited by the intelligence of the people writing those instructions.
19. DHDiana Hu
  Mm-hmm.
20. JHJake Heller
  And one of the things that we're investigating, for what it's worth, with 01 is, can we prompt it to tell it what to think about during its thinking process and inject, like, again, like, you know, we've hired some of the best lawyers in the country. How would the, would some of the best lawyers in the country think about solving this problem? And maybe, you know, we have no conclusive evidence one way or the other yet that this dramatically improves things, it's so early, uh, and just, just not enough time yet has passed. There's a chance that, that one of the new prompting techniques with 01 is teaching it not just, like, how to answer the question, what examples of good answers look like, but how to think. And I think that, that's another, like, really interesting opportunity here is, is, um, injecting domain expertise or, um, just your own intelligence.
21. DHDiana Hu
  I'm just so thankful because I think you're sort of sharing the breadcrumbs and we're, you know, (laughs) where there are a great many other spaces where this technology is just beginning. I mean, you go to pretty much any company, people have no concept of what's just happened. (laughs)
22. JHJake Heller
  Yeah.
23. DHDiana Hu
  Like, they actually literally still repeat all of those sort of tired tropes of, "Oh, you better be fine-tuning," or all the... I mean, these things are just not connected to, like, what we're seeing day-to-day with startups and founders trying to create things for users. What I'm kind of glad for is that we get to actually share this news, like, this knowledge, 'cause, like, even the things we talked about, you know, hey, you should probably do evals. (laughs) Like, there's, uh, a lot of alpha in getting to 100%, not just 70%. These are sort of the breadcrumbs that will actually go on to create, uh, all of the billion-dollar companies, maybe thousands of them actually.
24. JHJake Heller
  Yeah. We hope so. I mean, I think that you're about to start to see a lot of other fields, like law, really level up. When you don't have to spend, you know, millions of dollars in six months literally in a basement reading document by document by document, right?
25. DHDiana Hu
  Mm-hmm.
26. JHJake Heller
  When you, when you actually can just get past that and get just the results, now you're thinking strategically and intelligently. And the unlock for these companies, I mean, they currently pay, again, millions of dollars in salaries for these jobs to be done. Each of them, right? So for any company to come out with a AI that can do even 80% of that, the value is, like, really there. And I just want to encourage people to not kind of give up based on those tropes, right? Like, "Oh, it hallucinates too much. It's too inaccurate. It's too..." whatever. There's, uh, w- if we're an example of anything, it's like there's a path, and you can do it.
27. DHDiana Hu
  And there's some
36:42 – 37:05
Outro
1. DHDiana Hu
  good news in that, uh, you know what? The jobs aren't going to go away, they'll just be more interesting.
2. JHJake Heller
  That's what I think. Yeah.
3. DHDiana Hu
  Well, with that, we're out of time. But Jake, thank you so much for being with us.
4. JHJake Heller
  Thanks for having me.
5. DHDiana Hu
  See you guys next time.

Episode duration: 37:05

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode eBVi_sLaYsc

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome