Skip to content
Y CombinatorY Combinator

How Meta Prompting and Rubrics Make LLM Agents Reliable

Through rubric-based evals and explicitly layered meta prompting; Parahelp's agent prompt shows how role, task, and output-format layers drive LLM calls.

Garry TanhostJared FriedmanhostDiana HuhostHarj Taggarhost
May 30, 202531mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:000:58

    Intro

    1. GT

      Meta prompting is turning out to be a very, very powerful tool that everyone's using now. It kind of actually feels like coding in, you know, 1995. Like, the tools are not all the way there. We're, you know, in this new frontier. But personally, it also kind of feels like learning how to manage a person (laughs) where it's like, how do I actually communicate, uh, you know, the things they need to know in order to make a good decision?

    2. SP

      (intro music)

    3. GT

      Welcome back to another episode of The Light Cone. Today, we're pulling back the curtain on what is actually happening inside the best AI startups when it comes to prompt engineering. We surveyed more than a dozen companies and got their take right from the frontier of building this stuff, the practical tips. Jared, why don't we start with an example from one of your best AI startups.

  2. 0:584:59

    Parahelp’s prompt example

    1. JF

      I managed to get an example from a company called ParaHelp. ParaHelp does AI customer support. There are a bunch of companies who- who are doing this, but ParaHelp is doing it really, really well. They're actually powering the customer support for Perplexity and Replit and Bolt and a bunch of other, like, top AI companies now. So, if you- if you go and you, like, email a customer support ticket into Perplexity, what's actually responding is, like, their AI agent. The cool thing is that the ParaHelp guys very graciously agreed to show us the actual prompt that is powering this agent, um, and to put it on screen on YouTube for the entire world to see. Um, it's, like, relatively hard to get these prompts for vertical AI agents 'cause they're kind of like the crown jewels of the IP of these companies, and so very grateful to the ParaHelp guys for agreeing to basically, like, open source this prompt.

    2. GT

      Diana, can you walk us through this very detailed prompt? It's super interesting, and it's very rare to get a chance to see this in action.

    3. DH

      So, the interesting thing about this prompt is actually, first, it's really long, it's very detailed. In this document, you can see it's like six pages long, just scrolling through it. The big thing that a lot of the best prompts start with is this concept of, uh, setting up the role of the LLM. You're a manager of a customer service agent, and it breaks it down into bullet points what it needs to do. Then the big thing is telling the- the task, which is to approve or reject a tool call, because it's orchestrating agent calls from all these other ones. And then it gives it a bit of the high level plan. It breaks it down step by step. You see steps one, two, three, four, five, and then it gives some of the important things to keep in mind that it should not kind of go weird into calling different kinds of tools. It tells them how to structure the output, because a lot of things with agents is you need them to integrate with other agents. It's almost like gluing the API call. So, the- it's important to specify that it's gonna give certain, uh, output of accepting or rejecting and in this format. Then this is sort of the high level section, and one thing that the best prompts do, they break it down sort of in this markdown type of style, uh, formatting. So, you have sort of the heading here, and then later on, it goes into more details on how to do the planning, and you see this is like a sub-bullet part of it. And as part of the plan, there's actually three big sections, is how to plan, and then how to create each of the steps in the plan, and then the high level example of the plan. One big thing about the best prompts is they outline how to reason about the task, and then a big thing is giving it- giving it an example, "And this is what it does." And one thing that's interesting about this, it- it looks more like programming than writing English, because it has this, uh, XML tag kind of format to specify sort of the plan. We found that it makes it a lot easier for LLMs to follow, because a lot of LLMs were post-trained in RLHF with kind of XML type of input, and it turns out to produce better results.

    4. GT

      Yeah, one thing I'm surprised that isn't in here, or maybe this is just the version that they released, what I almost expect is there to be a section where it describes a particular scenario and, uh, actually gives example output for that scenario.

    5. JF

      Uh, that's in, like, the next stage of the pipeline. Yeah.

    6. GT

      Oh, really? Okay.

    7. JF

      Yeah. Uh, 'cause it's customer specific, right? 'Cause, like, every customer has their own, like, flavor of how to respond to these support tickets. And so, their challenge, like a lot of these agent companies, is like, how do you build a general purpose product when every customer, like, wants... you know, has, like, slightly different workflows and, like, preferences? That's a really interesting thing that I see the vertical AI agent companies talking about a lot, which is like, how do you have enough flexibility to make special purpose logic without turning into a consulting company where you're building, like, a new prompt for- for- for every customer? I actually think this, like, concept of, like, forking and merging prompts across customers and which part of the prompt is customer specific versus, like, company wide is like a- like a really interesting thing that the world is only just beginning to explore.

    8. DH

      Yeah,

  3. 4:596:51

    Different types of prompts

    1. DH

      that's a very good point, Jared. So, there's this concept of a defining the prompt in the system prompt, then there's the de- developer prompt, and then there's the user prompt. So, what this mean is, uh, the system prompt is basically almost like defining, uh, sort of the high level API of how your company operates. In this case, the example of ParaHelp is very much a system prompt. There's nothing specific about the customer. And then as they add specific instances of that API and calling it, then they stuff all that in into more the developer prompt, which is not shown here, and that adds all the context of, let's say, working with Perplexity. There are certain ways of how you handle rack questions as opposed to working with Bolt is very different, right? And then I don't think ParaHelp has a user prompt because their product is not consumed directly by an end user, but an end user prompt could be more like Replit or Azeru, right? Where users need to type, it's like, "Generate me a site that- that has these buttons," this and that. That goes all in the user prompt. So, that's sort of the architecture that's sort of emerging.

    2. HT

      And to your point about avoiding becoming a consulting company, I think, um...There are so many startup opportunities in building the tooling around all of this stuff. Like, for example, like, um, anyone who's done prompt engineering knows that the examples and worked examples are really important to improving the quality of the output. And so then if you take, like, ParaHelp as an example, they really want good worked examples that are specific to each company. And so you can imagine that as they scale, you almost want that done automatically. Like, in your dream world, what you want is just, like, a- an agent itself that can pluck out the best examples from, like, the customer dataset and then software that just, like, ingests that straight into, like, wherever it should belong in the pipeline without you having to manually go out and pluck that all and ingest it in all of yourself.

    3. JF

      That's probably a great segue into

  4. 6:517:58

    Metaprompting

    1. JF

      meta prompting, which is one of the things we want to talk about, because that's- that's a consistent theme that keeps coming up when we talk to our AI startups.

    2. GT

      Yeah. Troppier is, uh, one of the startups I'm working with in the current YC batch, and they've really helped people like YC Company, Ducky, do really in-depth understanding and debugging of the prompts and the return values from a multi-stage workflow. And one of the things they figured out is prompt folding. So you'll basically... One prompt can dynamically generate better versions of itself. So a good example of that is a classifier prompt that generates a specialized prompt based on the previous query. And so you can actually go in, take, uh, the existing prompt that you have, and actually feed it more examples where maybe the prompt failed or didn't quite do what you wanted, and you can actually... Instead of you having to go and rewrite the prompt, you just put it into, um, you know, the raw LLM and say, "Help me make this prompt better." And because it knows itself so well, strangely, um, meta prompting is turning out to be a very, very powerful tool that everyone's using now.

    3. DH

      And

  5. 7:5812:10

    Using examples

    1. DH

      the next step after, uh, you do sort of prompt folding, if the task is very complex, there's this concept of, uh, using examples, and this is what Jasberry does. It's one of the companies I'm working with this batch. They basically build automatic bug finding in code, which is a lot harder. And the way they do it is they feed a bunch of really hard examples that only expert programmers could do. Let's say if you want to find an N+1 query, it's actually hard for today for even like the best LLMs to find those. And the way they do those is they find parts of the code, then they add those into the prompt, and meta prompt is like, "Hey, this is an example of an N+1 type of error," and then that works it out. And I think this pattern of sometimes when it's too hard to even kind of write a prose around it, let's just give you an example that turns out to work really well, because it helps LLMs to reason around complicated tasks and steer it better, because you can't quite kind of put exact parameters. And it's almost like, um, unit testing programming in a sense, like test-driven development is sort of the LLM vers- version of that.

    2. GT

      Yeah. Another thing that Troppier, uh, sort of talks about is, you know, the- the model really wants to actually help you so much that if you just tell it, "Give me back output in this particular format," i- even if it doesn't quite have the information it needs, it'll actually just tell you what it thinks you want to hear, and it's literally a hallucination. So one thing they discovered is that you actually have to give the LLMs a real escape hatch. You need to tell it, "If you do not have enough information to say yes or no or make a determination, don't just make it up. Stop and ask me." And that's a very different way to think about it.

    3. HT

      That's actually something we learned at some of the internal work that we've done with agents at YC, where Jared came up with a really inventive way to give the LLM a s- escape hatch. You want to talk about that?

    4. JF

      Yeah. So the Troppier approach is one way to give the LLM an escape hatch. We came up with a different way, which is in the response format, to give it the ability to have part of the response be essentially a complaint to you, the developer, that like, you have given it confusing or underspecified information and it doesn't know what to do. And then the nice thing about that is that we can just run your LLM like in production with real user data, and then you can go back and you can look at the outputs that it has given you in that, like, output parameter. Um, we- we call it debug info internally. So like we have this like debug info parameter where it's basically reporting to us things that we need to fix about it, and it literally ends up being like a to-do list that you, the agent developer, has to do. It's like really kind of mind-blowing stuff.

    5. HT

      Yeah, just even for hobbyists or people who are interested in playing around for this for personal projects, like a very simple way to get started with meta prompting is to follow the same structure of the prompt, just give it a role, and make the role be like, you know, you're an expert prompt engineer who gives really, like, detailed, um, great critiques and advice on how to, um, improve prompts and give it the prompt that you had in mind, and it will spit you back a much, a- a more expanded, better prompt. And so you can just keep running that loop for a while. Work surprisingly well.

    6. DH

      I think that's a common pattern sometimes for companies when they need to get, um, responses from LM- LMS in their product a lot quicker. They do the meta prompting with a bigger, beefier model. Any of the, I don't know, hundreds of billions of parameters plus models like, uh, I guess Claude 4, 3.7, or your, uh, GPT-4o3, and they do this meta prompting, and then they have a very good working one that then they use into the distilled model. So they use it on a, for example, 4o and ends up working pretty well. Specifically sometimes for, uh, voice AI agents companies because latency is very important to, uh, get this whole Turing test to pass, because if you have too much pause bef- before the agent responds, I think humans can detect something is off. So they use a faster model but with a bigger, better prompt that was refined from the bigger models. So it's like a common pattern as well.

    7. HT

      Another,

  6. 12:1014:18

    Some tricks for longer prompts

    1. HT

      again, s- less sophisticated maybe, but, um, like as the prompt gets longer and longer, I think it becomes a- a large working doc, um, one thing I found useful is as you're using it, if you just note down in a Google Doc things that you're seeing, just, um...... the outputs not being how you want, or it not, uh, ways that you can think to improve it, you can just write those in note form and then give Gemini Pro, like, your notes plus the original prompt and ask it to suggest a bunch of edits to the prompt, um, to incorporate these in well, and it does that quite well.

    2. DH

      The other trick is, uh, in, uh, Gemini 2.5 Pro, if you look at the thinking traces as it's-

    3. HT

      Sure.

    4. DH

      ... uh, parsing through, uh, evaluation, you could actually learn a lot about all those misses as well. We- we've done that internal as well, right?

    5. JF

      Guys, this is critical, 'cause if you're just using Gemini via the API, until recently, you did not get the thinking traces. And, like, the thinking traces are like the critical debug gi- information to, like, understand, like, what's wrong with your prompt. They just added it to the API, so you can now actually, like, pipe that back into your developer tools and workflows.

    6. HT

      Yeah, I think it's an underrated, um, consequence of Gemini Pro having such long context windows, is you can effectively use it like a- a REPL. Go sort of, like, one by one, like, put your prompt on, like, one example, then literally watch the reasoning trace in real time to figure out, like, how you can steer it in the direction you want.

    7. GT

      Jared and the software team at YC has actually built this, um, y- you know, various forms of workbenches that allow us to, like, do debug and things like that. But to your point, like, sometimes it's better just to use gemini.google.com directly, and then drag and drop, you know, literally JSON files. And, uh, you know, you don't have to do it in some sort of-

    8. HT

      Mm-hmm.

    9. GT

      ... special container. Like, it, you know, seems to be totally something that works even directly in, you know, ChatGPT itself.

    10. HT

      Yeah, this is all stuff, um, I would give a shout-out to YC's Head of Data, Eric Bacon, who's, um, helped us all a lot, a lot of this meta prompting and using Gemini Pro 2.5 as, uh, effectively a

  7. 14:1817:25

    Findings on evals

    1. HT

      REPL.

    2. GT

      What about evals? I mean, we've, uh, talked about evals for, going on a year now. Um, what are some of the things that founders are discovering?

    3. JF

      Even though we've been saying this for a year or more now, Garry, I think it's still the case, that like, evals are the true crown jewel, like, data asset for all of these companies. Like, one- one reason that PowerHelp was willing to open source the prompt is they told me that they actually don't consider the prompts to be the crown jewels. Like, the evals are the crown jewels, 'cause without the evals, you don't know why the prompt was written (laughs) -

    4. DH

      Mm-hmm.

    5. JF

      ... the way that it was, (laughs) um, and it's very hard to improve it.

    6. GT

      Yeah. And I- I think in abstraction, you can think about, you know, YC funds a lot of companies, especially in vertical AI and SaaS, and then you can't get the evals unless you are sitting literally side by side with people who are doing X, Y, or Z knowledge work. You know, you need to sit next to the tractor sales regional manager-

    7. JF

      (laughs)

    8. GT

      ... and understand, well, you know, this person cares a- ... You know, this is how they get promoted, this is what they care about, this is that person's reward function. And then, you know, what you're doing is taking these in-person interactions, sitting next to someone in Nebraska, and then going back to your computer and codifying it into, uh, very specific evals. Like, this particular user wants this outcome after they ... You know, after this invoice comes in, we have to decide whether we're gonna honor the, you know, the warranty on this tractor. Like, just to take one of- one example. That's the true value, right? Like, you know, where everyone's really worried about, um, "Are we just wrappers?" And, you know, "What is going to happen to startups?" And I think this is literally where the rubber meets the road, where, um, if you- you know, if you are out there in particular places, understanding that user better than anyone else and having the software actually work for those people, that's the moat.

    9. JF

      Is that just, like, such a perfect depiction of, like, what is the core competency required of founders today? Like, literally, like, the thing that you just said, like, that's your job as a founder of a company like this is to be really good at that thing-

    10. GT

      Yeah, yeah.

    11. JF

      ... and, like, maniacally obsessed with, like, the details of the regional tractor sales manager's workflow.

    12. GT

      Yeah.

    13. JF

      Yeah.

    14. GT

      And then the wild thing is, it's very hard to do. Like, you know, how- y- have you even been to Nebraska?

    15. JF

      (laughs)

    16. DH

      (laughs)

    17. GT

      You know? The classic view is that, uh, the best founders in the world, they're, you know, sort of really great cracked engineers and technologists, and, uh, just really brilliant, and then at the same time, they have to understand some part of the world that very few people understand. And then there's this little sliver that is, you know, uh, the founder of a multi-billion dollar startup. You know, I think of Ryan Petersen from Flexport, you know, really, really great person who understands how software is built, but then also I think he was the third-biggest, uh, importer of medical hot tubs for-

    18. DH

      (laughs)

    19. GT

      ... an entire year, like-

    20. HT

      (laughs)

    21. GT

      ... you know, a decade ago. So, you know, the weirder that is, the more of the world that you've seen that nobody else who's a technologist has seen, uh, the greater the opportunity, actually.

  8. 17:2523:18

    Every founder has become a forward deployed engineer (FDE)

    1. HT

      I think you've put this in a really interesting way before, Garry, where you're sort of saying that every founder's become a forward-deployed engineer. That's like a term that traces back to Palantir, and since you were early at Palantir, maybe tell us a little bit about well, how did forward-deployed engineer become a thing at Palantir, and- and what can founders learn from it now?

    2. GT

      Yeah. I mean, I think the whole thesis of Palantir at some level was that, um, if you look at Meta, back then it was called Facebook, or Google, or any of the top software startups that everyone sort of knew back then, one of the key recognitions that Peter Thiel, and Alex Karp, and Stefan Cohen, and Joe Lonsdale, Nathan Gettings, like the original founders of Palantir had, was that, uh, go into anywhere in the Fortune 500, go into any government agency in the world, including the United States, and nobody who understands computer science and technology at the level that- you know, at the highest possible level would ever even be in that room. And so Palantir's sort of really, really big idea that they discovered very early was that, uh, the problems that those places face, they're actually multi-billion dollar, sometimes trillion dollar problems. And yet, uh, this was well before AI became a thing, you know. I mean, people were sort of talking about machine learning, but, you know, back then they called it data mining, you know. The world is awash in data, these, you know, giant databases of people and things and transactions, and we have no idea what to do with it. That's what Palantir was, is, and still is, that, um, you can go and find the world's best technologists who know how to-... write software to actually make sense of the world. You know, you have these petabytes of data, and you don't know, how do you find the needle in the haystack? Um, and, you know, the wild thing is, going on, uh, something like 20, 22 years later, it's only become more true that we have more and more data, and we have less and less of an understanding of what's going on. And, uh, it's no mistake that actually now that we have LLMs, like, we actually, it is becoming much more practical. And then the forward deployed engineer title was specifically, how do you sit next to literally the FBI agent who's, um, investigating domestic terrorism? How do you sit right next to them in their actual office and see, what does the case coming in look like? What are all the steps? Uh, when you actually need to go to the federal prosecutor, what are the things that they're sending? Is it... I mean, l- what's funny is like literally it's like Word documents and Excel spreadsheets, right? And, um, what you do as a forward deployed engineer is take these sort of, you know, file cabinet and fax machine things that people have to do and then convert it into really clean software. So, you know, the classic view is that it should be as easy to actually do, uh, an investigation at a three-letter agency as going and taking a photo of your lunch on Instagram and posting it to all your friends (laughs) . Like, that's, you know, kind of the funniest part of it. And so, you know, I think it's no mistake today that forward deployed engineers who came up through that system at Palantir now, they're turning out to be some of the best founders at YC actually.

    3. JF

      Yeah, I mean, it produ- produced this incredible, incredible number of startup founders, 'cause yeah, like, the training to be a forward deployed engineer, that's exactly the right training to be a founder of these companies now. The- the other interesting thing about Palantir is like other companies would send, like, a salesperson to go and sit with the FBI agent, and like Palantir sent engineers to go and do that. I think Palantir was probably the first company to really, like, institutionalize that and scale that as a process, right?

    4. GT

      Yeah, I mean, I think what happened there, the reason why they were able to get these sort of seven and eight and now nine figure contracts very consistently is that, uh, instead of sending someone who's like hair and teeth and they're in there and, you know, "Let's go to the, let's go to the, uh, steakhouse," you know, it's all like relationship, and you'd have one meeting, uh, they would really like the salesperson, and then through sheer force of personality, you'd try to get them to give you a seven-figure contract. And like the timescales on this would be, you know, six weeks, 10 weeks, 12 weeks, like five years. I don't know (laughs) . It's like... And the software would never work. Uh, whereas if you put an engineer in there, and you give them, uh, you know, uh, Palantir Foundry, which is what they now call sort of their core, uh, data viz and data mining suites, instead of the next meeting being reviewing 50 pages of, you know, sort of sales documentation or a contract or a spec or anything like that, it's literally like, "Okay, we built it." And then-

    5. JF

      (laughs)

    6. GT

      ... you're getting, like, real live feedback within days, and, I mean, that's honestly the biggest opportunity for startup founders. If startup founders can do that, and, uh, that's what forward deployed engineers are sort of used to doing, that's how you could beat a Salesforce or an Oracle or, you know, a Booz Allen or literally any company out there that has a big office and a big fancy... you know, you have big fancy salespeople with big strong handshakes, and it's like, how does a really good engineer with a weak handshake go in there and beat them? Well, it's actually, you show them something that they've never seen before and, like, make them feel super heard. You have to be super empathetic about it. Like, you actually have to be a great designer and product person, and then, you know, come back, and you can just blow them away. Like, the software's so powerful that, you know, the second you see something that, you know, makes you feel seen, you want to buy it on the spot.

    7. JF

      Is a good way of thinking about it that founders should think about themselves as being the forward deployed engineers of their own company?

    8. GT

      Absolutely, yeah. Like, you definitely can't farm this out. Like, literally, the founders themselves, they're technical. They have to be the great product people. They have to be the ethnographer. They have to be the designer. You want the person on the second meeting to see the demo you put together based on the stuff you heard, and you want them to say, "Wow, I've never seen anything like that, and take my money."

  9. 23:1826:13

    Vertical AI agents are closing big deals with the FDE model

    1. DH

      I think the incredible thing about this model is this is why we're seeing a lot of the vertical AI agents take off, is precisely this, because they can have these meetings with the end buyer and champion at these big enterprises. They take that context, and then they stuff it basically in the prompt, and then they can quickly come back in a meeting, like just the next day. Maybe with Palantir it would have taken a bit longer and a team of engineers. Here, it could be just the two founders who go in, and then they would close the six, seven-figure deals we've seen, and with large enterprises, which has never been done before, and it's just possible with this new model of forward deployed engineer plus AI. It's just on, accelerating.

    2. HT

      It just reminds me of a company I mentioned before on the podcast, like GigaML, who do custo- another customer support, and especially a lot of voice support, and it's just a classic case of two extremely, um, talented software engineers, not natural salespeople, but they force themselves to be essentially forward deployed engineers, and they closed a huge deal with Zepto and then a couple of other companies they can't announce yet, but they-

    3. JF

      And, and do they physically go on site, like the Palantir model?

    4. HT

      Yes, so they do, so they, they did all of that, where once they close the deal, they go on site, and they sit there with all the customer support people in figuring out how to keep tuning and getting the software or the LLM to work even better, but before that, even to win the deal, what they found is that they can, they can win by just having the most impressive demo, and in their case, they've, um, innovated a bit on the RAG pipeline so that they can, um, have their voice responses be both accurate and very low latency, so like a technically challenging thing to do. But I just feel like in the p- like, pre-sort of the current LLM rise, you couldn't necessarily differentiate enough in the demo phase of sales to beat out incumbents, so you couldn't really beat Salesforce by having a slightly better CRM with a better UI, but now, because the technology evolves so fast and it's so hard to get this, like, last five, ten-... five to 10% correct. You can actually, if you're a forward-deployed engineer, go in, do the first meeting, tweak it so that it works really well for that customer, go back with a demo and just get that, "Oh, wow, like, we've not seen anyone else pull this off before," experience and close huge deals.

    5. DH

      And that was the exact same case with Happy Robot, who has sold seven-figure contracts to the top three largest logistic brokers in the world. They're ... build AI voice agents for that. They are the ones doing the forward-deploy engineer model and talking to, like, the CIOs of these companies and quickly shipping a lot of product, like ver- very quick turnaround. And it's been incredible to see that take off right now, and it started from six-figure deals, now doing ... closing in seven-figure deals, which is crazy.

    6. HT

      Nice.

    7. DH

      This is just a couple of months after.

    8. GT

      So that's the kind of stuff that you can do (laughs) with, uh, I mean, unbelievably very, very smart prompt engineering, actually.

  10. 26:1327:26

    The personalities of the different LLMs

    1. GT

      Well, one of the things that's kind of interesting about, uh, each model is that they each seem to have their own personality. And one of the things the founders are really realizing is that you're gonna go to different people for different things, actually.

    2. DH

      One of the things that's known a lot is Claude is sort of the more happy and more human, steerable model, and the, uh, other one is, uh, LLaMA 4, is one that needs a lot more steering. It's almost like talking to a developer, and part of it could be an artifact of not having done as much RLH- RLHF on top of it. So, is a bit more rough to work with, but you could actually steer it very well if you actually are good at actually doing a lot of prompting and almost doing a bit more RLHF, but is a bit harder to work with, actually.

    3. GT

      Well, one of the things we've been using, uh, LLMs for internally is actually helping founders figure out who they should take money from. And so in that case, sometimes you need a very straightforward rubric, a 0 to 100, 0 being never ever take their money and 100 being take their money right away. Like, they actually help you so much that you'd be crazy not to take their money. Harj,

  11. 27:2629:47

    Lessons from rubrics

    1. GT

      we've been working on, uh, some scoring rubrics around that using prompts. What, what are some of the things we've learned?

    2. HT

      So it's certainly best practice to give, um, LLMs rubrics, especially if you want to get a numerical score as the output. You want to give it a rubric to help it understand, like, how should I think through and what's, like, a 80 versus a 90? But these rubrics are never perfect. There's often always exceptions.

    3. DH

      And you tried it with, uh, O3 versus Gemini 2.5 and you found discrepancies?

    4. HT

      Yeah. This is what we found really interesting, is that, um, you can give the same rubric to two different models. And in our, in our specific case what we found is that, um, O3 was very rigid, actually. Like, it really sticks to the rubric. It's, uh, heavily penalizes for anything that doesn't fit, like, the rubric that you've given it. Whereas Gemini 2.5 Pro was actually quite good at being flexible in that it would apply the rubric but it could also sort of almost reason through why someone might be, like, an, uh, exception or why you might want to, um, push something up more positively or negatively than the rubric might suggest, which I just thought was really interesting because I'd ... I- it's just like when you're training a person. You're trying to ... You give them a rubric. Like, you want them to use a rubric as a guide, but there are always these sort of edge cases where you need to sort of think a little bit more deeply. Um, and I just thought it was interesting that the models themselves will handle that differently, which means they sort of have different personalities, right? Like, O3 felt a little bit more like a soldier-

    5. GT

      (laughs)

    6. HT

      ... sort of like, "Okay, I'm definitely gonna go check, check, check, check, check." Um, and Gemini Pro 2.5 felt a little bit more like a high agency sort of employee. It was like, "Oh, okay, now I think this makes sense, but this might be an exception in this case," which was, um, which was really interesting to see.

    7. GT

      Yeah, it's funny to see that for investors. You know, sometimes you have investors like a Benchmark or a Thrive. It's like, "Yeah, take their money right away. Their process is immaculate. They never ghost anyone. They answer their emails faster than most founders." It's, you know, very impressive, and then, uh, one example here might be, you know, there are plenty of investors who are just overwhelmed and maybe they're just not that good at managing their time, and so they might be really great investors and their track record bears that out, but they're sort of slow to get back. They seem overwhelmed all the time. They accidentally, probably not intentionally, ghost people. And so this is leg- legitimately exactly what an LLM is for. Like, the debug info (laughs) on some of these are very interesting to see, like, you know, maybe it's a 91 instead of, like, an 89. We'll see.

  12. 29:4731:00

    Kaizen and the art of communication

    1. GT

      I guess one of the things that's been really surprising to me as, you know, we ourselves are playing with it and we spend, you know, maybe 80 to 90% of our time with founders who are all the way out on the edge, is, uh, you know, on the one hand, the analogies I think even we use to discuss this is, uh, it's kind of like coding. It kind of actually feels like coding in, you know, 1995. Like, the tools are not all the way there. There's a lot of stuff that's unspecified. We're, you know, in this new frontier. But personally, it also kind of feels like learning how to manage a person.

    2. HT

      Mm-hmm.

    3. DH

      Hmm.

    4. GT

      Where it's like, how do I actually communicate, uh, you know, the things they need to know in order to make a good decision? And how do I make sure that they know, um, you know, how I'm going to evaluate and score them? And, uh, not only that, like, there's this aspect of kaizen, you know, this, um, this manufacturing technique that created really, really good cars for Japan in the '90s. Uh, and that principle actually says that the people who are the absolute best at improving the process are the people actually doing it. That's literally why, uh, Japanese cars got so good in the '90s. And that's meta prompting to me. So, I don't know, it's a brave new world. We're sort of in this

  13. 31:0031:20

    Outro

    1. GT

      new moment. So, with that, we're out of time, but can't wait to see what kind of prompts you guys come up with, and we'll see you next time. (instrumental music)

Episode duration: 31:26

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode DL82mGde6wo

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome