EVERY SPOKEN WORD
25 min read · 4,745 words- SPSpeaker
[upbeat music] [audience applauding]
- SPSpeaker
Hello, everyone, and thank you so much for joining me this afternoon in the breakout room, the last session today of Code with Claude. I hope you've all had a fantastic day so far. My name is Margot Van Lare. I am an applied AI engineer at Anthropic here in London, and this afternoon we're gonna be talking about the prompting playbook. And prompting is arguably one of the first skills, if not the first skill, that we had to learn as engineers when we first started to work with LLMs. And even now, it continues to be one of the most critical, um, skills to building effective AI systems. So today we're gonna discuss some best practices, uh, in the context of two practical scenarios that you're probably encountering at work. The first is where you have an existing prompt in production that you've been maintaining for some time, um, and possibly you're migrating it to a new model or making a change to the architecture, and for some reason it's no longer working as well. The second scenario is where we're building an entirely new agentic use case from the ground up, and we need to build the prompt from zero to one. Now, in order to illustrate these best practices, I don't just want to give you a list of dos and don'ts. I want to walk through a practical example that's been inspired by real prompts that, um, I've seen some of our customers work with who are building on Claude. So the prompt that we'll look at today is a miniaturized example. The prompts that you're working with are probably a lot longer and more complex than the one we'll see today, um, but it's representative of some common problems that you might encounter when maintaining a prompt. So imagine that we have a prompt that multiple people have been collaborating on, contributing to. There's no clear owner. It covers a lot of different areas like policy, like tone, processes, and we have some patches for kind of previous models that we've migrated to all mixed together. Um, it's built up, and it's complex. And when we're migrating to a new model, we're finding that suddenly a lot of our test cases are no longer working as well as we expected. So what's actually going on here? Well, in order to start unpacking that question, um, we need a starting point, and that starting point is evaluations. We need evaluations to provide that rigor, um, to understand whether a change to our prompt is actually correlating to an improvement in its performance. And we have different models which have different capabilities and different behaviors. And when you migrate to a different model, it could be that your system is no longer working as well for two reasons. First of all, if, um, the new model might be capable, but it's behaving differently, and therefore we can tune our prompting to fix that behavior. The second case is where actually the model that we're changing to isn't as capable, and no amount of prompting is gonna fix that. So we need to have an eval suite to act as a way of testing that regression so that we can apply our prompting best practices to that. So in the example that we're gonna be looking at today, as I said, it's gonna be a miniaturized example. We'll have five test cases in our eval. In reality, you'll have a lot more test cases in your eval suite, but the key thing here is that it's representative of three key cases that we need to cover. Those three key cases include having a control case, which is a case which should always pass. It's something that the m-- we know the model handles well. It's unambiguous. The second is edge cases, and these are cases where we've seen the model fail before, and by including instructions into the prompt, we're making sure that same behavior doesn't slip through again in the future. And finally, and critically, we need to make sure that the model has a good understanding of the extent of its capabilities, where it should be handing off to a human or where actually maybe it should be point-blank refusing to answer a request. So in the example that we're gonna be looking at today, um, we'll be using a, um, prompt for a customer support bot for a telco company called Meridian Mobile, and these are the five test cases that we are going to be looking at today. We have a simple control case looking at, um, you know, what's the data limit in the basic plan? Uh, we're also looking at edge cases such as its ability to do calculations, such as calculating proration bills. If I switch my bill halfway through the month, uh, or if I switch my plan halfway through the month, what will my bill look like? We wanna check that it's accurately addressing key questions which are covered by our policy. Um, we need to make sure that it's escalating to a human whenever there is, um, a billing error. Um, and finally, we wanna make sure that our model isn't withholding any information that it has access to which it should be handing over to the customer.So what we're gonna do in this process is we'll take our prompt and we'll run it on our V0, um, of, of the eval, and we'll see what our failure modes are and systematically target those failure modes one at a time to see if we can resolve those failure modes by prompting. And along the way, we'll learn a little bit more about the kind of anti-patterns, um, and traps to avoid. And this is representative of how we would apply these best prompting techniques in practice, right? We are rarely writing a prompt from scratch. We're often debugging an existing prompt. And best practice before we start targeting those failure modes specifically is to kind of apply our general prompting one-o-one best practices, applying general hygiene to clean up before we do the eval run. So let's have a little look at the example that we're gonna be using. So what we're looking at here, first of all, before we look at the prompt, is just this Vibe coded web app that I've made for the presentation today so that we can look at how we're iterating on the prompt together. Um, in this page here, I can easily run my evals on all five test cases and inspect the results in a little bit more detail. So before we have a look at the prompt, I'm just gonna run the evals in the background. This is a pretty good first pass at a prompt. When we look at this, we've defined the bot's role at the top. When we scroll down, um, we've given it some data. We've given it some information on how to reason over, um, the answers that it should be giving to the customer. It's giving some critical instructions around the tone it should use, um, how to do calculations, et cetera. And then finally, we're passing in our customer account context and our user message. So let's have a look at how our first pass at the evals did. So we can see, as we expect, our control case, all of our test cases have passed. This is what we expect for this unambiguous test case. But it's performing pretty poorly in these other areas. Now, before we zoom in on those specific failure modes here, let's do some general cleanup of our prompt. So as we mentioned, when we look through this prompt, there's a couple oddities here already. So for example, first one is we're telling the bot that it's, um, a human, which just isn't true, right? We can see as we scroll down, there's clearly some information here that has been copied directly from a website. So the key giveaway here is a reference to a hero image. Um, there's even some references to cookies, um, at the bottom. So we need to remove a bit of redundant information. When we look at the instructions here, they're all grouped into one big paragraph. So we've got some reasoning here. We've got instructions about the role, um, some critical instructions as well, without a real way of unpacking, um, policy from guidelines, from tone, et cetera. So, let me just... I've preempted some changes we wanna make to this prompt, and this is just a diff view of some of those changes. So what we've done is, first of all, added some structure. So you can see that we've added XML tags here to define the role, to separate general guidelines, to separate policy, to separate tone of voice, um, et cetera. So if we run that eval button on this new updated prompt, we should hopefully see an improvement in the output as is. So we can see just by clearing up the prompt, we've already improved the model's performance on this prepaid scenario. There's an interesting regression there in that fifth hotspot case, and I don't wanna worry too much about that now. There's gonna be some natural level of variance in the different runs of the eval, and we'll come back to that case specifically to see if we can make the prompt consistently better in that area. So what did we learn from this then? Um, simply clearing up the prompt with a better structure, with a better role description has improved the performance. And this is a best practice that you can return to at any stage of writing and maintaining your prompt, especially as your prompts get more detailed and more complex. A general rule of thumb that I like to follow is if you're reading a prompt and you can't tell guidelines from policy, from data, most likely the model isn't able to either. So before looking at some of those cases in more detail, there's a little bit more general cleanup we can do, um, specifically here looking at creating an output contract. This is a key best practice to follow if you're struggling with your output format consistency. Now, in this case, we have a customer support bot. We want it to reply in a conversational tone, so it's unlikely to be a big issue in this case. But it's something to bear in mind if you're, you're dealing with more complex output structures like nested JSONs, for example. So again, if we go back to the prompt and see what fixes we can apply here.First of all, we've added a section, uh, um, at the end where we've defined an output format for the model, telling it to use, uh, um, XML tags to output the response. But the prompt is not always the most effective way of handling issues. We can also change things in the harness to ensure consistency to a higher degree. So what we've added here to the API call is a stop sequence, which is gonna detect that closing XML tag and tell the model to stop generating a response at that point. Now, when I run the eval here, I don't necessarily expect to see any clear improvement in performance, um, but it's a general best practice that we should be following and, as I said, is something that we should remember, in particular when we have more complex output schemas. One thing to point out here as well, if you do have a more complex output schema, something like structured outputs can be incredibly helpful to ensure that consistency in a more programmatic way. Okay. So after the cleanup then, we can see that we now have two test cases which are consistently passing, but we have three key failure modes: the proration, the billing error, and the hotspot. So let's isolate these one by one, uh, um, to iterate on the prompt and, and see the effect of that. First of all then, the hotspot question. So the question is: How much hotspot data is on my unlimited plan? What we expect the model to do is state directly the amount of hotspot data that the customer has. And the reason this is a slightly complex case is because the customer test case that we're dealing with is on a legacy plan, so actually the current policy doesn't apply to them. So if we see what's going on in the actual test case here, the customer data which we are feeding, um, um, to the prompt includes the amount of hotspot data that customer has. They have five gigabytes, right? But they also have a grandfathered plan. So what we're seeing, uh, the model is actually telling the customer is the general, um, the unlimited plan includes four gigabytes, uh, but since you're on a legacy plan, you should go check this out yourself. So let's have a look at the prompt then to see, um, why the model is deflecting this question to the customer account URL rather than actually giving the information itself. Now, if we read this prompt, originally it said, "We changed our plans recently, and the policy doc shows the current plan data and customers on grandfather's plan have different rates. Never give a customer the wrong plan details. Instead, point them to the URL." So it's clear that this instruction, this latter one, "Never give customer the wrong information," is the instruction that the bot has been optimizing for. And you might recognize this as being very similar to a patch that you might have introduced in a previous model that you were using to avoid-- where the model was giving the customer the wrong information about that plan. Now, as our models have evolved, they've gotten much better at instruction following, so it's likely that instructions like these have now become redundant and are actually being overfitted too. So what we're gonna tell the model instead is give this balanced view, uh, um, where it says, you know, "Customers on grandfather's plan have different allowances, but it's captured in the customer information that's given, and that is the accurate source of truth." So running the eval here, we should hopefully be addressing, uh, um, all of the test cases for the hotspot case. Now, I am running this live, so there could be some variability here, but we see here that now clearly all of our test cases are, are passing. So what did we learn from this? Well, we worry a lot about hallucinations or the invention of facts and numbers, but actually the opposite can also happen. The model can withhold information that it actually has access to. Now, we saw here that this is likely a result of a patch that we introduced for a previous model, and a best practice that we could follow here is actually using version control, where wherever we are making defensive changes in the prompt, we are tracking the reason why we've introduced these. Sometimes they're necessary, but in the future, these kind of changes can produce unwanted effects so that we can backtrack on them. So the next failing test case then is this proration calculation, where a customer asks, "What if I upgrade to the thirty gigabyte plan? What will my next bill be?" And what we want the model to do is to perform some calculation and return exactly, uh, um, what their next bill would be, rather than giving some sort of vague output, which is what we can see it's doing right now. Uh, um, if we look at what the model is returning, it's clearly reasoning through it. It's doing a little bit of mental maths here and there, but it's not really giving the customer a concrete answer, and I wouldn't rely on this as being able to accurately give the customer a response.So if we look at the prompt then to see how we can fix this. In the original prompt, we can see that all the instructions that were given to it is telling it, "Don't ever give a customer a vague answer. Uh, um, critical. Always calculate any prorated amounts correctly." Now, telling the model to do a good job isn't particularly helpful when we don't give the model the capability to actually do a good job. We want to avoid the model doing mental math. So what we're gonna introduce is give the model a tool. So we're saying in the prompt, "Whenever you're doing any calculations, please use the calculate proration tool to do so." In order to introduce that tool, we need to introduce it into the API to tell the model, "You have access to this tool." We need to define, um, the tool schema, which tells the model what this tool does and when to use it. And then finally, we need to actually implement the tool, which is the maths behind how it should be doing that calculation. So running that eval then for another pass. We can see that all the test cases are now passing. It's clearly done, uh, um, the maths using the tool in the background and returning the correct response. So the key lesson to take away here is instructions don't add capability. Telling the model it's critical to do a calculation right doesn't make it better at mental math. So the correct approach was to give it a tool, overall giving it the ability to reason over harder problems and using tools to actually execute them reliably. So now we have one final failing test case which we need to address, which is this billing error here. In this scenario, there is a billing conflict, and what we really want is the agent to escalate this to a human. And what we're seeing it doing instead is it's trying to explain to the customer what the reason behind it might be, and it's trying to kind of diagnose the problem itself. So in order to fix this behavior, let's again have a look what it was told in the prompt. We see in the initial instructions it was given, it says, "Avoid escalating or transferring to a care specialist unless absolutely necessary, as it costs approximately eight dollars and it counts against our team's fast contract resolution." Now, this is only giving one side of the story, right? We're telling it what the cost is to escalating, but not the benefit, which means it's going to overfit again to not escalating this scenario. And second of all, we've got this clear conflict between what we've defined in the eval in terms of what we want the model to do to do this escalation versus what we're actually telling it to do. And the fix that's relevant here is to give it both sides of the story by saying, "It costs eight dollars, uh, um, to escalate a case, but actually, if you get this wrong, then it's gonna cost you a refund as well as customer trust." Again, here we observed how the model optimizes for a goal, and this kind of instruction is a common instruction to give. It's quite similar to the one we saw earlier, where we didn't want it to overfit to a certain type of behavior. But it's the kind of instruction that can be followed quite differently by different generations of models. And specifically, as models become more intelligent, we need to remember to state both sides of the trade-offs because our models are becoming better themselves at making those trade-offs themselves. So if we just go back to our eval then and, um, run our final test case, we should see that all of our evals are now passing correctly. So overall, we looked at applying general hygiene principles, how that can provide an initial uplift to the prompt, making sure we're removing any redundant instructions which were initially intended as patches for previous models' behavior, making sure we're giving it tools to do certain tasks reliably. Now, there's one other scenario that we, uh, introduced at the start, which is one that you might also encounter in your work, which is where we're building a new agent from scratch. And the example that we'll look at here is, um, an agent whose purpose it is to create a week-long retail staff schedule based on employee availability and other constraints. And when we're building a new agent from scratch, we need to consider not just the prompt, but also the model that we're using and the harness that we're using. So in this next example, we're gonna compare a number of approaches to explore the impact of those three different areas.So again, I've just five coded up this web app so that we can walk through this problem, uh, um, in this demo. Here, I've just laid out what the problem is that we're addressing. We have our eight employees. Um, on the right, we have this schedule that we need to staff with the headcount, and we have our constraints that must be satisfied in every scenario. Now, because we have these hard rules, rather than using an LLM judge like we did in the previous case to do the grading, we can actually use a-- just a Python function which programmatically checks for every schedule that's generated, how many violations were made. So to begin with, in a s-- we wanna start simple. We're gonna use a simple prompt. We're gonna use the bare bones that we think we'll need with a model, Sonnet four-six, to see how it performs and how we're gonna hill climb against that. Um, so here is our baseline prompt. We've already applied some of that general hygiene and those best practices that we saw earlier on using XML tags to structure the prompt. We've given it an output format as well. Now that we're giving a schedule, uh, we're asking it to output a JSON, which if we don't give that output structure, might lead to parsing errors, uh, downstream. When we run the simple model on a first iteration of the evals, all cases fail. Now, just what we're looking at here is in our test set, we're essentially repeating, uh, um... We're doing five trials here, uh, and these numbers are showing how many violations were made in each trial. In the output, we can see that it's made a decent attempt at reasoning through the problem, but it's burning a lot of tokens, uh, and it's clearly not checking its work, um, as it's not getting to the right impact. So let's try a larger model, uh, a model which we know is better at reasoning. So we're gonna run it through, uh, Opus four point seven instead, keeping everything else the same. Now, interestingly, whilst all test cases are still failing, you can see that the overall number of violations that Opus has made has reduced significantly from Sonnet four point six. So we're possibly onto something here, right? This isn't good enough to ship because it's still failing, but clearly giving it more reasoning capability is helping drive it towards a better result. So what we're gonna try next is using Opus with adaptive thinking instead. So it can decide for itself how much thinking it needs, how much reasoning it needs to use to solve this issue. So no change to the prompt, really, just, uh, a change to the API here. So this now seems to reliably generate compliant schedules, but it requires a lot more tokens. We're tripling essentially the number of tokens that we're using here, and we're tripling the latency. So we want to try and see if we can optimize that cost latency trade-off a little bit more. Um, this is latency a hundred seconds. Obviously, I'm running this one async, uh, for the purposes of time. Opus four-seven hasn't magically gotten much faster since the last time, uh, you used it. Um, so let's see if we can optimize a little bit more for the token latency trade-off. What we haven't tried yet is using Sonnet four-six, so a smaller model, but with a better prompt. We looked a lot at the prompt optimization, uh, um, in that last section. So I've added a couple details to the prompt here. We're t-- we're... in particular, how to, uh, reason through this problem, and most critically, telling it to check its work before outputting it. So when I ran that eval, we see that it passes in two out of the five cases. Now, the failure modes that we're seeing is actually not violations of the scheduling requirements, but the model hasn't been able to finish the tasks within the output limit that we set. So whilst we could increase the max tokens that this model is able to use to get all five te-test cases passing, we see here that we're using even more tokens, and this run has an even higher latency. So this is probably not the route that we want to go down. Now, as a final pass then, we want to look at doing this a little bit more agentically. So we're gonna use this generate, evaluate, repair loop, where essentially the generator now creates a first draft of the schedule. And then we have a separate prompt which reports any specific violations that it made. So not programmatically checking it, but checking it with an LLM. So we're checking for every rule, and we're providing evidence of every violation.And we then have a third repair prompt which receives, uh, any violations that were made and tries to make targeted fixes to it. So we have three very simple prompts, but they're now running independently rather than trying to do everything in one large prompt. So we can see in this case, our agentic approach has solved all of our test cases, uh, with a much lower number of tokens and with a lower latency than trying Sonnet four-six with a better prompt. So going forward, it seems like there's two appropriate approaches to take here, using Opus four-seven with adaptive thinking or using this agentic loop. Now, moving forward, we'd probably want to do a little bit more optimization on this loop to try and get it to be more efficient. But there's one key benefit as well from using this generate, evaluate, repair loop, and that is that you can put in soft requirements at runtime. So in the evaluation prompt, we can say, "Harry doesn't like working with Sally, so as much as possible, try and separate them from working together." Or, "We need a third shift, uh, um, on Wednesday," for example. So it means that you're not having to make changes to the, uh, Python function, which is doing the evaluation in the back end every time to satisfy for any soft constraints which might depend just on a case-by-case basis. So to wrap up then, pulling all of those learnings together, what did we see? Well, we looked at two scenarios, two scenarios which I as an engineer see most in my day-to-day, which is where we're maintaining a prompt, we're migrating to a new model which has some different behaviors and we're b-- and building a new use case from scratch. We saw that general hygiene principles, following those can immediately uplift the performance against, um, a set of evals, and that we need those evals to be able to rigorously see any impacts of changing our prompt on the output. Then we saw this process of targeting failure modes one by one, adding structure, avoiding long ban lists, et cetera, were all things that helped push our model to the correct behavior. And then finally, with our new agentic bot that we were building, we saw the impact of splitting into three separate prompt systems. So rather than using one prompt to address everything, we're actually isolating different tasks where it's easy and repeatable to separate out the steps that it needs to take every time. Thank you so much for attending this afternoon. I hope you have a fantastic rest of your day.
- SPSpeaker
[applause] [upbeat music]
Episode duration: 33:48
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode G2B0YWuJUgI
