Caching, harnesses, and advisors: Building on Claude at GitHub scale

GitHub's Copilot team ships Claude to millions of developers across chat, CLI, coding agent, and code review, and has become one of the most demanding users of the Claude Platform. GitHub CPO Mario Rodriguez and Anthropic's Brad Abrams break down how the team pushes quality up and costs down at scale, from caching and evaluation to the new Advisor strategy. Walk away with patterns you can apply to your own Claude-powered product.

Brad AbramshostMario Rodriguezguest

May 6, 202626mWatch on YouTube ↗

EVERY SPOKEN WORD

25 min read · 4,576 words

SPSpeaker
[upbeat music]
BABrad Abrams
Please welcome to the stage Chief Product Officer of GitHub, Mario Rodriguez. [audience applauding]
MRMario Rodriguez
Hello, hello everyone. It's great to be back at Code with Claude. I love developer conferences. You know, they're a little bit messy, they're vibrant, they're still dev first, not agent first, and this is where we to have fun. We're here to have fun and to learn together. Now, you're here because you wanna learn more about the platform, and I'm here to kind of share with you some of the top things that we end up doing to running Copilot and all of our inference on top of this platform. Now, I wanna s- get started with the why. Like, at GitHub, we're very mission-driven. You know, our vision is to kind of empower developers to provide the best tools out there to advance human progress. But when I talk to customers overall, I'm telling them something that they wanna achieve. They wanna achieve a set of outcomes. And for them, they wanna keep their people in flow. They wanna keep developers in flow. They want their teams to gain velocity. They want teams to actually achieve more with the people that they have, and they wanna do this at scale. And to be able to do it at scale, you have to have efficiency, and you also have to have trust. And I start with this because almost every single product decision that I make kind of is grounded on these pillars. Even when we actually talk about the platform and integrating with our cloud platform, I wanna make sure that I keep developers in flow. I wanna make sure that teams gain that velocity, and I wanna make sure that the companies can do that with intelligence and trust. Now, as I think about it forward, and I was thinking really hard, "Okay, like, what do I wanna say here?" And if I was sitting there, what I would like to understand is, okay, how do you operate, you know... We do billions and billions, probably over big enough, uh, time sense, trillions of messages into, um, against the platform. So I wanted to kind of give you the best practices, or maybe not even best practices because things change so often, almost at a weekly basis. But kind of like the key learnings, the things that are at the bottom of almost all of the decisions that we make to be able to integrate with the platform and achieve that scale and that efficiency that is necessary. So I'm gonna divide it into three things. Number one is prompt caching. [chuckles] Without that, you know, we're not dead, but oh my God. Like, the amount of money that we would spend compared to what we do is incredible, right? So just one percent efficiency on this means a lot to us. It's kind of like high-frequency trading. Just one percent efficiency means millions overall. So I just wanna spend a little bit more into how we're doing that. The second thing is we're collaborating with Anthropic on capabilities to make sure that our customers are using the right inference, the right amount of intelligence at the right time. And then one of those is this advisor model, and we have two ways of looking at it, both through an advisor and a critic. So Brad and I will share a little bit more about that too. And then the third one is, okay, every time a new model drops, and if you're integrating with the platform, you know that this happens constantly, and you get a call at five, you know, AM on a Saturday that's like, "Hey, we're launching on Tuesday." Um, so how do we actually go through that the right way? How do we make decisions like what is the default model to keep developers in flow? And how do we make then decisions to actually route to the right intelligence at the right time overall? So I'll start with this, um, and this is a dashboard. This is not our dashboard, but if you are on top of the platform, I believe Anthropic released this last week. We got a little bit of preview of it. And what it allows you to understand is how are you doing, um, you know, like your cache hit ratios, how many messages you're sending, um, against Messages API, right? And I think this is great. This is the first step because without data, it's really, really hard to make decisions. Really, really hard. So if you haven't checked it out, please do so. Now, the one that I wanna spend a little bit more time on, it's ours, and this is just one lo- this is not the entirety of all of our dashboards that we have. Um, but it's one where we take a look at deltas between models. In this case, I want you to look mainly at the left-hand side, that's Opus 4.6, and then on the right is 4.7, and then the next column is the delta between them. So when a new model drops, and sometimes we get it in EAP, we go ahead and run a set of benchmarks against it. So think about terminal benchmark two, um, and then-- or, or one of the suite. We have our own ones as well. And then we try to decide, okay, how is the model performing? And then we ship it, and then we pay attention to that data again. And then after a period probably of thirty days, we're probably done, or maybe sometimes even sooner, we're done with all of the optimizations for that model. Now, you can see over here, um, kind of the median cache tokens, and at the end, the cache rateFor us to operate the service at scale, we need to run above ninety-four usually, ninety-four, ninety-five, ninety-six percent. If we operate at seventy percent, that means usually that we have a bug, believe it or not. Like we're doing something not right, and then we need to change the approach on how we're calling the model, how we're assembling the prompt, how we're doing the end-to-end to keep that developer in flow. So we pay attention a lot to that. The same thing, if I wanna actually change the cache rate from... I'm sorry, uh, change the default model from four point six to four point seven, I need to understand what's the cache rate difference and what it actually means to me. Again, we make billions and billions of calls overall. So just one percent in that, in this case it's one point three percent, means a lot to us. Now, you also have to take into account that from an input perspective, your cache rate is only ten percent of the cost, right? So it's a ten X difference in there. So you're constantly invalidating that cache. That is not a very, that's not a very good thing overall because you're gonna be paying ten X more. And then we kind of go across it. In this case, we did a, a baseline against four point six on it, and then also on Haiku. Now, imagine it stays like this for a second, and there's a bunch of red, as you can see in this screen, um, overall. Then we have a decision to make. We have decision to make to try to figure out how to get all of those reds into green. And it's not luck, it's a lot of hard work, and I wanna make sure you understand that. It's a lot of hard work to get it. You know, maybe from fifty to seventy percent it's not because you use kind of easy, uh, hanging fruit, but from seventy to eighty, from eighty to ninety, and ninety and above, there's a lot of hard work that goes into this and engineering. And there's three things that we take a look at. Number one is we make sure that there's no... And these are lessons learned. Like we have made these mistakes before, and that's why I wanted to share it with you. We put no dynamic content in the prefix, right? Like you need to keep that prefix as static as possible. And as an example, at one moment, we had UUIDs in the actual system prompt, and then we're getting constantly reset, and then that was invalidating the entirety of the cache rate. So remember, system, then there's tools, and s- then there's the conversation, and then there's the last message that you're sending. That's k- kind of the hierarchy. So you wanna keep that system prompt as stable as humanly possible. No dynamic content on it. Then goes tools. We made a lot of mistakes in tools at times. If you're actually loading tools dynamically and you're changing that tools prefix, then all of a sudden everything, the entire conversation, gets invalidated again. So you have to do work, and you have to make sure you have a lot of regression tests because you're gonna be experimenting a lot with what? Skills and tools overall in your end-to-end. You have to have regression tests to make sure that you're not affecting your tools. Then the third thing then overall is that we need to have that cache affinity. And let me tell you, it's really hard when you're doing a multi hard, a multi-model harness. So take for example, Copilot. I-- You know, the customer can be calling, let's say, Opus, then calling a GPT model, and then going to an OSS model, and then coming back into Opus again. And then through all of that, I have to make sure that the next Opus call, that last one that happened, actually has the right cache affinity too. So we do a lot of work when it comes to our multi-model harness to make sure that that is guaranteed. So, uh, the next slide. So here's an example, um, on something that we ran, um, and also to debunk a myth. So one of the key things that I heard many times, actually, when I go to customers and they ask me about integrating with the platform is, "Hey, does long context mean that it's more expensive?" And the answer to us is no. And this is a quick test that we ended up running. So there's a smaller context window, and you can see in there that my average compaction went up three times th- um, compared to a largest context window. This was something that we simulated. We kept the same model with the same context window. We just ended up doing more compaction on it and filling it, um, at a rate. Now, the key thing over here, if you remember the math of how many input and output tokens ends up happening is, is five X, right? So for Opus, it would be five dollars, twenty-five dollars as an example. And in that case, whenever you do compaction, you get four thousand tokens of output, and then you have-- because you have to summarize the message, et cetera, et cetera. So whenever that happens, then you end up usually paying more because you end up com-- doing a lot more compaction, which means that your output tokens go through the roof, and then your cache rate also gets a little bit more invalidated because of that too. Well, significantly invalidated because of that too. So longer context windows does not mean you're spending more. In fact, what you have to understand is how compaction is being done, and depending on the scenario, you wanna manage that for the user appropriately. So, and that's good. All right. So at a high level then, instrument your cache hit rate. Have a dashboard that actually shows it to you. Invest the time to do it. They're already shipping one for use, so at the very least, use that one. But go and invest in deltas, go and invest in pre the model launch, and then post the model launch as well. Now, no other efficiency matters until you see that one again. Like, you wanna be driving that from fifty, seventy, nineties overall, and it will take a lot of hard work, a lot of engineering time. And then you wanna measure per surface. We have VS Code, which is one of the, uh, the dashboards that we show you there, but we also have the Copilot CLI. We also have our coding agent in the cloud. We also have IntelliJ, and we have mobile too. So it's a lot of surfaces that you're gonna have to go and understand, either share the hardness across them or then tweak individually. So you wanna have a very good kind of sense of we ship something, we regress or not. We ship something, we improved or not. Now-Another thing that I was telling you about is, okay, if you got Prompt Caching already done, what about making sure that the right intelligence hits the user at the right time? And for that, Anthropic and I have-- Uh, Anthropic and us have been, um, partnering on this Advisor model. And what I wanna do is bring Brad Adams from the Anthropic team to talk more about it. Go ahead. Brad?
BABrad Abrams
[upbeat music] Thank you, Mario. Oh, hey. Uh, thank you, Mario, for being here. It's so great to have a partner like GitHub Copilot. Uh, we get... The Copilot team gives such great feedback. We give them very little time to evaluate models before we launch, and they give great, insightful feedback. Uh, and it's true on our API features as well. In fact, if you're using the Claude platform, a lot of what you're seeing is thanks to, uh, the things Mario and his team are doing on Copilot. And it's actually one of those pieces of feedback that I wanna talk about today. One of the pieces of feedback we got from the Copilot team is they really wanted Opus-level intelligence, but it's at Haiku-level prices. That sounds like a good deal, right? Opus intelligence. Uh, so I, I don't get to write the deals, but what I do get to do is build fun API features. So, uh, I wanna talk about the Advisor strategy. And really the insight on the Advisor strategy comes from software development teams. We all know if you take a junior engineer, and you give them a mentor who's a senior engineer, that junior engineer gets a lot better because the senior engineer looks over their shoulders, they review code or, uh, they review code, they look at design docs together. The senior engineer can make the junior engineer a lot more productive with-without taking too much time from the senior engineer. And it turns out the same thing is true for models. You can take a junior model, like Haiku, who we're listing here as the executor, uh, and give it access to Opus, uh, and you can very conservatively use those Opus tokens. So in this beautiful diagram, uh, that Claude created for me, what you see is the executor, uh, Haiku, is able to identify every shape as it comes in, no problem, except for one little weird shape. That weird shape, it's beyond what Haiku can do, so Haiku has a tool. It calls, uh, the advisor. It calls Opus. Opus then, because it's a bigger model, it does more reasoning, and it knows that. And what we, we see from this in our, in our evals is that we get, um, close to Opus-level intelligence at much lower prices because we're being very conservative about the tokens that Advisor actually sends. And because of the price difference between the two, it works out really well. And what I'm excited about is an integration that we've worked with with the Copilot team. So let's switch over to the demo machine. Uh, yeah, let's switch over to the demo machine. And what you're seeing is GitHub Copilot, um, on the left-hand side is just GitHub Copilot with Haiku, and on the right-hand side, you're seeing GitHub Copilot with Haiku plus the new Advisor tool, uh, hooked up. So I'm gonna hit Enter here to give this time to run, and there's a little kind of brain teaser-y, um, problem here that it has to solve. It has to, uh, the-- Given the exact same problem, and we see Haiku has taken off, whereas on the right-hand side is consulting an advisor, so it's going a little bit slowly. So I'm keeping my fingers crossed here that Haiku will, uh, Opus will come back and becau-- Uh, so on the left-hand side, Haiku is just spinning away, trying a bunch of things, just like you would see a junior engineer who's very eager, trying lots of things, but hasn't found it yet. On the right side, um, the Opus advisor just returned, and because it's Opus, it actually knows this, this bit of data, and so it's able to bring that back into context. And now we're done with Opus, and everything's still in Haiku, but because it got that little bit of h-hint, uh, the right-hand side has finished here. So we see it, uh, the, the right-hand side finish. The left is still trying to figure out which way is up here. So just that little bit of hint from Opus, very small cost, very small latency hit, makes Haiku so much more powerful. So this is an experiment we're doing in, uh, the GitHub Copilot CLI that we'll, we'll release soon. Uh, and looking forward to you, to you playing with it. Thank you, and welcome back, Mario.
MRMario Rodriguez
[clapping] Thank you. Thank you. I love some Opus intelligence at Haiku prices. That's, it's kind of what we gotta do. Um, okay. So what we showed you is an advisor model. There's another algorithm that you could actually employ as well as you're tinkering more how to get the right call to the right intelligence. And we call, you know, we're GitHub, we're a little bit weird on some of the things. We call it Rubber Duck. Um, and what Rubber Duck is really all about, and here's a demo I'm gonna show you in a second, it's really about inserting a critique at the right moments in time. So I have a feature. That feature, you know, it had an issue, and over here I'm implementing it at the moment. And what I wanna highlight, and it's gonna come really soon, is the model now is asking for a critique, and you can see how it's asking for a critique from four point six and four point five Opus. Then it receives that critique, and then it's able to actually change the plan prior to implementation and then continue with that implementation the right way.Um, and then we get that feature, and we deploy it. So that-that's very, very quick, but really what I wanted to highlight is the different model. It's a different model that acts more... less as an advisor and more as a critique on, "Hey, here's what I think you should do." And we insert this in three core places. One of them is after drafting a plan. We have a lot of our users that do a plan first, and then after drafting that plan, they're going through execution. We have another one after a complex implementation. So think about it. I just finished all of this. Go and critique. It's kind of like a pre-code review to, uh, to a degree, and we do it sometimes there because it ends up saving tokens than to wait all the way into an official code review as well. And then the other one is after writing test, but before running them. And in certain places where your CI suite with test takes a significant amount of time, then you can see how that at the end gets you to that flow faster, and it keeps that developer in flow. So those are the three places that we're doing it right now. And from my end, what I really have seen work well is kind of at the plan phase. Like, a lot of these systems are getting really, really good at planning. And then if you catch it there, then you end up with the most gains post as well. Um, now Robert Dog is already in experimental. Um, so if you download the Copilot CLI, and you enable that experiments, you're gonna see it there, and you can indot, um, invoke it whenever you want. So you could just say, "Hey, create a plan on this and consult with Robert Dog." And then you get a quick critique across model families as well. Now, let's continue, um, moving. The third thing that we wanted to talk about is, okay, how do we at scale take new models? Like, what is the process that we do? Um, you know, at the very beginning, it was messy for us. But since over the last kind of two and a half years, we've done a lot of work to make sure we're very methodical about how we take this model. So you can see Anthropic has a model to try. We onboard that model into what we call CAPI, which is our Copilot API, and then we have an endpoint. Um, and then from that endpoint, three things really need to happen. Number one, as you all know, you have to go ahead and work in the harness and work in your prompt. So for us, we have a multimodal harness, right? So it's not just one model family. So we have to go over there and make sure that we update the system prompts. We partnered very, very closely with Anthropic on that. Uh, we have to then make sure the tool interfaces are right and optimized. We then have to, and we don't do this anymore as much, but we have to tweak the agent loop a little bit, and then a lot more work goes into context management, into compaction, into making sure that we're hitting the right cache hit rates overall. Then from there, we do two things. We run offline benchmarks, and then we also do internal dogfooding. Think of that as online benchmarks overall. So a lot of the Microsoft developers use it, a lot of GitHub developers use it, and we have a set of, uh, s- um, other users that we kind of partner very closely with as well to give us feedback. So we have that offline, and then we have that internal dogfooding. Then from there, we share findings with Anthropic, and this is a presentation. You know, before we used to write a document. Now we write a document and kind of expand it in detail, where we're like, "This is everything that we're seeing." And then we go on, on a loop with them on changes that we need either at the API or even changes on some of those models as well, and the checkpoints that we end up getting. And then from there, we kind of do another loop all the way until the release of the model. Now, it's very important, um, two things. Like, yes, we do offline, um, and offline gives you an indication, but I would say it's not going to be at times reality. Um, you learn a lot more from your online evals overall and the online experiments after launch than really the offline. Offline just kind of sets a base, and if that base is consistent, you kind of know what to expect out there, and it gives you pretty good indication on that, what to expect. But all of the details do not get done through offline. So it takes us usually days or sometimes even weeks to tune everything of that model out through those online experiments. We do a lot of A/B testing, and we're very, you know, methodical about that, and we do weekly reporting that we report up to Anthropic and also across the entirety of the teams that we have to make sure that we're tuning up, um, the right model at the right time. So when it comes to optimizing the harness for us, there are kind of four things. One is there's the build prompt and context. There's calling the model as you see there, executing the tool, and then appending the results and pu-pull back, and loop back. And what I was saying here, the places where I spend the most time with the team is usually in the execute tool and that, that build prompt and context. Tools are very important to us. The more tools you have, then the more confusion ends up happening, the more you have to tune. So hundreds and hundreds of tools is not a good thing. So you wanna make sure you're tuning by surface, and you're tuning for the exact scenarios with the right tools, um, in that package. So we spend a lot of time in there, um, and optimizing that model. And then the check permissions and then run. Um, we end up, again, in that execution of the tool, you have to do the right things to pass the right context on it and then to read the right results overall. So whenever we want to introduce a new tool as an example, we spend a lot of time making sure that we're optimizing the harness both at the model level and at the tool execution level. Okay, so what does this mean at a high level? There's two things. Um, you have to go in and have both published benchmark, internal benchmark, but more importantly, dogfood with A and B and make sure that you have online eval set up. And make sure you have to... you could trust them with stat six as well. Then the second thing is you want to measure outcomes, not activity. When I talked about the tools, like, you know, as an example, acceptance rate on a code line, that's okay. Um, survival rate is actually an even better metric overall because if you ended up accepting it, and then you ended up deleting it after, well, that actually did not accomplish the outcome overall. So even though I could have an amazing acceptance rate, if the survival rate ends up being very low, that means that we did not do the right work. So it's very important if you're in product, and product really changes in the era of AI, that you're optimizing for the outcomes and not just the individual rows. Like I said, you end up then optimizing for a metric that then tax, um, tanks the measurement of your entire product. Okay, so then with that, again, we walked through prompt caching, the advisor strategy, and then we talked a little bit about measurement and how we improve the harness. And then for us, I just want to leave you and say thank you. Really excited again to be with you, and hopefully you learned a couple of things today. [audience applauding] [outro music]

Episode duration: 26:15

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode y5TmF_6o6xk

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome