Skip to content
ClaudeClaude

Caching, harnesses, and advisors: Building on Claude at GitHub scale

GitHub's Copilot team ships Claude to millions of developers across chat, CLI, coding agent, and code review, and has become one of the most demanding users of the Claude Platform. GitHub CPO Mario Rodriguez and Anthropic's Brad Abrams break down how the team pushes quality up and costs down at scale, from caching and evaluation to the new Advisor strategy. Walk away with patterns you can apply to your own Claude-powered product.

Brad AbramshostMario Rodriguezguest
May 5, 202626mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

GitHub’s Claude scaling playbook: caching, routing, and evaluation loops

  1. GitHub treats prompt caching as the primary cost-and-latency lever, targeting ~94–96% cache hit rates and considering ~70% a likely bug or prompt/tooling regression.
  2. To sustain high cache hits, Copilot engineering keeps system/tool prefixes highly static, avoids dynamic identifiers like UUIDs in the system prompt, and enforces cache affinity across a multi-model harness.
  3. GitHub and Anthropic developed an “Advisor” pattern where a cheaper model (Haiku) executes most steps and consults a smarter model (Opus) only when needed, achieving near-Opus quality at substantially lower cost.
  4. A complementary “Rubber Duck” critic pattern injects critique at key workflow moments—after planning, after complex implementation, and after writing tests—to prevent costly rework and improve end outcomes.
  5. New model adoption is governed by a repeatable pipeline: onboard to Copilot API, tune prompts/tools/context management, run offline benchmarks plus dogfooding/online A/B tests, then iterate with Anthropic using weekly reporting focused on outcome metrics like code survival.

IDEAS WORTH REMEMBERING

5 ideas

Treat cache hit rate as a first-order product and infra metric.

GitHub operates Copilot with an expectation of ~94–96% cache hits; falling to ~70% is interpreted as a regression (often a bug) that must be addressed before other optimizations matter.

Keep the system prompt and tool definitions as static as possible.

Dynamic content in the prefix (e.g., UUIDs in the system prompt) or dynamically changing tool blocks can invalidate the prompt cache for the entire conversation, multiplying costs and latency.

Cache affinity becomes a core engineering problem in multi-model systems.

If users bounce between model families (Claude, GPT, OSS) and then return, you must preserve the correct cache affinity so subsequent calls benefit from prior cached prefixes rather than paying full input cost again.

Long context isn’t inherently more expensive—compaction is the hidden cost driver.

GitHub found that aggressive summarization/compaction increases output tokens (often the expensive side) and can also disrupt caching, so larger context windows may reduce total cost by avoiding repeated compactions.

Use an Advisor pattern to buy “smartness” only when needed.

Haiku handles routine execution and selectively calls Opus for hard subproblems; the demo showed Opus providing a small hint that let Haiku finish quickly, improving quality with minimal added tokens and modest latency.

WORDS WORTH SAVING

5 quotes

Number one is prompt caching. Without that, you know, we're not dead, but oh my God. Like, the amount of money that we would spend compared to what we do is incredible, right? So just one percent efficiency on this means a lot to us.

Mario Rodriguez

For us to operate the service at scale, we need to run above ninety-four usually, ninety-four, ninety-five, ninety-six percent. If we operate at seventy percent, that means usually that we have a bug, believe it or not.

Mario Rodriguez

We put no dynamic content in the prefix, right? Like you need to keep that prefix as static as possible. And as an example, at one moment, we had UUIDs in the actual system prompt, and then we're getting constantly reset, and then that was invalidating the entirety of the cache rate.

Mario Rodriguez

One of the pieces of feedback we got from the Copilot team is they really wanted Opus-level intelligence, but it's at Haiku-level prices.

Brad Abrams

So longer context windows does not mean you're spending more. In fact, what you have to understand is how compaction is being done, and depending on the scenario, you wanna manage that for the user appropriately.

Mario Rodriguez

Prompt caching instrumentation and dashboardsCache hit-rate targets and cost sensitivity at scaleAvoiding cache invalidation (static prefixes, tool stability)Multi-model harness and cache affinity challengesLong context vs compaction cost tradeoffsAdvisor and critic (Rubber Duck) patterns for intelligence routingModel launch pipeline: benchmarks, dogfooding, online A/B testsOutcome metrics: acceptance vs survival rate

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome