ClaudeCaching, harnesses, and advisors: Building on Claude at GitHub scale
CHAPTERS
- 0:58 – 3:08
GitHub’s AI product pillars: flow, velocity, efficiency, trust
Mario frames GitHub Copilot’s approach around outcomes customers care about: keeping developers in flow, increasing team velocity, and scaling with both efficiency and trust. These pillars guide product and platform decisions, especially when operating at massive inference volume.
- •Customer outcomes: developer flow and team velocity
- •Scaling requires both efficiency and trust
- •Product decisions are anchored to these pillars
- •Context: GitHub runs Copilot inference at extremely high volume
- 3:08 – 4:40
Three scale lessons from running Copilot: caching, routing intelligence, and model rollouts
Mario outlines three core practices that underpin Copilot’s ability to run efficiently at GitHub scale. He previews prompt caching, an “advisor/critic” strategy for using the right model at the right time, and a disciplined process for adopting new models quickly and safely.
- •Prompt caching as a primary cost lever
- •Right-intelligence-at-right-time via advisor/critic patterns
- •Operational process for frequent new model launches
- •Why small % improvements matter at billions of calls
- 4:40 – 6:10
Instrumenting cache performance: platform dashboards and internal model deltas
The talk moves from principles to measurement, emphasizing that you can’t optimize without data. Mario highlights both Anthropic’s cache dashboard and GitHub’s internal dashboards that compare performance deltas between model versions before and after shipping.
- •Use dashboards to track cache hit ratios and message volume
- •Run benchmarks on new models (EAP and post-launch)
- •Compare model versions via delta views (e.g., Opus 4.6 vs 4.7)
- •Monitor for regressions during the first weeks of a rollout
- 6:10 – 8:12
Operating targets and cost math: why 94–96% cache hit rate is table stakes
Mario explains GitHub’s required cache-hit operating range and what it signals when the cache rate drops. He connects cache invalidation to input token costs (noting a ~10x cost difference), making clear why even 1% cache movement is financially significant.
- •GitHub aims to run above ~94–96% cache hit rate
- •~70% cache hit often indicates a bug or prompt assembly issue
- •Cache invalidation can drastically increase input costs
- •A 1% change matters materially at GitHub request volumes
- 8:12 – 9:43
Hard-won caching lessons: static prefixes, stable tools, and cache affinity in multi-model harnesses
This chapter distills concrete engineering practices to keep caches hot. Mario details mistakes GitHub made (like UUIDs in system prompts), the importance of keeping tool definitions stable, and the complexity of preserving cache affinity when routing across multiple model families.
- •Avoid dynamic content in system/prefix (e.g., UUIDs break caching)
- •Tool blocks changing dynamically can invalidate whole conversations
- •Use regression tests while iterating on tools/skills
- •Maintain cache affinity even with multi-model routing/harnesses
- 9:43 – 11:14
Debunking long-context cost fears: compaction can be the real cost driver
Mario challenges the assumption that longer context windows necessarily cost more. Using a simulated test, he shows that aggressive compaction/summarization can increase output tokens and invalidate caching, making systems more expensive than simply using longer context appropriately.
- •Longer context doesn’t inherently mean higher cost
- •Compaction/summarization can spike output tokens
- •More compaction can reduce cache effectiveness
- •Tune context management based on scenario, not blanket rules
- 11:14 – 13:01
Caching playbook recap: invest in instrumentation, optimize per surface, and fix caching first
Mario summarizes the caching strategy as the foundational optimization: measure it, improve it methodically, and segment by product surface. He notes that different Copilot clients (VS Code, CLI, cloud agent, IDEs, mobile) may require distinct tuning or shared infrastructure improvements.
- •Instrument cache hit rate and track pre/post model launch
- •Prioritize improving caching before other efficiency work
- •Measure and tune per surface (VS Code, CLI, agent, IDEs, mobile)
- •Use continuous regression detection: ship → measure → iterate
- 13:01 – 15:32
Advisor strategy: pairing a cheaper “executor” model with an Opus-level mentor
Brad explains an architecture inspired by senior/junior engineering mentorship. A smaller model (executor) handles most tasks and selectively consults a stronger model (advisor) only when needed, aiming to achieve near-Opus quality with far fewer expensive tokens.
- •Executor/advisor pattern mirrors junior/senior mentorship
- •Use expensive reasoning sparingly for difficult cases
- •Goal: near-Opus intelligence at much lower cost
- •Architecture: executor calls advisor as a tool when it detects limits
- 15:32 – 17:37
Live demo: Haiku alone vs Haiku + Advisor (Opus) for faster correct answers
Brad demonstrates GitHub Copilot CLI with Haiku on one side and Haiku plus the Advisor tool on the other. The advisor path introduces a small latency and cost hit but quickly provides the key insight, enabling Haiku to finish while Haiku-alone struggles.
- •Side-by-side comparison: baseline vs advisor-augmented flow
- •Advisor provides a targeted hint rather than full solution
- •Small added latency/cost can unlock large quality gains
- •Planned release as an experiment in Copilot CLI
- 17:37 – 18:37
Beyond advisors: “Rubber Duck” critique model to catch issues at key checkpoints
Mario introduces a complementary pattern: injecting a critique step (rather than advice) at moments that prevent downstream rework. The system requests critique from stronger models and uses it to adjust the plan before continuing execution.
- •Critic model used for critique vs advisor for targeted help
- •Demo: critique influences plan before implementation continues
- •Critique can prevent costly rework later in the loop
- •Cross-model critique used to improve reliability and outcomes
- 18:37 – 20:38
Where critique pays off: after planning, after complex implementation, and before running tests
Mario pinpoints three insertion points where critique provides strong ROI: right after drafting a plan, after complex implementations (pre-review), and after writing tests but before running them. The emphasis is on keeping developers in flow by reducing avoidable CI and review cycles.
- •Critique after drafting a plan (highest leverage)
- •Critique after complex implementation (pre-code review)
- •Critique after writing tests, before running them (save CI time)
- •Rubber Duck available experimentally in Copilot CLI
- 20:38 – 22:09
A disciplined model rollout pipeline: onboard, tune harness, benchmark, dogfood, iterate with Anthropic
Mario details GitHub’s evolved process for integrating new Anthropic models into Copilot. The pipeline includes onboarding into Copilot’s API layer, tuning prompts/tools/context management, running offline benchmarks, extensive internal dogfooding, and sharing structured findings with Anthropic through iterative checkpoints.
- •Onboard new model into Copilot API (CAPI) endpoint
- •Tune system prompts, tool interfaces, agent loop, context/compaction
- •Run offline benchmarks plus internal dogfooding (online signal)
- •Iterate with Anthropic via findings, docs, and checkpoints
- 22:09 – 26:15
Online evals and harness optimization: tools discipline, A/B testing, and outcome-based metrics
Mario emphasizes that offline benchmarks are only a baseline; real optimization happens via online experiments after launch. He outlines harness hot spots (prompt/context building and tool execution), cautions against excessive tool counts, and closes with product measurement guidance: track outcomes (like “survival rate”) over superficial activity metrics (like acceptance rate).
- •Online evals and A/B tests uncover real-world issues and tuning needs
- •Harness focus areas: prompt/context building and tool execution
- •Too many tools increase confusion; tune tools per surface/scenario
- •Measure outcomes (survival rate) rather than activity (acceptance rate)