ClaudeCaching, harnesses, and advisors: Building on Claude at GitHub scale
CHAPTERS
GitHub’s north-star outcomes: flow, velocity, scale, efficiency, and trust
Mario frames Copilot’s platform decisions around customer outcomes: keeping developers in flow, increasing team velocity, and operating at enterprise scale. He emphasizes that efficiency and trust are prerequisites for delivering those outcomes reliably.
Three core learnings from running Copilot at massive inference volume
Mario previews the talk’s structure based on what GitHub has learned operating at billions of requests. He outlines three focus areas: prompt caching, routing intelligence via advisor/critic patterns, and a disciplined process for adopting new models.
Instrumenting caching: dashboards, cache hit ratios, and model deltas
Mario highlights the importance of measurement, pointing to Anthropic’s cache dashboard and GitHub’s internal dashboards. GitHub monitors cache hit ratios and compares model versions (e.g., Opus 4.6 vs 4.7) to understand cost/perf impact before and after launch.
Operational targets: why 94–96% cache hit rate matters
GitHub sets high cache-hit targets to operate Copilot cost-effectively at scale, treating significantly lower rates as a likely bug. Mario explains that even ~1% changes matter given request volume and notes the steep cost penalty when cache is invalidated.
Hard-won prompt caching rules: static prefixes, stable tools, and cache affinity
Mario shares concrete lessons learned to push cache rates into the 90s. The core is to keep early prompt components stable, avoid dynamic content in system/tool prefixes, and maintain cache affinity even across multi-model flows.
Debunking ‘long context costs more’: compaction and summarization drive cost
GitHub tested scenarios showing that longer context windows aren’t necessarily more expensive. The real cost driver is compaction/summarization frequency, which increases output tokens and can reduce cache effectiveness.
Caching playbook recap: measure per surface and prioritize cache first
Mario summarizes the caching guidance: instrument cache hit rate, analyze deltas around model launches, and treat caching as the first efficiency milestone. He notes Copilot spans many surfaces, each requiring separate measurement and sometimes different tuning.
Advisor strategy: pairing a ‘junior’ model with a ‘senior’ model
Brad explains the advisor pattern: let a cheaper, faster model (Haiku) execute most tasks, and consult a stronger model (Opus) only when needed. The goal is to approximate higher-end intelligence while spending premium tokens sparingly.
Live demo: Haiku + Advisor vs Haiku alone in Copilot CLI
Brad demos the integration: Haiku alone struggles on a tricky problem, while Haiku with an Opus advisor quickly gets an essential hint and finishes. The pattern introduces a small latency/cost bump but materially improves success on difficult tasks.
‘Rubber Duck’ critic: inserting critique at high-leverage moments
Mario introduces a complementary pattern: a critic model that reviews and challenges the work-in-progress rather than advising from scratch. The critic is inserted at specific checkpoints to catch issues early and reduce downstream rework.
Where critique is applied: after planning, after complex changes, after tests (pre-run)
GitHub inserts critique into three moments in the workflow. Mario notes the plan phase tends to yield the biggest gains, but critique after implementation and after writing tests can also shorten the path to a successful outcome.
Scaling model adoption: a disciplined pipeline from onboarding to launch
Mario details GitHub’s model rollout process: onboard into Copilot API, tune prompts/tools/context, run offline benchmarks and internal dogfooding, then iterate with Anthropic through multiple feedback loops. He stresses that real-world online evals reveal issues that offline tests miss.
Harness optimization and evaluation metrics: tools, context, and outcome-based measurement
Mario breaks the harness into stages (build context, call model, execute tools, append results, loop) and says most effort goes into tool execution and context management. He closes by emphasizing outcome metrics—like survival rate—over activity metrics like acceptance rate.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome