At a glance
WHAT IT’S REALLY ABOUT
Production agent performance: prompt caching, context engineering, and advisor models
- Prompt caching is presented as the highest-impact optimization, delivering major cost reductions, faster time-to-first-token, and relief from API rate-limit pressure for repeated prompt segments.
- Context engineering is framed as an explicit discipline of controlling what enters the model context, avoiding abstractions that hide context composition and prevent optimization.
- Three production techniques—tool search, programmatic tool calling, and compaction—are shown to drastically reduce context growth while preserving capability in long-running agent loops.
- An “advisor” pattern pairs a cheaper execution model (e.g., Sonnet/Haiku) with on-demand Opus reviews to achieve near-Opus intelligence at substantially lower cost.
- The talk closes with platform additions (workload identity federation and the Ant CLI) that improve security posture and operational automation for teams running agents in production.
IDEAS WORTH REMEMBERING
5 ideasPrompt caching is the first optimization to implement for long-running agents.
Repeated tool-call loops create large shared prompt prefixes; caching those segments avoids reprocessing and can yield ~90% discounts on cached input tokens plus faster time-to-first-token.
Cache hit rate is a production KPI you should actively monitor.
The Claude Console prompt cache dashboard surfaces real usage; if your hit rate isn’t “in the 90s,” you likely have structural prompt issues or missing cache markers.
Use Claude Code’s prompt-caching skill to quickly improve cacheability.
The talk highlights an installed-by-default Claude Code capability that can suggest where to add cache-control markers and how to reorganize prompts to raise cache hit rates.
Context engineering requires visibility—avoid layers that hide what’s in context.
If frameworks or abstractions obscure the actual transcript/context, developers lose the ability to decide what belongs in context and can’t effectively optimize cost, latency, or reliability.
Tool search reduces context bloat by loading tools just-in-time.
You can declare many tools but defer injecting full tool schemas until the model needs them; customers reported meaningful token reduction (e.g., ~10%) and sometimes improved model focus.
WORDS WORTH SAVING
5 quotesWith prompt caching, if you mark which sections are common i- in your prompt, then we're able to com- compute the KV values, uh, essentially pre- pre-cache the, um, the models, you know, part of the inputs to the models in KVs and save those.
— Brad Abrams
In fact, it's a ninety percent discount.
— Brad Abrams
One mistake I see developers doing is using abstractions over top of the platform that obscure what's in the context, and then as a developer, you don't really know what Claude's seeing in its context.
— Brad Abrams
Context engineering is really a discipline. It's the discipline of deciding what belongs in Claude's context.
— Brad Abrams
So what the problem we're trying to solve with advisor is we want Opus-level intelligence, but at Haiku-level costs or Sonnet-level cost.
— Brad Abrams
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome