CHAPTERS
Why production agents feel hard: cost, reliability, and latency
Brad Abrams frames the session around the real challenge: shipping agents to production, not demos. He polls the audience to highlight that even teams with agents live often struggle with cost, reliability, and latency—and sets up the talk as a set of practical techniques to improve all three.
Prompt caching: the biggest lever for long-running agents
Brad introduces prompt caching as the most important optimization for agents whose context grows across repeated tool-call loops. By caching common prompt segments (KV cache), the system skips reprocessing and dramatically reduces both latency and cost.
The business case: 90% input discount + rate-limit benefits
He quantifies why prompt caching matters: a 90% discount on cached input tokens, which often dominate costs for agentic workloads. He also notes a lesser-known advantage: cached tokens don’t count toward API rate limits, improving throughput headroom.
How top teams get to 90%+ cache hit rates (and the tooling to help)
Brad cites customers like Cursor, Replit, and Perplexity achieving cache hit rates in the 90s via deliberate engineering. He then points to two tools that make it easier: a prompt cache dashboard in the Claude Console and a Claude Code skill that guides cache-marker placement and prompt restructuring.
Demo setup: fixing a dashboard and exposing the hidden cache problem
Brad brings Ben on stage to demo an “executive dashboard,” humorously re-themed into “Hero Corp AI.” They reveal a developer console showing agent behavior and discover the cache hit rate is effectively zero—illustrating how teams can miss large savings without visibility.
Turning caching on: cache writes, cache hits, and cache TTL
Ben uses Claude Code to improve cache hit rate and reruns the agent loop. Brad explains how prompt segments are written to cache on first encounter and then become cache hits on subsequent loops, with a default cache retention of about five minutes (extendable).
Context overflow even at 1M tokens: the need for context engineering
The demo demonstrates that massive tool outputs (Slack, Gong, Salesforce, etc.) can exhaust even a million-token context. Brad introduces “context engineering” as the discipline of intentionally choosing what belongs in the model’s context and avoiding abstractions that hide what the model sees.
Technique 1 — Tool Search: load tools just-in-time to save context
Brad explains that customers may have tens or hundreds of tools, but loading all tool schemas up front consumes valuable context. Tool Search defers loading tool definitions until the model actually needs them, reducing wasted tokens and sometimes improving model focus.
Technique 2 — Programmatic Tool Calling: keep big tool outputs out of context
Programmatic tool calling addresses the problem of tools returning too much text by letting the model write Python to inspect and extract only the needed fields. Full tool outputs stay in memory, while only concise extracted snippets are inserted into the model context.
Technique 3 — Compaction: summarizing stale turns to stay within limits
Compaction is presented as the “sledgehammer” for long-running agents that eventually hit context limits even after other optimizations. When a threshold is reached, the system pauses, summarizes prior conversation/tool activity into a tight continuation summary, and proceeds without losing the thread.
Demo recap: combining Tool Search + Programmatic Tool Calling + Compaction
They enable all three context engineering techniques in the Hero Corp demo and reload the page. The context bar grows much more slowly while executing the same tool calls, showing that the system can accomplish the same tasks with far fewer tokens and lower cost.
Cutting costs further with model choice: Opus is great, but expensive
Even after context improvements, the demo still costs around $10 per load because it’s using an Opus model. Brad argues that smaller models like Sonnet can handle tool calling and code generation well, and the remaining intelligence gap can be handled strategically.
Advisor strategy: Opus-level intelligence on demand with Sonnet/Haiku costs
Brad introduces the advisor pattern: a cheaper “executor” model does most work, but can consult a more capable “advisor” model for hard or high-stakes moments—analogous to junior engineers getting senior reviews. This yields large cost savings while preserving quality when it counts.
Advisor demo moment: catching a missed detail in a critical deal
Switching to “Sonnet + Opus as advisor,” the system flags that Sonnet initially marks an important contract as on track, but the advisor finds a buried requirement (cryothane) in the transcript. The UI updates the risk status and enables an action to secure cryothane, illustrating high-stakes escalation.
Wrap-up + other platform wins: WIF and Ant CLI
Brad closes with key takeaways: prioritize prompt caching, then apply context engineering, and finally use advisor for targeted intelligence. He also briefly highlights newer platform features—Workload Identity Federation for security and the Ant CLI for command-line management that integrates well with Claude Code.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome