Skip to content
ClaudeClaude

Getting more out of the Claude Platform

Cut cost, manage context, boost intelligence. In this session, we'll show you how to put our latest platform capabilities to work. Through live demos you'll see what great prompt caching looks like, learn to keep context lean for long-running agents with tool search, programmatic tool calling, and compaction, and use the advisor strategy for a cost-effective intelligence boost. Together, they're a set of patterns you can apply to your agents today to get more from every token.

May 7, 202628mWatch on YouTube ↗

CHAPTERS

  1. Why production agents feel hard: cost, reliability, and latency

    Brad Abrams frames the session around the real challenge: shipping agents to production, not demos. He polls the audience to highlight that even teams with agents live often struggle with cost, reliability, and latency—and sets up the talk as a set of practical techniques to improve all three.

  2. Prompt caching: the biggest lever for long-running agents

    Brad introduces prompt caching as the most important optimization for agents whose context grows across repeated tool-call loops. By caching common prompt segments (KV cache), the system skips reprocessing and dramatically reduces both latency and cost.

  3. The business case: 90% input discount + rate-limit benefits

    He quantifies why prompt caching matters: a 90% discount on cached input tokens, which often dominate costs for agentic workloads. He also notes a lesser-known advantage: cached tokens don’t count toward API rate limits, improving throughput headroom.

  4. How top teams get to 90%+ cache hit rates (and the tooling to help)

    Brad cites customers like Cursor, Replit, and Perplexity achieving cache hit rates in the 90s via deliberate engineering. He then points to two tools that make it easier: a prompt cache dashboard in the Claude Console and a Claude Code skill that guides cache-marker placement and prompt restructuring.

  5. Demo setup: fixing a dashboard and exposing the hidden cache problem

    Brad brings Ben on stage to demo an “executive dashboard,” humorously re-themed into “Hero Corp AI.” They reveal a developer console showing agent behavior and discover the cache hit rate is effectively zero—illustrating how teams can miss large savings without visibility.

  6. Turning caching on: cache writes, cache hits, and cache TTL

    Ben uses Claude Code to improve cache hit rate and reruns the agent loop. Brad explains how prompt segments are written to cache on first encounter and then become cache hits on subsequent loops, with a default cache retention of about five minutes (extendable).

  7. Context overflow even at 1M tokens: the need for context engineering

    The demo demonstrates that massive tool outputs (Slack, Gong, Salesforce, etc.) can exhaust even a million-token context. Brad introduces “context engineering” as the discipline of intentionally choosing what belongs in the model’s context and avoiding abstractions that hide what the model sees.

  8. Technique 1 — Tool Search: load tools just-in-time to save context

    Brad explains that customers may have tens or hundreds of tools, but loading all tool schemas up front consumes valuable context. Tool Search defers loading tool definitions until the model actually needs them, reducing wasted tokens and sometimes improving model focus.

  9. Technique 2 — Programmatic Tool Calling: keep big tool outputs out of context

    Programmatic tool calling addresses the problem of tools returning too much text by letting the model write Python to inspect and extract only the needed fields. Full tool outputs stay in memory, while only concise extracted snippets are inserted into the model context.

  10. Technique 3 — Compaction: summarizing stale turns to stay within limits

    Compaction is presented as the “sledgehammer” for long-running agents that eventually hit context limits even after other optimizations. When a threshold is reached, the system pauses, summarizes prior conversation/tool activity into a tight continuation summary, and proceeds without losing the thread.

  11. Demo recap: combining Tool Search + Programmatic Tool Calling + Compaction

    They enable all three context engineering techniques in the Hero Corp demo and reload the page. The context bar grows much more slowly while executing the same tool calls, showing that the system can accomplish the same tasks with far fewer tokens and lower cost.

  12. Cutting costs further with model choice: Opus is great, but expensive

    Even after context improvements, the demo still costs around $10 per load because it’s using an Opus model. Brad argues that smaller models like Sonnet can handle tool calling and code generation well, and the remaining intelligence gap can be handled strategically.

  13. Advisor strategy: Opus-level intelligence on demand with Sonnet/Haiku costs

    Brad introduces the advisor pattern: a cheaper “executor” model does most work, but can consult a more capable “advisor” model for hard or high-stakes moments—analogous to junior engineers getting senior reviews. This yields large cost savings while preserving quality when it counts.

  14. Advisor demo moment: catching a missed detail in a critical deal

    Switching to “Sonnet + Opus as advisor,” the system flags that Sonnet initially marks an important contract as on track, but the advisor finds a buried requirement (cryothane) in the transcript. The UI updates the risk status and enables an action to secure cryothane, illustrating high-stakes escalation.

  15. Wrap-up + other platform wins: WIF and Ant CLI

    Brad closes with key takeaways: prioritize prompt caching, then apply context engineering, and finally use advisor for targeted intelligence. He also briefly highlights newer platform features—Workload Identity Federation for security and the Ant CLI for command-line management that integrates well with Claude Code.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome