Skip to content
ClaudeClaude

Getting more out of the Claude Platform

Cut cost, manage context, boost intelligence. In this session, we'll show you how to put our latest platform capabilities to work. Through live demos you'll see what great prompt caching looks like, learn to keep context lean for long-running agents with tool search, programmatic tool calling, and compaction, and use the advisor strategy for a cost-effective intelligence boost. Together, they're a set of patterns you can apply to your agents today to get more from every token.

May 7, 202628mWatch on YouTube ↗

CHAPTERS

  1. 0:52 – 1:53

    Putting agents into production: what breaks first (cost, latency, reliability)

    Brad Abrams frames the session around real-world production constraints for agents, then quickly gauges the audience on who already has agents deployed and whether they’re happy with operational metrics. The talk sets up practical techniques to improve cost, latency, and reliability rather than focusing on demos.

    • Production agents often struggle with cost, latency, and reliability (not just correctness)
    • Audience poll highlights gaps between “in production” and “performing well”
    • Session focus: concrete techniques that materially improve ops metrics
  2. 1:53 – 2:54

    Prompt caching as the highest-leverage optimization

    Brad identifies prompt caching as the most impactful technique for long-running agents whose context grows with each tool call. He explains that caching avoids reprocessing repeated prompt segments, cutting inference work dramatically.

    • Long-running agent loops repeatedly append tool calls/results, growing prompts with lots of repeated content
    • Prompt caching stores reusable KV computations for common prompt segments
    • Skipping early inference stages reduces latency and compute
  3. 2:54 – 3:55

    Why prompt caching matters: 90% input discount and rate-limit relief

    The platform incentives and performance gains are made explicit: prompt caching can deliver a steep discount on input tokens and faster time-to-first-token. A lesser-known benefit is that cached tokens don’t count against API rate limits.

    • Up to ~90% discount on input tokens for cached segments
    • Faster response times, especially time-to-first-token
    • Cached tokens don’t count toward API rate limits
    • Even with increased limits, caching helps stay within constraints
  4. 3:55 – 5:26

    Benchmarks from production teams + tools to boost cache hit rate

    He cites Cursor, Replit, and Perplexity achieving cache hit rates in the 90s with meaningful engineering effort. Then he introduces platform tooling that makes it easier: a prompt cache dashboard and a Claude Code skill to improve cache hit rate.

    • High cache hit rates (90%+) are achievable but often require deliberate engineering
    • New Console Analytics: prompt cache dashboard for deep inspection
    • Claude Code includes a prompt-caching expert skill (installed by default)
    • Workflow: ask Claude Code to “Improve my cache hit rate” and follow guided changes
  5. 5:26 – 7:29

    Demo setup: fixing an executive dashboard UI with Claude Code

    Brad brings Ben on stage to demo an “executive dashboard,” then humorously critiques the dated UI. They use Claude Code to quickly improve the theme, setting the stage for deeper agent observability and optimization.

    • Live demo context: executive dashboard agent experience
    • Claude Code used to modify the app quickly (theme/UI)
    • Sets up a scenario for inspecting agent behavior under the hood
  6. 7:29 – 7:59

    Observability lesson: measure your cache hit rate (it’s 0%)

    They open a custom dev console showing context usage, tool calls, and an agentic transcript, revealing the cache hit rate is zero. Brad emphasizes that you can’t optimize what you don’t measure.

    • Dev console displays context usage, tool calls, and transcript for debugging
    • Cache hit rate visibility is crucial for optimization
    • Practical takeaway: establish instrumentation to track caching and context growth
  7. 7:59 – 8:59

    Turning caching on in practice: cache writes, cache hits, and TTL behavior

    Ben applies caching improvements, and the rerun shows cache writes and subsequent cache hits in the transcript. Brad explains cache lifecycle details, including default retention and how repeated loops benefit.

    • First-seen prompt segments generate cache writes; subsequent loops create cache hits
    • Default cache retention ~5 minutes (with options to extend)
    • Healthy agent loops should show a pattern of reads/hits as context repeats
    • Cache hit rate improves over repeated tool-call cycles
  8. 8:59 – 11:00

    When even 1M tokens isn’t enough: the need for context engineering

    The demo shows tool results (Slack, Gong, Salesforce, etc.) flooding the context, exhausting even a million-token window. Brad introduces “context engineering” as a discipline of deciding what belongs in the model context and avoiding abstractions that hide it.

    • Tool-rich agents can overload context with raw transcripts and large payloads
    • Context engineering: intentionally choosing what the model should see
    • Over-abstracting the platform can obscure context contents and block optimization
    • Reviewing the full model-accessible transcript is critical for tuning
  9. 11:00 – 12:31

    Technique 1 — Tool Search Tool: load tools just-in-time

    Brad explains how large toolsets can crowd out working context if fully loaded upfront. Tool Search Tool defers loading full tool definitions until the model actually needs them, reducing tokens and sometimes improving model focus.

    • Many production agents have tens/hundreds of tools
    • Without JIT loading, tool definitions consume too much context budget
    • Tool Search Tool keeps tools declared but loads only those needed for the current trajectory
    • Reported results: token reduction (~10%) and improved model behavior via less clutter
  10. 12:31 – 14:34

    Technique 2 — Programmatic tool calling: keep bulky tool outputs out of context

    Instead of stuffing massive tool outputs directly into the prompt, the model writes Python to inspect schemas and extract only the necessary fields. The full tool response stays in memory while only the minimal derived snippets enter context.

    • Problem: tool responses often return huge text blobs that “pollute” context
    • Solution: have the model write Python to parse/transform tool outputs
    • First pass inspects schema; later code selectively extracts needed values
    • Benefits: lower token usage, lower cost/latency, often better reasoning (e.g., parsing HTML)
  11. 14:34 – 15:35

    Technique 3 — Compaction: summarize stale turns to stay within context limits

    For long-running agents, even well-managed tools eventually hit context limits. Compaction summarizes older, no-longer-needed turns into a structured summary so the agent can continue without losing the thread.

    • Compaction is a “sledgehammer” for inevitably long conversations
    • Old turns are replaced by a concise summary that preserves key state and goals
    • Can trigger at configurable thresholds (not necessarily at max context)
    • Real-world usage example mentioned (Hex)
  12. 15:35 – 21:09

    Demo results: combining tool search, programmatic calling, and compaction

    They add all three context techniques at once and reload the app, showing context growth slows dramatically while calling the same tools. They then inspect each mechanism in the transcript to verify how it reduced tokens and cost.

    • Same tool calls and data sources, but far less context consumption via smarter inclusion
    • Tool Search Tool selects a small subset of relevant tools instead of hundreds
    • Programmatic tool calling shows visible code execution and selective extraction into context
    • Compaction triggers at a lower threshold to reduce cost/latency while preserving continuity
  13. 21:09 – 22:40

    Advisor strategy: Opus-level intelligence with a cheaper executor model

    Brad introduces a pairing approach: run most work on a smaller model (e.g., Sonnet/Haiku) and call a stronger model (Opus) only when needed for review or tough cases—mirroring junior/senior engineer workflows.

    • Goal: achieve strong reasoning at much lower average cost
    • Executor model handles routine tool calling/code; advisor model provides targeted guidance
    • Analogy: junior engineer executes; senior engineer reviews and unblocks hard parts
    • Used by customers (e.g., Bolt) to manage costs while retaining quality
  14. 22:40 – 26:13

    Demo: Sonnet executor + Opus advisor catches a critical missed detail

    Switching to “Sonnet + Opus as advisor” lowers cost while maintaining outcomes for high-stakes items. In the Metropolis renewal, Sonnet initially marks the deal on track, but Opus finds a buried requirement (cryothane), changing the plan and prompting action.

    • Model configuration change: Sonnet runs primary loop; Opus is invoked for advisory checks
    • Advisor is triggered for high-impact uncertainty rather than every step
    • Opus finds a key buried detail (cryothane requirement) missed by Sonnet
    • Agent output shifts to actionable intervention (lock in cryothane)
  15. 26:13 – 28:15

    Wrap-up: priorities + recent platform features (WIF and Ant CLI)

    Brad summarizes the practical order of operations: start with prompt caching, then do context engineering, then add the advisor strategy for on-demand intelligence. He closes by highlighting platform improvements like workload identity federation for security and the Ant CLI for automation via command line.

    • Do prompt caching first; it dominates cost/latency wins
    • Context engineering: control what enters context (tools, results, stale turns)
    • Advisor strategy: escalate to stronger reasoning only when needed
    • Workload Identity Federation reduces API key risk; Ant CLI enables console-like management via CLI (and works well with Claude Code)

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.