Skip to content
ClaudeClaude

Getting more out of the Claude Platform

Cut cost, manage context, boost intelligence. In this session, we'll show you how to put our latest platform capabilities to work. Through live demos you'll see what great prompt caching looks like, learn to keep context lean for long-running agents with tool search, programmatic tool calling, and compaction, and use the advisor strategy for a cost-effective intelligence boost.

May 22, 202626mWatch on YouTube ↗

CHAPTERS

  1. Why the Claude Platform layer matters for production agents

    Puneet Shah frames the session as practical guidance for building real products and businesses on top of Claude—not just accessing strong models. He quickly surveys the room to highlight the gap between building agents and shipping high-quality, cost-effective ones.

  2. Prompt caching: what it is and why it’s the #1 lever

    He explains prompt caching as reusing processed input tokens across conversation turns so only new tokens are reprocessed. This yields major savings and performance improvements without changing output quality.

  3. Measuring and debugging cache hit rate in the console

    Puneet emphasizes that optimization starts with visibility into cache hit rate. He highlights new console analytics and a recent debugging feature that shows why caches break.

  4. Live demo setup: Hero Corp dashboard and the “always read transcripts” rule

    Using a playful “superheroes for hire” company, the demo illustrates real agent plumbing: aggregating data from many sources and inspecting agent behavior via transcripts. Puneet repeats a core operational tip: always study transcripts to understand what the model saw and did.

  5. Prompt caching in practice: from 0% to meaningful savings

    The demo reveals a 0% cache hit rate, then implements prompt caching to prove savings without changing results. The transcript shows explicit cache write and cache hit token counts, and the cost drops significantly.

  6. When the context window gets clogged: introducing context engineering

    The dashboard run hits context limits despite a large window, motivating “context engineering”—deciding what to include so the model performs well. Puneet outlines three techniques: limit tool definitions, curate tool outputs, and keep long conversations going with compaction.

  7. Tool search: only load tool schemas when needed

    Tool-heavy agents can waste context by stuffing dozens/hundreds of tool definitions into every prompt. Tool search lets the model request relevant tools first, then only injects the selected tool schemas into context, improving both token efficiency and model performance.

  8. Programmatic tool calling: sanitize and summarize tool outputs before sending

    Even after selecting the right tools, raw tool outputs can be huge and noisy (e.g., long call transcripts). The approach: have Claude write a small script (e.g., Python) that calls tools, extracts only relevant fields, and passes a curated summary back into the model context.

  9. Compaction: continue long agent runs past context limits

    For agents that work for hours, context eventually fills even with very large windows. Compaction summarizes and compresses conversation state using a custom prompt, drops low-relevance turns, and lets the run continue—creating an “almost unlimited context” experience.

  10. Context engineering results in the demo: slower context growth and lower cost

    Returning to the demo, the context bar grows more slowly due to reduced tool/schema and curated outputs. When the threshold is reached, compaction visibly reduces the context load and the run continues, driving cost down further.

  11. Advisor strategy: cheaper executor model + high-intelligence advisor on demand

    To reduce cost without losing intelligence, Puneet introduces an “advisor strategy”: run the main agent on a cheaper model (Sonnet/Haiku) and call a stronger model (Opus) only when needed. This mirrors senior/junior engineering pairing—low overhead on easy tasks, big help on hard ones.

  12. Advisor strategy in action: catching the ‘watermelon’ status risk

    In the demo, Sonnet thinks an important renewal is fine, but the task is escalated to the Opus advisor for deeper review. Opus identifies a hidden blocker—turning a seemingly green project red—and overrides the executor’s conclusion.

  13. Wrap-up: the optimization playbook and keeping up with fast platform releases

    Puneet recaps the practical checklist: implement prompt caching, engineer context (tools, tool outputs, compaction), and use advisor strategy for cost-effective intelligence. He closes by emphasizing rapid platform iteration and highlights recent launches like auto prompt caching and AWS availability.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.