Getting more out of the Claude Platform

Cut cost, manage context, boost intelligence. In this session, we'll show you how to put our latest platform capabilities to work. Through live demos you'll see what great prompt caching looks like, learn to keep context lean for long-running agents with tool search, programmatic tool calling, and compaction, and use the advisor strategy for a cost-effective intelligence boost.

May 21, 202626mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Practical platform techniques to cut agent costs and boost quality

Prompt caching can dramatically reduce agent costs (≈90% discount on cached tokens), improve latency, and effectively increase rate limits when cache hit rates are high.
Console analytics now help teams measure cache hit rate and diagnose cache breaks (often caused by small prompt changes like timestamps or reordered content).
Context engineering keeps long-running agents effective by limiting what enters the context window via tool search, programmatic tool calling, and conversation compaction.
Tool search reduces token bloat by deferring full tool schemas until needed, which can lower token usage and sometimes improve model performance by reducing irrelevant context.
An advisor strategy pairs a cheaper executor model (Sonnet/Haiku) with an Opus “advisor” invoked only for hard cases, preserving quality while lowering average cost.

IDEAS WORTH REMEMBERING

5 ideas

Optimize for cache hit rate before anything else.

Caching reuses previously processed input tokens, yielding major cost savings and lower latency without changing output quality; aim for ~80%+ and note top agent builders report 90%+.

Treat cache breaks as a prompt hygiene problem.

Caching requires byte-for-byte identical cached segments; common mistakes like inserting timestamps or reordering system/tool text can drop hit rate to zero, and console analytics can pinpoint why.

Always inspect agent transcripts to find context waste.

The talk repeatedly emphasizes reading what the model actually sees; transcripts reveal oversized tool definitions, noisy tool outputs, and conversation-history bloat that degrade performance and raise cost.

Use tool search to avoid stuffing tens of tool schemas into every turn.

Instead of passing all tools upfront, pass a “tool search” tool and inject only the required tool definition when needed, freeing context for real work; Lovable reported ~10% token reduction plus better performance.

Programmatically curate tool outputs before sending them back to the model.

Have Claude write lightweight scripts (e.g., Python) that call tools, strip irrelevant content (like long call transcripts), and return only the needed summaries/fields—Quora used this to clean HTML and improve quality.

WORDS WORTH SAVING

5 quotes

If you remember nothing else from this session, think about prompt caching.

— Puneet Shah

You get a ninety percent discount, so a huge cost savings to actually build your p- uh, your agent.

— Puneet Shah

You should always look at the transcript of your agents to really understand what's going on.

— Puneet Shah

Context engineering is the art and science of figuring out what context you expose to Claude to give your agent the best performance.

— Puneet Shah

It's, uh, green on the outside, but deep, deep red on the inside.

— Puneet Shah

Prompt caching mechanics and benefitsCache hit rate targets and diagnosticsTranscript inspection for debugging agentsTool search (dynamic tool schema injection)Programmatic tool calling and result curationCompaction for long-running conversationsAdvisor strategy (executor + high-end advisor model)

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.