Getting more out of the Claude Platform

Cut cost, manage context, boost intelligence. In this session, we'll show you how to put our latest platform capabilities to work. Through live demos you'll see what great prompt caching looks like, learn to keep context lean for long-running agents with tool search, programmatic tool calling, and compaction, and use the advisor strategy for a cost-effective intelligence boost.

May 22, 202626mWatch on YouTube ↗

CHAPTERS

0:18 – 2:04
Why the Claude Platform matters for production agents (and quick audience poll)
Puneet Shah frames the session as being about the Claude Platform layer that helps turn strong models into real products and businesses. He gauges the audience’s agent-building maturity—who has built agents, shipped them to production, and is satisfied with quality/cost/speed—setting up the talk’s focus on practical optimization.
- •Platform focus: beyond intelligence to production readiness
- •Audience poll: built agents vs. shipped to production vs. happy with results
- •Sets goal: share hard-won lessons for getting more out of Claude in real deployments
2:04 – 3:34
Prompt caching: what it is and why it’s the highest-leverage optimization
He introduces prompt caching as reusing processed input tokens across turns so only new tokens are processed. The key value is large cost savings, effective rate-limit boosts, and lower latency as conversations get long.
- •Caches processed input tokens before output generation, reuses across conversation
- •Benefits: ~90% discount on cached tokens, higher effective rate limits, faster time-to-first-token
- •Practical target: aim for ~80%+ hit rate for agentic apps
3:34 – 5:06
Benchmarking cache hit rate and using console analytics to debug cache breaks
Puneet points to leading customers achieving 90%+ cache hit rates and stresses measuring your own hit rate first. Anthropic’s console now surfaces prompt caching analytics, including diagnostics for why a cache broke (often due to subtle prompt changes).
- •High performers (e.g., Replit/Cursor/Perplexity/Claude Code) invest heavily to reach 90%+
- •Console analytics now show cache hit rates alongside cost/usage
- •Debugging cache breaks: prompt ordering and token stability matter (e.g., timestamps in system prompt break caching)
5:06 – 5:37
Getting started fast: auto caching and Claude Code skills to improve caching
He reassures teams starting at 0% hit rate and suggests quick paths to improvement. A minimal code change via auto caching can get basic benefits, and Claude Code/skills can guide reordering and prompt management for better cache performance.
- •If hit rate is 0%, start simple—this is common
- •Auto caching can be enabled with a one-line change for baseline caching
- •Use Claude Code / Claude API skill to get guided improvements to cache hit rate
5:37 – 7:38
Hero Corp demo setup: why transcripts are essential for diagnosing agents
The talk shifts into a playful “Hero Corp” dashboard demo, emphasizing a serious lesson: always inspect agent transcripts to understand what the model sees and why it behaves a certain way. The dashboard aggregates many sources, illustrating typical production complexity.
- •Demo premise: superhero-for-hire business runs on dashboards and OKRs
- •Agent pulls from multiple systems (web, Slack, Gong, Jira) to build a holistic view
- •Core practice: always review transcripts to debug context, tool usage, and behavior
7:38 – 9:09
Live win: enabling prompt caching drops cost without changing outputs
They discover the dashboard agent has a 0% cache hit rate and then implement prompt caching. The output remains identical (no intelligence change), while the transcript shows cache writes/hits and cost drops substantially due to reuse.
- •Before: prompt cache hit rate is 0%—wasted reprocessing
- •After: cache write + cache hit shown in transcript (e.g., 172 tokens)
- •Cost reduction example: roughly halves cost at ~58% cache hit rate; outputs unchanged
9:09 – 10:10
Context engineering: avoiding context-window failure with smarter context selection
The demo hits the million-token limit, motivating “context engineering” as the craft of choosing what to show the model. Puneet outlines three main levers: tool selection, tool-result selection, and long-run conversation management.
- •Hitting large context windows is inevitable for long-running agents
- •Context engineering = deciding what context to expose for best performance
- •Three techniques: tool search, programmatic tool calling, and compaction
10:10 – 12:12
Tool search: only load tool definitions into context when needed
With many tools, passing all schemas up front quickly consumes context. Tool search keeps a lightweight “tool finder” available, then injects only the relevant tool definitions when the model asks for them, improving both token efficiency and performance.
- •Problem: dozens/hundreds of tools bloat context and reduce room for reasoning/work
- •Solution: model calls a tool-search step to retrieve the right tool list first
- •Only selected tool definitions are inserted, reducing token usage (e.g., Lovable saw ~10% drop and better performance)
12:12 – 13:12
Programmatic tool calling: curate tool outputs before sending back to the model
Puneet describes using Claude’s coding ability to write scripts that call tools and filter results to only what matters. This prevents huge, noisy outputs (like long transcripts or HTML) from flooding context and degrading performance.
- •Have Claude write a small script (e.g., Python) to fetch and preprocess tool data
- •Strip irrelevant data (e.g., HTML cleanup, transcript reduction) before model sees it
- •Example: summarize “sentiment/vibes” instead of injecting full Gong call transcripts; Quora improved performance by trimming content
13:12 – 14:14
Compaction: summarize and continue for ‘almost unlimited’ long-running agents
For agents that work for hours, compaction prevents hard stops when context fills. It summarizes and retains key facts using a custom prompt, drops no-longer-relevant turns, and continues—repeating as needed to maintain continuity.
- •Enables continued operation past context limit by summarizing and pruning history
- •Custom compaction prompt helps preserve critical facts and direction
- •Operational pattern: set thresholds, compact, continue; Hex simplified code and maintained strong performance
14:14 – 17:16
Demo walkthrough: tool search + curated results + compaction threshold in action
Back in the dashboard, the context growth slows due to reduced tool/schema and output bloat. They show tool search retrieving only the needed tool (with large schemas), programmatic calling extracting only aggregate sentiment, and compaction triggering at ~400K to reset context size.
- •Context bar rises slower with less injected tool and output content
- •Tool search example: selects ‘hero retention metrics’ and injects only that tool definition (large schemas avoided)
- •Programmatic calling example: reduce long Gong transcripts to aggregate sentiment
- •Compaction example: threshold set around 400K; context drops after compaction run
17:16 – 19:17
Choosing a compaction threshold: balancing intelligence, cost, and latency
Puneet explains that while a million-token window is valuable, many apps benefit from compacting earlier for better cost/latency tradeoffs. A 400–500K threshold can be a practical starting point depending on model and workload.
- •Million-token context is powerful, but not always optimal for cost/latency
- •Pick a threshold that matches your use case (often ~400–500K to start)
- •Compaction prompt design matters: keep key facts, remove irrelevant turns, preserve trajectory
19:17 – 21:19
Advisor strategy: pair cheaper ‘executor’ models with an Opus advisor for hard cases
To reduce cost further without losing intelligence, he introduces the advisor strategy: run most work on Sonnet/Haiku and consult Opus only when needed. This mimics senior/junior engineering dynamics—keeping the cheap model “hands on keyboard” while the advisor improves decisions on complex tasks.
- •Executor (Sonnet/Haiku) handles routine steps; advisor (Opus) consulted on tricky parts
- •Analogy: senior engineer mentoring junior engineer to reach near-senior outcomes
- •Customer example: Bolt saw better architectural decisions on complex tasks with minimal overhead on easy tasks
21:19 – 26:40
Demo validation and final takeaways: lower cost, preserved quality, and what to track
They switch the dashboard to Sonnet with an Opus advisor, lowering cost while checking quality on a critical renewal decision—Opus catches a hidden risk Sonnet missed. Puneet closes by summarizing the practical playbook (prompt caching, context engineering, advisor strategy) and highlights the platform’s rapid pace of new launches.
- •Advisor demo: Opus reviews transcript and flags a ‘watermelon’ risk Sonnet marked green
- •Optimization recap: caching → context engineering (tools, results, compaction) → advisor strategy
- •Cost narrative: big reduction from initial run to ~£11 and lower with model switching
- •Keep up with platform launches: automatic prompt caching and Claude Platform on AWS among highlights

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Why the Claude Platform matters for production agents (and quick audience poll)

Prompt caching: what it is and why it’s the highest-leverage optimization

Benchmarking cache hit rate and using console analytics to debug cache breaks

Getting started fast: auto caching and Claude Code skills to improve caching

Hero Corp demo setup: why transcripts are essential for diagnosing agents

Live win: enabling prompt caching drops cost without changing outputs

Context engineering: avoiding context-window failure with smarter context selection

Tool search: only load tool definitions into context when needed

Programmatic tool calling: curate tool outputs before sending back to the model

Compaction: summarize and continue for ‘almost unlimited’ long-running agents

Demo walkthrough: tool search + curated results + compaction threshold in action

Choosing a compaction threshold: balancing intelligence, cost, and latency

Advisor strategy: pair cheaper ‘executor’ models with an Opus advisor for hard cases

Demo validation and final takeaways: lower cost, preserved quality, and what to track

Get more out of YouTube videos.