At a glance
WHAT IT’S REALLY ABOUT
Practical platform techniques to cut agent costs and boost quality
- Prompt caching can dramatically reduce agent costs (≈90% discount on cached tokens), improve latency, and effectively increase rate limits when cache hit rates are high.
- Console analytics now help teams measure cache hit rate and diagnose cache breaks (often caused by small prompt changes like timestamps or reordered content).
- Context engineering keeps long-running agents effective by limiting what enters the context window via tool search, programmatic tool calling, and conversation compaction.
- Tool search reduces token bloat by deferring full tool schemas until needed, which can lower token usage and sometimes improve model performance by reducing irrelevant context.
- An advisor strategy pairs a cheaper executor model (Sonnet/Haiku) with an Opus “advisor” invoked only for hard cases, preserving quality while lowering average cost.
IDEAS WORTH REMEMBERING
5 ideasOptimize for cache hit rate before anything else.
Caching reuses previously processed input tokens, yielding major cost savings and lower latency without changing output quality; aim for ~80%+ and note top agent builders report 90%+.
Treat cache breaks as a prompt hygiene problem.
Caching requires byte-for-byte identical cached segments; common mistakes like inserting timestamps or reordering system/tool text can drop hit rate to zero, and console analytics can pinpoint why.
Always inspect agent transcripts to find context waste.
The talk repeatedly emphasizes reading what the model actually sees; transcripts reveal oversized tool definitions, noisy tool outputs, and conversation-history bloat that degrade performance and raise cost.
Use tool search to avoid stuffing tens of tool schemas into every turn.
Instead of passing all tools upfront, pass a “tool search” tool and inject only the required tool definition when needed, freeing context for real work; Lovable reported ~10% token reduction plus better performance.
Programmatically curate tool outputs before sending them back to the model.
Have Claude write lightweight scripts (e.g., Python) that call tools, strip irrelevant content (like long call transcripts), and return only the needed summaries/fields—Quora used this to clean HTML and improve quality.
WORDS WORTH SAVING
5 quotesIf you remember nothing else from this session, think about prompt caching.
— Puneet Shah
You get a ninety percent discount, so a huge cost savings to actually build your p- uh, your agent.
— Puneet Shah
You should always look at the transcript of your agents to really understand what's going on.
— Puneet Shah
Context engineering is the art and science of figuring out what context you expose to Claude to give your agent the best performance.
— Puneet Shah
It's, uh, green on the outside, but deep, deep red on the inside.
— Puneet Shah
High quality AI-generated summary created from speaker-labeled transcript.
