Picking the right model

Name: Picking the right model
Uploaded: 2026-05-21T00:00:00Z
Duration: 31 min 39 s
Description: Public benchmarks and online hot takes are only directionally useful and rarely match real production workloads.

Hands-on techniques for testing and comparing models against your use case, so you can make a confident call each time a new release ships.

May 21, 202631mWatch on YouTube ↗

CHAPTERS

0:16 – 1:48
Why “pick the right model” is harder than it sounds (and why you need a repeatable process)
Lucas frames the real-world confusion that follows any new model launch: benchmarks, model cards, prompting guides, and social-media hot takes don’t directly answer what matters for your product. The goal is a repeatable, data-driven yes/no process for deciding whether to adopt a new model or which model to start with.
- •Model launches generate noise that doesn’t map cleanly to your business needs
- •Key question: will swapping models improve your specific use case?
- •Need a repeatable decision process rather than relying on vibes
- •Target outcome: a clear, defensible model-selection decision
- •The solution direction: build an evaluation (eval) tailored to your workload
1:48 – 3:19
The practical model-choice framework: quality, latency, and cost (plus thinking/effort complexity)
The talk starts from the simple heuristic (Opus for intelligence, Haiku for speed/cost, Sonnet for balance) and then complicates it with “thinking” and “effort” settings. Lucas introduces the core three-pillar framework—quality, latency, cost—as the basis for making the decision systematic.
- •Naive heuristic: Opus (quality), Haiku (low latency/cost), Sonnet (balance)
- •Thinking/effort settings create more combinations than ‘just pick a model’
- •Many teams compare across providers, not just within Anthropic models
- •Three selection pillars: quality, latency, cost
- •These pillars become the dimensions your eval must measure
3:19 – 4:20
Three big takeaways to guide model selection (private evals, outcome economics, tunable knobs)
Lucas previews the talk’s core lessons: public benchmarks are less useful than your own small eval, cost should be measured per successful outcome, and you can tune multiple levers (not just model choice) to move along—or shift—the performance/cost frontier.
- •A small, well-designed private eval beats public benchmarks for decision-making
- •Optimize for cheapest per successful outcome, not cheapest per token
- •There are multiple control knobs beyond model choice
- •You can move along the cost–quality frontier with finer granularity
- •Some strategies can shift the frontier entirely (not just trade off)
4:20 – 5:21
Why public benchmarks don’t predict your production workload
Benchmarks like SWE-bench Verified or BrowseComp can be directionally helpful but rarely match production reality. Real systems combine multiple skills (research + coding + tool use) and often involve different languages, constraints, and heterogeneity that single benchmarks miss.
- •Benchmarks provide directional signal (e.g., “better at coding”)
- •Production tasks are heterogeneous and cross benchmark categories
- •Agent workflows often require research + implementation together
- •Your language/tools/domain may not be represented in standard benchmarks
- •Therefore: you need bespoke evals tied to your real tasks
5:21 – 6:52
How to build an eval: tasks as atomic units and the “math exam” analogy
Lucas explains eval design: define tasks with inputs and success criteria, then assemble a dataset. Like a math exam, you must assess not only the final answer but also the reasoning/process (‘show your work’), especially for agentic workflows.
- •An eval is a dataset of tasks (inputs + success criteria)
- •Tasks are the atomic unit of measurement
- •Assess both final outcome and intermediate steps
- •The ‘working’ matters especially for agentic/tool-using systems
- •Good evals require upfront definition of ‘correct’ behavior
6:52 – 8:53
Grading with LLM judges + deterministic checks: evaluating outcomes and tool behavior
Using a customer service agent example, Lucas shows how to combine LLM-as-judge grading (robust to superficial differences) with deterministic, code-based assertions. This layered approach checks correctness, tool usage, query structure, and required arguments.
- •LLM-as-judge can evaluate response quality against expectations
- •LLM-as-judge can validate tool actions (e.g., SQL equivalence) despite variance
- •Deterministic graders enforce invariants (must call tools, must pass args)
- •Per-task graders add coverage across multiple failure modes
- •Building these graders is high-leverage engineering work
8:53 – 11:24
Common eval failure modes: variance, infrastructure issues, and stale/non-representative datasets
Lucas highlights three gotchas: confusing noise for signal (run multiple times, watch variance), infra/tool failures masquerading as model failures (inspect transcripts), and “silent saturation” when your eval stops reflecting production inputs (close the loop with real traces).
- •Mistaking noise for signal: re-run tasks and check variance
- •High variance may indicate ill-defined tasks or misaligned graders
- •Infra/tool/API failures should be separated from model capability
- •Transcript inspection helps attribute failures correctly
- •Continuously refresh evals with production traces to avoid silent saturation
11:24 – 13:57
Model-specific quirks, prompt updates, and why transcript observability is essential
Even similar model variants can behave differently (e.g., tool under-triggering vs over-triggering), so prompts often need adjustments guided by model-specific prompting notes. Lucas emphasizes setting up tracing/observability and routinely reading transcripts to understand real agent behavior and avoid misleading metrics.
- •Different model versions can change tool-calling behavior significantly
- •Read prompting guides; consider having Claude help update prompts
- •Make transcript review a core debugging workflow
- •Use observability/tracing tools (e.g., LangSmith, BrainTrust)
- •Example: eval looked great until transcripts revealed leakage via Git history
13:57 – 15:29
Moving along the frontier: a latency surprise and the role of turns, planning, and thinking
A code-fix pipeline example shows that a smaller model isn’t always faster end-to-end. More capable models can finish in fewer turns with less validation/research, sometimes reducing total time even if per-token costs are higher.
- •Baseline: Haiku no-thinking scores 92%; thinking enabled reaches 100%
- •Sonnet/Opus also reach 100% and can be faster overall
- •More capable models may need fewer turns and less rework
- •End-to-end latency depends on workflow efficiency, not just model size
- •Use configs (thinking/effort) to find best fit for your constraints
15:29 – 17:01
Understanding “thinking” vs “effort”: adaptive scratchpad and output/work budgeting
Lucas clarifies the knobs: adaptive thinking lets the model decide how much to reason before acting, while effort controls how much the model writes/does across thinking, tool calls, and responses. These parameters provide fine-grained control over accuracy, cost, and latency beyond picking a model.
- •Adaptive thinking: model decides when/how much to think (System 2)
- •Effort controls work budget across thinking, tool calls, and responses
- •You can mix settings (e.g., low thinking + high effort)
- •These knobs let you tune position on the cost–accuracy curve
- •They add flexibility beyond a single “model choice” decision
17:01 – 18:01
Non-intuitive token efficiency: stronger models can use fewer tokens; effort tunes trade-offs
Data from Opus vs Sonnet illustrates that higher-end models can achieve higher accuracy with fewer output tokens, contradicting “smaller is faster” intuition. Effort settings create a visible spread that lets you select accuracy vs token/latency trade-offs more precisely.
- •Opus can complete tasks with fewer output tokens than Sonnet in some settings
- •Choosing by ‘vibes’ (smaller = faster) can lead to suboptimal picks
- •Effort levels create a controllable accuracy vs token/latency spectrum
- •Higher effort typically improves accuracy at the cost of more output/work
- •Run your own eval to reveal these non-obvious properties
18:01 – 21:04
Shifting the frontier #1: prompt caching for big cost reductions (and how not to break it)
Prompt caching can cut input token costs to one-tenth for cached prefixes, enabling ‘Opus-quality’ within lower budgets. Lucas shares practical guidance—aim for high cache hit rates, measure them via token metrics, and use append-only messaging to prevent cache breaks (avoid dynamic timestamps in prefixes).
- •Prompt caching: cached prefix input tokens billed at ~1/10 list price
- •Enables higher-quality models within the same budget envelope
- •Top systems can reach ~80–90% cache hit rates
- •APIs/SDKs expose cache token metrics so you can measure and optimize
- •Append-only message strategy; avoid dynamic variables (e.g., timestamps) in cached prefixes
21:04 – 25:39
Shifting the frontier #2: context hygiene/engineering to reduce tokens and improve accuracy
Rather than complex orchestration, Lucas argues many teams get outsized gains from cleaning and compressing tool outputs. Examples show large token reductions (Markdown vs JSON, simplified timestamps, deduping search results) that reduce cost/latency and can even raise accuracy by shrinking reasoning load.
- •Focus on making tool outputs concise, readable, and relevant
- •Format changes (e.g., Markdown) and simplified fields reduce token count
- •Example: sports tool response cleanup yields ~66% token reduction
- •Example: deduping search results cuts input tokens ~77%, cost ~65%, accuracy +9%
- •Cleaner context reduces reasoning burden and improves response quality
25:39 – 31:39
Workshop: sweeping models, thinking, and effort on TauBench and interpreting the Pareto trade-offs
Lucas introduces a hands-on workshop tool/skill that instruments an eval to run across multiple models and configurations, then plots results. Using TauBench (airline customer service), he highlights surprising results: Opus may deliver better pass rates with fewer tokens and sometimes lower latency than Sonnet, but at higher cost—helping teams choose based on what they value.
- •Workshop skill audits/instruments evals and runs sweeps across configs
- •Runs across models (Haiku/Sonnet/Opus), thinking on/off, multiple effort levels
- •Produces plots for pass rate vs tokens, cost, and latency
- •Findings: Opus high thinking/effort best pass rate; can use fewer tokens than Sonnet
- •Decision-making becomes data-driven: pick based on your priority (quality/cost/latency)

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Why “pick the right model” is harder than it sounds (and why you need a repeatable process)

The practical model-choice framework: quality, latency, and cost (plus thinking/effort complexity)

Three big takeaways to guide model selection (private evals, outcome economics, tunable knobs)

Why public benchmarks don’t predict your production workload

How to build an eval: tasks as atomic units and the “math exam” analogy

Grading with LLM judges + deterministic checks: evaluating outcomes and tool behavior

Common eval failure modes: variance, infrastructure issues, and stale/non-representative datasets

Model-specific quirks, prompt updates, and why transcript observability is essential

Moving along the frontier: a latency surprise and the role of turns, planning, and thinking

Understanding “thinking” vs “effort”: adaptive scratchpad and output/work budgeting

Non-intuitive token efficiency: stronger models can use fewer tokens; effort tunes trade-offs

Shifting the frontier #1: prompt caching for big cost reductions (and how not to break it)

Shifting the frontier #2: context hygiene/engineering to reduce tokens and improve accuracy

Workshop: sweeping models, thinking, and effort on TauBench and interpreting the Pareto trade-offs

Get more out of YouTube videos.