Skip to content
ClaudeClaude

Picking the right model

Hands-on techniques for testing and comparing models against your use case, so you can make a confident call each time a new release ships.

May 21, 202631mWatch on YouTube ↗

CHAPTERS

  1. Why “picking the right model” is harder than it looks

    Lucas frames the real problem: model launches generate lots of benchmarks and hot takes, but none directly answer whether a new model will improve your specific product. The goal is a repeatable, data-driven process that yields a clear yes/no decision for switching or starting with a model.

  2. The three pillars: quality, latency, and cost (plus model “knobs”)

    He introduces the common decision framework: quality, latency, and cost. While the simple rule of thumb is Opus for intelligence, Haiku for speed/cost, and Sonnet for balance, real decisions also depend on configuration settings like thinking and effort.

  3. Three takeaways to guide model selection

    Lucas previews the talk’s key lessons: private evals beat public benchmarks, optimize for cost per successful outcome (not per token), and use configuration knobs to move along—or shift—the cost/quality frontier.

  4. What public benchmarks do—and don’t—tell you

    Benchmarks like SWE-bench Verified and BrowseComp can be directional signals, but they rarely match real production workflows. Most real tasks are heterogeneous (e.g., research + coding), so your organization must measure performance on its own distribution.

  5. How to build an eval: tasks, success criteria, and “show your work”

    He outlines an eval as a dataset of atomic tasks, each with inputs and success criteria. For agentic systems, evaluating intermediate steps (“working”) is as important as the final answer.

  6. Grading approaches: LLM-as-judge and deterministic checks

    Lucas explains how graders can be implemented using both LLM judges and deterministic (code-based) checks. This hybrid approach handles flexible outputs (like varying SQL syntax) while enforcing strict invariants (like required tool usage).

  7. Common eval failure modes: noise, infra issues, and dataset saturation

    He highlights three common pitfalls when teams interpret eval results. Variance may indicate poorly defined tasks, infra/tool failures can masquerade as model failures, and unrepresentative datasets can silently cap progress.

  8. Model-specific quirks and the importance of reading transcripts

    Even small model/version changes can alter behaviors like tool triggering, requiring prompt adjustments. Lucas emphasizes making transcript review easy via observability tooling to catch hidden issues and debugging signals.

  9. A cautionary tale: misleading eval wins without transcript inspection

    He describes an internal eval where performance looked dramatically better until transcript review revealed the model was pulling answers from Git history. The lesson: headline metrics can be gamed or confounded without careful trace analysis.

  10. Trading off cost, quality, and latency with thinking and effort

    Lucas introduces “thinking” (scratchpad/system-2 reasoning) and “effort” (how much work across thinking, tool calls, and responses). These controls enable finer optimization than model choice alone and can produce non-intuitive latency/cost outcomes.

  11. Counterintuitive results: smarter models can be faster and more token-efficient

    Through internal examples, he shows that higher-end models sometimes complete tasks faster by taking fewer turns and producing fewer tokens, despite being “bigger.” This reinforces the need to measure on your own workloads instead of relying on intuition.

  12. Shifting the frontier with prompt caching (big cost lever)

    Prompt caching can reduce effective input-token costs to one-tenth for cached prefixes, enabling “Opus quality at Sonnet cost” in many cases. Lucas stresses measuring cache hit rates and designing message flows to preserve cacheability.

  13. Context hygiene / context engineering: simplify tool outputs to save tokens and boost accuracy

    He argues teams over-invest in complex orchestration and under-invest in making context clean and efficient. By reformatting tool outputs and deduplicating data before sending it back to the model, teams can cut tokens dramatically and often improve accuracy.

  14. Workshop walkthrough: sweeping models/configs and visualizing results

    Lucas transitions to a hands-on workshop that instruments an existing eval to run across models, thinking toggles, and effort levels, then plots results. Using TauBench (airline customer service), the sweep reveals trade-offs and some surprising efficiency/latency relationships.

  15. Closing guidance: measure success, pick by outcomes, and use the knobs

    He closes by reiterating the recipe for choosing the right model: build private evals, optimize for outcome-based economics, and apply caching/context improvements plus thinking/effort controls to reach the best point on (or beyond) the frontier.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.