Picking the right model

Hands-on techniques for testing and comparing models against your use case, so you can make a confident call each time a new release ships.

May 20, 202631mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Choose AI models using private evals, not public benchmarks alone

Public benchmarks and online hot takes are only directionally useful and rarely match real production workloads.
A small, well-designed private eval (tasks + success criteria + graders) gives a repeatable yes/no decision process for model selection.
The right choice is the model/configuration that is cheapest per successful outcome, not cheapest per token.
Model “knobs” like thinking and effort can change accuracy, latency, and token usage in non-intuitive ways, so you must test combinations.
You can shift the cost/quality frontier through prompt caching and context engineering (token-efficient tool outputs and cleaner context).

IDEAS WORTH REMEMBERING

5 ideas

Build a private eval to pick models repeatably and defensibly.

A bespoke eval aligned to your real tasks will guide decisions far better than generic benchmarks like SWE-bench, especially for heterogeneous agent workflows that blend research, tool use, and coding.

Optimize for cost per successful outcome, not cost per token.

A “more expensive” model can still be cheaper end-to-end if it succeeds more often or finishes in fewer turns/tokens (higher task completion per dollar).

Measure quality, latency, and cost together—models can be counterintuitive.

More capable models may be faster in practice because they need fewer steps/turns and less verification; you only discover this by sweeping models/configs on your eval.

Treat eval tasks like a math exam: grade both outcome and working.

For agentic systems, it’s not enough to check the final answer; also verify intermediate behaviors (e.g., correct tool calls, correct database queries, proper localization arguments).

Use multiple graders: LLM-as-judge for flexible comparisons, deterministic checks for invariants.

LLM judges can handle semantically equivalent variants (e.g., different but correct SQL), while code-based checks ensure required tools and parameters are always used.

WORDS WORTH SAVING

5 quotes

A small-- it can be a very small, well-designed eval will be much more important for you guys to assess which model to use than any public benchmark out there.

— Lucas

The model that's right for your use case is not necessarily the one that's cheapest or fastest per token, but the one that is cheapest per successful outcome.

— Lucas

In a world where we're automating a lot of stuff with AI, taking the time to actually build that eval data set, I think, is like one of the best uses of your human time that there is.

— Lucas

If we would've just looked at the headline metrics from the eval, we would have thought, like, "Great, we've made a huge improvement." But it's only by digging into the transcripts that you start to see some of the actually underlying patterns that are emerging, some of the real things that need to be fixed and done.

— Lucas

My kind of hot take here is people spend too much time thinking about these, like, super complex multi-agent orchestration systems and not enough time doing the simple thing that works, which is just, like, good context hygiene and good context engineering.

— Lucas

Quality–latency–cost trade-offsPrivate eval construction (tasks, datasets, graders)LLM-as-judge vs deterministic gradersCommon eval failure modes (noise, infra, saturation)Transcript review and observability/tracingThinking vs effort configurationPrompt caching strategy and cache hit ratesContext hygiene/engineering to cut tokens and improve accuracy

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.