Skip to content
ClaudeClaude

The capability curve

Frontier models are getting more capable, fast. Where the curve is going, and what it means for developers building on Claude.

May 7, 202615mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

How rapidly improving Claude models change agentic coding best practices

  1. Claude’s coding performance has improved sharply year-over-year, illustrated by an SWE-bench Verified jump from 62% to 87% and a demo comparing older vs newer model output quality.
  2. Recent capability gains cluster around better upfront planning, stronger error recovery (avoiding “doom loops”), and sustained attention across very long agentic runs.
  3. Developers should anchor their work in product-representative evaluations that stay unsaturated as models improve, then regularly re-test against the newest frontier models.
  4. As models get stronger, teams can often simplify scaffolding and prompts—removing legacy steps and brittle rules can improve performance while reducing token costs.
  5. To fully benefit from newer models, give agents room to think, expand tool access safely (e.g., approval classifiers), and close the loop so the agent can verify and iterate on its own outputs.

IDEAS WORTH REMEMBERING

5 ideas

Model improvements compound across the full agent workflow.

Better planning, recovery from mistakes, and long-run coherence don’t just add performance—they multiply it by reducing stalls and rework across end-to-end tasks.

Let the model plan before acting for higher downstream quality.

Older models tended to “act first, think later”; newer models benefit from time/tokens to strategize upfront, leading to fewer mid-flight corrections.

Design for recovery, not perfection, because iteration is now stronger.

Newer models can backtrack and try alternative paths instead of spiraling; systems that surface errors clearly and allow retries will waste fewer tokens and finish more tasks.

Evals must match your real user/task distribution to guide useful optimization.

Academic or adjacent benchmarks can mislead; build evaluations from the same kinds of requests, repos, data shapes, and constraints your product actually sees.

Keep evals from saturating as models get smarter.

If your test set becomes “too easy,” improvements won’t show up; continuously refresh difficulty and coverage so frontier models still produce informative deltas.

WORDS WORTH SAVING

5 quotes

About a year ago, on our model at the time, Sonnet 3.7, it scored sixty-two percent on this eval. Today, with Opus 4.7, it scores eighty-seven percent. That's a over twenty-five percent jump in just over a year.

Alex Albert

Older models would have this particularly bad failure mode where they would act first and then think later.

Alex Albert

Previous models would hit this thing that we'd often call a doom loop.

Alex Albert

If you can measure something, you can improve on it, so it's important that, A, you have evals, and B, these evals are measuring something close to your product distribution that you actually want to improve on.

Alex Albert

Often, you can actually boost your performance by removing instead of adding things onto your scaffolding.

Alex Albert

SWE-bench Verified and coding capability gainsModel planning improvementsError recovery and avoiding doom loopsLong-context attention and autonomous runsBuilding product-distribution evalsEval saturation and frontier-model testingScaffolding/prompt simplification and tool safety

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome