The capability curve

Frontier models are getting more capable, fast. Where the curve is going, and what it means for developers building on Claude.

May 22, 202626mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Claude’s capability curve: faster autonomy, better planning, safer agents

Claude’s coding ability has improved dramatically in a year, moving from solving a fraction of issues to near-senior performance on well-specified tasks as benchmarks like SWE-bench Verified become saturated.
A side-by-side demo rebuilding claude.ai shows newer models producing more complete, higher-quality apps with fewer lines, better UX features, and working functionality like chat and diagrams.
Key capability gains come from stronger upfront planning, robust error recovery (less “doom looping”), and sustained coherence across very long agentic runs up to ~million-token contexts.
Long-horizon agent loops—plan, execute, validate against tests, iterate—now enable week-long refactors and rewrites when strong test suites provide reliable verification.
To “ride the curve,” teams should invest in realistic evals, continuously refresh them as they saturate, shrink legacy scaffolding/prompts, and give models room to think and operate autonomously with safety gates like auto-approval classifiers.

IDEAS WORTH REMEMBERING

5 ideas

Benchmarks are being outpaced; measure capability with your own evals.

SWE-bench Verified went from useful to saturated for frontier models, so teams need application-specific evaluations that reflect real production tasks and failure modes.

Upfront planning is now a core differentiator—stop over-scaffolding it.

Modern models tend to plan before acting and self-correct during planning; you often get better outcomes by enabling higher reasoning effort rather than forcing rigid planning prompts.

Error recovery has improved; give agents feedback channels to iterate.

“Doom looping” is largely reduced when the agent can run tools, observe failures, and spend compute to change approach—so wire in tests, logs, and environment signals.

Long-horizon coherence enables bigger bets; stop artificially shrinking tasks.

With far better sustained attention across huge contexts, you can hand over larger specs/codebases and let agents run for hours, using periodic checkpoints and validations.

Great test suites unlock transformative automation.

The Bun rewrite anecdote underscores that high-coverage tests act as the ground truth that lets agents refactor or port major systems safely and quickly.

WORDS WORTH SAVING

5 quotes

Claude Code is everywhere, and coding agents have completely revolutionized how we do software.

— Jeremy

Evaluations are just the unit tests and the regression tests of the AI era.

— Jeremy

Doom looping is essentially the problem of having a failure and then attempting a solution and, you know, Claude will tell you, "Aha, I've got the problem. I fixed it." And then you look at the problem, and it's just sort of repeated the same solution again.

— Jeremy

Now at this point, our models can hold coherence up to one million tokens and even beyond that point.

— Jeremy

He was able to get Claude to run over the course of an entire week and rewrite all of Bun in Rust in one week to get %100 pass rate almost on the entire test suite.

— Jeremy

SWE-bench Verified and benchmark saturationClaude Code adoption and AI-written PRsDemo: rebuilding claude.ai in one shotPlanning/reasoning before actingError recovery and avoiding doom loopsLong-horizon coherence and million-token runsEvals, scaffolding reduction, auto mode, closed agent loop

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.