At a glance
WHAT IT’S REALLY ABOUT
Claude’s capability curve: faster autonomy, better planning, safer agents
- Claude’s coding ability has improved dramatically in a year, moving from solving a fraction of issues to near-senior performance on well-specified tasks as benchmarks like SWE-bench Verified become saturated.
- A side-by-side demo rebuilding claude.ai shows newer models producing more complete, higher-quality apps with fewer lines, better UX features, and working functionality like chat and diagrams.
- Key capability gains come from stronger upfront planning, robust error recovery (less “doom looping”), and sustained coherence across very long agentic runs up to ~million-token contexts.
- Long-horizon agent loops—plan, execute, validate against tests, iterate—now enable week-long refactors and rewrites when strong test suites provide reliable verification.
- To “ride the curve,” teams should invest in realistic evals, continuously refresh them as they saturate, shrink legacy scaffolding/prompts, and give models room to think and operate autonomously with safety gates like auto-approval classifiers.
IDEAS WORTH REMEMBERING
5 ideasBenchmarks are being outpaced; measure capability with your own evals.
SWE-bench Verified went from useful to saturated for frontier models, so teams need application-specific evaluations that reflect real production tasks and failure modes.
Upfront planning is now a core differentiator—stop over-scaffolding it.
Modern models tend to plan before acting and self-correct during planning; you often get better outcomes by enabling higher reasoning effort rather than forcing rigid planning prompts.
Error recovery has improved; give agents feedback channels to iterate.
“Doom looping” is largely reduced when the agent can run tools, observe failures, and spend compute to change approach—so wire in tests, logs, and environment signals.
Long-horizon coherence enables bigger bets; stop artificially shrinking tasks.
With far better sustained attention across huge contexts, you can hand over larger specs/codebases and let agents run for hours, using periodic checkpoints and validations.
Great test suites unlock transformative automation.
The Bun rewrite anecdote underscores that high-coverage tests act as the ground truth that lets agents refactor or port major systems safely and quickly.
WORDS WORTH SAVING
5 quotesClaude Code is everywhere, and coding agents have completely revolutionized how we do software.
— Jeremy
Evaluations are just the unit tests and the regression tests of the AI era.
— Jeremy
Doom looping is essentially the problem of having a failure and then attempting a solution and, you know, Claude will tell you, "Aha, I've got the problem. I fixed it." And then you look at the problem, and it's just sort of repeated the same solution again.
— Jeremy
Now at this point, our models can hold coherence up to one million tokens and even beyond that point.
— Jeremy
He was able to get Claude to run over the course of an entire week and rewrite all of Bun in Rust in one week to get %100 pass rate almost on the entire test suite.
— Jeremy
High quality AI-generated summary created from speaker-labeled transcript.
