The capability curve

Name: The capability curve
Uploaded: 2026-05-22T00:00:00Z
Duration: 26 min 25 s
Description: Claude’s coding ability has improved dramatically in a year, moving from solving a fraction of issues to near-senior performance on well-specified tasks as benchmarks like SWE-bench Verified become saturated.

Frontier models are getting more capable, fast. Where the curve is going, and what it means for developers building on Claude.

May 22, 202626mWatch on YouTube ↗

CHAPTERS

0:18 – 2:18
From Claude 4 launch to today: coding agents reshape software work
Jeremy sets the stage by contrasting last year’s Claude 4-era excitement with today’s reality: models improved dramatically and coding agents are now commonplace. Audience polling highlights how often Claude is writing most—or all—of shipped PRs, signaling a new development paradigm.
- •Jeremy’s role: improving Claude’s coding behaviors and capabilities
- •A year-over-year shift: Opus 4 felt cutting-edge then, now feels outdated
- •Claude Code and coding agents becoming mainstream in daily workflows
- •Audience poll: PRs mostly/completely written by Claude, sometimes unread
- •The paradigm shift: software development is now agent-augmented
2:18 – 3:49
Measuring the curve: SWE-bench Verified and the limits of benchmarks
He explains how SWE-bench Verified served as a key metric for software engineering progress, showing steep gains over 12 months. But frontier models are now saturating such benchmarks, forcing the industry to rethink how progress is measured.
- •SWE-bench Verified: GitHub issues + tests as a realistic SE benchmark
- •Reported jump from ~60% (Sonnet 3.7) to 87% (Opus 4.7)
- •Interpretation: models solve far more well-specified issues than before
- •Frontier models saturate benchmarks, making them less informative
- •Progress is outpacing benchmark creation cycles
3:49 – 5:52
Side-by-side demo: rebuilding claude.ai in one shot (12 months apart)
Jeremy demonstrates capability growth by giving two generations of models the same prompt: rebuild the claude.ai website from scratch. The newer model produces a more complete, functional, and polished app with fewer lines of code.
- •Demo prompt: rebuild claude.ai from scratch in one shot
- •Older model: minimal planning, lots of code, basic UI, broken chat behavior
- •Newer model: better tool use, fewer lines, more faithful UI
- •Functional improvements: working chat, history/sidebar, formatted output
- •Extra polish: diagrams/mermaid support and dark mode
5:52 – 7:53
Capability shift #1: models now plan and reason before acting
He describes a major behavioral improvement: models increasingly generate a plan first, revise it, and only then execute. This reduces the need for heavy prompt scaffolding to force planning and enables more reliable end-to-end completion.
- •Past failure mode: “act first, think later” (Sonnet 3.7 analogy)
- •Now: models read, investigate, and form a plan autonomously
- •Self-correction during planning (“Actually…”, “Never mind…”) improves success
- •Less need to force scaffolding; planning emerges naturally
- •Developer advice: allow time/effort for reasoning (high reasoning effort)
7:53 – 9:24
Capability shift #2: better error recovery and the end of doom loops
Jeremy explains how models have improved at learning from tool feedback and adjusting strategies after failures. Doom looping—repeating the same failed fix—has largely diminished, increasing efficiency and reliability.
- •Doom looping: repeating the same ineffective fix after errors
- •Modern models use tool feedback + additional compute to revise approach
- •Iteration becomes adaptive rather than repetitive
- •Fewer wasted tokens and faster convergence on correct solutions
- •Developer advice: provide feedback channels (tests/tools) so models can recover
9:24 – 11:25
Capability shift #3: sustained attention over long agentic runs
He highlights the leap in long-horizon coherence: models can keep track of goals and specs over very long contexts and lengthy runs. This allows developers to hand over bigger tasks—like whole codebases—without constant babysitting.
- •Old behavior: losing the plot during large refactors or long specs
- •Now: coherence up to ~1M tokens (and beyond) is increasingly feasible
- •Models retain specs better across long executions
- •Developers can avoid over-fragmenting tasks into tiny chunks
- •Advice: be more ambitious—try whole-codebase tasks and long runs
11:25 – 12:25
What “autonomy” is made of: planning + recovery + long-horizon coherence
Jeremy connects the three improvements into a unified view of autonomy. Autonomous agents can plan, execute, verify against the environment, recover from mistakes, and continue for hours rather than minutes.
- •Autonomy components: advance planning, robust execution, memory/coherence
- •Agents validate work against the environment (e.g., run tests)
- •Checkpointing and goal verification over time
- •Result: end-to-end task completion becomes realistic
- •Shift from short scripts to multi-hour agent runs
12:25 – 14:25
Long-horizon agent loop in practice: rewriting Bun’s engine in Rust in a week
A striking case study: Bun’s founder used Claude agents plus a near-complete test suite to rewrite the JavaScript engine in Rust in about a week. The story emphasizes how strong verification (tests) enables unusually large, reliable agent-driven refactors.
- •Agent loop: plan → execute → run tests → iterate until passing
- •Case study: rewrite Bun’s engine in Rust to eliminate memory errors
- •Enabler: ~100% test coverage provided strong verification signals
- •Scale: multi-day/week-long autonomous work leading to a merged PR
- •Takeaway: ambition + great tests unlock massive agent productivity
14:25 – 15:26
Customer signals: proofs in planning, long-run coherence, and verification quality
He shares examples from customers observing the same capability trends in production contexts. Reports include models doing formal reasoning during planning, maintaining coherence for hours, and improving code quality through better self-verification.
- •Vercel: models sometimes write proofs before implementing systems code
- •Windsurf: strong sustained reasoning across long agentic runs
- •Shopify: Opus 4.7 step-up in intelligence, code quality, and verification
- •Recurring theme: better planning, better verifying, longer coherence
- •These changes appear consistently across new model releases
15:26 – 16:27
Riding the curve: treat intelligence as a fast-moving foundation
Jeremy reframes progress as an ongoing trajectory rather than a single model upgrade. Since capabilities shift every few months, teams should build processes that continuously absorb model improvements into products.
- •It’s not about “Opus 4.6 vs 4.7,” it’s the overall trajectory
- •Model improvements arrive frequently and materially change outcomes
- •Teams should design to capture gains rather than hardcoding assumptions
- •The “ground is shifting beneath our feet” for app builders
- •Sets up practical patterns for adapting over time
16:27 – 19:59
Pattern #1: build real evals (AI-era unit tests) and keep them unsaturated
He argues evals are the most important accelerator for adopting better models. Effective evals mirror real user traffic and failure modes, and must evolve as models improve to avoid saturation that hides progress.
- •Evals are the unit/regression tests of the AI era—start building them
- •Avoid relying on academic benchmarks if they don’t match your use case
- •Use real production failure modes and traffic to design evaluations
- •Watch for eval saturation (no headroom to measure improvements)
- •Benchmark new models via scripts instead of vibes-based adoption
19:59 – 23:01
Pattern #2: shrink scaffolding as models improve (audit prompts and harness)
As models get more capable, old prompt hacks and harness complexity can become liabilities. Jeremy recommends trimming prompts/tools and revalidating assumptions, since newer models may follow instructions more literally and expose outdated prompt “bugs.”
- •Scaffolding/harness includes prompts, tools, environment, skills
- •Over time prompts become “Frankenstein” collections of legacy fixes
- •New models may make old prompt quirks harmful by following them better
- •Example: outdated citation-format instruction caused incorrect output
- •Use evals to safely minimize prompts and remove no-longer-needed workarounds
23:01 – 26:25
Pattern #3: give the model room—reasoning effort, safe autonomy, and closing the loop
He closes with operational practices that unlock frontier performance: allow test-time compute, enable controlled autonomy, and let agents improve the agent system itself. The goal is more productive, safer long-running agents that continuously self-optimize.
- •Enable adaptive thinking and set higher reasoning effort for hard tasks
- •Controlled autonomy via “auto mode” tool-call safety classification
- •Loop humans in only for high-risk actions; automate routine approvals
- •Close the agent loop: let Claude run evals and propose prompt/tool improvements
- •Design systems so agents can inspect, iterate, and improve their own performance