Skip to content
ClaudeClaude

The capability curve

Frontier models are getting more capable, fast. Where the curve is going, and what it means for developers building on Claude.

May 22, 202626mWatch on YouTube ↗

CHAPTERS

  1. From Claude 4’s launch to today’s “capability curve” reality

    Jeremy (Anthropic PM for coding capabilities) sets the stage by contrasting last year’s Claude 4-era excitement with today’s dramatically improved models and ubiquitous coding agents. He frames the talk as guidance for how developers can adapt as model capability accelerates.

  2. Audience reality check: Claude-written PRs and the new dev workflow

    Jeremy polls the room to show how common AI-authored code has become, even to the point of merging code without reading it. He notes this is risky if done carelessly, but it signals a major shift in how software gets built—especially at Anthropic.

  3. Measuring the curve: SWE-bench Verified and benchmark saturation

    He uses SWE-bench Verified to quantify progress: models improved from around 60% to 87% in about a year, and some frontier models now saturate the benchmark. The key takeaway is that progress is so fast that traditional benchmarks increasingly lag behind.

  4. Side-by-side demo: rebuilding claude.ai in one shot (12 months apart)

    Jeremy demonstrates the same prompt across model generations: rebuild the claude.ai website from scratch. The comparison highlights concrete differences in planning, tool use, correctness, and product completeness—showing why identical prompts now yield dramatically better apps.

  5. Capability shift #1: planning and reasoning before acting

    He explains that older models tended to jump into execution and only consult the “plan” after failing—like assembling IKEA furniture without reading the instructions. Newer models increasingly plan up front, revise their approach mid-plan, and behave more like senior engineers.

  6. Capability shift #2: error recovery and the end of “doom looping”

    Jeremy describes how models used to repeat the same failed fix over and over, wasting tokens and time. Improved tool feedback utilization and “test-time compute” now help models adapt after failures, change strategies, and converge faster.

  7. Capability shift #3: sustained attention across long agentic runs

    Long refactors and complex specs previously caused models to lose coherence mid-task, especially over very large contexts. Jeremy claims models now maintain coherence up to ~1M tokens and beyond, enabling much longer, more ambitious end-to-end work with less babysitting.

  8. How these stack into autonomy: the long-horizon agent loop

    He ties planning, recovery, and sustained attention into a single concept: autonomy. Jeremy outlines a long-horizon agent loop—plan, execute, verify against the environment (e.g., tests), iterate, and checkpoint progress—enabling agents to run for hours rather than minutes.

  9. Case study: Bun rewritten in Rust in a week via tests + long-running agents

    Jeremy recounts a striking example: Bun’s founder used Claude agents over a week to rewrite the JavaScript engine in Rust to address memory safety issues, leveraging near-100% test coverage for verification. The story emphasizes that strong test suites unlock outsized agent capabilities and make formerly “months-long” projects feasible rapidly.

  10. What customers report: proofs, longer reasoning runs, and better verification

    He shares feedback from companies using Claude in production: improved planning (including formal-ish proof work), stronger long-horizon reasoning, and higher code quality with verification. These anecdotes support the theme that model gains show up as practical engineering advantages.

  11. Riding the curve (1): build real evals like unit tests for AI

    Jeremy argues that evals are the foundation for adapting quickly to new models. He reframes evals as “unit tests and regression tests of the AI era,” urging teams to start small, measure what matters in their real application traffic, and avoid over-relying on academic benchmarks.

  12. Riding the curve (2): watch for eval saturation and keep raising the bar

    As models improve, evals can become too easy and stop reflecting real progress. Jeremy explains how teams may underestimate model gains because their evals are “in the past,” and recommends continuously updating eval difficulty to preserve signal—while keeping some regression-style tests at 100% as needed.

  13. Riding the curve (3): benchmark new models and simplify scaffolding over time

    He recommends routinely running new models against evals to choose upgrades based on evidence rather than “vibes.” He also warns that prompts/tools/harnesses tend to grow into brittle “Frankenstein” scaffolding designed for old weaknesses—so teams should prune and audit scaffolding as models get smarter.

  14. Riding the curve (4): give models room—thinking time, safe autonomy, and closed loops

    Jeremy closes with three operational practices: allow models to spend compute to reason, enable autonomy with safety checks, and let agents improve agents by running evals and iterating on prompts/tools. The overarching message is to design systems that let Claude work end-to-end while staying controllable and continuously improving.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.