The capability curve

Name: The capability curve
Uploaded: 2026-05-08T00:00:00Z
Duration: 15 min 5 s
Description: Claude’s coding performance has improved sharply year-over-year, illustrated by an SWE-bench Verified jump from 62% to 87% and a demo comparing older vs newer model output quality.

Frontier models are getting more capable, fast. Where the curve is going, and what it means for developers building on Claude.

May 8, 202615mWatch on YouTube ↗

CHAPTERS

1:46 – 2:58
Conference pulse: how developers are using Claude more aggressively than last year
Alex Albert opens by contrasting this year’s conference energy with last year’s, noting a shift from early experimentation to higher trust and faster shipping with agentic coding. He frames the talk around rapid month-over-month capability gains and what that means for builders.
- •Last year: Claude Code was newly released; agentic coding was still novel
- •This year: users report higher trust and broader ambition in what they build
- •Core premise: building on models that improve quickly changes best practices
- •Sets expectations for a short, practical talk
2:58 – 3:28
Audience speedup check: perceived 2×–10× productivity gains
He runs a quick show-of-hands survey to quantify how much faster attendees feel Claude makes them. The informal poll establishes that most people in the room use Claude and many report significant acceleration.
- •10× faster: many hands
- •5× and 2×: additional large segments of the room
- •Baseline: nearly everyone uses Claude
- •Motivates the need to adapt workflows to accelerating capability
3:28 – 3:59
Quantifying improvement with SWE-bench Verified (62% → 87%)
Alex introduces SWE-bench Verified as a key measure of autonomous PR completion. He highlights a large year-over-year jump, arguing that the latest model is far more likely to succeed on difficult tasks older models missed.
- •SWE-bench Verified evaluates autonomous software PR completion
- •Sonnet 3.7 scored ~62% about a year ago
- •Opus 4.7 scores ~87% today (25+ point gain)
- •Framing: much higher success rate on hard PRs, not just marginal improvements
3:59 – 5:00
Demo setup: recreating claude.ai from a single prompt
He transitions from metrics to a concrete side-by-side comparison, describing a demo that runs the same task across models. The task is intentionally ambitious: recreate claude.ai with one prompt inside Claude Code.
- •Same task, two models, ~12 months of progress
- •Runs within Claude Code with tool use and code generation
- •Goal: make gains tangible beyond benchmark numbers
- •Sets up an output-quality comparison
5:00 – 5:30
Older-model result: generic UI and immediate failure
The first run (Sonnet 4) produces a basic black-and-white chat interface but quickly errors when used. The takeaway is that earlier capability often tops out at superficial UI rather than a working, integrated app.
- •Produces a minimal chat UI
- •Lacks robustness—hits an error immediately
- •Demonstrates “looks okay” but doesn’t function end-to-end
- •Highlights limitations in integration and execution reliability
5:30 – 6:30
Newer-model result: branded, functional, feature-complete app (and more efficient)
Opus 4.7 generates a much closer claude.ai clone: correct styling, real API responses, chat memory, inline rendering, and dark mode. Alex also notes it achieves better results with fewer lines of code, implying higher efficiency.
- •Matches Claude styling and overall UX more closely
- •Successfully calls the Claude API and returns responses
- •Supports chat history/new chat while retaining old context
- •Renders inline visualizations and includes dark mode
- •Better output with fewer lines of code
6:30 – 7:31
Where gains show up #1: planning before acting (and why to allow it)
Alex explains that older models tended to “act first, think later,” leading to messy execution. Newer models plan more thoroughly upfront, and developers should avoid forcing immediate action to preserve downstream performance.
- •Past failure mode: premature action without strategy
- •New behavior: more deliberate planning and task decomposition
- •Developer guidance: give the model time/space to think
- •Rushing to action can reduce overall quality
7:31 – 8:01
Where gains show up #2: better error recovery vs. ‘doom loops’
He describes how older models could get stuck repeatedly trying failing fixes, wasting context and tokens. Newer models more readily backtrack and try alternate approaches, improving efficiency and success rates.
- •“Doom loop” pattern: repeated failing attempts and spiraling
- •Newer models can backtrack and reframe the problem
- •Outcome: fewer wasted tokens and fewer stalled runs
- •Practical impact: improved task completion reliability
8:01 – 9:01
Where gains show up #3: sustained attention over long agentic runs
Alex notes that older models often lost track of goals or ignored system instructions in long contexts. Newer models maintain coherence across very long runs, reducing the need to micromanage or chunk work.
- •Older issue: losing the plot, forgetting constraints, ignoring system prompts
- •Newer strength: coherence over hundreds of thousands to ~1M tokens
- •Less need to babysit context windows or split tasks
- •Enables longer autonomous agent behavior
9:01 – 9:31
Compounding effects + customer anecdotes (Vercel, Windsurf, Shopify)
He emphasizes that planning, recovery, and long-run attention compound into better end-to-end performance. Customer examples illustrate these behaviors: proving before coding, staying focused in long runs, and iterative refinement.
- •Capabilities compound into materially better end-to-end outcomes
- •Vercel: model writes proofs for systems code before implementation
- •Windsurf: attention sustained through longest agentic runs
- •Shopify: iterative refinement while coding
9:31 – 11:02
Tip 1: Start with evals—match product distribution, avoid saturation, test frontier models
Alex argues improvement begins with measurement: build evals that reflect real user traffic and tasks, keep them challenging as models improve, and regularly test the newest frontier models. Sometimes the best “optimization” is simply upgrading the model.
- •Evals must reflect real product/task distribution, not adjacent benchmarks
- •Avoid eval saturation; increase difficulty as models improve
- •Continuously test on the newest frontier models
- •Model swap alone can deliver major gains
11:02 – 12:33
Tip 2: Revisit scaffolding and prompts—remove complexity as models improve
He advises reassessing the surrounding scaffolding (workflows, tools, prompts, skills) that guides the model. Newer models may need less orchestration; simplifying prompts and workflows can improve performance and reduce token costs.
- •“Scaffolding” includes workflows, tools, prompts, and skill structures
- •Newer models may not need multi-step workflows; one thread may suffice
- •Performance can improve by removing, not adding, layers
- •Prompt stacks accumulate cruft; prune with each model generation
- •Simplification can also reduce token usage
12:33 – 14:04
Tip 3: Give the model room—adaptive thinking, safer tool access, and closed-loop iteration
Alex recommends letting Claude decide when to think (via effort/adaptive thinking), expanding tool access in controlled ways, and closing the loop so agents can inspect and refine their own outputs. He cites Claude Code Auto Mode as a pattern for safe autonomy using classifiers and approval gates.
- •Use adaptive thinking and the effort parameter to balance thinking vs actions
- •Allow broader tool access, but with safety controls and approvals
- •Claude Code Auto Mode: classifiers determine when human approval is needed
- •Design for iteration: let the agent verify and improve its own outputs
- •Example: give a frontend agent computer-use tools to test and debug interactively
14:04 – 15:05
Wrap-up: building on the capability curve
He closes by reiterating the theme: model capability is improving rapidly and developers should adapt their measurement, scaffolding, and autonomy patterns accordingly. Alex invites attendees to connect afterward to share feedback and needs.
- •Central message: practices must evolve alongside fast-moving model gains
- •Key levers: evals, simpler scaffolding/prompts, and enabling autonomy safely
- •Encourages ongoing experimentation with new model releases
- •Invitation to chat and provide feedback after the talk

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Conference pulse: how developers are using Claude more aggressively than last year

Audience speedup check: perceived 2×–10× productivity gains

Quantifying improvement with SWE-bench Verified (62% → 87%)

Demo setup: recreating claude.ai from a single prompt

Older-model result: generic UI and immediate failure

Newer-model result: branded, functional, feature-complete app (and more efficient)

Where gains show up #1: planning before acting (and why to allow it)

Where gains show up #2: better error recovery vs. ‘doom loops’

Where gains show up #3: sustained attention over long agentic runs

Compounding effects + customer anecdotes (Vercel, Windsurf, Shopify)

Tip 1: Start with evals—match product distribution, avoid saturation, test frontier models

Tip 2: Revisit scaffolding and prompts—remove complexity as models improve

Tip 3: Give the model room—adaptive thinking, safer tool access, and closed-loop iteration

Wrap-up: building on the capability curve

Get more out of YouTube videos.