The capability curve

Frontier models are getting more capable, fast. Where the curve is going, and what it means for developers building on Claude.

May 8, 202615mWatch on YouTube ↗

EVERY SPOKEN WORD

10 min read · 2,186 words

SPSpeaker
[upbeat music]
SPSpeaker
Please welcome to the stage member of technical staff of Anthropic, Alex Albert.
SPSpeaker
[upbeat music]
SPSpeaker
Hello. Hello, everybody. All right, let me set my water down here. I'm Alex Albert. I'm one of our research PMs here at Anthropic. Um, I just heard from the stage manager, Loam, that there's a happy hour that just started. So I wanna thank all you guys for choosing me over drinks and snacks. I really appreciate that. And I promise this will be, uh, worth your fifteen, twenty minutes here. Um, it's been really interesting comparing the vibes at this year's conference to last year's conference. Last year, Claude Code was not even GA'd. It had been out for around two or three months. Uh, folks were just starting to get used to this paradigm of agentic coding. These days, as I'm talking to folks here, it feels like people are doing a lot more. They're trusting Claude. They're shipping things faster with Claude. They're building things that they weren't building before that haven't been possible to build on the timescales that they are. Out of-- Actually, out of curiosity, I wanna run a little exercise here. So, uh, I'm gonna say a question. Uh, if the question is you, please raise your hand and just keep your hand up. I wanna get a, a sense of the room. Uh, raise your hand if you feel like Claude has allowed you to go ten times faster than what you were doing a year ago. Okay. Wow, that was a lot. Uh, keep it up. Uh, how about five times faster? All right, two X faster. Raise your hand if you use Claude. There we go. Okay. Love that. Um, one of the best ways that we've measured Claude's coding intelligence and how we've seen Claude make this impact on folks is through this benchmark called SWE-bench Verified. SWE-bench Verified measures the model's ability to autonomously complete software PRs. Uh, about a year ago, on our model at the time, Sonnet 3.7, it scored sixty-two percent on this eval. Today, with Opus 4.7, it scores eighty-seven percent. That's a over twenty-five percent jump in just over a year. Um, to put this in other words, Opus 4.7 is more than three times as likely to succeed on some of those difficult PRs that Sonnet 3.7 was failing on a year ago. Now, numbers are great, but examples are even better. So to make this a bit more concrete, I have a quick demo here that we put together of the same task, but twelve months apart. So let's get into it. So in this example, we're gonna be comparing Sonnet four to Opus 4.7, and we're gonna give them the same task. Oh, let's go back here and get this demo working. Give this a second here to see if it's playing. If the people watching this... Oh, there we go, right? No, still nothing. All right. Um, well, there was a demo here, and it was really, really cool, so you're gonna have to believe me on that. Oh, here we go. Okay. Sonnet four working within Claude Code. The task that we gave it was to recreate claude.ai with a single prompt. So you can see how it does here. We get this really generic black and white chat application. We can enter in a prompt. We'll fire that off, and immediately we just hit an error, so it doesn't really even work. Basically, it just made like a nice little UI for us. Now, let's run this same task, but with Opus 4.7 instead and see how it does. So we're gonna have the same setup. It's gonna be running within Claude Code. It's gonna be writing a bunch of lines of code, calling a ton of tools, and eventually we're gonna get an output. And immediately, this output looks better. So we have the Claude color scheme. It knows that already. We actually get a response back from the Claude API when we send it a message. We can start a new chat, and it still remembers the old chat, so we're keeping track of things. It even renders inline the visualization like the claude.ai dap, app does. And like a true developer, it even implemented dark mode.So in addition to that being a better output, it also did it in less lines of code, so it's more efficient as well. Now, this talk isn't about any one of these models in particular. What I do wanna focus on is what it means to build on something that is getting meaningfully better like this month over month. Now, before I get into the specific tips, I wanna take a look at where these model gains are actually landing, starting first with Claude's planning ability. Older models would have this particularly bad failure mode where they would act first and then think later. It's kinda like me when I'm having to build, like, a new set of IKEA furniture. Like, I'm going straight into it, and then only once it's, like, a mess am I coming back and being like, "Ah, maybe I should actually read the instructions." Um, I don't know if anybody else is the same on that, but that was basically the same as Sonnet 3.7. Newer models are much more thorough. They take their time upfront to actually think about the problem, strategize a little bit, plan out what they're gonna do, and then dive into the action and dive into writing those lines of code. What this means for you as a developer is that you should allow Claude that time to think, uh, give it some time to actually think about the problem, what it's gonna do. Don't force it to just jump straight into that action because this can reduce your downstream performance. The second major area where we're seeing gains is in Claude's error recovery. Previous models would hit this thing that we'd often call a doom loop. So the model would run into a problem, it would try to propose a solution, that solution wouldn't work, and then all of a sudden it's really stuck, and it's just gonna keep spiraling and spiraling until eventually that context stalls out and you gotta just clear everything. Newer models are much more adaptable. If they hit a problem, they're able to now backtrack out of it and actually think about it in a different way and maybe take a different path. What this means for you as a developer is that you're gonna get better p- task performance from Claude with fewer wasted tokens 'cause it's not gonna get caught in those loops. Final area we're seeing these gains is in Claude's attention over long runs. Older models would often lose the plot as they're working on something. Maybe they would start forgetting things, they're not paying attention as much, those instructions that you put in the system prompt aren't being, uh, paid attention to as much on that, on that thread. Newer models are much more able to hold coherence across runs. They're able to remember those system prompt instructions and stay focused over the course of hundreds of thousands or even a million tokens. In terms of your applications and how you use Claude, this means that you don't need to babysit the context window as much. You don't need to chunk up work, and you can trust that Claude can operate for you on long runs autonomously. Adding thing, these th- things all up together, we have better planning, we have fewer fai- failures and error recovery, and we have agents running for longer, and this compounds into better end-to-end task performance. Our customers are seeing this as well. Vercel saw in Opus 4.7's planning that it was actually writing proofs for systems code before implementing a single line of code. Windsurf saw in their evals that Claude was really sustaining its attention over their longest agentic runs, and Shopify found that as the model was coding, it was actually going back and iteratively refining its outputs. So how can you as a developer see those same things our customers are seeing in your applications? Well, somewhat counterintuitively, it starts not with your application, but with your evals. If you can measure something, you can improve on it, so it's important that, A, you have evals, and B, these evals are measuring something close to your product distribution that you actually want to improve on. That might sound pretty obvious, but it's often something that I see leads developers astray when they try to eval their product on something adjacent to their use case, but not on their actual task distribution. So maybe, for example, they have a coding agent, and instead of evaling that coding agent on traffic they have or, uh, traffic of a similar pattern of their-- what their users are doing, they're evaling it on an academic coding benchmark. The second thing you wanna do once you have those evals is to make sure that they're not saturated. As models are getting smarter, evals are continuously needing to get harder and harder in order to get more signal from frontier models. You wanna make sure that the evals you're building are growing alongside the model so that you can continue to get new signal as more intelligent models come out. And finally, once you have those evals and you've ensured they're not saturated, test them on the newest frontier models. I found that sometimes the best optimization you can make to your app is simply swapping in the latest model, so it's worth spending a meaningful amount of time testing and trying out models as they come. The second tip for you here is to take a second look at your scaffolding. When I'm saying scaffolding, what I'm referring to is the code and the prompts and the skills and the tool setups, everything that kind of surrounds the model and directs it towards its goals. With newer models, you might not need some of the things that you needed before. So maybe instead of a multi-step workflow, you can just let the model work on a task in one thread. Often, you can actually boost your performance by removing instead of adding things onto your scaffolding. Now, alongside your scaffolding, you also wanna take a second look at your prompts. Prompts begin to build up model over model, and after a long amount of time and in many different model generations, it becomes somewhat of, like, a hideous mess of different rules and instructions, and you're not really sure why you added one thing and what it's doing now for your future models.With every new model, take a second look at those prompts, cut down on things that might not be needed anymore, and this will help both your task performance but also save you tokens as well. Final tip here is to give the model room to work. Like I was mentioning earlier when I was talking about planning, it's important to let Claude choose when to think. Use adaptive thinking and dial the amount of tokens that Claude is thinking and the amount of actions that it's taking with the effort parameter. Second thing here is to allow your agents more access to tools in a controlled way. Now, some of you hearing that might get a little bit nervous. I'm not saying to just let it do anything. But there are ways that... and, uh, methods that you can take to allow Claude to actually execute on more systems, uh, in a safe way. One example of this is Claude Code Auto Mode. So, uh, in one of our recent engineering blog posts, we talked about Auto Mode. Auto Mode actually runs classifiers over the tool calls that Claude is proposing, and we can determine if that tool call needs, uh, explicit human approval or not. This allows us to let Claude run in the background for longer, and it can work more autonomously without needing a human to step in. The final tip here is to ensure that you're closing the loop for your agent. Design your system so that Claude can actually inspect its own outputs and iterate on them. Going back to that coding agent example, if you have a agent that's working on front-end applications, maybe you wanna give it a computer use tool so that it can click around on the site and Q&A test different bugs or features that it writes. Models are continuously getting better at verifying and iterating on its outputs, so it's important to allow it the affordances to do so. And with that, that is a quick look at the capability curve. I wanna thank everybody for coming out today. I know this is the last talk of the day. I'll be hanging around in the reception, so please come find me. I'd love to chat about how we can make Claude better for you. Thank you. [audience applauding] [upbeat music]

Episode duration: 15:05

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode tP4MGcJ80Y0

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome