EVERY SPOKEN WORD
25 min read · 5,005 words- SPSpeaker
[on hold music]
- SPSpeaker
Hey everyone, I'm Jeremy, and I am a product manager here at Anthropic with our research team, and I work on coding behaviors and capabilities. In other words, how do we make Claude a better software engineer? I'm here to talk to you today about the capability curve, or this pretty wild curve we've been on from the last few years all the way up to now. Um, it's been pretty cool walking around the conference today and just comparing it to how it was last year. Last year, around the same time, I had just started at Anthropic in March of twenty twenty-five, and we had just launched Claude Four right before Code with Claude. Um, so we'd all been up late until, you know, three AM the night before launching this model, and the energy was just electric at the conference. Back then, Opus Four was state-of-the-art, and Clo-Claude Code had sort of just launched. It wasn't even GA yet, and it hadn't really taken off completely. It's pretty surreal just how much has changed since then, um, where Opus Four is a dinosaur, almost a distant memory, and models are so much better. Claude Code is everywhere, and coding agents have completely revolutionized how we do software. Um, so I'm curious, who here has shipped a PR that is mostly written by Claude? Raise your hands. Who here has shipped a PR in the last week that was completely written by Claude? That's almost half the room. Who here has shipped a PR where they did not read the code at all, it was completely written by Claude? Nice. [laughs] You know, it's a dangerous game to do this, um, and you have to do it carefully and do it well, but this has completely changed how we do software. Um, and you know, our CEO, Dario, has talked about how most software at Anthropic is now written by Claude. Claude has written most of the code in Claude Code. Um, and so we're just in a completely new paradigm, and I wanna talk to you about how things have changed and how we're thinking about this going forward, and how you can adapt to this curve and use it to make your applications better and better over time. So how we think about this, um, we often think about sort of benchmarks to measure progress, and one of the best benchmarks that we've had over the last few years for measuring progress in software engineering has been SWE-bench Verified. This benchmark is composed of GitHub issues, and it essentially tests can the model solve the GitHub issue and pass all the tests accurately? Um, and you can see here that Sonnet 3.7 was just at about sixty percent last year, um, and now Opus 4.7 has passed eighty-seven percent on this benchmark. And so essentially what this means is that models have gone, um-- they're solving three times more issues than they did in the past, um, and they're just advancing rapidly on this kind of benchmark to the point where we don't even use SWE-bench Verified anymore because our most Frontier models, like Mythos Preview, have completely saturated the benchmark. Essentially, there's no more room to improve. Um, in general, we're starting to move faster than benchmarks can come out, um, and that makes it harder to measure this kind of progress. But this just shows how, um, over the course of just twelve months, Claude has gone from a junior software engineer who can only solve a fraction of GitHub issues to almost a senior software engineer that can solve pretty much any well-specified GitHub issue that is presented to it. Um, and now sort of the bottleneck has moved elsewhere, um, beyond just solving PRs to more tricky things. So even better than a benchmark is a demo, and so I wanna show you a demo of the same task twelve months apart from Sonnet 3.7 up till now. So I'll show the demo. So essentially in this demo, um, we compare Sonnet Four to Opus 4.7. Um, and in this task, we essentially just ask Claude, "Can you rebuild the entirety of the claude.ai website from scratch in one shot?" So our claude.ai website, um, you know, it's taken many software engineers a lot of time to build it. Here we can see Sonnet Four will take this task and start building. It'll start writing code. Um, it won't really plan that much in advance. It won't necessarily correct its approach as it's working. It writes two thousand lines, and it produces this sort of demo, um, that sort of works, um, but it's a pretty basic UI. And you can see that, you know, the chat doesn't work at all. You know, it doesn't actually, um, produce a response. But then with Opus 4.7, we try the same task, the same prompt today, and the model is able to start working. Um, it uses a bunch of tools. It uses the same sort of approach, um, but it has to write fewer lines to accomplish a similar result. You can see it's only at seventeen hundred, and the application itself is better. Um, you can see it looks more similar to claude.ai. It actually produces a completion. There's the chat sidebar. It has a chat history. It can even produce the diagrams, um, and the chat sort of input works. Um, it can create these formatted outputs, and it can even make mermaid diagrams within the chat. Um, so Opus 4.7 is just dramatically better at the same task and can produce a full working web application. It even added dark mode like a true developer. So this is just one instance of how giving the same models the same task or giving different models the same task just completely changes the outputs that you're getting. Um, and I think if you try any task that you maybe struggled with twelve months ago to get models to do successfully, you're gonna see a huge difference. And this just shows how the foundation under our feet is shifting as developers, and we have to adapt to that.So I wanna talk about some of the areas that are driving these gains in intelligence improvement and some of the biggest changes we've seen in models over the last 12 months. So the first area where the gains have been landing is in planning and reasoning before acting. I don't know how many people here remember what Sonnet 3.7 was like. Raise your hand if you remember using Sonnet 3.7. Yeah, so Sonnet 3.7 sort of, um, acted like I might act when making IKEA furniture. Um, you know, I just jump right into it, I start building, and then I look at the plan after I've already tried and failed. Um, it didn't really plan in advance. It didn't really go into the task, um, with a plan set out. And so this failure mode that most of our models had used to be acting first and then thinking later. And what's changed over the last 12 months is that rather than having to sort of scaffold the models really carefully and force them to plan, models will plan on their own. In other words, they will read before taking action, they'll compose a careful plan that has a high likelihood of success, and they'll figure out and investigate before they start taking action. Um, they'll often also catch their mistakes as they're writing a plan, and so you'll notice as the models reason through, they'll say things like, "Actually," or, "Never mind," um, and change their approach as they're developing a plan. That means that it doesn't take as much work for the model to sort of iterate as it's building the application because it's already developed the spec in advance like a senior software engineer might. And so what this means for you is that as you're building with Claude, you should give it time to think and time to develop this plan. You may not need to force it into doing this, and all you need to do is sort of select a high reasoning effort and then allow Claude to develop the plan on its own. Another big area where we've seen improvements in Claude models over the last 12 months is in error recovery and adapting to failure. So you might remember about 12 months ago, you know, all sorts of models had this issue of doom looping. Doom looping is essentially the problem of having a failure and then attempting a solution and, you know, Claude will tell you, "Aha, I've got the problem. I fixed it." And then you look at the problem, and it's just sort of repeated the same solution again. And often it would fall into this trap of trying the same problem and then repeating the same solution or small variations on the same solution again and again. This problem essentially doesn't happen anymore. And so we've sort of solved dupe lo-- doom looping for the most part because models are able to try a tool call, um, try some action, receive the results from the environment as tool results, and then based on that, reason about what to do, use some thinking tokens, use some test time compute to spend more computation and figure out how it should actually respond to that failure. So now rather than just trying the same thing after encountering an error, the model will change their approach and keep executing through failures to accomplish the task. So what this means for you is that you get better task per-per-performance with fewer wasted tokens. So, you know, rather than a Sonnet model or an Opus model repeating the same task again and again and spending a bunch of tokens without even giving you a good result, now the model will accomplish the same result while trying only a couple times and iterating from its failures. And so giving the model the ability to iterate from failures, some way to get feedback from the environment, and some way to reason from that feedback will result in, in better outcomes than it would have 12 months ago. The third biggest area where we've seen a lot of improvements in Claude models over the last 12 months is sustained attention over long agentic runs. So about 12 months ago, if you tried a model and tried to get it to, you know, work across an entire code base and do some refactor, um, and you had to use hundreds of thousands of tokens to do that, you would notice that it started to lose the plot partway through the task. Um, it might forget what it's doing. It may forget details or fine points about how to accomplish the task. If you gave it a complex spec with, you know, dozens of instructions, it may miss, um, many of those instructions or not remember how to accomplish it. Um, now at this point, our models can hold coherence up to one million tokens and even beyond that point. And so what that means is that if you give the model a spec at the beginning of the task, it won't just forget it partway through. And for the most part, um, you know, up to some limit of complexity, the model can actually remember a spec and carry it out over the course of millions of tokens. Um, what this means for you is that you no longer have to necessarily break up tasks into these tiny bite-sized pieces. You ne- you don't necessarily have to break up these tasks, tasks into individual context windows, and you don't have to sort of babysit as much and think about, "Oh, you know, Claude is already at two hundred thousand tokens. I have to stop the task now." Now you can sort of just let Claude run and trust that the model and the harness are able to work for millions of tokens without necessarily having failures. Um, we're not there yet in terms of the models having perfect coherence over millions of tokens, but we're much, much closer than we were 12 months ago. So that means you should be more ambitious with your tasks. Don't assume that Claude, you know, can't handle something because it's very long-running. Um, you can hand it the whole code base and see what it can do rather than sort of limiting your ambition before starting the task. So together, all of these improvements stack into more autonomous agents. Autonomy is really composed of these different capabilities. You know, autonomy means that you're able to plan in advance and think about how to accomplish the task. It means that that plan sets you up for success. It means that when you run into failures, you can recover from those failures and keep working despite seeing errors. And it means that you sort of remember what you're doing partway through the task. And so you can see how these capabilities that we, we've been working on improving all ladder up into autonomous agents that can do end-to-end task completion, combining planning-Failure recovery and long-horizon coherence. Overall, our agents are now able to run for many hours rather than just a few minutes. So yeah, long horizon, long-horizon agents are where we are no- at now. And you can see that how essentially a long-horizon agent loop works is that it starts with a plan, it starts executing, and then it needs some way to verify its work against the environment. So it may run the tests, it may confirm that the tests are passing. If the tests aren't passing, it fi- it'll figure out how to iterate and make them pass, and it can do that over a very long period, and every few checkpoints it might validate that against a goal. One of the most exciting examples I've seen of this recently is one of my coworkers, who is the founder of Bun, um, which is essentially o- one of the core sort of infrastructure pieces behind Claude Code. He decided one day, I'm really tired of dealing with these memory errors that, you know, the JavaScript engine behind Bun constantly runs into. What if I rewrote the entire engine in a memory-safe language like Rust? He decided this basically last week, um, and because Bun has a great test suites-- test suite with nearly %100 coverage of the entire engine, he was able to get Claude to run over the course of an entire week and rewrite all of Bun in Rust in one week to get %100 pass rate almost on the entire test suite. And then he merged this PR, and Bun is now written in Rust. This happened in a single week. Um, it's hard to like, uh, you know, articulate how mind-blowing that is. Um, for me, I think that it's incredible how much Claude can do if you give it something that really verifies the entire software system. So, you know, the only way that Jared, the founder of Bun, was able to do this was because they already had a great test suite and because he had the ambition to ask, could Claude actually do this? And then the ambition to actually try that and run it against the whole test suite. And so this level of software project that, you know, if Jared had done it on his own, he doesn't even know Rust, um, and yet he was able to do this regardless. Um, you know, this would've taken him many months to do in the past, and at this point, he was able to do it in a week as a single individual, having just many Claude agents, you know, iterate against the test suite. So this is the world that we're living in now, where long-horizon agents can accomplish software projects that would take individuals months to do. Um, and this is not really slowing down. What we're seeing is that agents are getting better and better, and individual software engineers are able to accomplish more than they've been able to in the past. Some examples from our customers, um, you know, for example, Vercel has seen that on the planning point, um, models will sometimes do proofs on systems code before they even start the work. So in-- as they're in this planning stage, they'll actually write proofs and verify the system before they start the task. Um, similarly, Windsurf found that on the long-horizon point, um, our models have been much more capable of operating with sustained reasoning over their longest agentic runs. Um, and they're seeing that it's sort of market-leading in the ability to just have coherence over a very long time horizon over many hours. Shopify also found that with Opus Four point seven, it was a big step up in intelligence, especially in this code quality and the ability to verify its work as it goes and sort of fix up things as it's working. In general, every new model, we sort of hear things from our customers around these kinds of capabilities, how they're becoming better at planning, better at verifying, better wo- at working over many hours. So how do you actually ride this curve? Um, it's not really about any individual model. It's not about sort of Opus Four point seven or Opus Four point six. It's about this overall trajectory that we're on, where the sh- ground is sort of shifting beneath our feet and the foundation that we're building on is becoming more and more intelligent over time. Every couple months, the models are becoming significantly more intelligent, and that really should change how we think about building applications. So here are some of the things that I've learned from working with our customers and working on our models over the last few months. Um, there are a few patterns that I think allow you to absorb these improvements in model intelligence and really translate them into benefits for your own productivity, for your internal company use, as well as for your end customers. First of all, evals are really critical. And so one of the things that I've seen allows teams to iterate and build with new models the, the most rapidly is by having high-quality evals that they can actually trust. And so the first step is just to build evals at all. Um, we have a blog post on our engineering blog that is essentially a, a guide to how do you build evals, but the first step is really just to start. I see a lot of teams are sort of afraid to get started with evals because they seem like an academic exercise that might take a lot of work or that they might need to hire researchers to do. But essentially, evaluations are just the unit tests and the regression tests of the AI era. So every software application that uses AI should have evaluations, and if you don't, it's similar to not having unit tests for your traditional application. Um, and so it's really critical to just start by building some form of an, an evaluation. Um, another important point when building evaluations is to make sure that they measure what you actually care about, and that means building evals that match your real traffic and test behaviors that you want to see in production. Something I often see, uh, with customers is, you know, using a academic benchmark, you know, something like Suitebench Verified or BrowseComp or TerminalBench, rather than using an application that actually measures the use case they care about. So for example, if you're building a finance agent, it's best to sort of collect failure modes from your customers, see what's failing, what's succeeding with your application, and then build those into your evaluation so it measures the kinds of tasks your application actually does. Another important point is to know when your evals are saturated. What we mean by saturated is essentially that there's no more room for improvement on the evaluation. So, you know, if Opus Four point seven can already get ninety percent on the evaluation and the last ten percent of tasks is impossible or unfair or just no model can get it-Uh, then that means that the eval is saturated, and that means that it's no longer useful for measuring model improvements. One trend that I've seen, um, in working on our models during, during sort of early testing is that customers often think that the model is not that much of an improvement at first if they use their preexisting evals and those evals are already saturated. So they might run their eval and see, you know, only a one percent improvement, and they think, "Oh, this model isn't that great." But then they spend another week testing it on harder and harder tasks, and they realize, "Oh, no, actually, our eval is in the past. Our eval does not measure model improvements anymore, and so we need to change our eval to actually see the gain from the new model." Um, so you have to sort of keep raising the bar to make sure that your evaluation can meta- measure model progress. Um, of course, like software tests, you might have some tests that are actually intended for regressions, and for those, you might accept, you know, having a one hundred percent pass rate because you expect every model to be able to do the task. But for evaluations that you want to use to actually measure, you know, are models improving your application, you want them to be unsaturated so that as models improve, you can see that gain in your application and in your evals. Finally, what you wanna do is actually benchmark new models on these evaluations. In general, this is what allows you to test models quickly because it means that you can just sort of kick off a script and then see the eval results, and then trust that if the model is performing better, then you should plug it into your application rather than having to, you know, read all of Twitter, see what the vibes are about the model, test it yourself over many weeks. Um, and so companies that tend to have evals tend to be the fastest at adapting, adapting to new models. And what we've seen is that this is a big competitive advantage because often the biggest improvement you can make to your application is using the best model for your application and the most frontier model. And so if you can't adapt to them because you don't know which model is best and you don't have evals, you're going to be slower than competitors who do. Another key thing that we've learned over the last few years is that you should shrink your scaffolding over time. What I mean by scaffolding is essentially everything that goes around the core LLM. So the LLM is the intelligence engine, and then around the LLM are your prompts, your tools, the execution environment, your skills, essentially all of the scaffolding or so what sometimes people call the harness. Um, and this is essentially the stuff that allows the model to operate as an agent. And one thing that we've seen is that, you know, over time, you develop this huge Frankenstein prompt. Um, I've been there, and essentially, you know, as you develop your agent, you realize some failure, you add some line to the prompt to adjust for that failure, and eventually, you know, you have three thousand lines of mostly prompt instructions that were designed for previous models and for failures that might not even happen any-anymore. Um, and so what you want to do is when you get a new model, you want to cut down your prompt, figure out what's not necessary anymore, and describe what you actually intend for your application rather than just, you know, how to work around the quirks and weaknesses of previous models. One example of this is that when we were working on the Claude 4 launch, we were sort of adapting the claude.ai application to this new model, and one thing we real-realized is that the model was following instructions more effectively than previously-- previous ones, and there were pieces of the claude.ai prompt that it was following that we didn't even expect it to follow. So one example of that is that there was a example about how to do citations in a particular format, but we didn't actually use that format anymore, and the model was following that instruction and producing the incorrect format. Um, and once we changed that app example and sort of just tweaked a few characters in the prompt, it completely fixed that whole class of errors. So that's an example of how, you know, sometimes you-- when you have a huge prompt, you don't even expect the model to follow every component of it, but as the models get smarter, you sort of have to reassess and look. Maybe there's a bug in the prompt, maybe there's some instructions that aren't relevant anymore, and maybe there are some things that we should cut out to allow the model to sort of just work autonomously. So in general, we recommend that when you get a new model and as you improve your application, you shrink your scaffolding down and you audit your prompt and system for things that are not really relevant anymore. And ideally, you can use your evals to test whether, you know, if you cut down your system prompt to the bare minimum, can the model still perform just as well? Finally, a really key practice that we've seen a bunch of our customers adopt, um, and that helps with adapting to the capability curve is giving the model room to work. And what this means is a few different things. First of all, you want to allow the model to think, um, when appropriate. Essentially, all of the models at this point, um, at the frontier are reasoning models, which means that they benefit from test time compute. Essentially, if you give them the option to, um, then they can use more test time compute to apply more computation and apply more intelligence to the problem, and that can result in better outcomes. And so in general, you want to allow adaptive thinking, which gives the model to cho-choose to think, um, when appropriate, and you also want to sort of dial the effort parameter up depending on your application. So for very intelligence-sensitive use cases, like most software engineering or enterprise agents, you wanna set the effort level pretty high, usually to sort of the highest setting, um, or close to that, to allow maximum intelligence at the cost of more token usage. Another key thing that you wanna do to give the model the room to work is allow it to operate autonomously. And this can be a little scary. You know, if you give the model access to a production system, you don't want it to delete your cluster or deploy jobs to prod without you asking. And so one practice that we've found is effective is something called auto mode. We published a blog post about this as well, but essentially in Claude Code, we have an auto mode classifier that for every tool call, we use a prompted classifier to check, is this tool call safe, and can we just approve it automatically, or does it really need human approval?Um, and this allows us to sort of trust Claude to run autonomously, and almost every software engineer at Anthropic at this point, we're using auto mode to allow the model to just work on its own, and only loop us in when it actually needs approval for something critical or dangerous. And so you can apply this pattern to your own applications, and we're looking to make it more and more possible for people to run agents autonomously while still keeping them safe and looping in humans when needed. And finally, a really important practice is to close the agent loop. And what I mean by this is allowing your agents to help you improve your agents. And essentially what this means is that you wanna design your system so that Claude or Claude Code can ex- inspect its own outputs and iterate on them to improve your system. One example of this is that I often will sort of plug Claude Code into some agent I'm working on, and if that agent already has evaluations to measure success, I can just ask Claude Code, "How can I improve the prompt? How can I improve the tools to get a higher score on this application?" And because, you know, Claude has access to the full agent loop, it can run the agent itself, it can run the evaluation itself, it can help you autonomously improve your own agent loop. And so if you can give Claude the ability to iterate on your own system, then you can sort of get to the point where you're almost self-improving, um, and you can sort of direct Claude to make improvements without having to be in the details of iterating on every piece. Um, so giving the model the room to work by allowing it to think when appropriate, allowing it to work autonomously in a controlled way, and by closing the agent loop, allows you to build really autonomous agents that can do much more productive work than ever before. Overall, this is sort of the capability per curve that we've been on and how to adapt to it, and thank you everyone for listening. [clapping] [upbeat music]
Episode duration: 26:25
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode DNRddIEoH3c
