How I AIClaude Fable 5 (Mythos) - is the world’s best coding model as good as they say?
EVERY SPOKEN WORD
15 min read · 3,032 words- 0:00 – 0:31
Introduction: Fable 5 is finally here
- CVClaire Vo
It's here, the model, the myth, the legend. Mythos from Anthropic has finally dropped. Well, baby Mythos. We're calling it Fable 5, and this new model is crushing benchmarks, but the question is, can it crush my backlog? I got early access to the model, and of course, I have my own opinions on where it does really well, where it needs a little work, and the question on everyone's mind, does it live up to the terrifying marketing hype?
- 0:31 – 5:14
What Anthropic says about the model
- CVClaire Vo
Let's get to it. Okay, let's talk about what Anthropic is telling us about this model, and then we'll get into what I think about it. So this is Claude Fable 5, the first Mythos-class intelligence model to reach GA. Now, if you haven't been paying attention, Anthropic has been marketing/scaring/warning us about the unbelievable capabilities of Mythos, and it is finally here. Now, they had originally been rolling this out with a couple select companies. I got early access to test what I thought of the model, but you have to know, this is not Mythos, capital M, big Mythos. This is baby Mythos. This is Fable, and so it's gonna have some guardrails on it, in particular, around cybersecurity exercises and biology exercises. Now, good news, your girl's working on PRDs. She's shipping SaaS. She's not working on biology quite yet, although give me, give me a little time and some time to experiment. Maybe I'll get there. So this is really gonna be focused on what the everyday user, what the everyday software engineer is gonna think about when they're using this model, although I did run into some things that I suspect are a result of the tuning and training of this particular model to be extra safe. Now, quick, it's not cheap. It's $10 per input token and $50 per output token. It's gonna be a new tier above Opus, and so if you're gonna use this model, you're gonna pay, pay the price. So what is Anthropic saying? Basically, it's a completely new model class. So we had Sonnet, we had Opus, and now we have Mythos, the first of which is Fable 5. It's completely state-of-the-art. It is exceeding every benchmark they tested by a significant amount. This 80% on SWBench Pro, you'll look at that compared to some of the more recent models that have come out. Very, very good benchmark performance. And then they're saying it's really good for long, complex tasks. Now, what are some things that earlier models couldn't do that they are saying now that Fable 5 can do? It's very autonomous, including running days-long asynchronous tasks. It's really an engineer's engineer, and that's some of the downside I experienced with this model. I'm gonna show you a very specific example of where you don't want an engineer doing your work with an engineer's, um, point of view. Proactive. Um, it's very good at vision, exceptionally good at vision. This is a place where I actually really loved the model, and you know me, I'm pretty critical of models, but I did see a step ahead of vision, so that's something we're gonna dive into. And then effort. It works hard. It builds harder. It verifies more. It's built for ambitious work. Now, guess what it also can do? It can consume those tokens. So Anthropic has said it consumes rate limits and tokens at about two X the rate of other models. So again, this is a big boy model, and it's gonna consume tokens, and some of the things that it's good at, and even some things that they have done in the harness seem like they're, uh, intentionally or not, token consumers. So we're gonna keep an eye out on costs and an eye out on efficiency when using this model. Again, talking about long-running tasks, Fable 5 is supposed to be able to run for days. So doing long-running planning, being able to spin up sub-agents, and I show a little bit about dynamic workflows, which are, you know, different architectures of sub-agents and holding multi-day sessions. Now, I have done probably day, days-long sessions with other models. I didn't have Fable for many days, so I cannot verify that it ran for days. I did get it to run, however, for several hours on some tasks that may or may not have merited that several hour effort. But it definitely seems like it has both the harness and the intelligence capability to run for a very long time, if that's appropriate for your task. Now, here's your pros and here's your cons. They explicitly say that Fable works like a seasoned engineer. Unfortunately, if you have worked with a seasoned engineer, you know there's good to this [laughs] and you know there's bad to this. So it is very complete in its investigation, and it's definitely gonna go search out all the corners. It's definitely gonna think about how it can be 120% sure that it's shipping the right thing. But guess what? That's not always in service of launching, and that's honestly not always in service of building a great product. So while you can give it a goal, and it will be very autonomous, and it will be very thorough, honestly, sometimes you want, like, a slightly less thorough engineer. Product manager talking, even engineer talking, sometimes you want it to be a little bit dumber. We'll talk about some of the prompting techniques it says and when to use this model. But it's just something to think about when you're working with any high intelligence model, is how much intelligence does the task actually take?
- 5:14 – 6:28
Token-intensive by design
- CVClaire Vo
Now, as I said before, it is token-intensive by design, and I did most of my tasks on extra high, and so it was like token burning on token burning. And so they say that high is probably the sweet spot for most work. I used extra high just because I don't want anybody in the comments saying, "Claire, you picked high for this task, and it should've been extra high, and you would've had a better experience." I used extra high. I used all of the brains of Fable. But again, it is very, very token-intensive. And my question for any of these models, this is not an Anthropic model question, this is not a Fable question, is does this token intensity actually output the right results? And that's a place where I'm just not 100% sure. But again, as us humans in the loop, we're gonna have to be much more intelligent about where to put what model and where to use what... reasoning and what effort level to match what we're doing. And again, I think that the untrained of us will say, "Oh, well, I have this Fable model. I should use it. It's better than anything." And honestly, I, I still think there's a place for good old Sonnet, I think there's a place for Opus, and I think there's a place for other models in the ecosystem. Now,
- 6:28 – 7:46
Safety classifiers and the new fallback concept
- CVClaire Vo
there are safeguards in this model, and so this was one of the first things that Anthropic told me testing the model, and this is one of the headlines that they're making in the release, which is there are specific classifiers in this model for cybersecurity, biology, chemistry, and distillation. Basically, they don't want anybody doing bad stuff in those categories in particular with this very intelligent model. What's nice about how they've implemented this, however, is they have this new Fallback concept. And so if you get classified into one of these categories, instead of saying like, "Do not pass go, you may no longer Fable," it just falls you back to Opus 4.8. This is also a capability in the API now, where you can do this graceful fallback to 4.8 if you're using a Mythos class model. They also have a 30-day retention policy used only to catch misuse, and it's not used to train Claude. So while it's still not claining-- training Claude, they do wanna check the use of this model because they have been and will forever be very cautious about us normies using their intelligent models. And just, you know, for context, 95% of sessions on this model did not hit a fallback. I don't believe I hit a fallback, but again, I'm not doing anything in cybersecurity, biology
- 7:46 – 8:30
Is this or is this not Mythos?
- CVClaire Vo
or chemistry at least yet. Okay, so this is the question, is this or is this not Mythos? It is Mythos. Fable has the safeguards. Mythos does not. Fable, all us normies can have in general availability. Mythos is still restricted to these Project Glasswing partners, some of these enterprise level partners that are really checking it against cybersecurity use cases. I would suspect that at some point we get some access to a Fable 5 point whatever, or that the Project Glasswing class opens up, but for now we get Fable, Project Glasswing or these, um, pre-selected companies get Mythos, but they are all fundamentally the same underlying
- 8:30 – 9:20
New product launches: Managed Agents and more
- CVClaire Vo
model. A couple product things that are also launching today along with the Fable 5 model, Claude Managed Agents are going into public beta. If you haven't paid attention, this is Anthropic's hosted harness, hosted sandbox for running long-running agentic work. I am still trying to figure out what a good use case for Claude Managed Agents is. I will get there, um, but Fable ships out of the box in Claude Managed Agents. There's also a new advisor strategy where you can use Fable 5 as a senior advisor and use cheaper models as an execution layer. A lot of people are doing this with Opus and Sonnet, and so this is gonna work today in the API and in Claude Code and is a strategy you can use. And then as I mentioned, this Fallback API where you can put an optional parameter on the messages API that allows you to continue to block requests by using 4.8
- 9:20 – 9:55
Crushing benchmarks
- CVClaire Vo
at Opus pricing. Okay, as we said, crushing benchmarks. Look at this. Fable 5 compared to Opus 4.8, GPT-5.5 and Gemini 3.1 Pro. Significant increase in SWBench Pro benchmark, um, very far ahead of these other models. And while I wasn't testing the most advanced use cases, I didn't find something that technically it failed at, so I think these benchmarks are really gonna hold. And these benchmarks have outperformed across the board, so this is Anthropic's state-of-the-art
- 9:55 – 11:40
What it's actually like to use (the good and the bad)
- CVClaire Vo
model. Okay, so enough about what they say. Let's talk about what I say. What is it actually like to use? So I ran Fable 5 on a bunch of different work, and I wanna give you my feedback on where I thought it did well, where it needed a little bit of work, and where I was really surprised. As I said before, it's really good at vision, and where is it good at vision that really impressed me? It's really good at document formatting. So this is kind of super simple, um, but we've been doing these handwriting documents for my seven-year-old based on classic texts and classic poems. And on the right is Opus 4.8 and on the left is Mythos 5, and it looks so silly, but I really do think Mythos 5 did a much better job of a second grade layout for a handwriting sheet. There's just, like, the right spacing. It's very clear to read. There's enough white space. I think on the one on the right, it's just very dense, and even the lines themselves are sort of hard to tell, do you write above, do you write below? So I do think that PDF formatting documents, I tested this against a bunch of different models, Mythos 5 really did a good job. So very simple eval for me, but a very, very good one. Now, here's the problem though. The writing is nearly unreadable. So if you're thinking about Mythos for prose, for spec writing, for PRDs, unfortunately it's an engineer. And what's the problem with engineers? They just really get wrapped around the axle on details. And this is a real struggle with these more intelligent frontier models, is they're, like, too smart, and so it's just very, very hard to parse what they're saying. And I'm gonna
- 11:40 – 12:56
Test 1: product graph spec
- CVClaire Vo
show an example of this in actually Claude Code. So I have this concept of a product graph that I'm working on for ChatPRD. It's actually a fairly complex open source project. And I had Fable 5 go through that and actually do, like, an adversarial review of my requirements to try to figure out where there were internal consistencies in the logic. And it gave me this markdown document that looks very long and intelligent, but if you actually go through it, it's just really hard.To parse. It's these, like, internal references. It's very detailed, but not in a way where you can zoom out. There are these big blocks of paragraphs, like look at, look at this. It is just really hard to see the forest for the trees in this particular model. And I saw this sort of like over and over again working on it with specs, is it was very complete, but nearly imparsable. And that's a real challenge when working with these very, very high intelligence models. Again, I would actually suggest pulling back to maybe a Sonnet or Opus model for specs, and then looking at Fable as an orchestrator of execution where that detail really matters, but you don't have
- 12:56 – 14:04
Test 2: designing a skills registry
- CVClaire Vo
to read it. The other thing that shock, shock, shocked me was how, like, actually legitimately terribly bad it was at design, or at least at one-shot design. And so I asked Fable to design a skills registry, and man alive, did it do a very poor job. I mean, I'm not even talking like AI slop bad. It's like fundamentally terrible design. Gray, black, red, simple outlines. Just really, really terrible. Now, the Anthropic team suggested that I just needed to be a little bit more detailed in my prompting. I've never had to do this before in, I would say, the last year of models in terms of front end. But even when I prompted it, it was still just not very impressive design. I think there's this real balance between design slop and specificity, and just shipping, like, terrible design. I'm not sure what about Fable 5 resulted in this. I'm gonna have to keep testing it as it rolls out today. But this was a real disappointment in terms of design. So again, you might wanna toss an Opus in the mix instead of relying on Fable
- 14:04 – 14:43
Conservative on execution
- CVClaire Vo
for design. It's really conservative on execution, so when I was trying to do that ambitious days-long work, I took a spec and I said, "Can you ship the V0 of this, the MVP?" I said, "Enough to... that a customer could get value." And the MVP, they just really took minimal to heart. It was, like, very, very narrow, not actually that useful. And I'm curious if this comes from some of the safeguards on this model. And it's, it's been a challenge I've seen since the kinda later Opus models, is they're not super ambitious. And so again, you'll have to think about how to prompt this to get that long-running outcome paired with the right product ambition.
- 14:43 – 15:39
Test 3: multi-agent orchestration
- CVClaire Vo
And then I really doubled down trying to test these Claude Dynamic workflows and these sub-agent designs, trying to see if this would really add value. And the multi-agent capability is definitely there, and I definitely had some successful multi-agent runs kicked off in Fable. But I also ran into a lot of stalls and errors in using multi-agent orchestration. Now, I made the mistake, I walked away from my laptop and came back to these sub-agents that had stalled after about three hours. And so, like, egg on my face. But I really wanna see how technically the Claude Code model holds up to the promise of multi-agent orchestration. I had some successes and some bugs. I think this is a Claude Code issue, not necessarily a model issue, although with this promise of long-running, days-long prompts, you really gotta deliver technically on the outcome.
- 15:39 – 17:23
My takeaways
- CVClaire Vo
So what's my takeaway? I would hand it hard problems. Of course, not cybersecurity, bio, or chemistry problems. But hard technical problems where being extremely detailed matters, long horizon work. I would also hand it vision problems where you really want something to look good or you want it to parse PDFs or other documents. It's done exceptionally well there. I was actually really surprised. I probably wouldn't, uh, hand it my front-end work, or I definitely wouldn't hand it my front-end work, and I definitely wouldn't hand it strategy or spec work. I think it overthinks things. I think its prose is nearly imparsable. And so maybe I'll test it again with effort level lower on sort of prose and spec writing, but it wasn't it for that. That being said, I'm not a hater on this model. I definitely not... It definitely has a place in your stack. I'm gonna test it. If you wanna learn more, definitely look up the prompting guide for Fable. It's gonna probably repeat a lot of what I said. Hand it your hardest problems, what this model's good for and what it's not, and how to get a good outcome. That being said, Mythos is here. I cannot wait to hear what you build, what you overbuild, and what you make ugly with this new model. Thanks for joining How I AI. [upbeat music] Thanks so much for watching. If you enjoyed the show, please like and subscribe here on YouTube, or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app. Please consider leaving us a rating and review which will help others find the show. You can see all our episodes and learn more about the show at howiaipod.com. See you next time. [upbeat music]
Episode duration: 17:24
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode IREnr4I89Ho