Claude Opus 4.6 vs GPT-5.3 Codex: Which is the better software engineer?

Name: Claude Opus 4.6 vs GPT-5.3 Codex: Which is the better software engineer?
Uploaded: 2026-02-11T12:00:00Z
Duration: 30 min 12 s
Description: Claire Vo tests OpenAI’s Codex desktop app (GPT-5.x Codex) against Anthropic’s Claude Opus 4.6/4.6 Fast using an ambitious, repeatable benchmark: redesigning an existing marketing site and refactoring production app components.

How I AIFeb 11, 202630m

Claire Vo (host)

Codex desktop app Git-centric workflow (projects, branches, worktrees, diffs, PRs)Skills and scheduled automations in CodexMarketing site redesign as evaluation benchmarkGPT-5.x Codex literalness and prompt overfittingOpus 4.6 planning/execution in Cursor Plan modeTwo-model workflow: build with Opus, review/harden with CodexOpus 4.6 Fast cost/performance tradeoffs

In this episode of How I AI, featuring Claire Vo, Claude Opus 4.6 vs GPT-5.3 Codex: Which is the better software engineer? explores opus 4.6 builds fast; GPT-5.3 Codex reviews like principal engineer Claire Vo tests OpenAI’s Codex desktop app (GPT-5.x Codex) against Anthropic’s Claude Opus 4.6/4.6 Fast using an ambitious, repeatable benchmark: redesigning an existing marketing site and refactoring production app components.

Opus 4.6 builds fast; GPT-5.3 Codex reviews like principal engineer

Claire Vo tests OpenAI’s Codex desktop app (GPT-5.x Codex) against Anthropic’s Claude Opus 4.6/4.6 Fast using an ambitious, repeatable benchmark: redesigning an existing marketing site and refactoring production app components.

She finds GPT-5.x Codex can be overly literal and hard to steer for creative, greenfield redesign work, often overfitting to the last instruction and failing to expand scope beyond a couple pages without heavy guidance.

Opus 4.6 performs better at planning and executing broad, long-running builds (especially in Cursor’s Plan mode), producing a cohesive site redesign after an initial “Tailwind slop” iteration that improved with direction.

Her winning stack is multi-model: use Opus to generate and implement features/design (80–90% done), then use GPT-5.3 Codex as a rigorous reviewer for architecture, performance, edge cases, and hardening before shipping—helping her merge ~93k LOC across 44 PRs in 5 days.

Key Takeaways

Codex’s UI is optimized for “real Git work,” not just chat-based coding.

Codex foregrounds repositories/projects, branches, worktrees for parallel agent work, diffs, and PR creation—useful for power users and for teaching Git concepts visually.

Get the full analysis with uListen AI

GPT-5.x Codex is reliable but too literal for creative redesigns.

In marketing copy and structure it repeatedly mirrored the user’s phrasing (e. ...

Get the full analysis with uListen AI

Harness matters: Cursor’s planning/to-do scaffolding improved long-task execution.

Claire suspects part of Codex’s weaker redesign experience came from the Codex app’s less mature conversational/task workflow, whereas Cursor’s Plan mode and tooling helped Opus stay organized and autonomous.

Get the full analysis with uListen AI

Opus 4.6 excels at greenfield and broad, cohesive implementation.

After initial styling missteps, Opus rebuilt the site with a more sophisticated visual system, aligned to brand colors, reused repo assets, and consistently propagated styles across pricing and other pages.

Get the full analysis with uListen AI

Best results come from pairing models with complementary strengths.

Claire’s repeatable loop is: Opus builds the feature to 80–90%; Codex reviews and finds high-impact issues/edge cases; Opus implements fixes—creating a fast, high-quality iteration cycle.

Get the full analysis with uListen AI

GPT-5.3 Codex shines as a principal-engineer-style reviewer and hardener.

It proactively prioritizes architecture/performance concerns, asks clarifying questions, and suggests scalable patterns—“happy to tear apart someone else’s code,” making it ideal for pre-ship review.

Get the full analysis with uListen AI

Token economics can still be ROI-positive—if task selection is disciplined.

Opus 4. ...

Get the full analysis with uListen AI

Notable Quotes

“I've shipped more code in the last five days than I think I have in the last month.”
— Claire Vo

“One of the things that I've noticed about the GPT-5X Codex models is they are so literal.”
— Claire Vo

“After two prompts, [it] literally made the headline 'A Dense Product Workflow for AI-Powered Teams.'”
— Claire Vo

“Opus 4.6 was just a lot better at planning for itself so that it could execute a long-running task.”
— Claire Vo

“It really replicates the principal software engineer experience… you will fight them tooth and nail to build anything for you, but they are more than happy to tear apart someone else's code.”
— Claire Vo

Questions Answered in This Episode

What exact prompt structure (constraints, examples, “don’t do” rules) did you find reduced GPT-5.x Codex’s tendency to overfit to the last instruction?

Claire Vo tests OpenAI’s Codex desktop app (GPT-5. ...

Get the full analysis with uListen AI

How much of Codex’s redesign underperformance was model behavior vs the Codex desktop app harness (task management, prompting UX, planning)? What would you change in the harness to fix it?

She finds GPT-5. ...

Get the full analysis with uListen AI

Can you share the Cursor Plan-mode template you used with Opus 4.6 to keep a long refactor on track (milestones, acceptance criteria, checklists)?

Opus 4. ...

Get the full analysis with uListen AI

What were the “high-impact issues” Codex found in the tool-component refactor (architecture, perf, edge cases), and which categories show up most often in your codebase?

Her winning stack is multi-model: use Opus to generate and implement features/design (80–90% done), then use GPT-5. ...

Get the full analysis with uListen AI

In your Opus→Codex→Opus loop, when do you stop cycling—what signals “ship it,” and what checks are automated vs manual?

Get the full analysis with uListen AI

Transcript Preview

Claire Vo

[upbeat music] Welcome back to How I AI. I'm Claire Vo, product leader and AI obsessive, here on a mission to help you build better with these new tools. Today, we're gonna bring you up to date on all the new coding model releases from OpenAI and Anthropic. In case you missed it, OpenAI released last week, Codex, their desktop app for AI engineering, the new model, GPT-5.3 Codex, try saying that five times fast, and Anthropic released their response, Opus4.6 and Opus4.6 Fast. If you're new here, then you don't know, but when these new models come out, I put them through their paces. I test them, I test them side by side on the same task, and I'm gonna give you my opinion about where they do well, where they fall apart, and which one goes where in my AI engineering stack. Spoiler alert: I've shipped more code in the last five days than I think I have in the last month. So I think these are pretty fabulous models, but they do have their quirks, they do have their strengths, and sometimes they go off the rails. Let's get to it. This episode is brought to you by WorkOS. AI has already changed how we work. Tools are helping teams write better code, analyze customer data, and even handle support tickets automatically. But there's a catch: these tools only work well when they have deep access to company systems. Your copilot needs to see your entire code base, your chatbot needs to search across internal docs, and for enterprise buyers, that raises serious security concerns. That's why these apps face intense IT scrutiny from day one. To pass, they need secure authentication, access controls, audit logs, the whole suite of enterprise features. Building all that from scratch, it's a massive lift. That's where WorkOS comes in. WorkOS gives you drop-in APIs for enterprise features, so your app can become enterprise-ready and scale up market faster. Think of it like Stripe for enterprise features. OpenAI, Perplexity, and Cursor are already using WorkOS to move faster and meet enterprise demands. Join them and hundreds of other industry leaders at workos.com. Start building today. Okay, to start, I like to pick a task when I'm evaluating new models that's pretty ambitious, something I definitely wouldn't wanna do by hand, and is consistent enough that I can actually compare the pros and cons of each model side by side. And I picked a task that I choose often when comparing these models, which is redesign my marketing site. I think all these models are pretty good at one-shotting kind of a landing page or a marketing page, a simple app. I don't feel like that's a practical evaluation criteria for these new models. I like to take a code base that's relatively complex or at least established, and compare side by side how these models work inside these code bases. So I took my ChatPRD homepage, marketing site. It's got lots of pages, it's got a blog, it's got the How I AI workflows on there. It's not a simple app, even though it's just kind of like a content front end, and I wanna bring it up to my twenty twenty-six ambitions, which are all about the enterprise. So while this, you know, website looks great, it's cute, it's got nice colors, it's definitely more focused on the kind of PLG individual user workflow, and I wanna uplevel this as we sell more to enterprise customers. So I'm gonna have these models duke it out and see which one does the better job, and I'm gonna test these in order of when they came out. So the first thing that came out in our very busy week last week was Codex. Now, Codex, as I said, is OpenAI's desktop app for coding, and before we get into it, I wanna show off some of the things that I think make Codex unique. First of all, Codex is focused around Git primitives. Now, if you don't know or you're not technical, you're a new software engineer, you've probably run into some concepts of Git as you get-- gotten started vibe coding, but I just wanna walk through a couple of things that might be useful for you to know. The first thing is the idea of a Git repository. That is basically a whole code base that represents an app or a project. Git repositories are represented over here in Codex as projects. You can see I have different repositories here that I'm working on, including my ChatPRD, um, website, the www website. Then in your repo, you can start working on new types of code, and there are kind of two ways you can take code and make it contained so that when you edit it, it doesn't break your production website. The first way that I use a lot are branches. Branches are little, as they say, branches of your code that you can make changes to, commit, and then ultimately decide to merge to production. There's also the concept of work trees. These are full copies of your code base that you would use or an agent would use to make changes, and one of the benefits of work trees versus branches, and you can get many of them going on, on the same time, on your same machine. And so if you're working with a lot of agents, you could give each agent its own work tree to work on, and it could do a lot of work in parallel without running into each other or causing issues. If you wanna learn more about work trees, definitely watch our episode with Alex from OpenAI on Codex, the terminal app, um, where he goes through how he uses work trees on a daily basis to kick off his agentic work. And then up in the top right, you can see we have a good diff panel. A diff is, again, the difference between what you had and what you have now. Um, you'll see red is code that was removed, green is code that was added. Um, you can see up here the count of line changed, either added or removed, and then you can create pull requests from Codex. Pull requests are kind of a signal to your team that says, "This code that I'm working on is ready to be part of the main production branch. Can you pull it in? I'm requesting it," and often that's where your CI/CD pipeline, your pre-develop-- or your pre-production checks go, and where your team, with their human eyes-... tends to look at your code. And you can see here, as I'm talking through this, Codex has put these concepts up front and center, and I think that's because they're trying to appeal to two audiences. One, they're just trying to appeal to, you know, let the tokens go, highly empowered, um, use all the agents, software engineers that are doing a lot of things at once on their, on their local machine and need to be able to benefit from these concepts of Git, work trees, um, local and cloud agents, all that kind of stuff. The second thing is, I think this is actually a really good framework for folks that are less technical to learn the concepts of Git. I have always said you should invest in the GitHub desktop experience. It is a version of this. It's what I use all the time to manage my work across branches and across files. I could work in the command line tool for GitHub, I just think it's nice to be able to see your changes and really know what's going on. And so Codex has brought some of these visual concepts, UI concepts, of Git into the Codex app. So it's nice if you're learning. The second thing that you'll see in Codex that is a little new and unique compared to other apps, is the concept of bringing skills up as a first-class citizen. So if you are new, skills are sort of a package set of prompts, instructions, reference files, and code that can be called by an agent to kind of consistently execute a task over time. Um, if you want to be, like, really cheap, it's like a bundled prompt. And you can see here that OpenAI and Codex have given screens a home, and they've given them icons, and they've given them buttons. And I have to say, I love this. If you watched my early episode when skills first came out, I was so exasperated that skills were like a zip file that you had to upload somewhere or put in your repository. This just makes it a much more visual experience to add skills to your code base or to your, um, system and refer to them over time. I also like that OpenAI shipped a bunch of recommended skills that a lot of people could benefit from, so you can get your mind wrapped around what kind of skills would benefit your AI work. The final thing that I think OpenAI put kind of front and center in Codex that's interesting, is this concept of automations. So automations are basically tasks that can run on a schedule. Um, you can see here when you create a new automation, you give it a name, you say what project it needs to run on. You basically run a prompt, it's not that fancy, and then you give it a schedule. And again, like skills, OpenAI has shipped a bunch of out-of-the-box automations. Now, my reaction here was: I'm already doing a lot of this stuff. You know, I'm a little ahead of the curve when it comes to some of the automations around my code base, so I've solved these problems, but I think everybody should solve these problems. So if you're looking for inspiration on what kind of automations would benefit your code base, the Codex automations, re- recommended automations is a really good place to start and get some inspiration. But let's get to actually writing code. Now, I have to say one caveat, which is I ran this process using GPT-5.2 Codex, which was the recommended model when this app came out. Now, very quickly, they came out with 5.3, and we'll see that towards the end of the episode. But I do want to call out, this is a slightly older version of the model, though I think the family of models, given my experience, have very similar outputs. So I would say I would probably get the same experience with 5.3. Now, what is the test case we're going to do? As I like to do, we are going to redesign the ChatPRD site. Last time, when some models came out, we redesigned a page, but we've been pitched that these new models are more independent, can do more long-running tasks, can handle more. And so I want it to take an existing code base and redesign the whole thing, and I'm gonna trust these very smart models to do it without too much prompting. And so that was my test cases. I wanted to take this homepage and this website, which is lovely, but it's very PLG-focused, and make it more polished, more upleveled for an enterprise audience. And so I started that in Codex, and I [chuckles] gave it pretty high-level prompt, but I thought it could go with it, which is, I said, "Optimize the marketing site in this repo for PLG plus enterprise. You can create new pages, redesign templates, et cetera, to make it the highest quality marketing site I could have." And then I listed a bunch of sites that I really like. If you're on this list, I think you have a nice website. Now, here's where it immediately disappointed me, and I'm sad to say it, but it did. One of the things that I've noticed about the GPT-5X Codex models is they are so literal. They are so literal, and so they follow instructions [chuckles] very well. And I know that is a, in many instances, a feature, not a bug. You want your model to follow your instructions explicitly, but you don't want it to follow it blindly, and that's what I found. I found that the Codex app harness, plus the Codex models, were just too literal to do greenfield or creative, broad work on my behalf. It will do high-quality coding work. I will get to that soon, but your ability to tell these models, like, "Hey, go and do X," I often found that with a combination of it being too literal and not pushing me to the next step, not actually saying, "Are you ready for me to build?" Meant that it was much more painful and slower to get work done with these models. And this is really ironic, because the 5.3 model is actually pretty fast, and so it should feel faster to code with it, but the actual back-and-forth experience conversationally was really challenging, and you'll see some of that here. So I said, "Redesign the website." I... We went back and forth on how to use the Figma skill. It didn't actually pick it up well, so I just gave, gave up on that.... And then I asked it to redesign the page, and it did it. Now, here's where example number one of being too literal came in. I had told it I want it to redesign the marketing site for a combination of product-led growth and enterprise. Basically, I wanted a market site that'd be friendly to users, but it would also help our sales team bring in inbound leads. And it built it, and literally had explicit references to PLG and enterprise in the copy. It was like, "If you're here for product-led growth, click here and sign up. If you are here as an enterprise customer, click here and talk to sales." It was so explicit, and this was my perpetual cycle with Codex on this redesign. We went back and forth. I gave it some design help. I asked it to design a couple things on styling. At some point, I said, "The design's okay, but it could be better. Take more inspiration from the sites I offered. Make the copywriting top-tier. I've spent two million dollars on it." You can see some of my desperate prompting here, just trying to figure out what is the unlock. Is it a technical spec unlock? Is it a, you know, find reference content unlock? Is it an identity unlock for these models? I couldn't figure it out, so I kept trying. And what was really funny is [chuckles] I just-- every time I would say something, it would overfit to my prompt. So when it gave me a website that I generally liked but said, "Hey, can you add more about integrations? Our enterprise customers really like integrations," it made the entire page about integrations. If I said, "Hey, I wanna focus a little more on enterprise," it would make the entire page about enterprise. It really didn't have that nuance of what goes where and how to build a balanced experience. It was really over fitting to my last prompt. And I will g- [chuckles] you know, I was saying, like: "We don't need to list exactly everything." I was trying to give it explicit examples, and then it put a long list of all those examples. It was just having a really hard time editing itself. And then I'm gonna give my favorite example of Codex being way too literal, which is I told it... You know, it created was something that I thought was fine, but it was a lot of images and not a lot of content, and I said, "Hey, I like a more content-dense site like Hex." Um, Hex, you have a lovely site. I think you did a really nice job. "I just want a more copy on there because I think I wanna be more technical, more detailed, more precise about what the value of my product is." And after two prompts, literally made the headline "A Dense Product Workflow for AI-Powered Teams." And I was like, "Oh," I mean, I made the, like, facepalm emoji face. I was like: Why in the world would you say [chuckles] that our product has a dense workflow? I asked for a content-dense site. I didn't say, "Make our content all about how dense our product is." So I just had a really tough time with Codex and Codex five two, GPT-5.2, on this particular task. We eventually got there, and I would say the output was okay. So this was the before, the after from Codex. I, I really liked this headline that came up. It was, like, one of the things somewhere buried down on the page that I thought was great. It eventually got re- uh, overwritten by my content-dense headline. I thought some of the headlines were, like, kind of interesting. It looks pretty nice. It pulled some interesting graphics from our repo. It put placeholders in here. You know, I think this is okay. It, it kind of didn't quite fit our design aesthetic, and what I was more frustrated by than the, um, say, the literal nature of the GPT model, which I had kind of gotten used to, this is, like, not something that's new to me, is that it really only redesigned this homepage and the enterprise page. So I had asked it to redesign the whole page, the whole site, and it really did not do that. And so again, this, like, sort of, it can do long-running tasks, it can take on ambitious things, it just took a lot more work from me to get it to even get to this two-page redesign, which I thought was okay, not great. Now, the code is great, it's fine, it's not terrible. It's certainly faster and better than what I would have done myself. That being said, I think we can do a lot better. So speaking of doing a lot better, let's go over to my friend Opus. Now, again, spoiler alert, y'all, I love, I love her. I love Opus, and I will caveat by saying I found a place where I really love Codex, so we're gonna come back. But as soon as I started getting my hands on Opus, I was just really happy, but it didn't start off perfect. So let's talk about where it went well and where it kind of went off the rails. So again, I started with the same prompt: "Optimize the chat marketing site in this repo for PLG and enterprise. Um, you can create new pages, redesign things, et cetera." Again, I put this content-dense fra- framework in here. I just... I had just come off that bad experience. I wanted to see what it did. And I will say, Opus 4.6 was just a lot better at planning for itself so that it could execute a long-running task. So it did its exploration of our code base and reference marketing sites, it used Cursor Plan mode to do a plan, and then it started building the components. Now, I have to give kudos to Cursor. I'm still a Cursor girl. Yes, I could have tested Opus 4.6 in Claude Code. I am sure there are optimizations there. I just, hand to God, think that Cursor does a good job of building harnesses for all of these models. I think the combination of, like, planning, and to-dos, and exploration, and the question tool, I just tend to get good results. So there is this open question of: Was it the model, or was it the Codex harness that, you know, in the, the desktop app that is not as mature as Cursor? Which one caused that bad experience? I'm not sure, um, but using Opus 4.6 in the Cursor desktop app was quite nice. Okay, so it's building, it's building, it's building, it's building.... it goes, it runs a build, it gives me a summary. I am very pleased with the independent nature of this model. I'm about to hire her. She can go run my marketing site. You are now my marketing engineer. Except the copy was great, the design was terrible, and unfortunately, I didn't commit this at this point, so I lost the design, but it just did not look good. It did not look sophisticated. I was like: "I'm going back home to Codex. What are we doing here?" It was terrible. So again, I did my desperate prompting here. I want it to look like I spent a million dollars on my design with the best agencies out here. Here are some colors. Um, let's see if I said... Oh, I said, "I want you to develop a unique and modern front-end visual style." This is Tailwind Indigo AI slop. If you know, you know. [chuckles] And it agreed with me. It was like: "You're right. I gave you an Eric Tailwind slop. Let me rebuild." And it rebuilt, and it was so lovely. And so we went back and, you know, it in- integrated our design system. It gave me an outline of what it did in terms of design. Um, we had to go back and forth on build, but eventually, I got something lovely. Here was the before, and the after was like this. I love this so much. We're probably gonna ship this in the next day or two, hopefully live when the episode goes live, but it still matches our brand aesthetic, but just looks so much nicer. It has our colors. She is pink. It uses some of our graphics instead of placeholders. It calls out some numbers, which is really great for selling the value proposition. It highlights the reviews, and then, you know, instead of what Codex was doing, which was making, like, very blunt statements about enterprise, it was, like, one hundred percent security, all this stuff. It gives a really nice kind of value proposition-oriented view of what would be nice for enterprise and redesigned our enterprise page as well. So once I got exactly what I liked, I asked it: "Okay, let's take these styles and go ahead and redesign the rest of the site to bring it up to matching." And it did a really good job. It kept everything consistent. It redesigned our pricing page. It's working on our How I AI page to make sure we're matching some of the designs. I think this looks really nice, and I was super happy with the output. And this is gonna be my meta assessment of Opus 4.6 versus the GPT-5x models, is that Opus 4.6 is really good at kind of generative, broad, greenfield work. You want it to implement a new feature, it will go implement a new feature, or you want to completely redine- design your site, it will completely redesign your site. I was really, really pleased with my experience on this model, and we're probably going to ship this live. Now, this is a much more front-end-focused, design-oriented task. I like this task because we can literally say: "Okay, what did I start with before? What did, you know, Opus come up with?" And then even compare that directly to, what did Codex come up with? Which I can refresh and show you here. I can do a side by side, and you can see with your eyes, you can read all the words and really make a decision about where these models do well. But that is not enough to assess whether or not these are good models, bad models. I like them, I'm gonna use them, or I'm not gonna use them. And as I go into the next workflow, where I found both models to be super useful, I'm going to admit something that is a little scary and maybe impressive, which is I asked Devin today: "How much code have I merged into GitHub in the last five days? I need to fix my Devin workspace." But if you go into it, in the last five days, I have merged forty-four PRs containing ninety-eight commits across one thousand and eighty-eight files. I have added ninety-two, almost ninety-three thousand lines of code. I have removed eighty-seven thousand lines of code. We've added five thousand net new lines. We have released, uh, a one, two, three, four, five MCP integrations. We've completely overhauled one of our big components, we've completely refactored our components folder, and we have shipped and fixed, and we have fixed a bunch of bugs. We have done a lot, and this is... None of this is in the web app. This is all in our core application, which is quite complicated and much more complicated than our marketing site. And I did all of this with now my two pals on my team, Opus 4.6 and Codex 5.3. So I did find a place that these two operate really well, and I am going to talk you through it. As I mentioned earlier, one of the big features that I released recently on ChatPRD was a bunch of MCP connectors. So now from our chat, you can look at what's happening in GitHub, you can look at what's happening in Linear, you can look at h- what's happening in Granola, and you bring all that into your product work. And this is one of probably two dozen tools that we now have available in the ChatPRD app, and we were displaying them all in different ways. All our tools were different. They were individual components. Our code was super, super messy. And so one of the things that I kicked off in Opus was a refactor of a reused component, um, that I wanted to be able to add to, remove from, customize, but have some shared code. I just knew the way we were doing this wasn't great. And so I started off a Opus 4.6 task to refactor how we use our tool components. So let's talk about how I actually rebuilt these components and where I use these different models. So first, I open up Cursor, and honestly, this might be the secret sauce in some of these experiences. I opened up Cursor, I built a plan with Opus 4.6 using Plan mode. I kicked it off, and I went back with 4.6 on how to build this. And you can see here, I got this lovely, like, sort of extensible tool component where I could add different things in, give them different lang- or give it different copy and language as it went through. It built a bunch of really nice front-end components for me, and I think, honestly, they look lovely. So as we saw before-... you get these lovely tool calls here. They look nice across all of our different kinds of tool calls, whether you're creating documents. I'm just really happy with this experience. Now, I'm ready to push this code to production. Here is where our friend Codex comes back to play. Now, and this is where I love to use Codex. I went back into Codex, and I said, "I've redesigned tool usage in this, um, index. It's gone through several rounds of feedback. Can you review the architecture and performance and see if you have any feedback we should consider before shipping? We're looking for something scalable but customizable, and we don't wanna overfit in any direction." And it went through and searched and identified a couple high-impact issues, prioritized those issues for me, asked me questions. I said, "One is intentional, two is a, a edge case" And it asked me if it wanted to implement any of the polish. I said yes, and it polished it. It passed our AI Bugbot code review, and we shipped this to production. And now this is my flow. So this was a very, again, kind of front-end focused, component-focused workflow. Uh, we just, you know, like, for the technical folks out there, we just completely are refla- replatforming our vector stores. It was a huge, huge, huge thing. It touched fifty files. It was really hard to do without kind of doing a huge, huge PR. It required like, I don't know, probably thirty rounds of, of feedback on this thing, and GPT-5.3 Codex was so lovely. [claps] Love it for code review, architectural review, and finding edge cases. And what I found is you could ask Opus 4.6 to build something. It would build something eighty to ninety percent done or good. You'd ask Codex to find everything wrong with it, it would find all the things that were wrong with it, and then you'd take it back to Co-- or Opus, and Opus would be like: "Oh, yeah, yeah, bro, you're right. [tsks] I really missed that thing. I better fix it" And so I do think... I'm gonna give Codex some love here. I think it's the better software engineer. Technically, Opus is kind of the software engineer that you want on your team, though. It actually builds stuff. And so [chuckles] what I've been saying to people about GPT-5.3 Codex is it really replicates the principal software engineer experience, in that you will fight them tooth and nail to build anything for you, but they are more than happy to tear apart someone else's code. So if you are looking for a principal engineer on your team, um, to pair with your eager product engineer of Opus 4.6, definitely, definitely use Codex. And I kind of feel like I can't live without Codex reviewing my code now. So I'm quite happy with this experience. Again, Bugbot, which I use from Cursor, does a lot of review of our PRs. It's also run on the Codex model, so I think it's a really good eagle-eye reviewer. It's just too hard to get out of the gate building new products, so I really like this flow, and I highly recommend that folks replicate it. I think it's really useful. To conclude our episode, I just wanna give a quick nod to Opus 4.6 Fast. If you have not heard, Opus 4.6 is Opus 4.6, but fast. You can select it here in its most powerful model, but fast, and it is expensive, six times the price. I think it's a hundred and fifty dollars per million output tokens, something like that. I actually used Opus 4.6 Fast a lot, and now I gotta go look at how much I'm spending. So what I will say is, while I have consumed the tokens, I am floating through an infinite ocean of tokens. I embrace a token abundance mindset. I'm starting to spend a lot of money on models, which, at the end of the day, super, super high ROI. Again, if we're looking at this, how expensive would it be for me to ship forty-four PRs, really, really huge features? It would take months of time, tons of people. We probably also wouldn't get it to perfect quality, and so I am really bullish that this is a worthwhile investment for my team, but don't mess around with 4.6 unless you're ready to pay the bill. And so I just think we're also gonna start looking at, where does this fit from a personality perspective? Where does this fit from a capability perspective? And then where does it fit from a budget perspective? Um, and as my friend from Cody at Sentry said, "If you're playing between 4.6 and 4.6 Fast, don't pick the wrong task, or you're [chuckles] gonna get a bill that you're not happy with." So that's today's model-focused episode of How I AI. I compared Opus 4.6, Codex, GPT-5.3 Codex, and Opus 4.6 Fast. What I found, you wanna use Opus for your product and feature work, being creative, and creating high-quality designs. You want Codex catching all your bugs, advising on our architecture, and really writing exceptional, high-quality, hardened code. Both of these models have a place in your stack. I still love Cursor for using them. I'm still a multi-model girl, but I think they do well in either the Codex desktop app, Claude Code, or wherever you like to get your AI-generated code. That is today's episode of How I AI. I'm looking forward to hearing your feedback about what your favorite model is and where you're using it, and we will see you next week. [upbeat music] Thanks so much for watching. If you enjoyed the show, please like and subscribe here on YouTube, or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app. Please consider leaving us a rating and review, which will help others find the show. You can see all our episodes and learn more about the show at howiai pod.com. See you next time! [upbeat music]

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome