Skip to content
How I AIHow I AI

Claude Opus 4.6 vs GPT-5.3 Codex: Which is the better software engineer?

I put the newest AI coding models from OpenAI and Anthropic head-to-head, testing them on real engineering work I’m actually doing. I compare GPT-5.3 Codex with Opus 4.6 (and Opus 4.6 Fast) by asking them to redesign my marketing website and refactor some genuinely gnarly components. Through side-by-side experiments, I break down where each model shines—creative development versus code review—and share how I’m thinking about combining them to build a more effective AI engineering stack. *What you’ll learn:* 1. The strengths and weaknesses of OpenAI’s Codex vs. Anthropic’s Opus for different coding tasks 2. How I shipped 44 PRs containing 98 commits across 1,088 files in just five days using these models 3. Why Codex excels at code review but struggles with creative, greenfield work 4. The surprising way Opus and Codex complement each other in a real-world engineering workflow 5. How to use Git concepts like work trees to maximize productivity with AI coding assistants 6. Why Opus 4.6 Fast might be worth the 6x price increase (but be careful with your token budget) *Brought to you by:* WorkOS—Make your app enterprise-ready today: https://workos.com?utm_source=lennys_howiai&utm_medium=podcast&utm_campaign=q22025 *Detailed workflow walkthroughs from this episode:* • How I AI: GPT-5.3 Codex vs. Claude Opus 4.6—Shipping 44 PRs in 5 Days: https://www.chatprd.ai/how-i-ai/gpt-5-3-codex-vs-claude-opus-4-6 • How to Combine Claude Opus and GPT-5.3 Codex for High-Velocity Code Refactoring: https://www.chatprd.ai/how-i-ai/workflows/how-to-combine-claude-opus-and-gpt-5-3-codex-for-high-velocity-code-refactoring • How to Redesign a Marketing Website Using Claude Opus 4.6 for Creative Development: https://www.chatprd.ai/how-i-ai/workflows/how-to-redesign-a-marketing-website-using-claude-opus-4-6-for-creative-development *In this episode, we cover:* (00:00) Introduction to new AI coding models (02:13) My test methodology for comparing models (03:30) Codex’s unique features: Git primitives, skills, and automations (09:05) Testing GPT-5.2 Codex on a website redesign task (10:40) Challenges with Codex’s literal interpretation of prompts (15:00) Comparing the before and after with Codex (16:23) Testing Opus 4.6 on the same website redesign task (20:56) Comparing the visual results of both models (21:30) Real-world engineering impact: 44 PRs in five days (23:03) Refactoring components with Opus 4.6 (24:30) Using Codex for code review and architectural analysis (26:55) Cost considerations for Opus 4.6 Fast (28:52) Conclusion *Tools referenced:* • OpenAI’s GPT-5.3 Codex: https://openai.com/index/introducing-gpt-5-3-codex/ • Anthropic’s Claude Opus 4.6: https://www.anthropic.com/news/claude-opus-4-6 • Cursor: https://cursor.sh/ • GitHub: https://github.com/ *Other references:* • Tailwind CSS: https://tailwindcss.com/ • Git: https://git-scm.com/ • Bugbot: https://cursor.com/bugbot *Where to find Claire Vo:* ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ X: https://x.com/clairevo _Production and marketing by https://penname.co/._ _For inquiries about sponsoring the podcast, email jordan@penname.co._

Claire Vohost
Feb 11, 202630mWatch on YouTube ↗

CHAPTERS

  1. New coding model releases + the 5‑day shipping sprint

    Claire Vo sets up a head-to-head test of OpenAI’s GPT‑5.3 Codex (via the Codex desktop app) versus Anthropic’s Claude Opus 4.6 (and 4.6 Fast). She previews the punchline: these models helped her ship an unusually large amount of code in just five days, but each has distinct quirks.

  2. Sponsor break: WorkOS and enterprise-ready security features

    A sponsored segment explains why enterprise AI apps need deep system access yet must satisfy strict security requirements. WorkOS is presented as a drop-in solution for auth, access controls, and audit logs to speed up enterprise readiness.

  3. Choosing a realistic benchmark: redesigning an established marketing site

    Claire explains why she avoids simplistic ‘one-shot landing page’ tests and instead uses an existing, moderately complex repo. She picks her ChatPRD marketing site (multi-page, blog, workflows) and defines a goal: upgrade it from PLG-only vibes to a more enterprise-polished presence.

  4. Codex desktop app tour: Git-first workflow (repos, branches, worktrees, diffs, PRs)

    Before judging the model, Claire highlights what’s unique about Codex as an app: it centers Git concepts and makes them more visible and teachable. She explains projects/repos, branches, worktrees for parallel agent work, diffs, and PR creation as first-class actions.

  5. Codex app features: Skills and scheduled Automations

    Claire reviews Codex’s ‘Skills’ (bundled reusable instructions/files) and likes that they’re finally a first-class UI element rather than a clunky zip-file workflow. She also covers Automations—scheduled prompt-driven tasks—and notes they’re useful inspiration even if advanced users already do similar things.

  6. Running the redesign in GPT‑5.2 Codex (with a note about 5.3)

    Claire begins the marketing redesign using the model version available at the time (GPT‑5.2 Codex), noting 5.3 arrived soon after and should behave similarly. She sets a high-level prompt—optimize for PLG plus enterprise, create new pages/templates—and expects more autonomy.

  7. Where Codex struggled: overly literal interpretation and prompt overfitting

    Codex produces work but behaves too literally for broad, creative tasks, repeatedly overfitting to the last instruction. Claire describes painful back-and-forth where small guidance (e.g., “more integrations,” “more enterprise,” “more content-dense”) derails balance and nuance in copy and layout.

  8. Codex redesign outcome: acceptable code, limited scope and incomplete site refresh

    The final Codex output looks ‘okay’ and the implementation quality is solid, but it doesn’t match the desired sophistication and misses the stated scope. Instead of redesigning the entire site, it effectively updates only the homepage plus an enterprise page, requiring more manual steering than expected.

  9. Switching to Opus 4.6 in Cursor: better planning and long-task execution

    Claire moves to Claude Opus 4.6 inside Cursor and immediately notices stronger self-planning and execution for long-running tasks. She credits Cursor’s harness (Plan mode, to-dos, exploration tools) and notes it’s unclear how much was model vs toolchain, but the combined experience is smoother.

  10. Opus 4.6 iteration: great copy, bad initial design—then a strong rebuild

    Opus’s first pass has strong copy but an unsophisticated look (‘Tailwind Indigo AI slop’), prompting a reset with clearer visual direction. After acknowledging the issue, Opus rebuilds with a cohesive, brand-aligned design system and produces a homepage/enterprise page Claire loves.

  11. Opus extends the redesign across the site with consistency

    After nailing the new visual language, Claire asks Opus to propagate the styling across remaining pages. Opus maintains consistency while updating pricing and other site sections, reinforcing Claire’s view that Opus excels at broad, generative, greenfield work.

  12. The 93k-line week: real product work beyond the marketing site

    Claire broadens the evaluation to core app engineering and shares dramatic shipping stats from the last five days. The work includes dozens of PRs, major refactors, bug fixes, and multiple MCP integrations—done with a two-model workflow using Opus 4.6 and GPT‑5.3 Codex.

  13. The winning two-model workflow: Opus builds, Codex reviews and hardens

    Claire describes a repeatable pattern: use Opus 4.6 (in Cursor Plan mode) to implement/refactor features quickly, then bring the result to Codex for architectural/performance review and edge-case hunting. Codex surfaces high-impact issues, asks clarifying questions, and helps polish before shipping.

  14. Cost/latency tradeoffs: Opus 4.6 Fast and ‘don’t pick the wrong task’

    Claire closes by discussing Opus 4.6 Fast: much faster, much more expensive, and best reserved for the right jobs. She argues the spend can be high-ROI given the output, but warns that choosing Fast unnecessarily can lead to unpleasant bills.

  15. Final verdict: where each model fits in an AI engineering stack

    Claire summarizes her stack: Opus 4.6 for creative product/feature work and high-quality design iteration; GPT‑5.3 Codex for hardened engineering judgment—code review, architecture, edge cases, and production readiness. She remains multi-model and prefers Cursor as the harness, while acknowledging Codex/Claude Code as alternatives.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome