At a glance
WHAT IT’S REALLY ABOUT
Opus 4.6 builds fast; GPT-5.3 Codex reviews like principal engineer
- Claire Vo tests OpenAI’s Codex desktop app (GPT-5.x Codex) against Anthropic’s Claude Opus 4.6/4.6 Fast using an ambitious, repeatable benchmark: redesigning an existing marketing site and refactoring production app components.
- She finds GPT-5.x Codex can be overly literal and hard to steer for creative, greenfield redesign work, often overfitting to the last instruction and failing to expand scope beyond a couple pages without heavy guidance.
- Opus 4.6 performs better at planning and executing broad, long-running builds (especially in Cursor’s Plan mode), producing a cohesive site redesign after an initial “Tailwind slop” iteration that improved with direction.
- Her winning stack is multi-model: use Opus to generate and implement features/design (80–90% done), then use GPT-5.3 Codex as a rigorous reviewer for architecture, performance, edge cases, and hardening before shipping—helping her merge ~93k LOC across 44 PRs in 5 days.
IDEAS WORTH REMEMBERING
5 ideasCodex’s UI is optimized for “real Git work,” not just chat-based coding.
Codex foregrounds repositories/projects, branches, worktrees for parallel agent work, diffs, and PR creation—useful for power users and for teaching Git concepts visually.
GPT-5.x Codex is reliable but too literal for creative redesigns.
In marketing copy and structure it repeatedly mirrored the user’s phrasing (e.g., explicit “PLG vs enterprise” segmentation, “dense workflow”), and over-rotated the whole page toward the last prompt rather than balancing goals.
Harness matters: Cursor’s planning/to-do scaffolding improved long-task execution.
Claire suspects part of Codex’s weaker redesign experience came from the Codex app’s less mature conversational/task workflow, whereas Cursor’s Plan mode and tooling helped Opus stay organized and autonomous.
Opus 4.6 excels at greenfield and broad, cohesive implementation.
After initial styling missteps, Opus rebuilt the site with a more sophisticated visual system, aligned to brand colors, reused repo assets, and consistently propagated styles across pricing and other pages.
Best results come from pairing models with complementary strengths.
Claire’s repeatable loop is: Opus builds the feature to 80–90%; Codex reviews and finds high-impact issues/edge cases; Opus implements fixes—creating a fast, high-quality iteration cycle.
WORDS WORTH SAVING
5 quotesI've shipped more code in the last five days than I think I have in the last month.
— Claire Vo
One of the things that I've noticed about the GPT-5X Codex models is they are so literal.
— Claire Vo
After two prompts, [it] literally made the headline 'A Dense Product Workflow for AI-Powered Teams.'
— Claire Vo
Opus 4.6 was just a lot better at planning for itself so that it could execute a long-running task.
— Claire Vo
It really replicates the principal software engineer experience… you will fight them tooth and nail to build anything for you, but they are more than happy to tear apart someone else's code.
— Claire Vo
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome