CHAPTERS
New coding model releases + the 5‑day shipping sprint
Claire Vo sets up a head-to-head test of OpenAI’s GPT‑5.3 Codex (via the Codex desktop app) versus Anthropic’s Claude Opus 4.6 (and 4.6 Fast). She previews the punchline: these models helped her ship an unusually large amount of code in just five days, but each has distinct quirks.
- •OpenAI releases: Codex desktop app + GPT‑5.3 Codex model
- •Anthropic releases: Opus 4.6 and Opus 4.6 Fast
- •Claire’s evaluation style: same task, side-by-side comparisons
- •Teaser outcome: massive velocity boost, but models differ in strengths
Sponsor break: WorkOS and enterprise-ready security features
A sponsored segment explains why enterprise AI apps need deep system access yet must satisfy strict security requirements. WorkOS is presented as a drop-in solution for auth, access controls, and audit logs to speed up enterprise readiness.
- •AI tools need access to codebases and internal docs to work well
- •Enterprise customers demand security controls and auditability
- •Building enterprise features in-house is costly and slow
- •WorkOS provides APIs for SSO, RBAC, audit logs, and more
Choosing a realistic benchmark: redesigning an established marketing site
Claire explains why she avoids simplistic ‘one-shot landing page’ tests and instead uses an existing, moderately complex repo. She picks her ChatPRD marketing site (multi-page, blog, workflows) and defines a goal: upgrade it from PLG-only vibes to a more enterprise-polished presence.
- •Evaluation should stress real repos, not toy apps
- •ChatPRD site includes multiple pages, blog, and workflow content
- •Goal: retain PLG friendliness while elevating for enterprise buyers
- •Test will measure long-running autonomy and consistency
Codex desktop app tour: Git-first workflow (repos, branches, worktrees, diffs, PRs)
Before judging the model, Claire highlights what’s unique about Codex as an app: it centers Git concepts and makes them more visible and teachable. She explains projects/repos, branches, worktrees for parallel agent work, diffs, and PR creation as first-class actions.
- •Codex maps repos to ‘projects’ for quick switching
- •Branches vs worktrees: worktrees enable parallel agent work safely
- •Diff view makes changes legible (adds/removes, line counts)
- •PR creation inside the tool aligns with team review + CI/CD
Codex app features: Skills and scheduled Automations
Claire reviews Codex’s ‘Skills’ (bundled reusable instructions/files) and likes that they’re finally a first-class UI element rather than a clunky zip-file workflow. She also covers Automations—scheduled prompt-driven tasks—and notes they’re useful inspiration even if advanced users already do similar things.
- •Skills = reusable prompt/instruction/reference packages
- •Codex UI makes skills discoverable and manageable
- •Automations run on schedules against a chosen project
- •Prebuilt automation templates help teams find good maintenance tasks
Running the redesign in GPT‑5.2 Codex (with a note about 5.3)
Claire begins the marketing redesign using the model version available at the time (GPT‑5.2 Codex), noting 5.3 arrived soon after and should behave similarly. She sets a high-level prompt—optimize for PLG plus enterprise, create new pages/templates—and expects more autonomy.
- •Test executed on GPT‑5.2 Codex; 5.3 discussed later
- •Prompt: redesign/optimize whole marketing site for PLG + enterprise
- •Expectation: long-running, independent execution in a real repo
- •Includes reference sites and ambitions for higher polish
Where Codex struggled: overly literal interpretation and prompt overfitting
Codex produces work but behaves too literally for broad, creative tasks, repeatedly overfitting to the last instruction. Claire describes painful back-and-forth where small guidance (e.g., “more integrations,” “more enterprise,” “more content-dense”) derails balance and nuance in copy and layout.
- •Codex follows instructions ‘blindly’—literalism becomes a liability
- •Copy becomes explicitly segmented (‘PLG click here / enterprise click here’)
- •Each new request dominates the whole page (integration-heavy, enterprise-only)
- •‘Content-dense’ request leads to awkward headline about ‘dense workflow’
Codex redesign outcome: acceptable code, limited scope and incomplete site refresh
The final Codex output looks ‘okay’ and the implementation quality is solid, but it doesn’t match the desired sophistication and misses the stated scope. Instead of redesigning the entire site, it effectively updates only the homepage plus an enterprise page, requiring more manual steering than expected.
- •Visual output: decent styling and some good headlines
- •Used repo graphics and placeholders, but aesthetic didn’t fully align
- •Major gap: didn’t actually redesign the whole site as requested
- •Takeaway: strong coding, weaker autonomy/creativity for greenfield design
Switching to Opus 4.6 in Cursor: better planning and long-task execution
Claire moves to Claude Opus 4.6 inside Cursor and immediately notices stronger self-planning and execution for long-running tasks. She credits Cursor’s harness (Plan mode, to-dos, exploration tools) and notes it’s unclear how much was model vs toolchain, but the combined experience is smoother.
- •Same core prompt used for fair comparison
- •Opus explores repo, references sites, creates a plan, then executes
- •Cursor features (Plan mode, to-dos, exploration) improve workflow
- •Open question: Codex issues may be model behavior and/or app maturity
Opus 4.6 iteration: great copy, bad initial design—then a strong rebuild
Opus’s first pass has strong copy but an unsophisticated look (‘Tailwind Indigo AI slop’), prompting a reset with clearer visual direction. After acknowledging the issue, Opus rebuilds with a cohesive, brand-aligned design system and produces a homepage/enterprise page Claire loves.
- •Initial attempt: content good, visuals not premium
- •Claire pushes for agency-level polish and provides color guidance
- •Opus recognizes the problem and rebuilds style holistically
- •Result: brand-consistent, stronger value props, numbers, reviews, better enterprise framing
Opus extends the redesign across the site with consistency
After nailing the new visual language, Claire asks Opus to propagate the styling across remaining pages. Opus maintains consistency while updating pricing and other site sections, reinforcing Claire’s view that Opus excels at broad, generative, greenfield work.
- •Styles generalized and applied across multiple pages
- •Pricing and additional pages updated to match the new system
- •Consistency and follow-through are stronger than Codex in this task
- •Conclusion here: Opus is better for creative, expansive builds
The 93k-line week: real product work beyond the marketing site
Claire broadens the evaluation to core app engineering and shares dramatic shipping stats from the last five days. The work includes dozens of PRs, major refactors, bug fixes, and multiple MCP integrations—done with a two-model workflow using Opus 4.6 and GPT‑5.3 Codex.
- •44 PRs, 98 commits, 1,088 files touched in 5 days
- •~92–93k lines added, ~87k removed (net ~5k)
- •Shipped ~5 MCP integrations plus major component overhaul/refactors
- •Demonstrates impact in a complex production codebase
The winning two-model workflow: Opus builds, Codex reviews and hardens
Claire describes a repeatable pattern: use Opus 4.6 (in Cursor Plan mode) to implement/refactor features quickly, then bring the result to Codex for architectural/performance review and edge-case hunting. Codex surfaces high-impact issues, asks clarifying questions, and helps polish before shipping.
- •Use Opus for implementation: components, refactors, feature buildouts
- •Use Codex for review: architecture, performance, scalability, edge cases
- •Codex prioritizes issues and can apply targeted polish changes
- •Bugbot/code review reinforces Codex’s strength as an ‘eagle-eye’ reviewer
Cost/latency tradeoffs: Opus 4.6 Fast and ‘don’t pick the wrong task’
Claire closes by discussing Opus 4.6 Fast: much faster, much more expensive, and best reserved for the right jobs. She argues the spend can be high-ROI given the output, but warns that choosing Fast unnecessarily can lead to unpleasant bills.
- •Opus 4.6 Fast is significantly pricier (roughly 6×)
- •Claire adopts a ‘token abundance’ mindset but tracks rising spend
- •Task-model-budget fit matters: match capability to need
- •Advice: avoid using Fast for tasks that don’t justify the cost
Final verdict: where each model fits in an AI engineering stack
Claire summarizes her stack: Opus 4.6 for creative product/feature work and high-quality design iteration; GPT‑5.3 Codex for hardened engineering judgment—code review, architecture, edge cases, and production readiness. She remains multi-model and prefers Cursor as the harness, while acknowledging Codex/Claude Code as alternatives.
- •Opus 4.6: greenfield creation, redesigns, feature implementation
- •GPT‑5.3 Codex: principal-engineer-style review and hardening
- •Best combo: Opus builds 80–90%, Codex finds what’s wrong, Opus fixes
- •Tooling preference: Cursor harness; models can run elsewhere too
