Skip to content
OpenAIOpenAI

Codex and the future of coding with AI — the OpenAI Podcast Ep. 6

What happens when AI becomes a true coding collaborator? OpenAI co-founder Greg Brockman and Codex engineering lead Thibault Sottiaux talk about the evolution of Codex—from the first glimpses of AI writing code, to today’s GPT-5 Codex agents that can work for hours on complex refactorings. They discuss building “harnesses,” the rise of agentic coding, code review breakthroughs, and how AI may transform software development in the years ahead. Chapters 1:15 – The first sparks of AI coding with GPT-3 4:00 – Why coding became OpenAI’s deepest focus area 7:20 – What a “harness” is and why it matters for agents 11:45 – Lessons from GitHub Copilot and latency tradeoffs 16:10 – Experimenting with terminals, IDEs, and async agents 22:00 – Internal tools like 10x and Codex code review 27:45 – Why GPT-5 Codex can run for hours on complex tasks 33:15 – The rise of refactoring and enterprise use cases 38:50 – The future of agentic software engineers 45:00 – Safety, oversight, and aligning agents with human intent 51:30 – What coding (and compute) may look like in 2030 57:40 – Advice: why it’s still a great time to learn to code

Andrew MaynehostGreg BrockmanguestThibault Sottiauxguest
Sep 15, 202550mWatch on YouTube ↗

CHAPTERS

  1. Why AI coding feels inevitable: from GPT-3 docstrings to “daily driver” Codex

    Andrew Mayne opens with Greg Brockman and Codex engineering lead Thibault Sottiaux on how quickly AI-assisted coding has progressed. Greg recalls early GPT-3 moments where docstrings reliably turned into working functions—an immediate signal that coding would be a major application area.

    • Early GPT-3 “signs of life” completing functions from docstrings and signatures
    • Ambition milestone: “1,000 coherent lines of code” quickly became routine
    • Developers acclimate fast; capabilities feel normal soon after arriving
    • Framing the episode: agentic coding, GPT-5 Codex, and what 2030 could look like
  2. Why OpenAI went unusually deep on coding (despite the ‘G’ in AGI)

    Greg explains that OpenAI typically pushes general capability, but programming became an exception where they built specialized data, metrics, and evaluation programs. The team learned that coding demanded distinct investment to measure and improve real-world usefulness, not just benchmark wins.

    • Coding treated as an exceptional focus area with dedicated metrics and data work
    • Historical context: separate Codex and language-specific pushes (e.g., Python-focused)
    • Shift from competition-style coding to practical usefulness in diverse environments
    • Need to train models around how people actually build software
  3. The “harness”: why tooling and agent loops matter as much as raw intelligence

    They introduce the idea that code is ‘text that comes to life’—it must run, interact with tools, and affect real environments. Thibault defines the harness as the integration layer (tools + agent loop) that lets a model act, comparing it to a body for the model’s brain.

    • Model alone is I/O; harness integrates tools, environment access, and iteration loops
    • Coding requires execution and feedback (tests, errors, runtime), not just text output
    • End-to-end integration can produce ‘magical’ collaborator behavior
    • Harness quality can be as important as model intelligence for usability
  4. GitHub Copilot lessons: latency budgets and interface co-evolution

    Greg reflects on Copilot as the first time many developers felt an AI embedded in their workflow. The key product revelation was that latency is a feature: autocomplete-style experiences require sub-second responses, forcing tradeoffs between speed and intelligence and motivating different interfaces for slower, smarter models.

    • Copilot made AI-in-the-loop coding tangible for mainstream developers
    • Autocomplete has a tight latency budget (~1500ms) or users won’t wait
    • Smarter but slower models require different harnesses and interaction modes
    • Thesis: higher intelligence pays off long-term if the interface adapts
  5. From copy-paste debugging to agents that fetch their own context

    Thibault describes observing developers stuffing more context into ChatGPT (code snippets, traces) until interactions became unwieldy. That pressure suggested flipping the paradigm: let the model drive, gather context itself, and debug with less human micromanagement.

    • Developers used ChatGPT for complex debugging but struggled to provide context
    • Increasingly complex copy/paste workflows revealed a ceiling
    • Key insight: agents should pull context, not rely on user-fed snippets
    • Goal: user supervises while the model does the exploratory work
  6. Form-factor experiments: terminal, IDE, cloud async agents, and internal “10x”

    They detail prototypes across terminal and remote/async setups, including an internal terminal tool called “10x.” The team explored letting agents run at scale (close laptop, keep working) while also recognizing the practicality of local workflows and the need to meet developers where they already are.

    • Early terminal prototype was productive internally but not polished enough to ship
    • Async/remote vision: agents keep working while you’re away, follow on phone
    • Multiple deployment patterns: local, remote, hybrid daemon approaches
    • Tension: build for OpenAI’s internal stack vs broad external environments
  7. Convenience vs intelligence: integrations can be ‘transformative’ even without smarter models

    Greg emphasizes two axes—intelligence and convenience (latency, cost, integration)—and a moving “acceptance region” where users adopt tools. They cite terminal-context integrations that eliminated copy-paste as a step-change in productivity, illustrating how harness improvements can rival model upgrades.

    • Adoption depends on both model capability and convenience/integration
    • High-value tasks can justify slower models; low-stakes tasks need instant convenience
    • Example: auto-reading terminal context removed copy/paste friction and felt transformative
    • Design challenge: decide when to invest in intelligence vs convenience improvements
  8. Choosing where to use Codex: terminal, IDE, GitHub @mentions, and Agents.md

    The team describes today’s ‘experimentation phase’ across interfaces: terminal power workflows, IDE for controlled edits/undo, and GitHub @mentions for delegated tasks. They introduce Agents.md as a lightweight way to encode navigation hints and team preferences so the agent can operate efficiently and consistently.

    • GitHub integration: @mention Codex to delegate fixes/moves with a remote ‘laptop’
    • Terminal excels for outcome-driven “vibe coding”; IDE preferred for precise edits and review
    • Vision: one coherent agent that works across tools like a human collaborator
    • Agents.md: concise codebase map + preferences (tests location, style, conventions)
    • Open problem: durable agent memory and deeper codebase understanding over time
  9. Enterprise ‘killer’ work: refactoring, migrations, patching, and tool creation

    Greg argues massive refactoring and migrations (e.g., COBOL modernization) remain largely unsolved but economically pivotal. They discuss automating painful work like library migrations and security patching, and the longer-term flywheel where agents build new tools (like modern Unix utilities) to amplify productivity.

    • Refactoring large codebases is a major frontier; big payoff for enterprises
    • Lowering migration cost could unlock far more modernization work (COBOL example)
    • Security patching and defensive automation likely to become critical use cases
    • Agents that create tools for themselves/users could compound gains over time
    • Future scope expands to SRE-like operations and service administration
  10. Codex code review: crossing the threshold from ‘bot noise’ to trusted safety net

    Thibault describes an internal breakthrough: high-signal PR review that checks intention/contract against implementation, traces dependencies, and surfaces deep issues. They note a ‘threshold effect’—below it, auto-review is ignored; above it, teams rely on it and feel pain when it’s unavailable.

    • Codex PR review aims for contract/intent verification, not superficial linting
    • Findings can go layers deep across dependencies and logic assumptions
    • Internal impact: accelerated PR throughput, bugs caught before release
    • Threshold dynamic: once useful enough, people demand it; when not, it’s pure noise
    • Reviews are designed to be readable and educational—even when the model is wrong
  11. What’s new in GPT-5 Codex: harness-optimized reliability and multi-hour ‘grit’

    They present GPT-5 Codex as a GPT-5 variant optimized for the Codex harness—tighter coupling between model and tools for reliability. A standout capability is persistence: it can work for hours on complex refactors, while still responding quickly on simple requests.

    • GPT-5 Codex is tuned for the agent harness and tool-using workflows
    • Faster responses for simple questions; deeper ‘thinking’ and persistence for hard tasks
    • Demonstrated up to ~7 hours of continuous work on complex refactoring internally
    • Workflow: plan → delegate → iterate through errors/tests until completion
    • Improved code quality and reliability as primary optimization targets
  12. The agentic future: millions of supervised agents, permissions, and scalable oversight

    Thibault forecasts cloud populations of agents producing economic value under human steering. Both highlight the core safety challenge: humans can’t read every line, so systems need sandboxing, permissioning, escalation paths, and scalable oversight methods to maintain trust and alignment with intent.

    • Expectation: large-scale multi-agent systems in cloud data centers
    • Human role shifts to supervision, steering, and approval of risky actions
    • Codex CLI uses sandboxing by default; permissions should be explicit and staged
    • Need for scalable oversight: maintain trust without exhaustive human code review
    • Alignment target expands from individual intent to team/organization intent
  13. 2030 outlook: abundance of creation, scarcity of compute, and security endgames

    They predict AI will enable far easier creation (digital and physical), but compute will remain scarce and strategically important. Discussion includes the security arms race and a possible ‘endgame’ via formal verification, plus the need to bring GPUs closer to users to reduce latency in tool-heavy agent loops.

    • Future may be materially abundant in what can be created, but constrained by compute supply
    • Compute allocation already shapes research outcomes; demand could scale to billions of ‘personal agents’
    • Reducing latency matters: tool-call-heavy agents benefit from nearby/edge GPUs
    • Security: hope for new defensive primitives (e.g., formal verification) beyond cat-and-mouse
    • Mission focus: expand availability of intelligence while improving efficiency and cost
  14. Still learn to code—now with AI: fundamentals, faster learning, and new leverage

    Both guests argue it’s an excellent time to learn programming, with AI accelerating language acquisition and problem solving. They stress that the most successful AI coders still understand fundamentals—architecture, structure, and correctness—using AI to avoid reinventing wheels and to surface questions novices don’t know to ask.

    • Recommendation: learn to code and learn to use AI effectively
    • AI helps ramp quickly in new languages (team examples with Rust) and unfamiliar codebases
    • Fundamentals still matter: architecture and code comprehension drive success
    • Codex can suggest established solutions (e.g., serialization libraries) and prevent common pitfalls
    • Usage is rapidly growing (reported 10x); broader access via Plus/Pro plans drives adoption

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.