Codex and the future of coding with AI — the OpenAI Podcast Ep. 6

Codex and the future of coding with AI — the OpenAI Podcast Ep. 6

OpenAISep 15, 202550m

Andrew Mayne (host), Greg Brockman (guest), Thibault Sottiaux (guest), Andrew Mayne (host), Andrew Mayne (host)

Early GPT-3 coding capabilities and “vibe coding” originsWhy OpenAI went unusually deep on coding metrics/dataHarness definition: tools + agent loop + integrationsCopilot lessons: latency budgets and interface designTerminal/IDE/async cloud agent form factorsagents.md as instruction + navigation compressionCodex code review, refactoring, migrations, security patchingLong-running GPT-5 Codex tasks and reliability gainsScalable oversight, sandboxing, permissions, alignment2030 outlook: abundance, compute scarcity, GPUs near usersLearning to code with AI; fundamentals still matter

In this episode of OpenAI, featuring Andrew Mayne and Greg Brockman, Codex and the future of coding with AI — the OpenAI Podcast Ep. 6 explores gPT-5 Codex ushers in agentic coding, refactoring, and oversight challenges The conversation traces AI coding from early GPT-3 docstring-to-function “sparks” to today’s Codex as an agentic collaborator embedded in terminals, IDEs, and GitHub.

GPT-5 Codex ushers in agentic coding, refactoring, and oversight challenges

The conversation traces AI coding from early GPT-3 docstring-to-function “sparks” to today’s Codex as an agentic collaborator embedded in terminals, IDEs, and GitHub.

A central theme is the “harness”: the tools, agent loop, integrations, and UX that let a model reliably act in real environments—often as important as raw model intelligence.

They highlight lessons from GitHub Copilot (latency as a product constraint), and why different interfaces fit different model speeds—fast autocomplete vs slower but more capable agents.

GPT-5 Codex is positioned as tightly coupled to its tool harness, enabling higher reliability, fast responses for small tasks, and sustained multi-hour effort (up to ~7 hours) on complex refactors, alongside growing emphasis on safety, scalable oversight, and looming compute scarcity by 2030.

Key Takeaways

Coding success depends on co-evolving model intelligence and the harness.

They argue you don’t get useful agentic coding from a strong model alone; you need execution, tools, looping, context access, and UX that make code “come to life” in real workflows.

Get the full analysis with uListen AI

Latency is a first-class feature that shapes what product you can build.

Copilot revealed that autocomplete has a tight budget (~1500ms), forcing smaller/faster models; slower smarter models can still win if the interface shifts to async or delegated work.

Get the full analysis with uListen AI

Agentic coding emerged from users pushing context limits in chat.

Developers kept pasting more code, traces, and logs; the natural inversion was letting the model fetch context and drive debugging itself rather than the user orchestrating every step.

Get the full analysis with uListen AI

Form factor experimentation is still ongoing; “one agent, many surfaces” is the goal.

They describe terminals, IDEs, GitHub @mentions, and cloud computers as complementary. ...

Get the full analysis with uListen AI

agents.md is a practical bridge toward agent memory and preference alignment.

It helps agents navigate a repo efficiently and follow non-obvious conventions (tests here, style there). ...

Get the full analysis with uListen AI

High-signal AI code review crosses a threshold where it becomes mission-critical.

Past bot reviews were “noise,” but Codex review reportedly finds deep, contract-level issues; once utility crosses a threshold, teams become dependent and upset when it’s removed.

Get the full analysis with uListen AI

GPT-5 Codex’s differentiator is endurance plus responsiveness.

They claim it can answer quickly on simple tasks yet sustain long, tool-rich efforts on hard refactors—internally observed up to ~7 hours—suggesting higher reliability and “grit.”},{

Get the full analysis with uListen AI

Refactoring and migrations are the enterprise killer apps (and a societal necessity).

They emphasize large-scale refactors and legacy migrations (e. ...

Get the full analysis with uListen AI

Safety for coding agents is about permissions, sandboxing, and scalable oversight.

Codex CLI defaults to a sandbox; they envision escalating permissions with human approval and better oversight methods so users can trust outputs without reading every line.

Get the full analysis with uListen AI

By 2030, compute—not ideas—may be the binding constraint.

They predict material abundance but “absolute compute scarcity,” noting the world is far from a future where billions of people each run always-on agents; GPU proximity also matters for low-latency tool loops.

Get the full analysis with uListen AI

Notable Quotes

As soon as you saw that, you knew this is going to work, this is going to be big.

Greg Brockman

For coding… this text comes to life… you realize that the harness is almost like equally part of how you make this model usable as the intelligence.

Greg Brockman

Latency was a product feature…. fifteen hundred milliseconds… Anything that's slower… no one wants to sit around waiting for it.

Greg Brockman

Think about it… the harness being your body and the model being your brain.

Thibault Sottiaux

We've seen it work internally up to seven hours for… very complex refactorings.

Thibault Sottiaux

Questions Answered in This Episode

When you say GPT-5 Codex is “optimized for the harness,” what specific training or RL signals tie tool use, planning, and code quality together end-to-end?

The conversation traces AI coding from early GPT-3 docstring-to-function “sparks” to today’s Codex as an agentic collaborator embedded in terminals, IDEs, and GitHub.

Get the full analysis with uListen AI

For the 7-hour refactoring runs: what does the agent loop look like (tests, intermediate commits, rollback/branching), and what failure modes still stop it?

A central theme is the “harness”: the tools, agent loop, integrations, and UX that let a model reliably act in real environments—often as important as raw model intelligence.

Get the full analysis with uListen AI

What concrete metrics do you use internally to decide whether to invest next in more intelligence vs more convenience (latency, cost, integrations)?

They highlight lessons from GitHub Copilot (latency as a product constraint), and why different interfaces fit different model speeds—fast autocomplete vs slower but more capable agents.

Get the full analysis with uListen AI

In Codex code review, how does the system infer “contract and intention” of a PR—commit messages, diffs, tickets, tests, agents.md—and how do you measure precision vs noise?

GPT-5 Codex is positioned as tightly coupled to its tool harness, enabling higher reliability, fast responses for small tasks, and sustained multi-hour effort (up to ~7 hours) on complex refactors, alongside growing emphasis on safety, scalable oversight, and looming compute scarcity by 2030.

Get the full analysis with uListen AI

You mentioned previous auto-review bots were net-negative until a threshold; what was the key capability jump that crossed the threshold (reasoning depth, repo context, tool execution)?

Get the full analysis with uListen AI

Transcript Preview

Andrew Mayne

Hello, I'm Andrew Main, and this is the OpenAI Podcast. In this episode, we're going to speak with OpenAI co-founder and president Greg Brockman, and Codex engineering lead Thibault Sottiaux, and we're going to talk about agentic coding, GPT-5 Codex, and where things might be heading in 2030.

Greg Brockman

Just bet that the greater intelligence will pan out in the, in the long run.

Thibault Sottiaux

Uh, and it's just really optimized for, you know, what people are using, uh, GPT-5 within Codex for.

Greg Brockman

How do you make sure that that AI is producing things that are actually correct?

Andrew Mayne

We're here to talk about Codex, which first, I've been using it, um, since actually, since I worked here, the first version of this, and then now you guys have the new version of this. I was playing with it all weekend long, and I've been very, very impressed by this, and it's amazing how far this technology has come in a few years. I would love to find out the early story. Like, where did the idea of even using a language model to do code come from?

Greg Brockman

Well, I mean, I remember back in the GPT-3 days-

Andrew Mayne

Mm

Greg Brockman

... seeing the very first signs of life, of take a docstring and a Python definition of, of a function-

Andrew Mayne

Mm-hmm

Greg Brockman

... name, and then watching the model complete the code. And as soon as you saw that, you knew this is going to work, this is going to be big. And I remember at some point we were talking about these aspirational goals of imagine if you could have a language model that would write a thousand lines of coherent code. [chuckles] Right? That was, like, a big goal for us. And the thing that's kind of wild is that that goal has come and passed, and I think that we don't think twice about it, right? I think that while you're developing this technology, you really just see the holes, the flaws, the things that don't work. Um, but every so often it's good to, like, step back and realize that, like, actually things have just, like, come so far.

Thibault Sottiaux

It's incredible how used we get to things improving all the time, and how it has just become like a, a daily driver, and you just use it every day, and then you reflect back to like a month ago, this wasn't even possible. Um, and this just continues to happen. Uh, I think that's quite fascinating, like how quickly humans adapt to new things.

Greg Brockman

Now, now, one of the struggles that we've always had is the question of whether to go deep on a domain.

Andrew Mayne

Mm-hmm.

Greg Brockman

Right? Because we're really here for the G, right, for AGI, general intelligence.

Andrew Mayne

Mm-hmm.

Greg Brockman

And so to first order, our instinct is just push on, making all the capabilities better at once. Coding's always been the exception to that, right? We, we really have a very different program that we use to focus on coding data, on code metrics, on trying to really understand how do our models perform on code. Um, and that, you know, we've started to do that in other domains too, but, but for programming and coding, that that's been like a very exceptional focus for us. And, you know, for GPT-4, we really produced a single model that was just a leap on all fronts. Um, but we actually had trained, you know, the Codex model-

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome