
Codex and the future of coding with AI — the OpenAI Podcast Ep. 6
Andrew Mayne (host), Greg Brockman (guest), Thibault Sottiaux (guest), Andrew Mayne (host), Andrew Mayne (host)
In this episode of OpenAI, featuring Andrew Mayne and Greg Brockman, Codex and the future of coding with AI — the OpenAI Podcast Ep. 6 explores gPT-5 Codex ushers in agentic coding, refactoring, and oversight challenges The conversation traces AI coding from early GPT-3 docstring-to-function “sparks” to today’s Codex as an agentic collaborator embedded in terminals, IDEs, and GitHub.
GPT-5 Codex ushers in agentic coding, refactoring, and oversight challenges
The conversation traces AI coding from early GPT-3 docstring-to-function “sparks” to today’s Codex as an agentic collaborator embedded in terminals, IDEs, and GitHub.
A central theme is the “harness”: the tools, agent loop, integrations, and UX that let a model reliably act in real environments—often as important as raw model intelligence.
They highlight lessons from GitHub Copilot (latency as a product constraint), and why different interfaces fit different model speeds—fast autocomplete vs slower but more capable agents.
GPT-5 Codex is positioned as tightly coupled to its tool harness, enabling higher reliability, fast responses for small tasks, and sustained multi-hour effort (up to ~7 hours) on complex refactors, alongside growing emphasis on safety, scalable oversight, and looming compute scarcity by 2030.
Key Takeaways
Coding success depends on co-evolving model intelligence and the harness.
They argue you don’t get useful agentic coding from a strong model alone; you need execution, tools, looping, context access, and UX that make code “come to life” in real workflows.
Get the full analysis with uListen AI
Latency is a first-class feature that shapes what product you can build.
Copilot revealed that autocomplete has a tight budget (~1500ms), forcing smaller/faster models; slower smarter models can still win if the interface shifts to async or delegated work.
Get the full analysis with uListen AI
Agentic coding emerged from users pushing context limits in chat.
Developers kept pasting more code, traces, and logs; the natural inversion was letting the model fetch context and drive debugging itself rather than the user orchestrating every step.
Get the full analysis with uListen AI
Form factor experimentation is still ongoing; “one agent, many surfaces” is the goal.
They describe terminals, IDEs, GitHub @mentions, and cloud computers as complementary. ...
Get the full analysis with uListen AI
agents.md is a practical bridge toward agent memory and preference alignment.
It helps agents navigate a repo efficiently and follow non-obvious conventions (tests here, style there). ...
Get the full analysis with uListen AI
High-signal AI code review crosses a threshold where it becomes mission-critical.
Past bot reviews were “noise,” but Codex review reportedly finds deep, contract-level issues; once utility crosses a threshold, teams become dependent and upset when it’s removed.
Get the full analysis with uListen AI
GPT-5 Codex’s differentiator is endurance plus responsiveness.
They claim it can answer quickly on simple tasks yet sustain long, tool-rich efforts on hard refactors—internally observed up to ~7 hours—suggesting higher reliability and “grit.”},{
Get the full analysis with uListen AI
Refactoring and migrations are the enterprise killer apps (and a societal necessity).
They emphasize large-scale refactors and legacy migrations (e. ...
Get the full analysis with uListen AI
Safety for coding agents is about permissions, sandboxing, and scalable oversight.
Codex CLI defaults to a sandbox; they envision escalating permissions with human approval and better oversight methods so users can trust outputs without reading every line.
Get the full analysis with uListen AI
By 2030, compute—not ideas—may be the binding constraint.
They predict material abundance but “absolute compute scarcity,” noting the world is far from a future where billions of people each run always-on agents; GPU proximity also matters for low-latency tool loops.
Get the full analysis with uListen AI
Notable Quotes
“As soon as you saw that, you knew this is going to work, this is going to be big.”
— Greg Brockman
“For coding… this text comes to life… you realize that the harness is almost like equally part of how you make this model usable as the intelligence.”
— Greg Brockman
“Latency was a product feature…. fifteen hundred milliseconds… Anything that's slower… no one wants to sit around waiting for it.”
— Greg Brockman
“Think about it… the harness being your body and the model being your brain.”
— Thibault Sottiaux
“We've seen it work internally up to seven hours for… very complex refactorings.”
— Thibault Sottiaux
Questions Answered in This Episode
When you say GPT-5 Codex is “optimized for the harness,” what specific training or RL signals tie tool use, planning, and code quality together end-to-end?
The conversation traces AI coding from early GPT-3 docstring-to-function “sparks” to today’s Codex as an agentic collaborator embedded in terminals, IDEs, and GitHub.
Get the full analysis with uListen AI
For the 7-hour refactoring runs: what does the agent loop look like (tests, intermediate commits, rollback/branching), and what failure modes still stop it?
A central theme is the “harness”: the tools, agent loop, integrations, and UX that let a model reliably act in real environments—often as important as raw model intelligence.
Get the full analysis with uListen AI
What concrete metrics do you use internally to decide whether to invest next in more intelligence vs more convenience (latency, cost, integrations)?
They highlight lessons from GitHub Copilot (latency as a product constraint), and why different interfaces fit different model speeds—fast autocomplete vs slower but more capable agents.
Get the full analysis with uListen AI
In Codex code review, how does the system infer “contract and intention” of a PR—commit messages, diffs, tickets, tests, agents.md—and how do you measure precision vs noise?
GPT-5 Codex is positioned as tightly coupled to its tool harness, enabling higher reliability, fast responses for small tasks, and sustained multi-hour effort (up to ~7 hours) on complex refactors, alongside growing emphasis on safety, scalable oversight, and looming compute scarcity by 2030.
Get the full analysis with uListen AI
You mentioned previous auto-review bots were net-negative until a threshold; what was the key capability jump that crossed the threshold (reasoning depth, repo context, tool execution)?
Get the full analysis with uListen AI
Transcript Preview
Hello, I'm Andrew Main, and this is the OpenAI Podcast. In this episode, we're going to speak with OpenAI co-founder and president Greg Brockman, and Codex engineering lead Thibault Sottiaux, and we're going to talk about agentic coding, GPT-5 Codex, and where things might be heading in 2030.
Just bet that the greater intelligence will pan out in the, in the long run.
Uh, and it's just really optimized for, you know, what people are using, uh, GPT-5 within Codex for.
How do you make sure that that AI is producing things that are actually correct?
We're here to talk about Codex, which first, I've been using it, um, since actually, since I worked here, the first version of this, and then now you guys have the new version of this. I was playing with it all weekend long, and I've been very, very impressed by this, and it's amazing how far this technology has come in a few years. I would love to find out the early story. Like, where did the idea of even using a language model to do code come from?
Well, I mean, I remember back in the GPT-3 days-
Mm
... seeing the very first signs of life, of take a docstring and a Python definition of, of a function-
Mm-hmm
... name, and then watching the model complete the code. And as soon as you saw that, you knew this is going to work, this is going to be big. And I remember at some point we were talking about these aspirational goals of imagine if you could have a language model that would write a thousand lines of coherent code. [chuckles] Right? That was, like, a big goal for us. And the thing that's kind of wild is that that goal has come and passed, and I think that we don't think twice about it, right? I think that while you're developing this technology, you really just see the holes, the flaws, the things that don't work. Um, but every so often it's good to, like, step back and realize that, like, actually things have just, like, come so far.
It's incredible how used we get to things improving all the time, and how it has just become like a, a daily driver, and you just use it every day, and then you reflect back to like a month ago, this wasn't even possible. Um, and this just continues to happen. Uh, I think that's quite fascinating, like how quickly humans adapt to new things.
Now, now, one of the struggles that we've always had is the question of whether to go deep on a domain.
Mm-hmm.
Right? Because we're really here for the G, right, for AGI, general intelligence.
Mm-hmm.
And so to first order, our instinct is just push on, making all the capabilities better at once. Coding's always been the exception to that, right? We, we really have a very different program that we use to focus on coding data, on code metrics, on trying to really understand how do our models perform on code. Um, and that, you know, we've started to do that in other domains too, but, but for programming and coding, that that's been like a very exceptional focus for us. And, you know, for GPT-4, we really produced a single model that was just a leap on all fronts. Um, but we actually had trained, you know, the Codex model-
Install uListen to search the full transcript and get AI-powered insights
Get Full TranscriptGet more from every podcast
AI summaries, searchable transcripts, and fact-checking. Free forever.
Add to Chrome