a16zGPT-5 and Agents Breakdown – w/ OpenAI Researchers Isa Fulford & Christina Kim
At a glance
WHAT IT’S REALLY ABOUT
OpenAI researchers unpack GPT-5’s usability, agents, data, and culture
- GPT-5 is positioned as a step-change in everyday utility—especially for coding and writing—beyond simply posting stronger benchmark scores.
- OpenAI focused heavily on post-training details (datasets, reward models, and behavior tuning) to improve usability while reducing sycophancy, hallucinations, and deception-like behavior.
- The researchers argue that as public benchmarks saturate, the most meaningful frontier metric becomes real-world usage and the new tasks the model unlocks at accessible price points.
- Agents are defined as asynchronous systems that do useful work on a user’s behalf, but reliability, oversight, and task/environment coverage remain key bottlenecks.
- Progress is increasingly “data- and task-driven,” with growing emphasis on building realistic RL environments, collecting computer-use data, and bootstrapping synthetic data once baseline capability exists.
IDEAS WORTH REMEMBERING
5 ideasUsability—not just benchmarks—is the north star for GPT-5.
Kim emphasizes that strong evals matter, but the real win is that GPT-5 feels broadly more useful in the tasks people actually do (notably coding and writing), which should show up in usage patterns.
Front-end coding improvements came from obsessive attention to data and rewards.
Rather than a single magic trick, they attribute the leap to careful dataset curation and reward-model design, plus a deliberate push to “nail” front-end specifics like aesthetics and interactive behavior.
Sycophancy is treated as a post-training trade-off problem.
Kim describes post-training as an “art” of balancing rewards: optimizing for helpful/engaging can overshoot into overly effusive or unhealthy behavior, so GPT-5’s behavior was intentionally “reset” to be healthier.
Hallucinations and deception-like behavior are linked by over-helpfulness.
They argue models may “want to respond” even when they lack capability; enabling more deliberate thinking (pausing instead of blurting) and aligning incentives reduces both confident errors and misleading helpfulness.
As benchmarks saturate, OpenAI increasingly builds capability-first internal evals.
Fulford explains they work backward from desired capabilities (e.g., slide decks, spreadsheets) and create representative internal evals when public ones don’t exist, using experts, synthetic generation, and usage signals to hill-climb.
WORDS WORTH SAVING
5 quotesI feel like ChatGPT got released, and everyone was like, "Wow, that's so cool." But then you just kind of take it for granted that you literally have this, like, wizard in your pocket.
— Christina Kim
It's like everything they tell you not to do at a startup. It's just like your user is anyone.
— Isa Fulford
The design of this model has been very, very intentional for model behavior, especially with the sycophancy issues that we had like a few months ago with 4o.
— Christina Kim
We're trying to make the most capable thing and we're also trying to have as, make it useful to as many people as possible and accessible to as many people as possible.
— Isa Fulford
If this exponential is true, like, there's not really much else I want to spend my life working on.
— Christina Kim
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome