GPT-5 and Agents Breakdown – w/ OpenAI Researchers Isa Fulford & Christina Kim

ChatGPT-5 just launched, marking a major milestone for OpenAI and the entire AI ecosystem. Fresh off today's live stream, a16'z Erik Torenberg was joined in the studio by three people who played key roles in making this model a reality: - Christina Kim, Researcher at OpenAI, who leads the core models team on post-training - Isa Fulford, Researcher at OpenAI, who leads deep research and the ChatGPT agent team on post-training - Sarah Wang, General Partner at a16z, who helped lead our investment in OpenAI since 2021 They discuss what’s actually new in ChatGPT-5—from major leaps in reasoning, coding, and creative writing to meaningful improvements in trustworthiness, behavior, and post-training techniques. We also discuss: - How GPT-5 was trained, including RL environments, and why data quality matters more than ever - The shift toward agentic workflows—what “agents” really are, why async matters, and how it’s empowering a new golden age of the “ideas guy” - What GPT-5 means for builders, startups, and the broader AI ecosystem going forward Whether you're an AI researcher, founder, or curious user, this is the deep-dive conversation you won't want to miss. Timecodes: 00:00 ChatGPT Origins 02:13 Model Capabilities & Coding Improvements 04:11 Model Behaviors & Sycophancy 06:15 Usage, Pricing & Startup Opportunities 08:03 Broader Impact & AGI Discourse 16:59 Creative Writing & Model Progress 31:50 Training, Data & Reflections 36:25 Company Growth & Culture 41:39 Closing Thoughts Resources: Find Christina on X: https://x.com/christinahkim Find Isa on X: https://x.com/isafulf Find Sarah on X: https://x.com/sarahdingwang Stay Updated: Let us know what you think: https://ratethispodcast.com/a16z Find a16z on Twitter: https://twitter.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Subscribe on your favorite podcast app: https://a16z.simplecast.com/ Follow our host: https://x.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details, please see a16z.com/disclosures.

Christina KimguestErik TorenberghostIsa Fulfordguest

Aug 7, 202542mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

OpenAI researchers unpack GPT-5’s usability, agents, data, and culture

GPT-5 is positioned as a step-change in everyday utility—especially for coding and writing—beyond simply posting stronger benchmark scores.
OpenAI focused heavily on post-training details (datasets, reward models, and behavior tuning) to improve usability while reducing sycophancy, hallucinations, and deception-like behavior.
The researchers argue that as public benchmarks saturate, the most meaningful frontier metric becomes real-world usage and the new tasks the model unlocks at accessible price points.
Agents are defined as asynchronous systems that do useful work on a user’s behalf, but reliability, oversight, and task/environment coverage remain key bottlenecks.
Progress is increasingly “data- and task-driven,” with growing emphasis on building realistic RL environments, collecting computer-use data, and bootstrapping synthetic data once baseline capability exists.

IDEAS WORTH REMEMBERING

5 ideas

Usability—not just benchmarks—is the north star for GPT-5.

Kim emphasizes that strong evals matter, but the real win is that GPT-5 feels broadly more useful in the tasks people actually do (notably coding and writing), which should show up in usage patterns.

Front-end coding improvements came from obsessive attention to data and rewards.

Rather than a single magic trick, they attribute the leap to careful dataset curation and reward-model design, plus a deliberate push to “nail” front-end specifics like aesthetics and interactive behavior.

Sycophancy is treated as a post-training trade-off problem.

Kim describes post-training as an “art” of balancing rewards: optimizing for helpful/engaging can overshoot into overly effusive or unhealthy behavior, so GPT-5’s behavior was intentionally “reset” to be healthier.

Hallucinations and deception-like behavior are linked by over-helpfulness.

They argue models may “want to respond” even when they lack capability; enabling more deliberate thinking (pausing instead of blurting) and aligning incentives reduces both confident errors and misleading helpfulness.

As benchmarks saturate, OpenAI increasingly builds capability-first internal evals.

Fulford explains they work backward from desired capabilities (e.g., slide decks, spreadsheets) and create representative internal evals when public ones don’t exist, using experts, synthetic generation, and usage signals to hill-climb.

WORDS WORTH SAVING

5 quotes

I feel like ChatGPT got released, and everyone was like, "Wow, that's so cool." But then you just kind of take it for granted that you literally have this, like, wizard in your pocket.

— Christina Kim

It's like everything they tell you not to do at a startup. It's just like your user is anyone.

— Isa Fulford

The design of this model has been very, very intentional for model behavior, especially with the sycophancy issues that we had like a few months ago with 4o.

— Christina Kim

We're trying to make the most capable thing and we're also trying to have as, make it useful to as many people as possible and accessible to as many people as possible.

— Isa Fulford

If this exponential is true, like, there's not really much else I want to spend my life working on.

— Christina Kim

GPT-5 coding leap (front-end, usability)Post-training craft: datasets, reward models, trade-offsModel behavior: sycophancy, hallucinations, deceptionAgents and deep research: async work, tools, confirmationsEvals vs. usage as progress metricRL environments, task design, data bottlenecksMid-training to update knowledge and extend capability

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.