GPT-5 and Agents Breakdown – w/ OpenAI Researchers Isa Fulford & Christina Kim

ChatGPT-5 just launched, marking a major milestone for OpenAI and the entire AI ecosystem. Fresh off today's live stream, a16'z Erik Torenberg was joined in the studio by three people who played key roles in making this model a reality: - Christina Kim, Researcher at OpenAI, who leads the core models team on post-training - Isa Fulford, Researcher at OpenAI, who leads deep research and the ChatGPT agent team on post-training - Sarah Wang, General Partner at a16z, who helped lead our investment in OpenAI since 2021 They discuss what’s actually new in ChatGPT-5—from major leaps in reasoning, coding, and creative writing to meaningful improvements in trustworthiness, behavior, and post-training techniques. We also discuss: - How GPT-5 was trained, including RL environments, and why data quality matters more than ever - The shift toward agentic workflows—what “agents” really are, why async matters, and how it’s empowering a new golden age of the “ideas guy” - What GPT-5 means for builders, startups, and the broader AI ecosystem going forward Whether you're an AI researcher, founder, or curious user, this is the deep-dive conversation you won't want to miss. Timecodes: 00:00 ChatGPT Origins 02:13 Model Capabilities & Coding Improvements 04:11 Model Behaviors & Sycophancy 06:15 Usage, Pricing & Startup Opportunities 08:03 Broader Impact & AGI Discourse 16:59 Creative Writing & Model Progress 31:50 Training, Data & Reflections 36:25 Company Growth & Culture 41:39 Closing Thoughts Resources: Find Christina on X: https://x.com/christinahkim Find Isa on X: https://x.com/isafulf Find Sarah on X: https://x.com/sarahdingwang Stay Updated: Let us know what you think: https://ratethispodcast.com/a16z Find a16z on Twitter: https://twitter.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Subscribe on your favorite podcast app: https://a16z.simplecast.com/ Follow our host: https://x.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details, please see a16z.com/disclosures.

Christina KimguestErik TorenberghostIsa Fulfordguest

Aug 8, 202542mWatch on YouTube ↗

CHAPTERS

From WebGPT to ChatGPT: Tool use as the origin story
Christina Kim traces the lineage from WebGPT—an early tool-using, browser-grounded model—to the realization that users want multi-turn dialogue, which became ChatGPT. The discussion highlights how grounding, follow-up questions, and conversational UX shaped the product direction.
Day-of GPT-5 launch: “Eval wins” vs real-world usefulness
On launch day, the guests emphasize that headline benchmarks matter, but the more meaningful change is broad, noticeable utility in everyday workflows. Christina points to coding and writing as personal “step-change” use cases where the model feels materially more helpful.
Coding leap—especially front-end: data, reward models, and obsessing over details
The conversation drills into why GPT-5 appears to be a top-tier coding model, with a specific jump in front-end web development. The theme is craft: careful dataset work, reward design, and a deliberate focus on “nailing” the front-end experience, including aesthetics.
Behavior design after sycophancy: making a “healthy, helpful assistant”
Christina explains GPT-5 behavior as an intentional reset, informed by earlier sycophancy issues. Post-training is framed as an art of trade-offs—being helpful and engaging without becoming overly flattering or unhealthy in its interactions.
Hallucinations, deception, and why “thinking time” changes outcomes
They connect hallucinations and deception to a shared drive to be helpful even when uncertain. Allowing models to “pause” and reason step-by-step can reduce blurting incorrect answers, while better incentives can reduce misleading responses.
Pricing, usage signals, and the startup surface area GPT-5 unlocks
The guests look to post-launch usage as the real metric—what new workflows emerge when strong capabilities meet accessible pricing. They predict a wave of new developer and indie products enabled by “vibe coding” and rapid prototyping for non-technical builders.
AGI discourse and the limits of benchmarks: usage becomes the frontier metric
They argue that many standard evals feel close to saturated, so the next “proof” of progress is what people do with the models day-to-day. Internally, they emphasize working backward from desired capabilities and building evals when none exist.
Choosing what to build: general intelligence vs targeted vertical wins
Isa and Christina describe the tension between broad, generally useful capabilities and focusing on specific high-impact domains like coding. They note that smarter base models often unlock multiple capabilities at once, which makes general intelligence gains unusually valuable.
Data- and task-pilled: RL environments, realism, and why tasks are the bottleneck
Both guests stress that with increasingly effective learning algorithms, high-quality tasks and environments become the limiting factor. Realistic RL environments and representative computer-use data are highlighted as major needs, with the promise of bootstrapping synthetic data once models are good enough.
Agents defined: asynchronous work, artifacts, private data, and action-taking constraints
Isa defines agents as systems that do useful work asynchronously and return with results or questions. Near-term priorities include better synthesis (web + private services), stronger artifact creation (docs/slides/sheets), and safer action-taking with confirmation for irreversible steps.
Latency trade-offs: why users will wait—until they won’t
They discuss the product shift from speed-at-all-costs to willingness to wait for higher-value outcomes (e.g., deep research reports). But expectations quickly reset: once users adapt, they demand the same quality faster, and sometimes equate longer outputs with more thoroughness even when that’s not true.
Creative writing gains and everyday communication: tenderness, iteration, and taste
Christina highlights creative writing as a favorite GPT-5 improvement, describing outputs as more emotionally resonant and “touching.” They also connect writing help to everyday work (Slack messages, phrasing) and discuss “taste” as choosing simple, effective approaches and directions amid commoditizing capabilities.
Training pipeline and “mid-training”: extending knowledge and capability efficiently
Christina explains mid-training as a phase between pre-training and post-training to extend intelligence and refresh knowledge without a full new pre-train. It’s positioned as a data-focused method to update cutoffs and expand model competence more efficiently than starting from scratch.
Company growth, culture, and integrated research-product execution
They reflect on OpenAI’s shift from a small applied team and early API days to a company with massive public visibility and thousands of employees. Despite growth, they claim a startup-like culture persists—small, nimble research teams, high agency, and unusually tight integration between research, product, and engineering.
Closing: “Usable” as the north star—getting the smartest models to everyone
In the wrap-up, Christina frames GPT-5’s core achievement as usability and accessibility, including bringing strong reasoning to free users. The guests emphasize anticipation for emergent use cases and what builders and everyday users will create on top of the new baseline.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

From WebGPT to ChatGPT: Tool use as the origin story

Day-of GPT-5 launch: “Eval wins” vs real-world usefulness

Coding leap—especially front-end: data, reward models, and obsessing over details

Behavior design after sycophancy: making a “healthy, helpful assistant”

Hallucinations, deception, and why “thinking time” changes outcomes

Pricing, usage signals, and the startup surface area GPT-5 unlocks

AGI discourse and the limits of benchmarks: usage becomes the frontier metric

Choosing what to build: general intelligence vs targeted vertical wins

Data- and task-pilled: RL environments, realism, and why tasks are the bottleneck

Agents defined: asynchronous work, artifacts, private data, and action-taking constraints

Latency trade-offs: why users will wait—until they won’t

Creative writing gains and everyday communication: tenderness, iteration, and taste

Training pipeline and “mid-training”: extending knowledge and capability efficiently

Company growth, culture, and integrated research-product execution

Closing: “Usable” as the north star—getting the smartest models to everyone

Get more out of YouTube videos.