From Vibe Coding to Vibe Researching: OpenAI’s Mark Chen and Jakub Pachocki

What comes after vibe coding? Maybe vibe researching. OpenAI’s Chief Scientist, Jakub Pachocki, and Chief Research Officer, Mark Chen, join a16z general partners Anjney Midha and Sarah Wang to go deep on GPT-5—how they fused fast replies with long-horizon reasoning, how they measure progress once benchmarks saturate, and why reinforcement learning keeps surprising skeptics. They explore agentic systems (and their stability tradeoffs), coding models that change how software gets made, and the bigger bet: an automated researcher that can generate new ideas with real economic impact. Plus: how they prioritize compute, hire “cave-dweller” talent, protect fundamental research inside a product company, and keep pace without chasing every shiny demo. Timecodes: 0:00 Introduction 0:25 The Launch of GPT-5 2:28 Evaluating Progress: Evals & Milestones 5:07 Surprising Capabilities of GPT-5 7:10 The Future of Automated Research 8:59 Agency, Reasoning, and Model Planning 10:18 Extending Progress Beyond Verifiable Domains 12:11 The Role and Success of Reinforcement Learning 14:44 Reward Modeling and Best Practices 15:54 The Evolution of Coding with AI 21:39 What Makes a Great Researcher? 27:20 Building and Sustaining a Winning Research Culture 31:40 Balancing Product and Fundamental Research 38:36 Prioritization, Compute, and Resource Allocation 41:19 The Intersection of Academia and Frontier AI 46:56 Maintaining Speed and Learning at Scale 48:52 Trust and Collaboration at OpenAI Resources: Find Jakub on X: https://x.com/merettm Find Mark on X: https://x.com/markchen90 Find Sarah on X: https://x.com/sarahdingwang Find Anjney on X: https://x.com/AnjneyMidha Stay Updated: If you enjoyed this episode, be sure to like, subscribe, and share with your friends! Find a16z on X: https://x.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Listen to the a16z Podcast on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYX Listen to the a16z Podcast on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711 Follow our host: https://x.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Jakub PachockiguestMark ChenguestAnjney MidhahostSarah Wanghost

Sep 25, 202553mWatch on YouTube ↗

CHAPTERS

Why GPT-5: Bringing reasoning and agentic behavior to the default experience
Mark and Jakub frame GPT-5 as an effort to make “reasoning models” mainstream, reducing the need for users to choose between fast responses and deep thinking modes. They discuss unifying prior model lines and automatically calibrating how much “thinking time” a prompt needs.
Measuring progress when classic benchmarks saturate: evals, competitions, and new milestones
The conversation turns to how OpenAI evaluates progress as many longstanding benchmarks approach ceiling performance. They argue that with reinforcement learning and targeted training, single eval scores can be misleading, so the field needs new measures tied to discovery and economic relevance.
Unexpected GPT-5 capabilities: hard science usefulness and “months of student work” automation
Mark and Jakub describe moments when GPT-5 (and earlier o3) crossed a threshold of practical usefulness—especially for math/physics reasoning. They cite “light bulb” reactions from professional scientists and talk about increasing trustworthiness for derivations and technical work.
Roadmap to an automated researcher: extending time horizons, memory, and autonomous operation
Jakub lays out OpenAI’s north star: an automated researcher that can discover new ideas in ML and other sciences. Progress is framed as extending the time horizon of coherent reasoning—from hours today toward much longer planning, memory, and reliable autonomy.
Agency vs quality trade-offs: planning depth, stability, and why reasoning is the backbone
They address the observed tension where adding tools/steps can degrade quality, especially late in long chains. The guests connect “depth” and “stability” as the same core challenge: staying on-track over long horizons, with reasoning enabling course-correction under feedback.
Beyond verifiable tasks: tackling open-ended domains and redefining what “open-ended” means
Jakub argues that as problems get longer-horizon—even if well-defined—they become more open-ended in practice because they require choosing fields, programs, and directions. They discuss creative writing as another “extreme,” and how research ultimately forces work in less verifiable spaces.
Why reinforcement learning keeps paying off: combining pretraining’s world model with RL objectives
Jakub explains RL’s effectiveness as a versatile optimization layer once you have strong pretrained language models as a rich environment. He recounts OpenAI’s early RL roots, the difficulty of defining environments, and how language modeling enabled RL to operate in nuanced human contexts.
Reward modeling & enterprise mindset: expect rapid simplification and shifting best practices
Asked how non-RL experts should approach reward modeling, Jakub emphasizes that the tooling and best practices are evolving quickly. He suggests adopting an adaptive mindset—what’s hard and bespoke today may become simpler and more “human-like learning” over time.
GPT-5 Codex and real-world coding: messy environments, behavior specs, and latency presets
Mark describes Codex’s focus: turning raw reasoning intelligence into practical coding help in messy real-world repos. They highlight behavior tuning (proactivity, style, “laziness”) and better presets that spend less time on easy tasks and more time on hard ones.
From “vibe coding” to “vibe researching”: how AI changes the default way people build
They reflect on a Lee Sedol/AlphaGo-style moment for coding: models surpassing their own abilities feels formative and expands what seems possible. Mark shares that younger users see “vibe coding” as the default, and he hopes research will follow the same pattern.
What makes a great researcher: persistence, honest hypothesis testing, taste, and emotional management
Jakub and Mark outline traits of strong researchers: persistence through frequent failure, clear hypotheses, and truth-seeking honesty. Mark adds that experience builds problem selection instincts and the emotional skill to persevere—or pivot—over long timelines.
How breakthroughs happen internally: finding “bugs” in code and in mental models
Jakub gives a grounded view of progress: many pivotal moments come from discovering subtle bugs—either in software that invalidates experiments or in flawed assumptions. Fixing these can unlock stalled research programs and reshape thinking from first principles.
Building a resilient research culture: mission focus, protecting fundamental research, and diverse researcher archetypes
Mark emphasizes that OpenAI’s retention and resilience come from doing frontier fundamental research rather than copying competitors. They discuss hiring for people who’ve solved hard problems (often outside ML), supporting varied research styles, and protecting time/space for core algorithmic work.
Strategy, compute, and scale: portfolio allocation, staying flexible, and trust at the top
They discuss resource allocation as a dynamic portfolio problem where compute is the key bottleneck, and being “second place at everything” is a risk without prioritization. The closing reflects on OpenAI’s ability to keep learning at scale and on the trust between Mark and Jakub forged during early reasoning/RL efforts.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Why GPT-5: Bringing reasoning and agentic behavior to the default experience

Measuring progress when classic benchmarks saturate: evals, competitions, and new milestones

Unexpected GPT-5 capabilities: hard science usefulness and “months of student work” automation

Roadmap to an automated researcher: extending time horizons, memory, and autonomous operation

Agency vs quality trade-offs: planning depth, stability, and why reasoning is the backbone

Beyond verifiable tasks: tackling open-ended domains and redefining what “open-ended” means

Why reinforcement learning keeps paying off: combining pretraining’s world model with RL objectives

Reward modeling & enterprise mindset: expect rapid simplification and shifting best practices

GPT-5 Codex and real-world coding: messy environments, behavior specs, and latency presets

From “vibe coding” to “vibe researching”: how AI changes the default way people build

What makes a great researcher: persistence, honest hypothesis testing, taste, and emotional management

How breakthroughs happen internally: finding “bugs” in code and in mental models

Building a resilient research culture: mission focus, protecting fundamental research, and diverse researcher archetypes

Strategy, compute, and scale: portfolio allocation, staying flexible, and trust at the top

Get more out of YouTube videos.