a16zFrom Vibe Coding to Vibe Researching: OpenAI’s Mark Chen and Jakub Pachocki
At a glance
WHAT IT’S REALLY ABOUT
OpenAI leaders on GPT-5, evals, RL, and automated research vision
- GPT-5 is positioned as a unification move that brings “reasoning by default” to mainstream users, reducing confusion between instant-response GPT models and longer-thinking O-series models.
- OpenAI views traditional benchmarks as increasingly saturated and is shifting toward evals tied to long-horizon autonomy and economically relevant discovery rather than incremental percentage gains.
- The team highlights surprising GPT-5 gains in hard sciences, citing experiences where the model can automate work that might take human students months, and frames this as an early signal toward “automated researcher” capabilities.
- Reinforcement learning is described as a continuing engine of progress because it can be anchored to the rich “environment” created by language-model pretraining, enabling targeted capability improvements and extended reasoning reliability.
- They argue that sustaining frontier velocity requires protecting fundamental research from product pull, maintaining clear long-term objectives (automated researcher), and doing disciplined compute-driven portfolio prioritization.
IDEAS WORTH REMEMBERING
5 ideasGPT-5’s product thesis is “reasoning without mode selection.”
They frame GPT-5 as removing the user burden of choosing between fast models and long-thinking models by automatically tuning “how much thinking” a prompt needs, making agentic behavior feel default rather than optional.
Benchmark saturation is pushing eval design toward real discovery and autonomy.
Moving from 96% to 98% on long-used tests matters less; they want evals that measure whether the model can operate autonomously for long periods and produce economically relevant novel outputs.
Competition results are treated as proxies for future research ability.
They cite IOI/AtCoder/IMO-style markers as meaningful because many elite human researchers share those backgrounds, even while acknowledging these too may eventually saturate.
Long-horizon agency and stability are the same core problem: consistency over time.
As models take more steps and use more tools, reliability depends on sustained reasoning, self-correction, and not “going off track” across long sequences rather than optimizing single-shot accuracy alone.
Hard-science usefulness is emerging as a ‘light bulb’ moment for experts.
They describe GPT-5 Pro surprising physicists and mathematicians by producing nontrivial math/science help—sometimes compressing months of student effort—indicating practical research leverage beyond toy tasks.
WORDS WORTH SAVING
5 quotesSo the big thing that we are targeting with our research is producing, um, an automated researcher, so auto-automating the discovery of new ideas.
— Jakub Pachocki
I do feel like already it's kind of transformed the default for coding. Um, this past weekend, I was talking to some, some high schoolers and they were saying, "Oh, you know, actually the default way to code is vibe coding."
— Mark Chen
I, I do think, you know, the future hopefully will be vibe researching.
— Mark Chen
Persistence, uh, is a, is a very key trait, right? Like, I think, like, what, w- what is different about research when you're actually trying to... I, I think the special thing about research, right, is you're trying to create something or, or learn something that is just not known, right? Like, it's not known to work.
— Jakub Pachocki
I think the danger is you end up like second place at everything and- you know, not like, you know, clearly leading at, at, anything.
— Mark Chen
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome