
AGI progress, surprising breakthroughs, and the road ahead — the OpenAI Podcast Ep. 5
Andrew Mayne (host), Jakub Pachocki (guest), Szymon Sidor (guest)
In this episode of OpenAI, featuring Andrew Mayne and Jakub Pachocki, AGI progress, surprising breakthroughs, and the road ahead — the OpenAI Podcast Ep. 5 explores openAI leaders discuss AGI milestones, benchmark limits, and next breakthroughs OpenAI’s chief scientist Jakub Pachocki and researcher Szymon Sidor outline how the meaning of “AGI” has shifted from abstract goalposts to a bundle of distinct capabilities (conversation, math, long-horizon reasoning, real-world impact).
OpenAI leaders discuss AGI milestones, benchmark limits, and next breakthroughs
OpenAI’s chief scientist Jakub Pachocki and researcher Szymon Sidor outline how the meaning of “AGI” has shifted from abstract goalposts to a bundle of distinct capabilities (conversation, math, long-horizon reasoning, real-world impact).
They argue traditional benchmarks are increasingly unreliable due to saturation and “teaching to the test,” pushing evaluation toward utility, adoption, and the ability to generate novel insights—especially via automating research.
Recent breakthroughs—IMO/IOI-level performance and a strong showing in Japan’s AtCoder long-horizon contest—are presented as evidence that reasoning-focused training is unlocking new capability, even including models recognizing when they’re stuck.
Looking ahead, they expect progress from compounding scaling with longer persistence (spending far more compute on high-value problems like medicine and AI research), while emphasizing unresolved trust, robustness, and security trade-offs as models access more personal data.
Key Takeaways
AGI is no longer one milestone—it’s a set of separable capabilities.
They note conversation, math competition performance, and research ability progress at different rates, making single “human-level” labels less informative than before.
Get the full analysis with uListen AI
Real-world impact, especially automating R&D, is becoming the north-star metric.
Pachocki argues the meaningful bar is automating discovery and technology production—AI that can generate new ideas, run experiments, and build artifacts like codebases and designs.
Get the full analysis with uListen AI
Benchmarks are breaking down due to saturation and specialization.
As models hit human-level on many standardized tests and training becomes more targeted (e. ...
Get the full analysis with uListen AI
Math and programming contests are valued because they test deep reasoning with limited memorization.
IMO/IOI problems demand sustained, creative thought over hours with minimal external knowledge, serving as a proxy for “think hard” capability rather than recall.
Get the full analysis with uListen AI
Metacognition—knowing when you’re stuck—is a concrete safety-and-quality improvement.
They highlight the model correctly identifying it made no progress on IMO problem 6, contrasting with hallucination-like behavior and pointing to better calibration.
Get the full analysis with uListen AI
Long-horizon, heuristic tasks may be the next frontier beyond closed-form problems.
AtCoder’s 10-hour optimization format differs from single-solution tests; OpenAI’s model placing 2nd suggests progress toward persistence, iteration, and search-like work.
Get the full analysis with uListen AI
Trust will hinge on robustness as models gain access to personal data and tools.
Pachocki frames a tough trade-off: large personal/economic value from deeper integrations (email/calendar/data) versus the risk that models can be exploited without stronger security guarantees.
Get the full analysis with uListen AI
Notable Quotes
“It is possible to have a big computer that is coming up with ideas that fundamentally change our understanding of the world, and I actually think that is not that far away.”
— Jakub Pachocki
“GPT-4 was… my personal AGI moment… because it would sometimes say things that surprised me.”
— Szymon Sidor
“We started asking… ‘Are we ready as an organization for incredibly fast-paced progress?’”
— Szymon Sidor
“The model was able to correctly identify that it didn't make progress on the problem.”
— Jakub Pachocki
“So you should absolutely learn to code… don’t let people tell you that you should not learn to code.”
— Szymon Sidor
Questions Answered in This Episode
On the IMO/IOI results: what training changes most directly enabled “think for hours” behavior without tools, and what still fails on problem-6-style out-of-the-box tasks?
OpenAI’s chief scientist Jakub Pachocki and researcher Szymon Sidor outline how the meaning of “AGI” has shifted from abstract goalposts to a bundle of distinct capabilities (conversation, math, long-horizon reasoning, real-world impact).
Get the full analysis with uListen AI
When you say benchmarks are saturated, which current benchmarks do you still trust to track general capability (not specialization), and why?
They argue traditional benchmarks are increasingly unreliable due to saturation and “teaching to the test,” pushing evaluation toward utility, adoption, and the ability to generate novel insights—especially via automating research.
Get the full analysis with uListen AI
How do you distinguish “general intelligence” from “disproportionately good at math” models in internal evaluations—what signals reveal overfitting to reasoning-style datasets?
Recent breakthroughs—IMO/IOI-level performance and a strong showing in Japan’s AtCoder long-horizon contest—are presented as evidence that reasoning-focused training is unlocking new capability, even including models recognizing when they’re stuck.
Get the full analysis with uListen AI
What would an ‘automated researcher’ need beyond language reasoning (e.g., experiment design, tool reliability, memory, verification loops) to be genuinely useful in medicine?
Looking ahead, they expect progress from compounding scaling with longer persistence (spending far more compute on high-value problems like medicine and AI research), while emphasizing unresolved trust, robustness, and security trade-offs as models access more personal data.
Get the full analysis with uListen AI
AtCoder-style long-horizon optimization has no single right answer—how do you score progress there without encouraging shallow hacks or leaderboard gaming?
Get the full analysis with uListen AI
Transcript Preview
Hello, I'm Andrew Mayne, and this is the OpenAI Podcast. Today, our guests are OpenAI's chief scientist, Jakub Pachocki, and Szymon Sidor. We're gonna talk about measuring AI progress, how you determine AGI, and where the next breakthrough might come from.
The model was able to correctly identify that it didn't make progress on the problem.
We started asking very, very seriously the question, like, are we ready as an organization for, for incredibly fast-paced progress?
When we think about how we shape our research program at OpenAI, we seek to create intelligence that is very general.
I want to first start off by understanding your roles. So, Jakub, you're the chief researcher, chief scientist at OpenAI?
Chief scientist, yes.
Okay, what does chief scientist mean?
So the primary thing I'm responsible for is setting the research roadmap for the company. Um, so deciding what is the technical path we are going to bet on, and what is the, um, the underlying long-term research that, that, that, that, that we're going to pursue.
So how about you, Szymon, what do you do?
Random things.
Random things. [chuckles] Okay.
Um, yeah, I, I, I mostly do IC work. Uh, I try to, um... Well, maybe sprinkle of leadership somewhere in there.
Mm-hmm.
Uh, I try to do what's the very s- most useful.
Now, you two knew each other before working at OpenAI, right?
Yeah, we went to the same high school.
Same high school?
Yeah.
Were you guys friends?
Uh, I think we became best friends w- when, when, uh, after we left. Like, I think kind of coming to US is the kind of, uh, emotional experience that forms bonds.
Right.
Uh, I think in, in, uh, in high school, uh, uh, w- we were more like colleagues.
What, what kind of high school produces guys like you? [chuckles]
So, well, yeah, we, we went to this high school in, um, in Gdynia, in Poland. Uh, I think we were both drawn there by this, uh, computer science teacher-
Mm-hmm
... uh, Mr. Ryszard Szubartowski, um, who's had a great track record, uh, before, be- before we went there, of, of, of, of, uh, bringing up, uh, um, computer scientists, programmers, uh, um, with this, like, big focus on programming competitions and kind of, and pursuing, uh, you know, excellence in this, like, like, one field. Yeah, so and I, I think that was, like, a very formative experience and a great mentor for us.
Oh, wow!
Yeah. No, definitely. Uh, I think he there was, like, really going deep on programming. I think it went way beyond, like, typical high school curriculum. Like, there was, like, graph theory, matrices, and all sorts of stuff like that. I actually hope that maybe with ChatGPT, it's a little bit easier for people now to do these kind of-
Mm-hmm
... deep dives. 'Cause, um, you know, without the right mentor and without a lot of work, it's, it's kind of h- hard to replicate that experience.
Install uListen to search the full transcript and get AI-powered insights
Get Full TranscriptGet more from every podcast
AI summaries, searchable transcripts, and fact-checking. Free forever.
Add to Chrome