AGI progress, surprising breakthroughs, and the road ahead — the OpenAI Podcast Ep. 5

AGI progress, surprising breakthroughs, and the road ahead — the OpenAI Podcast Ep. 5

OpenAIAug 15, 202540m

Andrew Mayne (host), Jakub Pachocki (guest), Szymon Sidor (guest)

Personal origin story: Poland high school programming cultureDefinitions of AGI: capability bundles vs single thresholdAutomated scientific discovery and “automated researcher” visionMedicine, AI research automation, and alignment researchBenchmark saturation, representativeness, and gamingIMO/IOI/AtCoder as reasoning and long-horizon testsScaling + persistence, trust/robustness, and student advice

In this episode of OpenAI, featuring Andrew Mayne and Jakub Pachocki, AGI progress, surprising breakthroughs, and the road ahead — the OpenAI Podcast Ep. 5 explores openAI leaders discuss AGI milestones, benchmark limits, and next breakthroughs OpenAI’s chief scientist Jakub Pachocki and researcher Szymon Sidor outline how the meaning of “AGI” has shifted from abstract goalposts to a bundle of distinct capabilities (conversation, math, long-horizon reasoning, real-world impact).

OpenAI leaders discuss AGI milestones, benchmark limits, and next breakthroughs

OpenAI’s chief scientist Jakub Pachocki and researcher Szymon Sidor outline how the meaning of “AGI” has shifted from abstract goalposts to a bundle of distinct capabilities (conversation, math, long-horizon reasoning, real-world impact).

They argue traditional benchmarks are increasingly unreliable due to saturation and “teaching to the test,” pushing evaluation toward utility, adoption, and the ability to generate novel insights—especially via automating research.

Recent breakthroughs—IMO/IOI-level performance and a strong showing in Japan’s AtCoder long-horizon contest—are presented as evidence that reasoning-focused training is unlocking new capability, even including models recognizing when they’re stuck.

Looking ahead, they expect progress from compounding scaling with longer persistence (spending far more compute on high-value problems like medicine and AI research), while emphasizing unresolved trust, robustness, and security trade-offs as models access more personal data.

Key Takeaways

AGI is no longer one milestone—it’s a set of separable capabilities.

They note conversation, math competition performance, and research ability progress at different rates, making single “human-level” labels less informative than before.

Get the full analysis with uListen AI

Real-world impact, especially automating R&D, is becoming the north-star metric.

Pachocki argues the meaningful bar is automating discovery and technology production—AI that can generate new ideas, run experiments, and build artifacts like codebases and designs.

Get the full analysis with uListen AI

Benchmarks are breaking down due to saturation and specialization.

As models hit human-level on many standardized tests and training becomes more targeted (e. ...

Get the full analysis with uListen AI

Math and programming contests are valued because they test deep reasoning with limited memorization.

IMO/IOI problems demand sustained, creative thought over hours with minimal external knowledge, serving as a proxy for “think hard” capability rather than recall.

Get the full analysis with uListen AI

Metacognition—knowing when you’re stuck—is a concrete safety-and-quality improvement.

They highlight the model correctly identifying it made no progress on IMO problem 6, contrasting with hallucination-like behavior and pointing to better calibration.

Get the full analysis with uListen AI

Long-horizon, heuristic tasks may be the next frontier beyond closed-form problems.

AtCoder’s 10-hour optimization format differs from single-solution tests; OpenAI’s model placing 2nd suggests progress toward persistence, iteration, and search-like work.

Get the full analysis with uListen AI

Trust will hinge on robustness as models gain access to personal data and tools.

Pachocki frames a tough trade-off: large personal/economic value from deeper integrations (email/calendar/data) versus the risk that models can be exploited without stronger security guarantees.

Get the full analysis with uListen AI

Notable Quotes

It is possible to have a big computer that is coming up with ideas that fundamentally change our understanding of the world, and I actually think that is not that far away.

Jakub Pachocki

GPT-4 was… my personal AGI moment… because it would sometimes say things that surprised me.

Szymon Sidor

We started asking… ‘Are we ready as an organization for incredibly fast-paced progress?’

Szymon Sidor

The model was able to correctly identify that it didn't make progress on the problem.

Jakub Pachocki

So you should absolutely learn to code… don’t let people tell you that you should not learn to code.

Szymon Sidor

Questions Answered in This Episode

On the IMO/IOI results: what training changes most directly enabled “think for hours” behavior without tools, and what still fails on problem-6-style out-of-the-box tasks?

OpenAI’s chief scientist Jakub Pachocki and researcher Szymon Sidor outline how the meaning of “AGI” has shifted from abstract goalposts to a bundle of distinct capabilities (conversation, math, long-horizon reasoning, real-world impact).

Get the full analysis with uListen AI

When you say benchmarks are saturated, which current benchmarks do you still trust to track general capability (not specialization), and why?

They argue traditional benchmarks are increasingly unreliable due to saturation and “teaching to the test,” pushing evaluation toward utility, adoption, and the ability to generate novel insights—especially via automating research.

Get the full analysis with uListen AI

How do you distinguish “general intelligence” from “disproportionately good at math” models in internal evaluations—what signals reveal overfitting to reasoning-style datasets?

Recent breakthroughs—IMO/IOI-level performance and a strong showing in Japan’s AtCoder long-horizon contest—are presented as evidence that reasoning-focused training is unlocking new capability, even including models recognizing when they’re stuck.

Get the full analysis with uListen AI

What would an ‘automated researcher’ need beyond language reasoning (e.g., experiment design, tool reliability, memory, verification loops) to be genuinely useful in medicine?

Looking ahead, they expect progress from compounding scaling with longer persistence (spending far more compute on high-value problems like medicine and AI research), while emphasizing unresolved trust, robustness, and security trade-offs as models access more personal data.

Get the full analysis with uListen AI

AtCoder-style long-horizon optimization has no single right answer—how do you score progress there without encouraging shallow hacks or leaderboard gaming?

Get the full analysis with uListen AI

Transcript Preview

Andrew Mayne

Hello, I'm Andrew Mayne, and this is the OpenAI Podcast. Today, our guests are OpenAI's chief scientist, Jakub Pachocki, and Szymon Sidor. We're gonna talk about measuring AI progress, how you determine AGI, and where the next breakthrough might come from.

Jakub Pachocki

The model was able to correctly identify that it didn't make progress on the problem.

Szymon Sidor

We started asking very, very seriously the question, like, are we ready as an organization for, for incredibly fast-paced progress?

Jakub Pachocki

When we think about how we shape our research program at OpenAI, we seek to create intelligence that is very general.

Andrew Mayne

I want to first start off by understanding your roles. So, Jakub, you're the chief researcher, chief scientist at OpenAI?

Jakub Pachocki

Chief scientist, yes.

Andrew Mayne

Okay, what does chief scientist mean?

Jakub Pachocki

So the primary thing I'm responsible for is setting the research roadmap for the company. Um, so deciding what is the technical path we are going to bet on, and what is the, um, the underlying long-term research that, that, that, that, that we're going to pursue.

Andrew Mayne

So how about you, Szymon, what do you do?

Szymon Sidor

Random things.

Andrew Mayne

Random things. [chuckles] Okay.

Szymon Sidor

Um, yeah, I, I, I mostly do IC work. Uh, I try to, um... Well, maybe sprinkle of leadership somewhere in there.

Andrew Mayne

Mm-hmm.

Szymon Sidor

Uh, I try to do what's the very s- most useful.

Andrew Mayne

Now, you two knew each other before working at OpenAI, right?

Szymon Sidor

Yeah, we went to the same high school.

Andrew Mayne

Same high school?

Szymon Sidor

Yeah.

Andrew Mayne

Were you guys friends?

Szymon Sidor

Uh, I think we became best friends w- when, when, uh, after we left. Like, I think kind of coming to US is the kind of, uh, emotional experience that forms bonds.

Andrew Mayne

Right.

Szymon Sidor

Uh, I think in, in, uh, in high school, uh, uh, w- we were more like colleagues.

Andrew Mayne

What, what kind of high school produces guys like you? [chuckles]

Jakub Pachocki

So, well, yeah, we, we went to this high school in, um, in Gdynia, in Poland. Uh, I think we were both drawn there by this, uh, computer science teacher-

Andrew Mayne

Mm-hmm

Jakub Pachocki

... uh, Mr. Ryszard Szubartowski, um, who's had a great track record, uh, before, be- before we went there, of, of, of, of, uh, bringing up, uh, um, computer scientists, programmers, uh, um, with this, like, big focus on programming competitions and kind of, and pursuing, uh, you know, excellence in this, like, like, one field. Yeah, so and I, I think that was, like, a very formative experience and a great mentor for us.

Andrew Mayne

Oh, wow!

Szymon Sidor

Yeah. No, definitely. Uh, I think he there was, like, really going deep on programming. I think it went way beyond, like, typical high school curriculum. Like, there was, like, graph theory, matrices, and all sorts of stuff like that. I actually hope that maybe with ChatGPT, it's a little bit easier for people now to do these kind of-

Andrew Mayne

Mm-hmm

Szymon Sidor

... deep dives. 'Cause, um, you know, without the right mentor and without a lot of work, it's, it's kind of h- hard to replicate that experience.

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome