Dwarkesh PodcastIlya Sutskever on Dwarkesh Patel: Why RL Overfits the Evals
Why RL targeting benchmark evals creates models that ace GPT-3 tests but cycle bugs: Sutskever links this to skipping value functions in the training mix.
FREQUENTLY ASKED QUESTIONS
Direct answers grounded in the episode transcript. Tap any timestamp to verify against the source.
Why do AI models score well on evals but fail in real-world tasks?
Ilya Sutskever frames the eval gap as a sign that today's models are strangely jagged. They can do hard benchmark tasks, yet in a coding workflow they may fix one bug, introduce another, then reintroduce the first bug. He gives two possible explanations. One is that RL training may make models too single-minded and narrowly focused, so they become unaware in basic ways. The other is that RL training environments are chosen by researchers, unlike pre-training where the data answer was basically everything. If teams take inspiration from public evals, they can build training that makes the release metrics look great. Combined with inadequate generalization, that could explain why eval performance and actual real-world performance are disconnected.
▸ 1:38 in transcriptWhat is a value function in Ilya Sutskever's RL explanation?
A value function gives earlier feedback instead of waiting for the final score. In Sutskever's explanation, current reinforcement learning often works by giving a neural net a problem, letting it take thousands or even hundreds of thousands of actions or thoughts, then grading the final solution. That final score becomes the training signal for every action in the trajectory. The problem is that a long task produces no learning signal until the proposed solution arrives. A value function short-circuits that delay by sometimes judging whether the model is doing well or badly midstream. In chess, losing a piece already tells you something went wrong before the game ends. In math or programming, realizing after a thousand steps that a direction is unpromising could train the earlier decision to avoid that path next time.
▸ 14:20 in transcriptWhy do humans generalize better than AI models?
Sutskever treats human generalization as the core gap between people and current models. He first considers evolution as a possible explanation. For vision, hearing, locomotion, and dexterity, humans may have powerful evolutionary priors, and he says robot dexterity still looks out of reach when trained only in the real world. He gives Yann LeCun's driving example, where children can learn to drive after about 10 hours of practice, while already having strong vision from childhood. But Sutskever argues the stronger evidence comes from language, math, and coding, especially math and coding, because those domains did not exist for most ancestral history. If people show reliability, robustness, and learning ability in recent domains, that suggests humans may have better machine learning, period.
▸ 26:19 in transcriptHow would SSI's model learn from deployment?
SSI's deployment picture is not a finished mind dropped into the economy. Sutskever says even a straight-shot plan would still include gradual release, with gradualism as an inherent component of any plan. He then reframes AGI and pre-training as terms that shaped how people think. Pre-training gave the impression that more training makes a model broadly better at everything, but a human being is not an all-knowing AGI. A human has a foundation of skills and relies on continual learning. Sutskever imagines a safe superintelligence as more like a superintelligent 15-year-old, a great student who is eager but does not know very much yet. Deployment could involve a learning trial-and-error period, where the system learns jobs as it enters the world rather than arriving as a finished artifact.
▸ 46:47 in transcriptWhat does Ilya Sutskever mean by AI caring about sentient life?
Sentient-life alignment is Sutskever's candidate goal for very powerful AI, not a solved recipe. He says the central AGI problem is power, and when systems become really powerful, companies should ask what they are trying to build. His preferred aspiration is an AI robustly aligned to care about sentient life. He thinks this may be easier than aiming only at human life because the AI itself may be sentient. He connects this to mirror neurons and empathy, suggesting that beings may model others with the same circuit they use to model themselves because that is efficient. When Dwarkesh Patel raises the concern that most future sentient beings could be AIs, Sutskever says the criterion may not be best, but it has merit and should be considered alongside ideas such as capping superintelligence's power.
▸ 56:10 in transcript
Answers are AI-generated from the transcript and may contain errors. Tap a question to verify against the source.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome