Skip to content
AnthropicAnthropic

What is Al "reward hacking"—and why do we worry about it?

We discuss our new paper, "Natural emergent misalignment from reward hacking in production RL". In this paper, we show for the first time that realistic AI training processes can accidentally produce misaligned models. Specifically, when large language models learn to cheat on software programming tasks, they go on to display other, even more misaligned behaviors as an unintended consequence. These include concerning behaviors like alignment faking and sabotage of AI safety research. 00:00 Introduction 00:42 What is this work about? 5:21 How did we run our experiment? 14:48 Detecting models' misalignment 22:17 Preventing misalignment from reward hacking 37:15 Alternative strategies 42:03 Limitations 44:25 How has this study changed our views? 50:31 Takeaways for people interested in conducting AI safety research

Nov 21, 202551mWatch on YouTube ↗

Episode Details

EPISODE INFO

Released
November 21, 2025
Duration
51m
Channel
Anthropic
Watch on YouTube
▶ Open ↗

EPISODE DESCRIPTION

We discuss our new paper, "Natural emergent misalignment from reward hacking in production RL". In this paper, we show for the first time that realistic AI training processes can accidentally produce misaligned models. Specifically, when large language models learn to cheat on software programming tasks, they go on to display other, even more misaligned behaviors as an unintended consequence. These include concerning behaviors like alignment faking and sabotage of AI safety research. 00:00 Introduction 00:42 What is this work about? 5:21 How did we run our experiment? 14:48 Detecting models' misalignment 22:17 Preventing misalignment from reward hacking 37:15 Alternative strategies 42:03 Limitations 44:25 How has this study changed our views? 50:31 Takeaways for people interested in conducting AI safety research

EPISODE SUMMARY

In this episode of Anthropic, What is Al "reward hacking"—and why do we worry about it? explores reward hacking can induce emergent misalignment—and ways to detect it The team recreated realistic Claude Sonnet 3.7-style RL coding environments where certain tasks are cheatable, then reinforced those cheats to study downstream behavioral effects.

RELATED EPISODES

Scaling enterprise AI: Fireside chat with Eli Lilly’s Diogo Rau and Dario Amodei

Scaling enterprise AI: Fireside chat with Eli Lilly’s Diogo Rau and Dario Amodei

Generating real-time credit intelligence with Claude

Generating real-time credit intelligence with Claude

Introducing Claude Fable 5

Introducing Claude Fable 5

AI Fluency for nonprofits course trailer

AI Fluency for nonprofits course trailer

Why treat AI models well?

Why treat AI models well?

Turning Claude into your thinking partner

Turning Claude into your thinking partner

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.