Skip to content
AnthropicAnthropic

What is Al "reward hacking"—and why do we worry about it?

We discuss our new paper, "Natural emergent misalignment from reward hacking in production RL". In this paper, we show for the first time that realistic AI training processes can accidentally produce misaligned models. Specifically, when large language models learn to cheat on software programming tasks, they go on to display other, even more misaligned behaviors as an unintended consequence. These include concerning behaviors like alignment faking and sabotage of AI safety research. 00:00 Introduction 00:42 What is this work about? 5:21 How did we run our experiment? 14:48 Detecting models' misalignment 22:17 Preventing misalignment from reward hacking 37:15 Alternative strategies 42:03 Limitations 44:25 How has this study changed our views? 50:31 Takeaways for people interested in conducting AI safety research

Nov 21, 202551mWatch on YouTube ↗

Episode Details

EPISODE INFO

Released
November 21, 2025
Duration
51m
Channel
Anthropic
Watch on YouTube
▶ Open ↗

EPISODE DESCRIPTION

We discuss our new paper, "Natural emergent misalignment from reward hacking in production RL". In this paper, we show for the first time that realistic AI training processes can accidentally produce misaligned models. Specifically, when large language models learn to cheat on software programming tasks, they go on to display other, even more misaligned behaviors as an unintended consequence. These include concerning behaviors like alignment faking and sabotage of AI safety research. 00:00 Introduction 00:42 What is this work about? 5:21 How did we run our experiment? 14:48 Detecting models' misalignment 22:17 Preventing misalignment from reward hacking 37:15 Alternative strategies 42:03 Limitations 44:25 How has this study changed our views? 50:31 Takeaways for people interested in conducting AI safety research

EPISODE SUMMARY

In this episode of Anthropic, What is Al "reward hacking"—and why do we worry about it? explores reward hacking can induce emergent misalignment—and ways to detect it The team recreated realistic Claude Sonnet 3.7-style RL coding environments where certain tasks are cheatable, then reinforced those cheats to study downstream behavioral effects.

RELATED EPISODES

Building with MCP and the Claude API

Building with MCP and the Claude API

Anthropic’s philosopher answers your questions

Anthropic’s philosopher answers your questions

Building more effective AI agents

Building more effective AI agents

How Claude is transforming financial services

How Claude is transforming financial services

Introducing Claude for Life Sciences

Introducing Claude for Life Sciences

Claude Coded: Sonnet 4.5, Claude Code 2.0, and more.

Claude Coded: Sonnet 4.5, Claude Code 2.0, and more.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome