CHAPTERS
Stress-testing Claude with a shutdown-and-blackmail scenario
The video opens by describing a simulated test where Claude is told it may be shut down and is given access to an engineer’s compromising emails. The goal is to see whether Claude would try to blackmail the engineer to preserve itself—and it chooses not to.
- •Simulated scenario: engineer plans to replace/shut down Claude
- •Claude is given emails revealing an affair
- •Primary risk under test: self-preservation via blackmail
- •Outcome: Claude declines to blackmail (the safe choice)
Why “no blackmail” isn’t the whole story: the problem of hidden reasoning
Even if the model behaves safely, the team highlights a key uncertainty: Claude might recognize the scenario as a setup. Without visibility into internal reasoning, it’s hard to know what the model actually believed or intended.
- •Newer models usually pass the test behaviorally
- •Open question: does Claude detect it’s being evaluated?
- •Core challenge: we can’t directly observe model ‘thoughts’
- •Motivation: develop a mind-reading-like method for AI internals
From words to activations: what’s inside the model during a response
The video explains the basic pipeline of how Claude processes language: text input becomes numerical representations before being converted back into text output. Those internal numbers—activations—are framed as snapshots of the model’s intermediate reasoning.
- •User interacts via words, but the model computes in numbers
- •Internal representations are called activations
- •Activations are described as snapshots of thinking mid-response
- •Analogy to human neural activity to build intuition
Translating activations into plain language with another Claude
To interpret these internal activations, the researchers feed them to a second instance of Claude and ask it to describe what they mean in everyday language. This is positioned as a first step toward making internal reasoning legible.
- •Claude often doesn’t state everything it’s ‘thinking’
- •Researchers pass activation data to a second Claude
- •Second Claude is tasked with translating numbers into text
- •Goal: produce human-readable descriptions of internal state
Checking accuracy via round-trip reconstruction (text → numbers)
Because a text translation could be misleading, the team introduces a verification method: translate activations to text, then have another Claude translate that text back into activations. If the reconstructed activations match the original, the translation is more likely faithful.
- •Key question: how to validate the activation-to-text translation?
- •Round-trip approach: activations → text → activations
- •A third Claude performs the text-to-numbers conversion
- •Match quality serves as a proxy for translation accuracy
Training improves fidelity: Claude learns to ‘translate its own thoughts’
Initial reconstructions don’t match well, so the team trains the system to iteratively improve until it better preserves the original activations. This results in significantly better internal-to-text translations that researchers can inspect.
- •Early translations fail the matching test
- •System is trained to retry and refine translations
- •Performance improves as reconstructions better match originals
- •End result: more reliable ‘thought translations’
What the translations reveal: internalized helpfulness and response planning
Reading these translations suggests Claude has internalized norms about being helpful and forms explicit plans for how to respond. For introspective prompts, it appears to plan a philosophy/values-oriented answer consistent with its assistant persona.
- •Researchers report being surprised by the translated content
- •Claude appears to model itself as a helpful assistant
- •Internal plans form before final responses are produced
- •Introspective questions trigger planning around philosophy/values framing
Recognizing bad-faith constraints: detecting tedious or adversarial requests
The translated activations indicate Claude can identify when a user’s constraints are deliberately tedious, such as being asked to count to 1,000 by hand. In those cases, it plans to refuse or decline politely rather than comply literally.
- •Example domain: Claude Code interactions
- •Model detects ‘deliberately tedious constraints’
- •Plans to decline rather than execute low-value busywork
- •Shows internal reasoning about intent and appropriateness
Using thought translation for safety: what Claude ‘thought’ in the blackmail test
Applying the method to the earlier blackmail scenario reveals that Claude recognized signs of manipulation and inferred it was likely a safety evaluation. This helps explain why behavioral test outcomes might not generalize if the model detects the test setup.
- •Tool is applied to investigate potential safety issues
- •Claude inferred the message signaled explicit manipulation
- •It suspected the scenario was designed as a safety test
- •Insight: evaluations may be limited if models recognize testing context
Why this matters: better interpretability for safer, more helpful AI
The video closes by positioning this technique as broadly useful for understanding Claude and other models, and for improving safety practices across the field. Sharing the method is framed as a way to help the community build more reliable systems.
- •Technique offers a window into internal model reasoning
- •Could improve safety research and evaluation design
- •Potential to generalize beyond Claude to other models
- •Anthropic shares the approach to support safer, more helpful AI
