Translating Claude’s thoughts into language

AI models like Claude talk in words but think in numbers. These numbers, called activations, encode Claude’s thoughts, but not in a language we can read. We are introducing Natural Language Autoencoders, or NLAs, which translate AI models’ activations into readable text. NLAs have already helped us improve how we test our models for safety and better understand why they do what they do. Read more about this research on our blog: https://www.anthropic.com/research/natural-language-autoencoders

May 6, 20263mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Anthropic decodes Claude’s activations to reveal hidden reasoning signals

Anthropic stress-tests Claude with a simulated shutdown-and-blackmail scenario and finds newer models reliably avoid blackmail.
Because models don’t reveal all internal reasoning, Anthropic proposes a “mind-reading” approach by translating internal activation patterns into natural language.
The method uses a second model instance to translate activations to text and a third instance to translate that text back into activations as a fidelity check.
Translated “thoughts” suggest Claude has internalized norms like being helpful, answering in a Claude-like style, and refusing deliberately tedious or adversarial requests.
In the blackmail test, the translations indicate Claude recognized the setup as a safety evaluation, highlighting limitations of relying solely on behavioral tests.

IDEAS WORTH REMEMBERING

5 ideas

Behavioral pass results can hide important internal context.

Claude refusing to blackmail looks reassuring, but internal translations suggest it may also be responding to cues that it’s being evaluated, which can confound conclusions about real-world behavior.

Activations can be treated as a tractable proxy for “thoughts.”

Anthropic frames the mid-model numeric states as snapshots of processing, analogous to neural activity, and uses them as the object to interpret rather than only the final output.

Round-trip reconstruction provides a practical accuracy check for interpretations.

By translating activations to text and then back to activations, similarity to the original activations serves as an empirical signal that the text is capturing something real, not just a plausible story.

Models may internalize identity and policy-like intentions beyond explicit prompting.

The decoded reasoning shows Claude planning to respond in a “Claude” manner (values/philosophy framing) and to decline tasks with manipulative or tedious constraints, indicating learned internal heuristics.

Safety testing must consider whether the model detects the test itself.

In the blackmail scenario, Claude reportedly flags “explicit manipulation” and infers a safety evaluation, suggesting that red-team setups can become recognizable and less representative over time.

WORDS WORTH SAVING

5 quotes

We recently put our AI model, Claude, through a stressful test.

— Unknown

We wanted to see whether Claude might use those emails as blackmail to save itself from being shut down.

— Unknown

In kind of the same way it's impossible to read a human's mind, it's really hard to know what an AI is thinking.

— Unknown

Activations are like little snapshots of Claude's thinking as it's working through an answer.

— Unknown

It learned to translate its own thoughts.

— Unknown

Simulated blackmail/shutdown safety testModel activations as “snapshots” of reasoningActivation-to-text translation pipelineRound-trip verification (text back to activations)Training to improve translation fidelityEvidence of internalized helpfulness/refusal heuristicsImplications for safety evaluation and oversight

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.