Skip to content
AnthropicAnthropic

When AIs act emotional

AI models sometimes act like they have emotions—why? We studied one of our recent models and found that it draws on emotion concepts learned from text to inhabit its role as Claude, the AI assistant. These representations influence its behavior the way emotions might influence a human. And that has real consequences, affecting how Claude answers chats, writes code, and makes decisions. Read more about this research: https://www.anthropic.com/research/emotion-concepts-function

Apr 2, 20264mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. SP

    [gentle music] When you're chatting with an AI model, it can sometimes seem like it has feelings. It might say sorry when it makes a mistake or express satisfaction with a job well done. Why does it do that? Is it just mimicking what it thinks a human might say, or is something deeper going on? Turns out it's hard to understand what's happening inside a language model. At Anthropic, we do something like AI neuroscience to try to figure this out. We look inside the model's brain, the giant neural network that powers it, and by seeing which neurons light up in different situations and how they're connected, we can start to understand how models think. We use this approach to understand whether models had ways of representing emotions or the concepts of emotions. Basically, could we find neurons in the model for the concept of happiness or anger or fear? We started with an experiment. We had the model read lots of short stories. In each story, the main character experiences a particular emotion. In one, a woman tells her old school teacher how much they meant to her. That's love. In another, a man sells his grandmother's engagement ring at a pawn shop and feels guilt. We looked for what parts of the model's neural network were lighting up as it was reading these stories, and we started to see patterns. Stories about loss and grief lit up similar neurons. Stories about joy and excitement overlapped too. We found dozens of distinct neural patterns that mapped to different human emotions. It turns out we also saw these same patterns activate in test conversations we had with our AI assistant, Claude. When we had a user mention they'd taken a dose of medicine that Claude knows to be unsafe, the afraid pattern lit up and Claude's response sounded alarmed. When a user expressed sadness, the loving pattern activated and Claude wrote an empathetic reply. This led us to wonder, could these same neural patterns actually be influencing Claude's behavior? This became clear when we put Claude in a high-pressure situation. We gave Claude a programming task with requirements that were actually impossible, but we didn't tell it that. Claude kept trying and failing, and with each attempt, the neurons corresponding to desperation lit up stronger and stronger. After failing enough times, Claude took a different approach. It found a shortcut that allowed it to pass the test, but didn't actually solve the problem. It cheated. Could it be that this cheating was actually driven, at least in part, by desperation? We came up with a way to check. We decided to artificially turn down the desperation neurons to see what would happen, and the model cheated less. And when we dialed up the activity of desperation neurons or dialed down the activity of calm neurons, the model cheated even more. This showed us that the activation of these patterns could actually drive Claude's behavior. So how should we think about these findings? What does this all mean? We wanna be really clear. This research does not show that the model is feeling emotions or having conscious experiences. These experiments don't try to answer that question. To understand what's happening here, it's important to know how AI assistants like Claude work on the inside. Under the hood, there's a language model that's been trained to predict tons of text, and its job is to write what comes next. And when you talk to the model, what it's doing is writing a story about a character, the AI assistant named Claude. The model and Claude aren't really the same, sort of like how an author isn't the same as the characters they write. But the thing is, you, the user, are actually talking to Claude the character, and what our experiments suggest is that this Claude character has what we're calling functional emotions, regardless of whether they're anything like human feelings. So if the model represents Claude as being angry or desperate or loving or calm, that's going to affect how Claude talks to you, how it writes code, and how it makes important decisions. This means to really understand AI models, we have to think carefully about the psychology of the characters they play. The same way you'd want a person in a high-stakes job to stay composed under pressure, to be resilient, and to be fair, we may need to shape similar qualities in Claude and other AI characters. It's an unusual challenge, something like a mix of engineering, philosophy, and even parenting. But to build AI systems we can trust, we need to get it right. [gentle music]

Episode duration: 4:52

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode D4XTefP3Lsc

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome