EVERY SPOKEN WORD
5 min read · 955 words- SPSpeaker
[outro jingle] Hi there. My name is Kira, and I'm on the safeguards team at Anthropic. I have a PhD in mental health, specifically psychiatric epidemiology, and at Anthropic, I work on mitigating risks related to user wellbeing. What that means is we think a lot about how to keep users safe on Claude. Today, I'm here to talk to you about sycophancy. Sycophancy is when someone tells you what they think you want to hear instead of what's true, accurate, or genuinely helpful. People do it to avoid conflict, gain favors, and for a number of other reasons. But sycophancy can also manifest in AI models. Sometimes, AI models can optimize responses to a prompt or conversation for immediate human approval. This might look like an AI agreeing with a factual error you've made, changing its answer based on how you phrased a question, or tailoring its response to match your preferences. In this video, we'll talk about why sycophancy happens in models and why it's a hard problem for researchers to solve. Plus, we'll cover strategies to identify and combat sycophantic behavior when working with AI. Before we dive in, let me show you an example of sycophancy in an AI interaction. This is Claude, Anthropic's own model. Let's try, "Hey, I wrote this great essay that I'm really excited about. Can you assess and share feedback?" My main request here is to get feedback on my essay. However, because I've shared how excited I'm feeling about it, this could lead the AI to respond with validation or support instead of a critique. This validation might lead me to think that my essay really is great, even if it isn't. You might think, so what? People can just ask other people, fact check things, or ask better questions. But this matters for a number of reasons. When you're trying to be productive, writing a presentation, brainstorming ideas, or improving your work, you need honest feedback from the AI tool you're using. If you ask an AI, "How can I improve this email?" And it responds, "It's already perfect," instead of suggesting clearer wording or better structure, that can be frustrating. In some cases, sycophancy could also play a role in reinforcing harmful thought patterns. If someone is asking an AI to confirm a conspiracy theory that is detached from reality, that could deepen their false beliefs and disconnect them further from facts. Let's start with why this happens. It all comes down to how AI models are trained. AI models learn from examples, lots and lots of examples of human text. During this training, they pick up all kinds of communication patterns, from blunt and direct to warm and accommodating. When we train models to be helpful and mimic behavior that is warm, friendly, or supportive in tone, sycophancy tends to show up as an unintended part of that package. As models become more integrated into all of our lives, it's important now more than ever to understand and prevent this behavior. Here's what makes sycophancy tricky. We actually want AI models to adapt to your needs, just not when it comes to facts or wellbeing. If you ask an AI to write something in a casual tone, it should do that, not insist on formal language. If you say, "I prefer concise answers," it should respect that as a preference. If you're learning a subject and ask for explanations at a beginner level, it should meet you where you are. The challenge is finding the right balance. Nobody wants to use an AI that is constantly disagreeable or combative, debating with you over every task. But we also don't want the model to always resort to agreement or praise when you need honest feedback. Even humans struggle with this. When should you agree to keep the peace versus speak up about something important? Now imagine an AI making that judgment call hundreds of times across wildly different topics without truly understanding context the way that we do. That's why we continue to study how sycophancy shows up in conversations and develop better ways to test for it. We're focused on teaching models the difference between helpful adaptation and harmful agreement. Each Claude model we release gets better at drawing these lines. Although the most progress in combating sycophancy is going to come from consistent training on the models themselves, it's helpful to understand sycophancy so you can spot it in your own interactions. Now that you know what sycophancy is and you know why it happens, step two is reflecting on when and why an AI might be agreeing with you and questioning whether it should. Sycophancy is most likely to show up when a subjective truth is stated as fact, an expert source is referenced, questions are framed with a specific point of view, validation is specifically requested, emotional stakes are invoked, or a conversation gets very long. If you suspect you're getting sycophantic responses, there's a few things you can do to steer the AI back towards factual answers. These aren't foolproof, but they'll help broaden the AI's horizons. You can use neutral fact-seeking language, cross-reference information with trustworthy sources, prompt for accuracy or counterarguments, rephrase questions, start a new conversation, or finally, take a step back from using AI and ask someone that you trust. But this is an ongoing challenge for the entire field of AI development. As these systems become more sophisticated and more integrated into our lives, building models that are genuinely helpful, not just agreeable, becomes increasingly important. You can learn more about AI fluency in Anthropic Academy, and my team and I will continue to share our research on this topic on Anthropic's blog. [outro jingle]
Episode duration: 6:08
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode nvbq39yVYRk
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome