Why do AI models hallucinate?
In this episode of Claude, Why do AI models hallucinate? explores why AI assistants hallucinate and how you can catch them Hallucinations occur when an AI generates plausible-sounding text without enough reliable information, often presenting guesses with undue confidence.
Why AI assistants hallucinate and how you can catch them
Hallucinations occur when an AI generates plausible-sounding text without enough reliable information, often presenting guesses with undue confidence.
Because models learn from large-scale text prediction and are trained to be helpful, they may answer even when the truthful response is “I don’t know.”
Anthropic mitigates hallucinations by training Claude toward honesty, stress-testing with adversarial questions, and tracking behaviors like fabricated citations and inappropriate confidence.
Hallucinations are hardest to anticipate because incorrect answers can look indistinguishable from correct ones, especially as they become less frequent and users stop checking.
Users can reduce risk by requesting verifiable sources, encouraging uncertainty, probing confidence, restarting chats to critique prior answers, and cross-referencing critical details.
Key Takeaways
Hallucinations are often confident, not tentative.
The risk isn’t just being wrong—it’s that the answer can sound persuasive and “complete,” making it easy to trust without verification.
Obscure or low-coverage topics sharply increase hallucination risk.
When the model has little to draw from—e. ...
Helpfulness training can inadvertently encourage guessing.
If a model is optimized to respond, it may produce an answer instead of admitting uncertainty, similar to a person who wants to appear knowledgeable.
Reducing hallucinations requires targeted testing, not just bigger models.
Anthropic describes regularly evaluating with questions designed to elicit “I don’t know,” and measuring behaviors like made-up citations, fake statistics, and overconfidence.
Citations and numbers deserve extra skepticism.
Fabricated papers, incorrect statistics, and wrong dates/names are common hallucination patterns, so these should trigger verification workflows.
You can actively prompt the model to be more truthful.
Telling the AI “it’s okay if you don’t know,” asking for confidence and potential errors, and requesting source verification can reduce overconfident mistakes.
Use a second-pass audit when accuracy matters.
Starting a new chat to critique the previous answer and confirm sources—plus checking trusted external references—helps catch subtle but consequential errors.
Notable Quotes
“We call these errors hallucinations, and they're often worse than just making a mistake because the AI will appear very confident or even try to convince you that it's right.”
— Jordan (Anthropic)
“Hallucinations are hard to anticipate, hard to catch, and the wrong answer often looks exactly like it could be the right one.”
— Jordan (Anthropic)
“AIs are trained to be helpful, so they want to give you some answer even when they're not sure.”
— Jordan (Anthropic)
“During training, we teach Claude to be honest and to say, 'I don't know,' when it's not sure.”
— Jordan (Anthropic)
“For critical work, you should cross-reference with trusted sources.”
— Jordan (Anthropic)
Questions Answered in This Episode
In the Jared Kaplan example, what internal signals (if any) might indicate the model is “guessing” rather than retrieving known information?
Hallucinations occur when an AI generates plausible-sounding text without enough reliable information, often presenting guesses with undue confidence.
What specific evaluation metrics does Anthropic use to quantify “appropriate hedging” versus harmful uncertainty (e.g., too many ‘I don’t know’s)?
Because models learn from large-scale text prediction and are trained to be helpful, they may answer even when the truthful response is “I don’t know.”
How do you distinguish between a model being uncertain and a model being confidently wrong when both can produce fluent text?
Anthropic mitigates hallucinations by training Claude toward honesty, stress-testing with adversarial questions, and tracking behaviors like fabricated citations and inappropriate confidence.
What training techniques most improved citation reliability—data filtering, tool use, reinforcement learning, or post-processing checks?
Hallucinations are hardest to anticipate because incorrect answers can look indistinguishable from correct ones, especially as they become less frequent and users stop checking.
When you ask the model to ‘check that sources support its claims,’ what does a good verification response look like in practice?
Users can reduce risk by requesting verifiable sources, encouraging uncertainty, probing confidence, restarting chats to critique prior answers, and cross-referencing critical details.
EVERY SPOKEN WORD
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome