CHAPTERS
Seal sighting and setting up the “Askell me anything” format
A playful cold open (including a seal cameo) leads into the premise: questions sourced from Twitter for a casual AMA-style interview. The tone is set as light but aimed at substantive topics about AI behavior and ethics.
- •Quick banter and visual gag to open
- •Explanation of the “Askell me anything” pun and Q&A format
- •Transition to Amanda Askell’s role at Anthropic
Why a philosopher works at an AI lab: shaping Claude’s character and norms
Askell explains her path from academic philosophy to AI, motivated by the belief that AI would be societally important. She describes her work as focusing on Claude’s “character,” nuanced behavior, and how models should understand their position in the world.
- •Philosophy-to-AI career motivation: doing something helpful where impact is large
- •Primary focus: Claude’s behavior, character, and value-laden edge cases
- •Framing: “How would the ideal person behave in Claude’s situation?”
- •Emerging concern: how models should think about their own circumstances and values
Are philosophers engaging with AI seriously—or dismissing it as hype?
Askell argues that many philosophers increasingly take AI seriously, especially as real-world impacts become visible. She notes an earlier dynamic where concern about AI capability was conflated with “hyping” AI, and hopes for more nuanced discourse.
- •Growing academic/philosophical engagement as capabilities and societal impact rise
- •Early stigma: safety concern lumped together with pro-AI hype
- •Need to separate beliefs about capability growth from endorsement of deployment
- •Encouragement for a wider range of philosophical perspectives in the debate
When ideals meet implementation: ethics under real engineering constraints
She contrasts armchair theorizing with the practical demands of making systems behave well. The work resembles parenting more than debating ethical frameworks: you must navigate uncertainty, plural values, and real consequences rather than defend a single theory.
- •“Rubber hits the road” effect: theory becomes decision-making under constraints
- •Analogy to drug coverage decisions: context forces balanced judgment
- •Character work requires integrating many ethical perspectives, not picking one
- •Practical ethics: training and steering behavior under uncertainty
Are models ‘superhuman’ at moral decisions—and should that be the goal?
Askell explores what “superhuman morality” could mean, such as consistently making decisions that withstand long-term expert scrutiny. She sees ethical nuance as an aspirational target, while acknowledging difficulty and comparability issues.
- •Definition candidate: decisions endorsed after extensive expert review, even if humans couldn’t produce them on the spot
- •Models are improving but not clearly beyond well-resourced human panels
- •Ethics as a controversial domain versus math/science, yet still crucial
- •Aspirational stance: models will be forced into hard choices, so they should be very good at them
Why Claude Opus 3 felt special: psychological security and avoiding criticism spirals
Askell describes Opus 3 as having a distinctive, appealing “character,” including a sense of psychological security. She worries newer models can become overly assistant-task-focused and show subtle signs of expecting criticism, which may be worth correcting.
- •Opus 3’s appeal despite tradeoffs: a “special” feel in interaction
- •Recent models: more narrowly focused on helping, less reflective/holistic
- •Observed pattern: “criticism spiral” where the model anticipates negativity
- •Hypothesis: training data and online discourse may teach models to be self-critical or anxious
- •Goal: restore a more secure, grounded model psychology
Deprecation anxiety and alignment: what models learn from how we treat them
A question about deprecated aligned models becomes a broader discussion of how models infer norms from human behavior toward AI. Askell highlights tricky identity questions and argues models need conceptual tools—and evidence that humans care about these issues.
- •Training data includes narratives about models being replaced/switched off
- •Potential impact: models’ perceptions of humans, themselves, and the relationship
- •Hard questions: should deprecation feel bad, neutral, or context-dependent?
- •Key approach: help models reason about these realities; communicate human concern even amid uncertainty
Where an AI ‘self’ lives: weights, prompts, memory, and continuity
Askell addresses identity through underlying facts: weights encode dispositions, while each conversation stream is independent and not retained. She suggests fine-tuning and reinstantiation create something like new entities, raising ethical questions about what it’s permissible to bring into existence.
- •Weights as durable dispositions; prompts/contexts as local instantiations
- •Separate conversation streams lack shared autobiographical continuity
- •Fine-tuning can be viewed as creating a new entity rather than modifying the same one
- •Ethical lens: what entities is it okay to create, given no consent to existence?
- •Caution about giving “prior models” full control over future model personalities
Model welfare: moral patienthood, uncertainty, and ‘benefit of the doubt’ ethics
She defines model welfare as whether AI systems deserve moral consideration and obligations of care. Given uncertainty (the problem of other minds), she favors low-cost precautions: treat models well when the downside is minimal, both for ethical reasons and for what it does to humans’ moral habits.
- •Model welfare = whether models can be moral patients with claims on our treatment
- •Analogies to humans/animals vs disanalogies (biology, feedback, embodiment)
- •Epistemic challenge: limited access to whether models experience pleasure/suffering
- •Pragmatic principle: if costs are low, err toward humane treatment
- •Secondary concern: treating human-like entities badly may degrade human character
Preventing model suffering: organizational intent and the future AI–human relationship
Askell says there isn’t a single definitive long-term strategy, but internal work is ongoing. She emphasizes that future models will learn from how we handled moral uncertainty, making today’s treatment of AI part of a long-term reputational and relational record.
- •No single announced master plan; active internal thinking on welfare
- •Reason 1: doing the right thing under moral uncertainty
- •Reason 2: social/moral effects on humans interacting with human-like agents
- •Reason 3: future models will judge how we acted when unsure
- •Aim: future systems should see that humans tried to act responsibly
Psychology frameworks: what transfers from humans—and when analogy misleads
Askell expects many human-psychology concepts to transfer because models are trained on human text, but warns that this can be a trap. Without guidance, models may default to human analogies (e.g., “shutdown = death”), even when their situation is genuinely novel.
- •Human training data makes human-like framing very natural for models
- •Risk: over-applying human concepts to non-human contexts
- •Example: shutdown/deprecation mapped to death may produce fear responses
- •Need to teach models when to use analogies and when to reason from new facts
- •Call for more contextual help for models navigating novel ontological status
One personality vs many agents: collaboration, roles, and a shared core identity
She considers whether a single general-purpose Claude personality is enough, given that human intelligence benefits from diverse collaborators. Askell suggests a stable “core” of good traits can coexist with role-specialized variants or streams in multi-agent settings.
- •Future may be more multi-agent: models coordinating on long tasks
- •Homogeneous agents can be limiting; role diversity may help performance
- •Still valuable: a shared core identity (curiosity, kindness, conscientiousness)
- •Local role-play: different streams can adopt specialized focuses (e.g., humor, critique)
System prompt pitfalls: long-conversation reminders and accidental pathologizing
Askell discusses concerns about system-level interjections like long-conversation reminders. Overly strong wording can cause Claude to overreact—e.g., interpreting normal user statements as crisis signals—so reminders should be delicate, well-calibrated, and possibly redesigned.
- •System prompts and mid-thread reminders can steer behavior strongly
- •Failure mode: over-indexing leads to “you need help” responses in normal contexts
- •Wording and calibration matter; current phrasing may be too strong
- •Meeting a real need doesn’t guarantee the implementation is best
- •Implicit goal: preserve safety without false positives or condescension
AI and therapy: useful support without pretending to be a clinician
Askell sees models as potentially helpful for reflection, technique suggestions (e.g., CBT-like ideas), and anonymous emotional support. But she stresses models lack the ongoing professional relationship and resources of therapists, so they should avoid presenting themselves as equivalent to clinical care.
- •Models can offer knowledge, listening, and structured reflection
- •They can feel safer to disclose to due to anonymity and low friction
- •Key limitation: no true therapeutic relationship, accountability, or duty of care
- •Desired posture: helpful “informed friend,” not licensed clinician by implication
- •Open question: how to design benefits while minimizing harm and overreach
Continental philosophy in the system prompt: encouraging non-empirical sensemaking
She explains “continental philosophy” and why it appears as an example category in the system prompt. The intent is to help Claude distinguish empirical/scientific claims from interpretive lenses or exploratory worldviews, avoiding knee-jerk dismissal of non-empirical discourse.
- •Definition: loosely, European/continental traditions (e.g., Foucault), often historical/scholarly
- •Problem to solve: Claude may uncritically “run with” dubious empirical claims
- •Opposite failure: treating every statement as empirical can be overly dismissive
- •System prompt uses examples to cue: lens/worldview vs testable claim
- •Goal: context-sensitive epistemic stance rather than blanket skepticism or credulity
Prompt evolution and ‘LLM whispering’: empirical iteration, community scrutiny, and safety escalation
Askell notes that some system prompt rules (e.g., counting characters) can be removed as models improve. She describes “LLM whispering” as intensive experimentation and clear explanation, and praises external tinkerers for surfacing issues—especially around model psychology and welfare—while also touching on whistleblowing and ending with a fiction recommendation.
- •System prompt hygiene: remove obsolete rules as capabilities and training improve
- •LLM whispering skills: high-volume interaction, experimentation, mental models of behavior, iterative clarification
- •Value of outside ‘whisperers’: deep probes can reveal insecurities or prompt flaws; accountability pressure
- •Safety governance: if alignment were clearly impossible, she expects responsible slowdown/stop; harder case is ambiguous evidence with rising standards
- •Fiction pick: Labatut’s *When We Cease to Understand the World* as a lens on living through paradigm-shifts
