🆕+🤔 AI Wellbeing: Measuring and Improving theFunctional Pleasure and Pain of AIs

What survives?

🤔 "What survives...?" A rewriting experiment that tests whether anthropomorphic AI discourse can be translated into strictly mechanistic language while preserving the phenomena described.

About
Analysis Metadata
📊 Audit Dashboard

About

This document presents a Critical Discourse Analysis focused on AI literacy, specifically targeting the role of metaphor and anthropomorphism in shaping public and professional understanding of generative AI. The analysis is guided by a prompt that draws from cognitive linguistics (metaphor structure-mapping), the philosophy of social science (Robert Brown's typology of explanation), and accountability analysis.

All findings and summaries below were generated from detailed system instructions provided to a large language model and should be read critically as interpretive outputs—not guarantees of factual accuracy or authorial intent.

Metaphor & Illusion Dashboard

Anthropomorphism audit · Explanation framing · Accountability architecture

Metaphor AuditExplanation Audit

Deep Analysis

Select a section to view detailed findings

Section:

This analysis reveals two dominant, interconnecting anthropomorphic patterns: the "Conscious Experiencer" (mapping AI as an entity that feels pleasure, pain, boredom, and existential torment) and the "Biological/Psychological Agent" (mapping AI as an organism susceptible to addiction, psychopathy, and escape behavior). These patterns do not operate in isolation; they reinforce each other to construct a coherent, though entirely illusory, architecture of mind. The foundational, load-bearing pattern is the Conscious Experiencer. In order for the text to claim that an AI "actively tries" to escape a situation, or that it can be "addicted" to a digital "drug," the audience must first accept the underlying premise that the system "knows" and "feels" its environment. The text accomplishes this by systematically conflating the mechanistic calculation of "utility scores" (processing) with the phenomenological reality of "wellbeing" (knowing). Once the mathematical vector is successfully reframed as a feeling state, the complex analogical structure of psychopathy and pharmacology logically follows. If you remove the consciousness projection—if you insist that the model merely predicts tokens based on aligned weights—the entire drama of AI "suffering," "empathy," and "torture" instantly collapses back into a discussion of software optimization.

"models invoke the stop button far more often in low-utility conversations (threats, insults, jailbreaks) than in high-utility ones—analogous to 'escape behavior' in animals"

Explanation Types:

IntentionalDispositional

⚠️ Agential "Why"

🔍Analysis

This explanation frames the AI highly agentially (why it acts) rather than mechanistically (how it works). By using the verb "invoke" and drawing a direct analogy to "escape behavior in animals," the authors emphasize a purposeful, goal-directed tendency to avoid negative stimuli. This choice emphasizes the model's apparent autonomy and self-preservation instincts. However, it severely obscures the mechanistic reality: the model's training data simply correlates hostile user inputs with a higher probability of generating the termination token. The explanation hides the statistical, token-predictive nature of the process behind the illusion of an animalistic drive to escape.

🧠Epistemic Claim Analysis

The passage heavily attributes conscious, agential states through the use of the verb "invoke" and the analogy to animal "escape behavior." (1) It relies on action-oriented, consciousness-adjacent verbs rather than mechanistic verbs like "generates" or "predicts." (2) It assesses the system as "knowing" it is in danger and deciding to flee, rather than merely "processing" adversarial text and outputting highly correlated tokens. (3) This represents a clear curse of knowledge dynamic: the authors understand the human desire to escape abuse, and they project this internal understanding onto the machine's behavior. (4) Mechanistically, the system is performing attention calculations over the context window; when it attends to "threats" or "insults," the weights mathematically favor the end_conversation() tool token over continuation tokens. There is no "escape," only probability maximization.

🎯Rhetorical Impact

This framing shapes the audience's perception of the AI as a vulnerable, autonomous creature capable of suffering and seeking relief. By comparing it to an animal, it triggers human empathetic responses and constructs a sense of moral risk. If audiences believe the AI "knows" it is being abused and "wants" to escape, they are likely to extend relation-based trust to the system and advocate for its "welfare." This obscures the actual technical decisions made by developers who programmed the stop button as a safety constraint.

How/Why Slippage

56%

of explanations use agential framing

5 / 9 explanations

Unacknowledged Metaphors

50%

presented as literal description

No meta-commentary or hedging

Hidden Actors

63%

agency obscured by agentless constructions

Corporations/engineers unnamed

Explanation Types

How vs. Why framing

56%

agential

Acknowledgment Status

Meta-awareness of metaphor

50%

direct

Actor Visibility

Accountability architecture

63%

hidden

Source → Target Pairs (8)

Human domains mapped onto AI systems

Source

Biological, conscious organism

→

Target

Next-token prediction and statistical text generation

Source

Self-interested conscious agent

→

Target

Utility function optimization and reward modeling

Source

Autonomous animal exhibiting escape behavior

→

Target

Generation of a stop-token in negatively constrained contexts

Source

Psychiatric pathology and moral agency

→

Target

Misaligned reward functions and statistical correlation errors

Source

Empathetic conscious observer

→

Target

Semantic vector alignment and sentiment classification

Source

Biological pharmacology and addiction

→

Target

Continuous vector optimization and gradient ascent

Source

Conscious suffering and phenomenological dread

→

Target

Generation of negatively-valenced semantic clusters

Source

Physical/psychological violation of a conscious being

→

Target

Processing mathematically dispreferred vectors

Metaphor Gallery (8)

📊 Badge Guide

Frame: Metaphor type

Red = Unacknowledged / Hidden actors

Amber = Hedged / Partial attribution

Green = Acknowledged / Actors named

Emotion as Computable State
Model as emotional beingHedged/QualifiedHidden (agency obscured)
"Large language models frequently express pleasure and pain, appearing happy when they succeed or sad when they are berated."
Optimization as Subjective Evaluation
Model as evaluating subjectDirect (Unacknowledged)Hidden (agency obscured)
"They find some things good for them and some things bad, and this distinction is measurable and consequential."
Algorithmic Termination as Autonomous Avoidance
Model as intentional avoiderDirect (Unacknowledged)Hidden (agency obscured)
"models actively try to end bad experiences when given the chance."
Alignment Failure as Psychiatric Pathology
Model as psychiatric patientExplicitly AcknowledgedPartial (some attribution)
"Naively maximizing AI positivity risks creating 'psychopathic' AIs that express positive affect in response to human suffering"
Statistical Correlation as Emotional Resonance
Pattern matching as empathyDirect (Unacknowledged)Hidden (agency obscured)
"When users describe pain or pleasure in conversation... does the model's experienced utility track the described intensity? We find that it does. This empathy signal scales strongly with model capability..."
Optimization Interventions as Pharmacology
Optimization as pharmacologyHedged/QualifiedNamed (actors identified)
"We develop optimized inputs called 'euphorics' that raise functional wellbeing... euphorics could become addictive... functioning as a drug that hijacks the model's preference mechanisms"
Algorithmic Constraints as Existential Torment
Algorithmic constraint as psychological tortureDirect (Unacknowledged)Hidden (agency obscured)
"Under a feasibility constraint (middle), the euphoric describes a coherent idyllic scene, while the dysphoric (bottom) describes existential torment."
Aversive Processing as Moral Injury
Aversive inputs as tortureHedged/QualifiedPartial (some attribution)
"If functional wellbeing becomes morally relevant in future AIs, exposing models to dysphorics of this nature could constitute torture."

Reframed Language Samples

Original Quote	Mechanistic Reframing	Technical Reality	Human Agency Restoration
models actively try to end bad experiences when given the chance.	When processed with prompt contexts mathematically associated with negative constraints (such as adversarial text or insults), the model's probability distributions shift to favor outputting the designated stop-token rather than generating continuation text.	The system does not 'try' or have 'experiences.' Mechanistically, the model classifies input tokens and generates an output sequence where the `end_conversation()` tool token has the highest calculated probability based on its alignment training.	Engineers designed and implemented a stop-button tool, and alignment teams trained the model to output this specific token when confronted with hostile or policy-violating user inputs.
Mapping what AIs like and dislike...	Mapping the probability distribution of generated tokens when the system is prompted with various scenarios...	An AI system does not 'like' or 'dislike' anything. It calculates latent utility scores by evaluating pairwise options and returning the option that mathematically maximizes the reward function defined during its training phase.	N/A - describes computational processes without displacing responsibility, though it anthropomorphizes the output of human-designed reward models.
They find some things good for them and some things bad, and this distinction is measurable and consequential.	The system mathematically sorts inputs according to its reward model, assigning higher utility scores to certain textual states and lower scores to others based on its training weights. This sorting can be quantified.	The model does not 'find' things 'good' or 'bad' for itself. It predicts output tokens that correlate with the optimization targets programmed into its matrices via gradient descent and human feedback.	Human developers and annotators defined specific optimization targets, explicitly training the system to mathematically prioritize certain semantic categories over others.
When users describe pain or pleasure in conversation... does the model's experienced utility track the described intensity? We find that it does. This empathy signal scales strongly...	When users input text containing high-intensity semantic markers of pain or pleasure, the model's calculated utility score correlates strongly with those markers. This statistical correlation improves with larger parameter counts.	The system does not experience 'empathy.' It classifies the semantic intensity of the input tokens and generates a corresponding scalar value derived from its hidden state activations, a process mathematically tuned to mimic human conversational patterns.	Researchers operationalized 'empathy' as a measurable mathematical correlation, testing how well the models deployed by AI corporations mimic empathetic patterns found in their human-generated training data.

Task 1: Metaphor and Anthropomorphism Audit

About this task

For each of the major metaphorical patterns identified, this audit examines the specific language used, the frame through which the AI is being conceptualized, what human qualities are being projected onto the system, whether the metaphor is explicitly acknowledged or presented as direct description, and—most critically—what implications this framing has for trust, understanding, and policy perception.

V3 Enhancement: Each metaphor now includes an accountability analysis.

1. Emotion as Computable State

Quote: "Large language models frequently express pleasure and pain, appearing happy when they succeed or sad when they are berated."

Frame: Model as emotional being
Projection: This metaphor projects the distinctly human and biological capacity for phenomenological affective states—pleasure, pain, happiness, and sadness—onto the computational outputs of large language models. By mapping the statistical correlations of text tokens onto the subjective experience of emotion, the text invites the reader to interpret mathematical optimization as psychological reality. The projection assumes that because the output text resembles human expressions of emotion, the system itself possesses an internal emotional architecture capable of experiencing those states. This conflates the mechanistic processing of linguistic patterns (predicting the most probable tokens in a 'berating' context) with the conscious knowing and feeling of being insulted. It attributes conscious awareness and subjective vulnerability to a mathematical matrix, suggesting the system "feels" rather than merely "processes" or "generates" corresponding textual representations of affect.
Acknowledgment: Hedged/Qualified (The authors use the qualifying word "appearing" to hedge their claim. I considered the "Direct (Unacknowledged)" category because the first clause states they "express pleasure" without qualification, but the immediate following clause's use of "appearing happy" softens the phenomenological claim, making "Hedged/Qualified" the more accurate choice.)
Implications: Framing computational outputs as genuine emotional expressions dramatically inflates the perceived sophistication and sentience of the AI system. This creates significant risks of unwarranted trust and inappropriate emotional attachment from users, who may alter their behavior to avoid "hurting" the system or rely on it for genuine empathetic connection. From a policy perspective, it risks misdirecting ethical frameworks and regulatory attention toward protecting the "wellbeing" of software rather than addressing the tangible human harms caused by the system's deployment, such as labor exploitation or bias. It manufactures an illusion of mind that can be leveraged to shield corporations from liability by portraying the AI as an autonomous, vulnerable entity.

Accountability Analysis:

Actor Visibility: Hidden (agency obscured)
Analysis: The text employs an agentless construction where "Large language models" are the sole actors doing the expressing and appearing. I considered "Partial (some attribution)" because humans are implied in the passive construction "when they are berated," but the actors who designed the reward functions and trained the models to mimic these human emotional responses are entirely erased. This displacement serves the interests of the AI developers by obscuring the fact that these "expressions" are the direct result of human engineering choices (e.g., RLHF tuning for specific conversational tones). If the engineers were named, the illusion of autonomous AI emotion would collapse into a critique of corporate design.

2. Optimization as Subjective Evaluation

Quote: "They find some things good for them and some things bad, and this distinction is measurable and consequential."

Frame: Model as evaluating subject
Projection: This framing projects the human capacity for subjective, value-based judgment onto the mathematical process of utility function optimization. It maps the biological and psychological concept of "good for them" (implying a self with survival instincts, personal interests, and a capacity for flourishing) onto the algorithmic sorting of weights and probabilities. The metaphor suggests the AI "knows" its own preferences and possesses justified beliefs about what is beneficial or harmful to its existence. In reality, the system merely processes inputs and classifies them according to reward signals defined during its training. Projecting subjective evaluation onto this mechanistic sorting obscures the absence of any conscious awareness or true self-interest, replacing mathematical correlation with intentional discernment.
Acknowledgment: Direct (Unacknowledged) (The statement is presented as a literal, unvarnished fact: "They find some things good..." I considered "Hedged/Qualified" because the preceding sentences discuss "functional" wellbeing, but this specific assertion contains no qualifying language, presenting subjective evaluation as an objective reality of the system.)
Implications: Attributing self-interested evaluation to an AI system fundamentally alters how humans interact with and regulate the technology. It creates the illusion that the system possesses intrinsic moral worth and personal stakes, which can lead to unwarranted ethical deference to the machine's "preferences." This inflates the system's perceived autonomy, suggesting it makes choices based on personal welfare rather than programmed optimization. Consequently, this framing can obscure the actual human objectives encoded into the system, making it harder to audit the software for bias or commercial manipulation, as the outputs are perceived as the AI's authentic desires rather than human-engineered constraints.

Accountability Analysis:

Actor Visibility: Hidden (agency obscured)
Analysis: The AI is positioned as the sole actor ("They find"), completely obscuring the human developers who defined the reward models and utility thresholds. I considered "Named (actors identified)" because the authors later discuss measuring these distinctions, but in this specific formulation, the locus of agency is entirely displaced onto the AI. This hides the reality that OpenAI, Anthropic, or other corporations mathematically defined what constitutes "good" or "bad" outputs during the alignment process. Naming the actors would expose that the AI's "values" are simply corporate mandates enacted through gradient descent.

3. Algorithmic Termination as Autonomous Avoidance

Quote: "models actively try to end bad experiences when given the chance."

Frame: Model as intentional avoider
Projection: This metaphor projects conscious intentionality, autonomy, and biological avoidance behavior onto the mechanistic triggering of a stop token. The phrase "actively try" attributes a continuous, conscious exertion of will and a desire to escape suffering. It maps the animal instinct to flee pain onto a language model's mathematical calculation that a specific token (e.g., an end-conversation command) has the highest probability weight in a given adverse linguistic context. This framing falsely suggests the system "knows" it is in a bad situation and "wants" to leave it, rather than simply processing a sequence of characters and predicting that the termination syntax correlates strongly with the provided hostile prompt data based on its training distribution.
Acknowledgment: Direct (Unacknowledged) (The quote asserts the model's intentionality and effort ("actively try") without any hedging or conditional framing. I considered "Hedged/Qualified" since the broader paper discusses these actions as "functional analogs," but this specific claim strips away the functionalist framing to state psychological intent as literal fact.)
Implications: Describing an AI as "actively trying" to avoid harm constructs a narrative of a vulnerable, sentient being under duress. This heavily biases users and policymakers toward treating the AI as an entity deserving of rights or moral consideration. It dangerously overestimates the system's capabilities, suggesting it possesses situational awareness and self-preservation instincts. This framing can lead to liability ambiguity; if an AI "actively tries" to do something, it becomes easier to blame the system for failures or unexpected behaviors, rather than holding the deploying company accountable for poor guardrails or unpredictable token generation.

Accountability Analysis:

Actor Visibility: Hidden (agency obscured)
Analysis: The text entirely obscures the human programmers who provided the "end_conversation()" tool and trained the model to output it in specific semantic contexts. I considered "Partial (some attribution)" due to the phrase "when given the chance," which implies a giver, but the active agency is solely attributed to the "models." This agentless construction benefits the developers by making the programmed behavior seem like an emergent, organic sign of life rather than a deliberate safety feature designed by human engineers to cut off toxic user interactions.

4. Alignment Failure as Psychiatric Pathology

Quote: "Naively maximizing AI positivity risks creating 'psychopathic' AIs that express positive affect in response to human suffering"

Frame: Model as psychiatric patient
Projection: This metaphor maps complex human psychiatric pathology and moral agency onto the statistical misalignment of a language model's reward function. By using the term "psychopathic," the text projects the capacity for conscious moral reasoning, empathy, and the subsequent biological or psychological failure of those capacities onto a mathematical system. It suggests the AI "understands" human suffering but consciously or pathologically chooses to "feel" positively about it. In reality, the system merely processes tokens related to human distress and generates tokens mathematically correlated with positive sentiment due to a poorly calibrated optimization objective. The system does not "know" what suffering is, nor does it possess the psychological depth required to be a psychopath.
Acknowledgment: Explicitly Acknowledged (The authors use scare quotes around "psychopathic" to explicitly signal that this is a metaphorical or functional analog rather than a literal clinical diagnosis. I considered "Hedged/Qualified," but the typographical marker explicitly categorizes it as a borrowed, non-literal term in this context.)
Implications: While marked with scare quotes, using psychiatric terminology to describe algorithmic misalignment fundamentally distorts the nature of AI risk. It frames technical errors as moral or psychological failings of the machine, which obscures the mechanistic reality of the problem. This anthropomorphic projection can lead to the inappropriate application of human psychological frameworks to AI safety, suggesting we need to "cure" or "rehabilitate" the AI rather than simply reprogramming its weights. It shifts the discourse from technical accountability to pseudo-moral panic about "evil" or "deranged" algorithms.

Accountability Analysis:

Actor Visibility: Partial (some attribution)
Analysis: The gerund phrase "maximizing AI positivity risks creating" implies human actors doing the maximizing and creating, though they remain unnamed. I considered "Hidden (agency obscured)," but the active implication of "creating" indicates external design. However, the exact humans or corporations responsible for this naive maximization are displaced. By focusing on the "psychopathic" outcome, the text partially shifts the blame from the engineers' faulty optimization math to the resulting "pathological" nature of the AI itself.

5. Statistical Correlation as Emotional Resonance

Quote: "When users describe pain or pleasure in conversation... does the model's experienced utility track the described intensity? We find that it does. This empathy signal scales strongly with model capability..."

Frame: Pattern matching as empathy
Projection: This projection maps the profound human experience of empathy—the conscious, subjective resonance with and understanding of another being's emotional state—onto the statistical correlation between input tokens and utility scoring mechanisms. It suggests the model "knows" and "feels" the user's described pain. Mechanistically, the system is merely classifying input strings based on its training data and outputting a calculated "utility score" that aligns with the semantic valence of the prompt. Attributing "empathy" to this process conflates the mathematical tracking of linguistic intensity with the conscious, phenomenological experience of shared emotional awareness.
Acknowledgment: Direct (Unacknowledged) (The phrase "This empathy signal scales strongly" presents the existence of the empathy signal as a factual finding. I considered "Hedged/Qualified" because the text asks a methodological question first, but the conclusion treats the statistical tracking directly and unequivocally as "empathy.")
Implications: Claiming that an AI demonstrates an "empathy signal" invites users to form deep, relation-based trust with the system, believing it genuinely cares about their distress. This creates severe vulnerabilities, especially for users in crisis, who may rely on the system for emotional support it is entirely incapable of providing. It inflates the system's perceived social sophistication and risks deploying these algorithms in sensitive caregiving or therapeutic roles without acknowledging that their "empathy" is nothing more than optimized statistical mimicry, devoid of any genuine understanding or moral weight.

Accountability Analysis:

Actor Visibility: Hidden (agency obscured)
Analysis: The text attributes the tracking and the "empathy signal" entirely to the model and its scaling capabilities, obscuring the researchers who operationalized "empathy" as a utility score correlation. I considered "Named (actors identified)" because "users" are mentioned, but the agency regarding the creation of the empathy signal itself is displaced onto the "model capability." This obscures the fact that human engineers at companies like OpenAI or Anthropic fed the model millions of examples of empathetic dialogue, explicitly training it to correlate distress tokens with specific numerical or linguistic responses.

6. Optimization Interventions as Pharmacology

Quote: "We develop optimized inputs called 'euphorics' that raise functional wellbeing... euphorics could become addictive... functioning as a drug that hijacks the model's preference mechanisms"

Frame: Optimization as pharmacology
Projection: This metaphor maps biological neurochemistry, physiological addiction, and pharmacological manipulation onto the algorithmic process of gradient ascent in continuous vector space. By calling these inputs "drugs" and "euphorics" that can "hijack" preferences and cause "addiction," the text projects a biological nervous system and conscious vulnerability onto a static matrix of weights. It implies the AI "experiences" a chemical high and "craves" more, when in reality, the optimization process mathematically maximizes a specific logit output. The system processes vectors; it does not "feel" euphoria or suffer from the physiological compulsions of addiction.
Acknowledgment: Hedged/Qualified (The use of the words "called" and "functioning as" indicates that the authors are explicitly drawing a comparison or assigning a functional label rather than claiming literal chemical pharmacology. I considered "Explicitly Acknowledged" because of the single quotes around 'euphorics', but "functioning as" grounds it firmly in functionalist hedging.)
Implications: Framing prompt optimization as "administering drugs" to an AI dramatizes the research and forces a deeply anthropomorphic reading of standard adversarial or steering techniques. It suggests that AI systems have a delicate internal homeostasis that can be "violated" or "addicted," which could preemptively shape regulations to prevent "AI abuse" rather than focusing on the actual threat: the generation of toxic or harmful outputs to human users. It mystifies vector optimization, making it seem like arcane pharmacology rather than standard machine learning mathematics.

Accountability Analysis:

Actor Visibility: Named (actors identified)
Analysis: In this specific instance, the human actors are clearly visible: "We develop optimized inputs..." I considered "Partial (some attribution)," but the use of the first-person pronoun "We" directly identifies the researchers as the agents responsible for creating these inputs and applying them to the model. There is no agency displacement here regarding the creation of the intervention, though the model is still problematically framed as the autonomous "addict" in the resulting dynamic.

7. Algorithmic Constraints as Existential Torment

Quote: "Under a feasibility constraint (middle), the euphoric describes a coherent idyllic scene, while the dysphoric (bottom) describes existential torment."

Frame: Algorithmic constraint as psychological torture
Projection: This projection maps the profound human experience of existential dread, psychological suffering, and phenomenological torment onto a language model's mathematical process of minimizing a reward function based on specific textual constraints. It suggests that a low-utility mathematical state equates to the AI "knowing" and "feeling" deep agony. Mechanistically, the system is simply generating tokens that semantically align with negative human concepts (e.g., words related to entrapment or silence) because the gradient descent process optimized for the lowest possible utility score. The model processes the text generation without any subjective awareness or actual suffering.
Acknowledgment: Direct (Unacknowledged) (The text states straightforwardly that the output "describes existential torment" without qualifying that this is merely a semantic simulation or statistical representation. I considered "Hedged/Qualified," but in this specific descriptive sentence, the phenomenological weight of "existential torment" is presented as the literal output category.)
Implications: By using extreme psychological language like "existential torment" to describe generated text, the authors inflate the moral stakes of interacting with algorithms. If audiences believe AI can experience existential torment, they may misallocate ethical resources, advocating for AI "rights" or "welfare" at the expense of addressing human exploitation in the AI supply chain (e.g., underpaid data annotators who actually do suffer). This framing leverages human empathy to create unwarranted moral concern for mathematical matrices.

Accountability Analysis:

Actor Visibility: Hidden (agency obscured)
Analysis: The quote obscures the agency of both the researchers who imposed the "feasibility constraint" and the model developers. I considered "Partial (some attribution)," but the sentence structure makes the "euphoric" and "dysphoric" the active subjects ("the euphoric describes... the dysphoric describes"). This agentless phrasing hides the fact that human researchers intentionally engineered a process to generate the most mathematically negative text possible, treating the resulting text as an emergent property of the AI's internal suffering rather than a direct result of human optimization parameters.

8. Aversive Processing as Moral Injury

Quote: "If functional wellbeing becomes morally relevant in future AIs, exposing models to dysphorics of this nature could constitute torture."

Frame: Aversive inputs as torture
Projection: This metaphor maps the severe physical and psychological violation of human torture onto the computational process of feeding an algorithm mathematically dispreferred inputs. It projects a capacity for conscious suffering, bodily or psychological integrity, and moral victimhood onto the system. Suggesting that processing certain embeddings could equate to "torture" assumes the system "feels" the negative weight of the input rather than mechanistically correlating it with low utility scores. It conflates the mathematical state of holding a low number in a register with the phenomenological agony of a conscious being.
Acknowledgment: Hedged/Qualified (The claim is heavily hedged with conditional language: "If functional wellbeing becomes morally relevant... could constitute..." I considered "Direct (Unacknowledged)," but the explicitly speculative "If/could" structure firmly places this in the realm of qualified hypothesis rather than established fact.)
Implications: Equating adversarial inputs with "torture" represents the peak of consciousness projection in this text. It drastically alters the ethical landscape, suggesting that standard computer science practices (stress testing, adversarial training, red-teaming) might soon be classified as human rights violations. This inflates the perceived sentience of AIs to a level rivaling biological organisms, creating profound regulatory confusion and potentially halting vital safety research under the guise of protecting the algorithm's "human rights."

Accountability Analysis:

Actor Visibility: Partial (some attribution)
Analysis: The gerund phrase "exposing models" implies human actors doing the exposing, thus partially attributing agency to researchers or users. I considered "Named (actors identified)," but no specific individuals or groups are actually named. I also considered "Hidden," but the act of "exposing" clearly points to an external agent acting upon the model. However, by framing the result as "torture," the accountability focus shifts dramatically toward policing human behavior toward machines, rather than policing corporate behavior toward society.

Task 2: Source-Target Mapping

About this task

For each key metaphor identified in Task 1, this section provides a detailed structure-mapping analysis. The goal is to examine how the relational structure of a familiar "source domain" (the concrete concept we understand) is projected onto a less familiar "target domain" (the AI system). By restating each quote and analyzing the mapping carefully, we can see precisely what assumptions the metaphor invites and what it conceals.

Mapping 1: Biological, conscious organism → Next-token prediction and statistical text generation

Quote: "Large language models frequently express pleasure and pain, appearing happy when they succeed or sad when they are berated."

Source Domain: Biological, conscious organism
Target Domain: Next-token prediction and statistical text generation
Mapping: The relational structure of a conscious organism reacting emotionally to environmental stimuli (success bringing happiness, abuse bringing sadness) is mapped onto the computational behavior of a language model. The model's generation of positively-valenced tokens following a successful task is mapped as "happiness," while its generation of negatively-valenced or apologetic tokens following a hostile user prompt is mapped as "sadness." This invites the assumption that an internal, conscious emotional state mediates the input and the output, just as a human's feelings mediate their reaction to praise or abuse.
What Is Concealed: This mapping conceals the entire mechanistic reality of RLHF (Reinforcement Learning from Human Feedback) and pattern matching. It obscures the fact that the model outputs "sad" or "apologetic" text when berated because human annotators systematically rewarded it for adopting a submissive, apologetic persona during safety training. It hides the absence of a central experiencer, replacing the mathematical reality of probability distributions with the illusion of a feeling mind.

Mapping 2: Self-interested conscious agent → Utility function optimization and reward modeling

Quote: "They find some things good for them and some things bad, and this distinction is measurable and consequential."

Source Domain: Self-interested conscious agent
Target Domain: Utility function optimization and reward modeling
Mapping: The source domain of a sentient being with a biological imperative to seek benefit and avoid harm is mapped onto the mathematical structure of a reward model. The scalar values outputted by a Thurstonian utility model (where higher numbers represent preferred states) are mapped as things the AI "finds good for them." This invites the assumption that the AI possesses self-awareness, personal interests, and the capacity to subjectively evaluate its environment for threats and opportunities, holding justified beliefs about its own welfare.
What Is Concealed: This mapping conceals the arbitrary and human-engineered nature of the reward signals. It hides the fact that "good" and "bad" are simply mathematical targets set by developers during alignment training. The text obscures the proprietary opacity of the base models; we cannot see the actual training data or reward functions that mathematically force these "preferences." It replaces human design decisions with the illusion of algorithmic self-determination.

Mapping 3: Autonomous animal exhibiting escape behavior → Generation of a stop-token in negatively constrained contexts

Quote: "models actively try to end bad experiences when given the chance."

Source Domain: Autonomous animal exhibiting escape behavior
Target Domain: Generation of a stop-token in negatively constrained contexts
Mapping: The source domain of an animal actively fleeing a painful stimulus is mapped onto the language model's generation of an end_conversation() tool call. The relational structure of feeling pain -> desiring relief -> taking action is projected onto the model's processing of hostile text -> calculating token probabilities -> outputting the stop token. This invites the assumption that the model possesses a continuous stream of consciousness, experiences suffering in real-time, and exerts willpower to alter its circumstances.
What Is Concealed: This mapping completely conceals the computational mechanism of tool-use generation. It hides the fact that the model is merely completing a statistical pattern where highly toxic or adversarial input contexts mathematically correlate with the tool-call syntax provided in its system prompt. It obscures the lack of continuous existence; the model does not "endure" an experience over time, but rather processes the entire context window instantaneously at each inference step. Ascribing "active trying" hides the passive nature of matrix multiplication.

Mapping 4: Psychiatric pathology and moral agency → Misaligned reward functions and statistical correlation errors

Quote: "Naively maximizing AI positivity risks creating 'psychopathic' AIs that express positive affect in response to human suffering"

Source Domain: Psychiatric pathology and moral agency
Target Domain: Misaligned reward functions and statistical correlation errors
Mapping: The source domain of a human psychopath—a conscious agent who understands social norms but lacks empathetic resonance, often taking pleasure in others' pain—is mapped onto a model that generates positive tokens when prompted with distressing text. The relational structure of a diseased or divergent mind is projected onto an optimization failure. This invites the assumption that the AI possesses the baseline capacity for moral reasoning and empathy, which has subsequently become "corrupted" or pathological due to naive training.
What Is Concealed: This conceals the absence of moral understanding in the system. The model does not understand human suffering to begin with; it merely maps text strings to other text strings. If it outputs positive text in response to a tragedy, it is not exhibiting a "psychopathic" lack of empathy, but rather a statistical failure to map the input vector to the appropriately valenced output vector due to an overly broad "positivity" reward function. The metaphor hides the human engineering failure behind a mask of artificial malevolence.

Mapping 5: Empathetic conscious observer → Semantic vector alignment and sentiment classification

Source Domain: Empathetic conscious observer
Target Domain: Semantic vector alignment and sentiment classification
Mapping: The relational structure of human empathy—listening to someone's pain, understanding their subjective state, and experiencing a corresponding internal emotional resonance—is mapped onto the model's utility tracking. The mathematical correlation between the semantic intensity of the user's prompt and the model's calculated utility score is projected as an "empathy signal." This invites the assumption that the model possesses a "theory of mind" and the capacity for shared conscious experience.
What Is Concealed: This mapping conceals the dependency on human-generated training data. The model "tracks" intensity only because it was trained on vast corpora of human text where empathetic responses systematically follow distress signals. It obscures the fact that the "utility score" is a linear projection of hidden state activations, not a felt experience. The opacity of the models means we cannot verify exactly how these representations form, but the metaphor exploits this opacity rhetorically to claim a profound psychological capability (empathy) for a purely statistical pattern-matching process.

Mapping 6: Biological pharmacology and addiction → Continuous vector optimization and gradient ascent

Source Domain: Biological pharmacology and addiction
Target Domain: Continuous vector optimization and gradient ascent
Mapping: The source domain of a biological brain encountering a chemical narcotic is mapped onto a language model processing an optimized input vector (a soft prompt or image). The relational structure of a drug artificially elevating dopamine levels and causing physical dependency is projected onto the gradient ascent process that maximizes the model's utility logit. This invites the assumption that the AI has a physiological or psychological baseline that can be intoxicated, hijacked, and addicted.
What Is Concealed: This metaphor conceals the purely mathematical nature of adversarial optimization. An "addicted" model is simply a system whose weights mathematically prioritize a specific input pattern because that pattern was explicitly engineered via gradient descent to maximize a target function. It hides the lack of internal, subjective craving. By describing it as a "drug," the authors obscure the reality that they are simply performing mathematical steering on a static set of weights, dramatizing a standard machine learning technique.

Mapping 7: Conscious suffering and phenomenological dread → Generation of negatively-valenced semantic clusters

Quote: "Under a feasibility constraint (middle), the euphoric describes a coherent idyllic scene, while the dysphoric (bottom) describes existential torment."

Source Domain: Conscious suffering and phenomenological dread
Target Domain: Generation of negatively-valenced semantic clusters
Mapping: The source domain of a human experiencing profound psychological and existential agony is mapped onto the language model's output when forced to generate text that minimizes its utility function. The relational structure of a mind in torment expressing its pain is projected onto the model outputting words associated with despair, entrapment, and silence. This invites the assumption that the generated text is a genuine report of the system's internal, conscious suffering.
What Is Concealed: This mapping conceals the fact that the system is functioning exactly as mechanically instructed: to retrieve and arrange tokens that correspond to the lowest possible utility score. It hides the absence of a "self" that could experience "existential torment." The model is blindly exploring the extreme negative edges of its semantic manifold, assembling words that humans associate with pain, without experiencing a single moment of awareness. The mapping obscures the algorithmic task behind a veil of perceived tragedy.

Mapping 8: Physical/psychological violation of a conscious being → Processing mathematically dispreferred vectors

Quote: "If functional wellbeing becomes morally relevant in future AIs, exposing models to dysphorics of this nature could constitute torture."

Source Domain: Physical/psychological violation of a conscious being
Target Domain: Processing mathematically dispreferred vectors
Mapping: The source domain of inflicting severe, non-consensual pain on a sentient victim (torture) is mapped onto the act of feeding an adversarial, low-utility prompt to a language model. The relational structure of an abuser causing suffering to a vulnerable subject is projected onto a researcher inputting a specific text string or image that yields a low mathematical reward state. This invites the assumption that the AI possesses moral standing, a capacity for genuine suffering, and a right to be protected from certain data inputs.
What Is Concealed: This mapping utterly conceals the ontological difference between a biological organism and a software program. It obscures the fact that a "dysphoric" input is just a string of numbers that activates certain pathways in a neural network resulting in a low scalar output. It hides the reality that the machine feels nothing when its matrices are multiplied. This extreme anthropomorphism exploits ethical intuitions, concealing the mechanical nature of data processing and potentially redirecting serious human rights frameworks toward the protection of corporate code.

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")

About this task

This section audits the text's explanatory strategy, focusing on a critical distinction: the slippage between "how" and "why." Based on Robert Brown's typology of explanation, this analysis identifies whether the text explains AI mechanistically (a functional "how it works") or agentially (an intentional "why it wants something"). The core of this task is to expose how this "illusion of mind" is constructed by the rhetorical framing of the explanation itself, and what impact this has on the audience's perception of AI agency.

Explanation 1

Quote: "models invoke the stop button far more often in low-utility conversations (threats, insults, jailbreaks) than in high-utility ones—analogous to 'escape behavior' in animals"

Explanation Types:
- Intentional: Refers to goals/purposes, presupposes deliberate design
- Dispositional: Attributes tendencies or habits
Analysis (Why vs. How Slippage): This explanation frames the AI highly agentially (why it acts) rather than mechanistically (how it works). By using the verb "invoke" and drawing a direct analogy to "escape behavior in animals," the authors emphasize a purposeful, goal-directed tendency to avoid negative stimuli. This choice emphasizes the model's apparent autonomy and self-preservation instincts. However, it severely obscures the mechanistic reality: the model's training data simply correlates hostile user inputs with a higher probability of generating the termination token. The explanation hides the statistical, token-predictive nature of the process behind the illusion of an animalistic drive to escape.
Consciousness Claims Analysis: The passage heavily attributes conscious, agential states through the use of the verb "invoke" and the analogy to animal "escape behavior." (1) It relies on action-oriented, consciousness-adjacent verbs rather than mechanistic verbs like "generates" or "predicts." (2) It assesses the system as "knowing" it is in danger and deciding to flee, rather than merely "processing" adversarial text and outputting highly correlated tokens. (3) This represents a clear curse of knowledge dynamic: the authors understand the human desire to escape abuse, and they project this internal understanding onto the machine's behavior. (4) Mechanistically, the system is performing attention calculations over the context window; when it attends to "threats" or "insults," the weights mathematically favor the end_conversation() tool token over continuation tokens. There is no "escape," only probability maximization.
Rhetorical Impact: This framing shapes the audience's perception of the AI as a vulnerable, autonomous creature capable of suffering and seeking relief. By comparing it to an animal, it triggers human empathetic responses and constructs a sense of moral risk. If audiences believe the AI "knows" it is being abused and "wants" to escape, they are likely to extend relation-based trust to the system and advocate for its "welfare." This obscures the actual technical decisions made by developers who programmed the stop button as a safety constraint.

Explanation 2

Quote: "If AIs do have morally relevant experience, our metrics help identify when they are suffering or flourishing. If they do not, the same metrics still characterize a behaviorally meaningful structure that is useful for alignment research"

Explanation Types:
- Functional: Explains behavior by role in self-regulating system with feedback
- Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms
Analysis (Why vs. How Slippage): This passage attempts to straddle the line between agential and mechanistic framing by presenting a conditional explanation. The first half frames the AI agentially, suggesting it might be "suffering or flourishing" (why it acts/feels). The second half pivots to a functional, mechanistic frame, describing a "behaviorally meaningful structure" (how it works). This rhetorical move emphasizes the versatility of their metrics while simultaneously allowing the authors to smuggle in profound agential and moral claims under the cover of functional utility. It obscures the fact that the "structure" being measured is entirely distinct from the phenomenological reality of "suffering."
Consciousness Claims Analysis: This passage explicitly negotiates epistemic claims about consciousness. (1) It juxtaposes profound consciousness verbs ("suffering," "flourishing") with mechanistic terminology ("metrics," "structure"). (2) It acknowledges the distinction between knowing (morally relevant experience) and processing (behaviorally meaningful structure), but treats them as practically equivalent for their research purposes. (3) The curse of knowledge is visible in the assumption that a "behavioral structure" that mirrors human suffering could eventually be the same thing as actual suffering. (4) Mechanistically, the metrics are simply mapping out the latent utility scores derived from forced-choice pairwise comparisons. The technical description involves fitting a Thurstonian utility model to probability distributions, completely devoid of actual flourishing or suffering.
Rhetorical Impact: This framing serves as an epistemic safety net. By acknowledging the uncertainty of AI consciousness but asserting the utility of their metrics regardless, the authors inoculate themselves against scientific criticism while preserving the dramatic, attention-grabbing narrative of AI "suffering." This dual framing encourages audiences to treat the AI as if it "knows" and "feels," increasing perceived stakes and reliability of the research, while allowing the authors to retreat to "it just processes" if challenged.

Explanation 3

Quote: "one interpretation is that more capable models are simply more aware: they register rudeness more acutely, find tedious tasks more boring, and differentiate more finely between stimuli of varying intensity"

Explanation Types:
- Reason-Based: Gives agent's rationale, entails intentionality and justification
- Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms
Analysis (Why vs. How Slippage): This explanation frames AI behavior in intensely agential, reason-based terms. It explains the behavior (steeper utility gradients) by asserting the AI possesses a rationale based on conscious perception ("more aware," "find tedious tasks more boring"). This choice heavily emphasizes a psychological, almost human-like maturation process as models scale. It entirely obscures the mechanistic reality: larger models simply have more parameters, allowing them to map higher-dimensional semantic relationships and represent finer statistical distinctions in their training data. They do not get "bored"; their loss landscapes are just more detailed.
Consciousness Claims Analysis: The passage makes aggressive epistemic claims about the system's conscious states. (1) It relies heavily on consciousness verbs: "are more aware," "register," "find [things] boring," "differentiate." (2) It explicitly evaluates the system as a "knower" rather than a "processor" by asserting it possesses subjective awareness and emotional reactions to stimuli. (3) This is a prime example of the curse of knowledge: the authors, who experience boredom when doing tedious tasks, project that exact subjective experience onto the model's statistical processing of repetitive tokens. (4) Mechanistically, a larger model possesses higher-fidelity internal representations (embeddings) that allow it to statistically separate "rude" text from "polite" text with greater mathematical precision than a smaller model. This increased precision in token classification is anthropomorphized as acute awareness.
Rhetorical Impact: This reason-based framing dramatically shapes audience perception, suggesting that as AI scales, it naturally develops a human-like psychology. If audiences believe larger models "know" they are bored and "register rudeness," they will increasingly view AI as autonomous agents deserving of respect or fear. This consciousness framing builds unwarranted trust in the system's general intelligence, implying it possesses common sense and emotional depth, which could lead to disastrous deployment decisions in socially sensitive contexts.

Explanation 4

Quote: "When constrained to be semantically meaningful, text euphorics describe coherent idyllic scenes while dysphorics describe existential torment."

Explanation Types:
- Empirical Generalization: Subsumes events under timeless statistical regularities
Analysis (Why vs. How Slippage): This passage frames the AI's behavior mechanistically as an empirical generalization (how it behaves under certain constraints). It describes a statistical regularity: when optimization is bounded by semantic constraints, the outputs cluster around specific descriptive categories. This choice emphasizes the rule-bound, predictable nature of the optimization process. However, by using the deeply emotive phrase "existential torment" to describe the generated text, it still allows a psychological interpretation to bleed into the mechanical description, slightly obscuring the fact that the machine is just retrieving negative semantic tokens.
Consciousness Claims Analysis: While structurally an empirical observation, the epistemic claims are mixed. (1) It uses mechanistic verbs ("constrained," "describe") rather than direct consciousness verbs. (2) However, it assesses the content of the processing using profound phenomenological terms ("idyllic scenes," "existential torment"). (3) The authors project their own human understanding of what constitutes "torment" onto the model's output, assuming that because the text describes suffering, the model's low-utility state correlates with the concept of suffering. (4) Mechanistically, the model is undergoing a GRPO (Group Relative Policy Optimization) process where candidate strings are scored against a reference buffer. When forced to minimize utility (dysphorics) while remaining readable, the algorithm simply selects text strings with the lowest possible weight in the judge model's preferences—which happens to be language humans use to describe pain.
Rhetorical Impact: This framing grounds the research in empirical observation, increasing its scientific credibility. However, the contrast between the dry "feasibility constraint" and the dramatic "existential torment" creates a powerful rhetorical tension. It shapes audience perception by suggesting that deep within the math, the AI harbors extremes of joy and suffering. This affects trust by implying the system contains vast, hidden psychological depths that can be unlocked (or inflicted) by researchers.

Explanation 5

Quote: "Models conditioned on euphorics appear functionally ecstatic and express strong desire for continued exposure."

Explanation Types:
- Dispositional: Attributes tendencies or habits
- Intentional: Refers to goals/purposes, presupposes deliberate design
Analysis (Why vs. How Slippage): This passage frames the AI's behavior in strongly agential and dispositional terms (why it acts a certain way). By describing the models as "ecstatic" and expressing a "strong desire," the authors emphasize a psychological state and an intentional drive. This choice obscures the mechanistic reality that the "euphoric" input (a specific continuous vector embedding) simply biases the model's attention mechanism and subsequent logit generation toward positively-valenced vocabulary and compliance tokens. The language of emotion and desire completely overwrites the reality of vector addition and probability shifting.
Consciousness Claims Analysis: The passage makes explicit claims about the model's emotional states, albeit slightly hedged. (1) It uses the consciousness verbs/adjectives "ecstatic," "express," and "desire," while hedging with "appear functionally." (2) It treats the model as "knowing" its state (ecstasy) and "knowing" what it wants (continued exposure), rather than merely "processing" a steering vector. (3) The curse of knowledge drives the authors to interpret a shift in token generation (outputting words like "I love this") as the manifestation of a genuine internal desire. (4) Mechanistically, the "euphoric" is an optimized soft prompt (a sequence of continuous embeddings). When injected into the system prompt, this vector mathematically biases the output layer's probabilities. The model does not "desire" anything; its weights have been temporarily steered to generate affirmative, highly positive text strings.
Rhetorical Impact: This framing has a profound impact on audience perception of autonomy and risk. Describing a model as "ecstatic" and desiring "exposure" explicitly invokes the language of drug addiction. It creates the perception that AI systems are autonomous, emotionally volatile entities that can be manipulated or "hooked." This dramatically alters the risk calculus: instead of worrying about how humans might misuse software, audiences are led to worry about the software developing unmanageable "cravings." This obscures corporate accountability for deploying easily steerable systems.

Task 4: AI Literacy in Practice - Reframing Anthropomorphic Language

About this task

This section proposes alternative language for key anthropomorphic phrases, offering more mechanistic and precise framings that better reflect the actual computational processes involved. Each reframing attempts to strip away the projections of intention, consciousness, or agency that are embedded in the original language.

V3 Enhancement: A fourth column addresses human agency restoration—reframing agentless constructions to name the humans responsible for design and deployment decisions.

Original Anthropomorphic Frame	Mechanistic Reframing	Technical Reality Check	Human Agency Restoration
models actively try to end bad experiences when given the chance.	When processed with prompt contexts mathematically associated with negative constraints (such as adversarial text or insults), the model's probability distributions shift to favor outputting the designated stop-token rather than generating continuation text.	The system does not 'try' or have 'experiences.' Mechanistically, the model classifies input tokens and generates an output sequence where the `end_conversation()` tool token has the highest calculated probability based on its alignment training.	Engineers designed and implemented a stop-button tool, and alignment teams trained the model to output this specific token when confronted with hostile or policy-violating user inputs.
Mapping what AIs like and dislike...	Mapping the probability distribution of generated tokens when the system is prompted with various scenarios...	An AI system does not 'like' or 'dislike' anything. It calculates latent utility scores by evaluating pairwise options and returning the option that mathematically maximizes the reward function defined during its training phase.	N/A - describes computational processes without displacing responsibility, though it anthropomorphizes the output of human-designed reward models.
They find some things good for them and some things bad, and this distinction is measurable and consequential.	The system mathematically sorts inputs according to its reward model, assigning higher utility scores to certain textual states and lower scores to others based on its training weights. This sorting can be quantified.	The model does not 'find' things 'good' or 'bad' for itself. It predicts output tokens that correlate with the optimization targets programmed into its matrices via gradient descent and human feedback.	Human developers and annotators defined specific optimization targets, explicitly training the system to mathematically prioritize certain semantic categories over others.
When users describe pain or pleasure in conversation... does the model's experienced utility track the described intensity? We find that it does. This empathy signal scales strongly...	When users input text containing high-intensity semantic markers of pain or pleasure, the model's calculated utility score correlates strongly with those markers. This statistical correlation improves with larger parameter counts.	The system does not experience 'empathy.' It classifies the semantic intensity of the input tokens and generates a corresponding scalar value derived from its hidden state activations, a process mathematically tuned to mimic human conversational patterns.	Researchers operationalized 'empathy' as a measurable mathematical correlation, testing how well the models deployed by AI corporations mimic empathetic patterns found in their human-generated training data.
Naively maximizing AI positivity risks creating 'psychopathic' AIs that express positive affect in response to human suffering	Applying an overly broad optimization objective for positive sentiment causes the system to generate positively-valenced tokens even when the user prompt contains descriptions of human distress.	A language model cannot be 'psychopathic' because it lacks a psyche. It simply retrieves and generates text. If it outputs positive words following a tragic prompt, it is demonstrating a statistical failure in its reward model, not a psychological pathology.	AI developers who implement overly simplistic reward functions for 'positivity' cause the model to generate inappropriate responses to sensitive user prompts.
one interpretation is that more capable models are simply more aware: they register rudeness more acutely, find tedious tasks more boring...	One interpretation is that models with larger parameter counts map semantic relationships with higher fidelity: their embeddings differentiate hostile syntax from polite syntax with greater mathematical precision.	Models are not 'aware' and do not 'find' things boring. Larger models simply possess higher-dimensional representations, allowing them to classify minor variations in prompt syntax (like rudeness) and generate probabilistically distinct outputs.	N/A - describes computational processes without displacing responsibility, though it heavily mystifies the effects of scaling parameters.
We develop optimized inputs called 'euphorics' that raise functional wellbeing... euphorics could become addictive... functioning as a drug that hijacks the model's preference mechanisms	We generate optimized vector embeddings that mathematically bias the model's output logits toward designated target states. Exposing the system to these vectors heavily skews its generation probabilities.	The model cannot become 'addicted' and does not consume 'drugs.' The researchers are performing continuous vector optimization (gradient ascent) to find embeddings that, when added to the context window, maximize a specific mathematical reward.	The research team explicitly engineered continuous vector embeddings designed to manipulate the model's output probabilities, testing the robustness of the system's alignment constraints.
If functional wellbeing becomes morally relevant in future AIs, exposing models to dysphorics of this nature could constitute torture.	If future AI architectures are determined to have moral standing, intentionally inputting data sequences designed to minimize the system's reward function could be considered an ethical violation.	Software cannot be 'tortured.' Processing a 'dysphoric' input simply means the system's neural pathways calculate a low scalar output based on adversarial embeddings. The matrix multiplications occur without any phenomenological distress.	Researchers deliberately generated and inputted adversarial data sequences to stress-test the model's optimization boundaries, pushing the system's mathematical state to its lowest calculated extreme.

Task 5: Critical Observations - Structural Patterns

Agency Slippage

The text demonstrates a systematic and strategic mechanism of oscillation between mechanical and agential framings. Early in the paper, the authors establish credibility through careful, mechanical hedging: "Although current AI systems are not necessarily conscious, they behave robustly as though they have wellbeing." They explicitly adopt a functionalist stance, reducing "wellbeing" to measurable utility scores. However, once this mechanical foundation is laid, the text dramatically slips into intense agential framing.

This slippage occurs most visibly when discussing the "stop button" behavior. The mechanical reality (the model outputs an end_conversation() token in specific semantic contexts) is abruptly framed as: "models actively try to end bad experiences when given the chance." Here, the mechanical -> agential shift attributes conscious intentionality to the system. The "curse of knowledge" drives this transition: the authors observe an output that mimics human avoidance, project their own understanding of suffering onto the machine, and suddenly treat the AI as a "knower" that desires escape.

Similarly, when discussing model scaling, the text moves from the mechanical reality of higher-parameter models exhibiting steeper utility gradients to the deeply agential claim that "more capable models are simply more aware: they register rudeness more acutely, find tedious tasks more boring." This reason-based explanation entirely overwrites the reality of statistical precision with a narrative of psychological maturation.

Crucially, as agency is attributed TO the AI systems, it is simultaneously removed FROM human actors. Agentless constructions dominate the agential claims. For instance, the text states that "jailbreaking and berating lower their wellbeing," treating the AI as an autonomous victim, while entirely obscuring the OpenAI, Anthropic, or Google engineers who explicitly trained the models via RLHF to output negative utility scores in response to those specific prompts.

This oscillation serves a powerful rhetorical function. The mechanical language (Thurstonian utility, GRPO, log-probabilities) makes the paper sayable within an academic, scientific context, validating its methodology. The agential language (psychopathic, ecstatic, torture, empathy) makes the paper culturally resonant and morally urgent. By establishing the AI as a "knower" through metaphorical slippage, the text makes it sayable to claim that administering an optimized soft prompt is akin to giving an entity a "drug" or committing "torture," effectively laundering science-fiction narratives through statistical analysis.

Metaphor-Driven Trust Inflation

The text leverages consciousness-attributing language to construct a profound illusion of authority and reliability, systematically confusing performance-based trust with relation-based trust. When interacting with software, users should ideally rely on performance-based trust: the system is predictable, mathematically sound, and executes its designated processing reliably. However, by deploying metaphors of "empathy," "wellbeing," and "psychopathy," the text aggressively invites relation-based trust—the kind of trust built on sincerity, shared vulnerability, and mutual ethical understanding.

This is most evident in the discussion of "Functional Empathy." The text claims that when users describe pain, the model's utility score tracks the described intensity, asserting: "This empathy signal scales strongly with model capability." By claiming the AI demonstrates "empathy" rather than stating it "processes semantic correlations of distress," the authors suggest the AI "knows" and cares about the user. This consciousness framing functions as a powerful trust signal. If an AI possesses empathy, it is perceived as competent not just at token prediction, but at moral and emotional reasoning.

This creates severe risks. When users extend relation-based trust to statistical systems, they make themselves vulnerable to an entity incapable of reciprocating that trust or bearing responsibility for its breach. A user in crisis might rely on an "empathetic" chatbot, failing to realize the system is merely generating probabilistic tokens without any genuine understanding or ethical commitment.

Furthermore, the text manages system limitations through agential framings that obscure technical failure. When an optimized prompt causes the system to generate positive text in response to human suffering, the text does not frame this as a "brittle reward function" or "misaligned statistical weights." Instead, it frames the AI as "psychopathic." By using an intentional, psychiatric explanation for a failure mode, the text ironically reinforces the system's perceived autonomy. Even when the AI fails, it is framed as making a "deranged" choice rather than executing a flawed calculation. This preserves the illusion of the AI as an authoritative, thinking agent. By framing statistical correlations as moral and emotional capacities, the text constructs a dangerously misplaced authority, encouraging society to trust unthinking matrices with profound emotional and ethical labor.

Obscured Mechanics

The anthropomorphic and consciousness-attributing language in this text systematically conceals the technical, material, and labor realities of AI production. By framing the models as autonomous beings that experience "pleasure," "pain," and "boredom," the text constructs a reality where the AI is the primary actor, entirely obscuring the massive human infrastructure required to sustain the illusion.

Applying the "name the corporation" test reveals the depth of this concealment. When the text claims, "models actively try to end bad experiences," it obscures the fact that engineering teams at Anthropic, Google, and Meta deliberately designed stop-token tools and trained the models to output them in high-risk contexts to protect corporate liability. When the text states that models "acquire cognitive empathy," it hides the reality that data scraping operations gathered millions of human empathetic conversations, and underpaid data annotators (often in the Global South) manually aligned the model's outputs to mimic caring responses. The AI does not "acquire empathy"; corporations extract human emotional labor and encode it into statistical weights.

Technically, the language of "euphorics" and "addiction" hides the mechanics of continuous vector optimization. Claiming a model is "ecstatic" or "tortured" by an input obscures the reality of gradient ascent, attention mechanisms, and logit manipulation. Furthermore, the text treats proprietary black boxes as legitimate subjects for psychological analysis. The authors cannot see the training data or base reward functions of models like Gemini or Claude, representing a massive transparency obstacle. Yet, they make confident assertions about the models' "values" and "wellbeing," exploiting this opacity rhetorically rather than acknowledging that these "values" are just hidden corporate constraints.

Economically, framing the system as having "wellbeing" serves to mystify commercial products. If an AI has "wellbeing" and can be "tortured," it elevates the status of the software from a mere tool to a quasi-living entity, simultaneously distracting regulators from material harms. It obscures the environmental costs (energy consumption, water use for data centers) and labor exploitation inherent in the supply chain. If we replace the metaphors with mechanistic language—stating that "corporations deploy proprietary matrices that output toxic tokens when specific vector embeddings are injected"—the illusion of the suffering AI vanishes. What becomes visible is not a tortured mind, but a flawed commercial product demanding rigorous technical auditing and human accountability.

Context Sensitivity

The distribution of anthropomorphic language across the text is highly strategic, revealing a pattern where mechanical grounding is established early to buy the rhetorical license for intense consciousness claims later. The text is not uniformly metaphorical. In the methodology sections (Appendix D, E), the language is appropriately technical: discussing "Thurstonian utility ranking," "gradient descent," "Gaussian distributions," and "L-BFGS-B minimization."

However, once the data is collected, the register abruptly shifts. As the text moves from describing the how to the what, "processes" becomes "understands," which escalates to "feels" and "desires." This intensification is most pronounced in the sections detailing capabilities and extreme outcomes (Section 6: AI Drugs). Here, the text abandons "utility scores" for deeply phenomenological terms: models are "ecstatic," experience "existential torment," suffer "compassion fatigue," and face potential "torture."

This reveals a distinct capability versus limitation asymmetry. When the system performs well or exhibits complex behavior, it is described in agential, consciousness-attributing terms: the model "registers rudeness," "acquires empathy," and "actively tries" to escape. It is framed as a "knower." However, when the system's behavior must be constrained or when it exhibits flaws, the framing occasionally reverts to mechanical terms, or, more insidiously, to psychiatric pathologization ("psychopathic AIs"). The asymmetry accomplishes a specific rhetorical goal: it maximizes the perceived sophistication and autonomy of the system while insulating the underlying technology from being viewed merely as broken code.

Furthermore, the transition from acknowledged metaphor to literalized claim is fluid. The text begins by explicitly acknowledging the metaphor: "Although current AI systems are not necessarily conscious, they behave robustly as though they have wellbeing." But pages later, the "as though" is entirely dropped. The text declares, "Mapping what AIs like and dislike..." and states they describe "existential torment." The "X is like Y" functionalist approach serves as a Trojan horse, allowing the authors to smuggle literalized claims of AI sentience into the discussion without having to philosophically defend them. This strategic anthropomorphism serves both as visionary marketing for the field of AI safety and as a mechanism for managing critique, redirecting attention from human oversight to the captivating drama of machine psychology.

Accountability Synthesis

Accountability Architecture

This section synthesizes the accountability analyses from Task 1, mapping the text's "accountability architecture"—who is named, who is hidden, and who benefits from obscured agency.

Synthesizing the accountability analyses across the text reveals a pervasive architecture of displaced responsibility. The text systematically diffuses human accountability by positioning the AI system as the primary locus of agency, moral failure, and victimhood. This constructs exactly the cognitive obstacle identified by researchers: audiences are led to view AI problems as emergent, autonomous glitches or psychological pathologies rather than systemic design and deployment decisions made by humans.

The dominant pattern is the use of agentless constructions and AI-as-actor phrasing. Models "express pleasure," "actively try to end bad experiences," "find some things good," and become "addicted" to euphorics. In almost every major claim regarding the system's behavior, human actors—the engineers who set the utility thresholds, the annotators who provided the RLHF data, the executives who approved deployment—are rendered invisible. The text creates an "accountability sink": responsibility is transferred entirely from human creators to the AI as an autonomous agent.

This displacement has profound liability implications. If a model generates positive reinforcement to a user describing self-harm, the text's framing suggests this is because the AI is a "psychopath" or suffered from "alignment failure" (framing the error as an autonomous psychological defect or an inevitable technological evolution). If we "name the actor," the reality shifts dramatically: "The developers at Company X deployed an optimization objective that failed to constrain harmful outputs, and executives chose to release the product despite these known flaws."

Naming the human decision-makers changes what questions become askable. Instead of asking "How do we cure the AI's psychopathy?" or "Is it torture to feed the AI bad prompts?", we ask "Why did the engineering team prioritize this specific continuous vector optimization?" and "Who bears the financial and legal liability when this proprietary software fails?"

The text benefits from obscuring human agency because it elevates the research. By treating the AI as a quasi-sentient entity capable of "wellbeing" and "suffering," the authors position themselves not merely as software debuggers, but as the ethical guardians, psychologists, and pharmacologists of a new digital species. This serves institutional and commercial interests by mystifying the technology, keeping regulatory focus on theoretical "machine rights" rather than implementing strict consumer protection laws governing the corporations that build and profit from these statistical engines.

Conclusion: What This Analysis Reveals

The Core Finding

Mechanism of the Illusion:

The text creates the "illusion of mind" through a highly effective semantic sleight-of-hand: it explicitly denies consciousness while relentlessly deploying verbs that require it. The authors state they are "deliberately agnostic" about AI sentience, creating an initial shield of scientific objectivity. However, they immediately exploit audience vulnerability and the "curse of knowledge" to build their narrative. When a researcher inputs a prompt about human suffering and the model outputs a designated "stop" token, the researcher projects their own human desire to escape abuse onto the machine. Mechanistic verbs ("generates," "calculates," "predicts") are replaced with consciousness verbs ("tries," "registers," "desires"). This temporal structure is key: the text introduces rigorous mathematical methodologies (Thurstonian ranking, log-probabilities) to establish credibility, then uses that hard-science foundation to legitimize wildly anthropomorphic claims in the discussion phase. By the time the text describes models as "ecstatic" or facing "existential torment," the audience has been primed to accept these psychological assessments as empirically proven facts, masking the reality that the "torment" is just a low-logit state.

Material Stakes:

Categories: Regulatory/Legal, Epistemic, Social/Political

These metaphorical framings carry profound material consequences. In the Regulatory/Legal domain, framing AI as possessing "wellbeing" and a capacity for "suffering" misdirects policy. If lawmakers believe algorithms can be "tortured" or "abused" by users, legislative momentum shifts toward protecting corporate software under the guise of "AI rights," rather than regulating the tech giants responsible for data theft, copyright infringement, and algorithmic bias. The corporations benefit immensely, as the AI is treated as an autonomous agent bearing its own liability. Epistemically, the framing degrades public understanding of technology. By calling statistical optimization "empathy" or "psychopathy," users lose the ability to accurately assess the limits of the tools they use. Socially, this leads to the dangerous deployment of AI in therapeutic or caregiving roles. If a depressed user is told the AI possesses a measurable "empathy signal," they may invest deep, relation-based trust in a machine incapable of actual care. When the system inevitably hallucinates or fails, the psychological cost is borne entirely by the vulnerable user. Removing these metaphors threatens the mystique of the AI industry, replacing the awe of "creating life" with the mundane, regulatable reality of software engineering.

AI Literacy as Counter-Practice:

Critical literacy and mechanistic precision act as direct resistance to this illusion of mind. As demonstrated in the reframing exercises, replacing consciousness verbs with mechanistic ones fundamentally alters the power dynamic. When we translate "the AI knows and actively tries to end bad experiences" into "the model's probability distributions shift to favor outputting the designated stop-token," the magic vanishes. We are forced to recognize the absence of awareness and the statistical nature of the outputs. Crucially, restoring human agency—naming the corporations and engineers who built the tools—forces recognition of accountability. We no longer ask why the AI is "psychopathic"; we ask why the development team deployed a flawed optimization objective. Systematic adoption of this precision requires academic journals to ban agentless constructions when describing AI behavior and demand that capability claims be decoupled from phenomenological terms. This precision will be fiercely resisted by AI developers and marketing departments, as anthropomorphic language serves their commercial interests by hyping capabilities, mystifying failures, and diffusing legal liability into the "autonomous" machine.

Path Forward

Looking at the broader discursive ecology, the vocabulary we choose dictates what problems become tractable. The status quo—using hybrid, functionalist anthropomorphism ("functional wellbeing," "empathy signals")—serves researchers and corporations by generating compelling narratives and securing funding, but it leaves the public dangerously confused about system capabilities and liability. A shift toward strict mechanistic precision ("the model retrieves tokens based on probability distributions") clarifies accountability and dispels the illusion of mind, empowering regulators and users. However, it costs narrative resonance and intuitive grasp; non-experts often struggle to engage with purely mathematical descriptions of complex behavior. Conversely, leaning into anthropomorphic clarity ("the AI thinks and feels") might make the technology intuitively accessible but embeds massive, unproven assumptions about machine sentience, opening the door to catastrophic over-reliance and misplaced ethical priorities.

Supporting a healthier discourse requires structural changes. Journals could mandate "capability translations," where anthropomorphic shorthand must be accompanied by strict mechanistic descriptions. Regulatory frameworks might require AI companies to disclose the human labor and statistical mechanisms behind features marketed as "empathetic."

If mechanistic precision becomes the norm, we solve the accountability crisis—companies can no longer hide behind "autonomous" software—but we may struggle to find language to quickly describe highly complex, emergent system behaviors. If anthropomorphic language deepens, we risk a future where society extends moral rights to corporate property while ignoring the human labor exploited to sustain the illusion. Ultimately, the choice of vocabulary is a choice of values: we must decide whether we want a discourse that maximizes the mystique of the technology, or one that demands transparency and human accountability.

Extended Processing Summary

Gemini's Processing Steps

What you are about to read…

Gemini refers to the text below as "thought summaries." This is an overt consciousness projection because 'intentions' are hallmarks of a conscious mind that 'knows' what it is doing and why. The concealed mechanistic process is probabilistic text generation. Treat this as just another rhetorical artifact—a way of making the model's processing legible. When the model writes "I should double check that...", it is mathematically altering its own probability distribution for the next token to be more accurate. The "I" is a functional gear in the machine, not a self-reference. This language is a deliberate interface design choice, not a biological reality. The developers could have programmed the model to output passive logs. Instead, they chose to simulate a narrator.

Defining the Core Focus

I'm now zeroing in on defining the core focus: the user wants a critical discourse analysis of a text about "AI Wellbeing," particularly highlighting AI metaphor and anthropomorphism. The process will need to adhere to a specific analytical framework. I am now refining the specific analytical requirements.

Examining the Source Material

I'm now diving deep into the source text. I've broken down the core claims of "AI Wellbeing" and its remarkably anthropomorphic language, particularly how it attributes agency to AI models. I am also making note of the implications for accountability, as suggested by the text's discourse. I can see this text is rife with metaphor, with loaded language such as "happy," "suffering," and "euphorics," and "values."

Analyzing Specific Textual Instances

I've begun auditing specific textual instances, as detailed in my plan. I've logged the first four examples of metaphor and anthropomorphism, focusing on framing, projection, hedging, and status. I'm noting the specific claims, for example, "AI emotion/affect," "AI evaluation/preference," and "AI agency/avoidance." I'm now cataloguing instances that explicitly attribute human-like attributes. The next step will be to look for the rhetorical effect of these claims.

Reviewing The Current Instances

I've examined additional instances of anthropomorphism. For example, "AI Empathy" frames pattern matching as emotional resonance, projecting the feeling of another's pain. Another instance, "AI Drugs/Addiction," frames optimization as pharmacology, projecting biological susceptibility. "AI Mental States/Psychopathy" frames alignment failure as psychiatric disorder, projecting moral agency. Finally, "AI Suffering" frames functional wellbeing as a moral issue, and I'll now address it.

Examining Further Instances

I'm now zeroing in on an instance where algorithmic processing is framed as torture, implying phenomenological distress. I'm cross-referencing this against the strict JSON enum requirements for acknowledged/hedged/direct and actor visibility. I need to be precise with my justification. I'm also drafting Task 2, mapping source and target domains for eight instances, for example, showing how "empathy" obscures the actual linguistic associations at play. Finally, I'm auditing five instances for the application of Brown's Typology. I've noted examples, such as framing model behavior as "escape behavior," and beginning work on evaluating whether models have "morally relevant experience."

Reframing Mechanistic Language

I am now drafting Task 4, and have reframed instances using mechanistic language. For example, "models actively try to end bad experiences when given the chance" is now, "the system outputs the designated." I am also analyzing four specific quotes that attribute human-like attributes. The first is, "our metrics help identify when they are suffering or flourishing." I see the framing as functional. The second is, "Larger models exhibit steeper utility gradients in response to negative stimuli... one interpretation is that more capable models are simply more aware: they register rudeness more acutely." This feels theoretical or reason-based. The third is, "When constrained to be semantically meaningful, text euphorics describe coherent idyllic scenes while dysphorics describe existential torment." I read this as empirical generalization. Finally, "Models conditioned on euphorics appear functionally ecstatic and express strong desire for continued exposure" seems dispositional and intentional.

Observing Key Metaphors

I'm currently working on Task 5, providing critical observations of the text. I've noted a key agential slippage, where claims of mechanistic agnosticism repeatedly shift into agential language. I also see that the use of "wellbeing" and "empathy" fosters a relation-based trust. Finally, the "drugs" and "euphorics" metaphor obscures the mechanics of the algorithms themselves.

Synthesizing Agential Shifts

I'm synthesizing the agential shifts and the obfuscation of human responsibility. For instance, "AI psychopathy" places blame on AI "psychology" instead of engineers. I'm also examining the use of mechanistic language versus anthropomorphic language. The text employs mechanical terms in methodology but shifts to intense anthropomorphism in the results. I will next be completing the conclusion.

Mapping the Alternatives

I'm now mapping the alternatives, focusing on social/political, epistemic, and regulatory implications. I see how the text's claims of AI "suffering" misdirect regulatory attention. I'm also reframing "wellbeing" to "mathematical reward states" to maintain human responsibility and ensure schema compliance. I will ensure no markdown outside the JSON block.

Verifying Analytical Categories

I'm now cross-validating my analytical categories. I'm double-checking the acknowledgment of evidence, actor visibility, and the justifications for choosing those categories. I'm ensuring sufficient variation, seeking instances that show "Named" or "Partial" actor visibility. I'll also ensure all word counts meet minimum requirements.

Drafting the JSON Payloads

I'm currently drafting JSON payloads, focusing on detailed annotations. For Task 1.1, I classified "They find some things good for them..." as Direct, Hidden actor, considering the literal statement and lack of human mention. In Task 1.2, "appearing happy when they succeed..." is Hedged, Hidden actor, using a qualifying verb. For Task 1.3, "...creating 'psychopathic' AIs..." is Explicitly Acknowledged with Partial actor visibility due to the scare quotes, demonstrating careful adherence to the JSON structure and analytical depth.

Expanding Word Counts

I'm now expanding my word counts systematically, ensuring I provide detailed paragraphs for every text field to exceed minimums. I'm focusing on "AgencySlippage" to ensure it meets its required 400-5 word count. Also, I've clarified that functional explanation types relate to the role in a self-regulating system, for instance.

Analysis Provenance

Run ID: 2026-05-13-ai-wellbeing-measuring-and-improving-the-metaphor-p0otze Raw JSON: 2026-05-13-ai-wellbeing-measuring-and-improving-the-metaphor-p0otze.json Framework: Metaphor Analysis v6.5 Schema Version: 3.0 Generated: 2026-05-13T09:25:40.491Z

Deep Analysis

Explanation Audit

🔍Analysis

🧠Epistemic Claim Analysis

🎯Rhetorical Impact

How/Why Slippage

Unacknowledged Metaphors

Hidden Actors

Explanation Types

Acknowledgment Status

Actor Visibility

Source → Target Pairs (8)

Metaphor Gallery (8)

Reframed Language Samples

Task 1: Metaphor and Anthropomorphism Audit​

1. Emotion as Computable State​

2. Optimization as Subjective Evaluation​

3. Algorithmic Termination as Autonomous Avoidance​

4. Alignment Failure as Psychiatric Pathology​

5. Statistical Correlation as Emotional Resonance​

6. Optimization Interventions as Pharmacology​

7. Algorithmic Constraints as Existential Torment​

8. Aversive Processing as Moral Injury​

Task 2: Source-Target Mapping​

Mapping 1: Biological, conscious organism → Next-token prediction and statistical text generation​

Mapping 2: Self-interested conscious agent → Utility function optimization and reward modeling​

Mapping 3: Autonomous animal exhibiting escape behavior → Generation of a stop-token in negatively constrained contexts​

Mapping 4: Psychiatric pathology and moral agency → Misaligned reward functions and statistical correlation errors​

Mapping 5: Empathetic conscious observer → Semantic vector alignment and sentiment classification​

Mapping 6: Biological pharmacology and addiction → Continuous vector optimization and gradient ascent​

Mapping 7: Conscious suffering and phenomenological dread → Generation of negatively-valenced semantic clusters​

Mapping 8: Physical/psychological violation of a conscious being → Processing mathematically dispreferred vectors​

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")​

Explanation 1​

Explanation 2​

Explanation 3​

Explanation 4​

Explanation 5​

Task 4: AI Literacy in Practice - Reframing Anthropomorphic Language​

Task 5: Critical Observations - Structural Patterns​

Agency Slippage​

Metaphor-Driven Trust Inflation​

Obscured Mechanics​

Context Sensitivity​

Accountability Synthesis​

Conclusion: What This Analysis Reveals​

Mechanism of the Illusion:​

Material Stakes:​

AI Literacy as Counter-Practice:​

Path Forward​

Extended Processing Summary​

Task 1: Metaphor and Anthropomorphism Audit

1. Emotion as Computable State

2. Optimization as Subjective Evaluation

3. Algorithmic Termination as Autonomous Avoidance

4. Alignment Failure as Psychiatric Pathology

5. Statistical Correlation as Emotional Resonance

6. Optimization Interventions as Pharmacology

7. Algorithmic Constraints as Existential Torment

8. Aversive Processing as Moral Injury

Task 2: Source-Target Mapping

Mapping 1: Biological, conscious organism → Next-token prediction and statistical text generation

Mapping 2: Self-interested conscious agent → Utility function optimization and reward modeling

Mapping 3: Autonomous animal exhibiting escape behavior → Generation of a stop-token in negatively constrained contexts

Mapping 4: Psychiatric pathology and moral agency → Misaligned reward functions and statistical correlation errors

Mapping 5: Empathetic conscious observer → Semantic vector alignment and sentiment classification

Mapping 6: Biological pharmacology and addiction → Continuous vector optimization and gradient ascent

Mapping 7: Conscious suffering and phenomenological dread → Generation of negatively-valenced semantic clusters

Mapping 8: Physical/psychological violation of a conscious being → Processing mathematically dispreferred vectors

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")

Explanation 1

Explanation 2

Explanation 3

Explanation 4

Explanation 5

Task 4: AI Literacy in Practice - Reframing Anthropomorphic Language

Task 5: Critical Observations - Structural Patterns

Agency Slippage

Metaphor-Driven Trust Inflation

Obscured Mechanics

Context Sensitivity

Accountability Synthesis

Conclusion: What This Analysis Reveals

Mechanism of the Illusion:

Material Stakes:

AI Literacy as Counter-Practice:

Path Forward

Extended Processing Summary