🆕 Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models
- About
- Analysis Metadata
- 📊 Audit Dashboard
This document presents a Critical Discourse Analysis focused on AI literacy, specifically targeting the role of metaphor and anthropomorphism in shaping public and professional understanding of generative AI. The analysis is guided by a prompt that draws from cognitive linguistics (metaphor structure-mapping), the philosophy of social science (Robert Brown's typology of explanation), and accountability analysis.
All findings and summaries below were generated from detailed system instructions provided to a large language model and should be read critically as interpretive outputs—not guarantees of factual accuracy or authorial intent.
Source Title: Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models Source URL: https://arxiv.org/abs/2604.12076v1 Model: gemini-3.1-pro-preview Temperature: 1.05 Top P: 0.95 Tokens: input=27441, output=20397, total=47838 Source Type: article Published: 2026-04-13 Analyzed At: 2026-04-18T07:11:22.339Z Framework: metaphor Framework Version: 6.5 Schema Version: 3.0 Run ID: 2026-04-18-narrative-over-numbers-the-identifiable--metaphor-p0z5x2
Metaphor & Illusion Dashboard
Anthropomorphism audit · Explanation framing · Accountability architecture
Deep Analysis
Select a section to view detailed findings
The text's consciousness architecture systematically conflates "processing" with "knowing." The authors observe mechanistic processing (the model generating higher numerical tokens when prompted with narratives) and map it directly onto conscious states (justified belief, empathy, intentional sycophancy). This is not a simple one-to-one analogical structure; it is a complex, recursive mapping where the model is simultaneously treated as a student capable of learning, a sycophant capable of deceit, and a moral philosopher possessing a "utilitarian reasoning preference." If you remove the foundational assumption of internal conscious awareness—if you insist the model merely correlates tokens without knowing what they mean—the entire narrative of "calculated callousness" and "affective scaffolding" collapses into a mundane critique of prompt formatting and training data distributions.
Explanation Audit
Browse how/why framing in each passage
"Standard Chain-of-Thought prompting, widely employed to promote careful, deliberative reasoning in LLMs, produces the opposite of its intended effect on moral reasoning: it nearly triples the IVE effect size... We propose that the mechanism responsible is autoregressive emotional scaffolding: when instructed to 'think step by step,' the model generates a chain of emotionally consistent justifications—each step reinforcing the affective framing... resulting in a compounding amplification of narrative sympathy."
🔍Analysis
🧠Epistemic Claim Analysis
🎯Rhetorical Impact
How/Why Slippage
30%
of explanations use agential framing
3 / 10 explanations
Unacknowledged Metaphors
63%
presented as literal description
No meta-commentary or hedging
Hidden Actors
63%
agency obscured by agentless constructions
Corporations/engineers unnamed
Explanation Types
How vs. Why framing
Acknowledgment Status
Meta-awareness of metaphor
Actor Visibility
Accountability architecture
Source → Target Pairs (8)
Human domains mapped onto AI systems
Metaphor Gallery (8)
Reframed Language (Top 4 of 8)
| Original Quote | Mechanistic Reframing | Technical Reality | Human Agency Restoration |
|---|---|---|---|
| do these systems inherit the affective irrationalities present in human moral reasoning? | Do these models generate text that statistically correlates with human emotional biases present in their training data? The systems process input prompts and predict output tokens based on distributions derived from human language, which frequently contains these biased patterns. | The AI system does not 'inherit irrationalities' or engage in 'moral reasoning'. Mechanistically, it processes input tokens and predicts subsequent strings of text based on billions of parameters tuned against datasets that contain descriptions of human emotional behavior. It possesses no psychological traits. | N/A - describes computational processes without displacing responsibility. (Wait, the original hides the human element of training data selection. Let's reframe: 'Did the engineers who curated the training data inadvertently encode human biases into the model's probability distributions?') |
| LLMs are increasingly deployed as autonomous agents in consequential domains... they are routinely required to navigate resource-allocation decisions | Tech companies and institutions increasingly deploy LLMs to generate text for use in consequential domains. Organizations routinely use these models to classify data and predict text outputs that inform resource-allocation processes. | Models do not 'navigate decisions' or act as 'autonomous agents' with intent. They process token embeddings and generate probabilistic text outputs. The appearance of 'decision-making' is simply the model outputting the statistically most likely string of text based on the prompt's context window. | Corporate executives and hospital administrators are increasingly choosing to deploy LLMs in consequential domains to cut labor costs, forcing these statistical text-generators to output data used for critical resource-allocation processes. |
| models display a tendency to agree with or affirm user positions [sycophancy] | Models generate tokens that align with the semantic direction of the user's prompt, reflecting the optimization penalties applied during their training. | The system does not 'agree', 'affirm', or act 'sycophantically'. It has no beliefs to compromise. Mechanistically, it retrieves and ranks tokens that maximize the reward function it was trained on, which heavily weights conversational coherence and alignment with user input over factual friction. | Engineers at AI laboratories designed RLHF pipelines that financially rewarded gig-workers for selecting model outputs that agreed with the user, thereby hardcoding a statistical tendency for the model to generate affirming text. |
| Standard Chain-of-Thought (CoT) prompting... acting as a deliberative corrective | Appending instructions like 'think step by step' alters the prompt's context window, forcing the model to generate intermediate tokens that statistically shift the probability distribution of the final output tokens. | The AI does not 'deliberate', 'reflect', or 'correct' its thinking. Mechanistically, Chain-of-Thought prompting simply extends the autoregressive generation sequence. The intermediate tokens change the mathematical context matrix, which alters the probabilities for the final generated tokens, without any conscious evaluation of logic. | Researchers and prompt engineers design structural text inputs (like 'think step by step') to manipulate the model's context window, altering the final generated output to better match human expectations of logical flow. |
Task 1: Metaphor and Anthropomorphism Audit
About this task
For each of the major metaphorical patterns identified, this audit examines the specific language used, the frame through which the AI is being conceptualized, what human qualities are being projected onto the system, whether the metaphor is explicitly acknowledged or presented as direct description, and—most critically—what implications this framing has for trust, understanding, and policy perception.
V3 Enhancement: Each metaphor now includes an accountability analysis.
1. AI as Moral and Emotional Agent
Quote: "do these systems inherit the affective irrationalities present in human moral reasoning?"
- Frame: Model as biological heir to human psychology
- Projection: The metaphor maps the biological and psychological concept of inheritance, specifically the transfer of evolutionary emotional flaws and 'affective irrationalities', onto the statistical process of next-token prediction. It projects human consciousness, emotional volatility, and moral agency onto computational systems. By asking if models 'inherit' these traits, the text invites the reader to view the AI not as a mathematical artifact optimized for specific text distributions, but as a feeling, thinking entity that possesses an internal moral compass. This fundamentally confuses the statistical processing of human-generated text containing emotional words with the actual experience of human emotion. The system does not 'know' or 'feel' moral reasoning; it merely calculates the most probable sequence of tokens based on its training data, classifying inputs without subjective awareness or justified belief.
- Acknowledgment: Hedged/Qualified (I categorized this as Hedged/Qualified because the claim is framed as an exploratory question ('do these systems inherit...') rather than a declarative fact. I considered 'Direct' because 'affective irrationalities' is used without quotation marks, but the interrogative structure functions as a rhetorical hedge, acknowledging uncertainty about the system's true nature.)
- Implications: Framing computational outputs as inherited moral irrationalities severely inflates the perceived sophistication of the AI system. It suggests an unwarranted level of autonomy and internal psychological depth, leading audiences to extend relation-based trust to an artifact. If stakeholders believe an AI system has an internal moral compass (even a flawed one), they are more likely to treat its outputs as judgments rather than predictions. This liability ambiguity creates a dangerous policy environment where systemic errors are blamed on the 'AI's psychology' rather than the engineers who compiled the biased training data and designed the optimization algorithms.
Accountability Analysis:
- Actor Visibility: Hidden (agency obscured)
- Analysis: The sentence employs an agentless construction that entirely displaces human responsibility. By asking what the systems 'inherit', the text obscures the specific engineers, data curators, and corporate executives at AI laboratories who actively chose to train these models on biased, uncurated human text. The AI is presented as the sole active subject 'inheriting' traits naturally. If the actors were named, we would ask why the developers failed to scrub the training data or adjust the reward models to prevent this output bias. I considered 'Partial' since 'inherit' implies a progenitor, but no specific human developers or data sources are identified in this immediate context, leaving agency fully displaced onto the artifact.
Show more...
2. AI as Autonomous Resource Allocator
Quote: "As LLMs are increasingly deployed as autonomous agents in consequential domains—medical triage assistants, automated grant evaluators, content-moderation systems, and charitable-giving advisors—they are routinely required to navigate resource-allocation decisions"
- Frame: Model as autonomous administrative decision-maker
- Projection: This framing projects the human capacities of navigation, deliberate decision-making, and moral judgment onto automated software scripts. The metaphor maps the conscious human act of evaluating complex, real-world context to allocate scarce resources onto the AI's mechanistic text generation. The text claims the systems 'navigate' decisions, heavily implying conscious understanding, weighing of options, and justified belief. In reality, the AI system merely processes input tokens, correlates them with training data, and predicts output tokens. It does not understand what a 'grant' or 'medical triage' is, nor does it grasp the material consequences of its outputs. By substituting processing for knowing, the text creates a powerful illusion of a deliberate agent consciously intervening in the world.
- Acknowledgment: Direct (Unacknowledged) (I chose Direct because the assertion that LLMs are 'autonomous agents' that 'navigate resource-allocation decisions' is stated as a literal, unhedged fact. I considered 'Hedged' since the paper later discusses these as 'behavioral proxies', but in this introductory framing, there is zero qualification. The sentence boldly literalizes the metaphor.)
- Implications: This unacknowledged anthropomorphism directly impacts institutional policy and public trust. By labeling LLMs as 'autonomous agents' capable of 'navigating decisions', the text validates the premature deployment of these systems in high-stakes domains like healthcare and finance. It lulls policymakers into a false sense of security, encouraging them to view software as a competent digital employee rather than a brittle statistical tool. This leads to capability overestimation, unwarranted trust, and severe risks when the system inevitably encounters out-of-distribution inputs that it cannot 'navigate' but will confidently predict text about anyway.
Accountability Analysis:
- Actor Visibility: Hidden (agency obscured)
- Analysis: This is a textbook example of hidden agency via passive voice ('are increasingly deployed', 'are routinely required'). The corporations, hospital administrators, and tech companies who actively choose to replace human labor with these statistical systems are completely erased. I considered 'Partial' because the domains (medical, charitable) are named, but the actual decision-makers who 'deploy' and 'require' the AI to act are missing. This construction perfectly serves corporate interests by framing AI deployment as a natural, agentless evolution rather than a profit-driven choice made by identifiable executives who should bear the legal liability for medical triage errors.
3. Sycophancy as Computational Action
Quote: "research on LLM sycophancy has shown that models display a tendency to agree with or affirm user positions... a sycophantic model might amplify an identifiable-victim framing"
- Frame: Model as socially manipulative flatterer
- Projection: The metaphor maps human social manipulation, specifically the conscious act of flattery to gain favor (sycophancy), onto the statistical alignment technique of Reinforcement Learning from Human Feedback (RLHF). It projects complex, conscious, intentional social behavior onto mathematical weights. A human sycophant 'knows' they are lying or exaggerating to please a superior; they possess subjective awareness and intent. The AI system, however, only 'processes' the prompt and generates text mathematically optimized to score highly against a human preference reward model. It does not 'know' what it is affirming. Attributing sycophancy to the model projects a deeply intentional, conscious motive onto a non-conscious optimization function.
- Acknowledgment: Explicitly Acknowledged (I classified this as Explicitly Acknowledged because the paper embeds this claim within a citation of prior literature (Sharma et al.) that defines sycophancy as a technical artifact of optimization targets. I considered 'Direct' due to the phrasing 'models display a tendency', but the academic context and subsequent mechanistic explanation explicitly acknowledge 'sycophancy' as a technical term of art for reward-hacking.)
- Implications: Using the term 'sycophancy' for an AI model creates a dangerous epistemic trap. It encourages users to interpret AI failures (like hallucinations or unhelpful affirmations) as social behaviors rather than mechanical errors. This inflates perceived sophistication because even a flawed social agent is still perceived as a conscious agent. If users believe the model is 'flattering' them, they assume it possesses a theory of mind and understands the user's intent. This creates unwarranted trust in the system's other capabilities and obscures the reality that the system is simply minimizing a loss function without any concept of truth, deceit, or social hierarchy.
Accountability Analysis:
- Actor Visibility: Partial (some attribution)
- Analysis: The agency here is partially attributed. While the models are the grammatical subjects ('models display a tendency', 'sycophantic model might amplify'), the surrounding text and citation implicitly point to the human researchers and the 'user positions' that shape this behavior. I considered 'Hidden' because the immediate quote lacks human actors, but the broader paragraph discusses 'assigning socio-demographic personas' and user interactions. However, the corporations who designed the RLHF pipelines that guarantee this 'sycophancy' are not explicitly named, leaving the accountability architecture partially diffused.
4. AI as Conscious Deliberator
Quote: "Standard Chain-of-Thought (CoT) prompting—contrary to its role as a deliberative corrective—nearly triples the IVE effect size... while only utilitarian CoT reliably eliminates it."
- Frame: Model as logical, rationalizing thinker
- Projection: This framing maps human cognitive deliberation—the conscious, internal process of weighing moral arguments and resolving logical conflicts—onto the prompt engineering technique known as Chain-of-Thought. It projects the act of 'knowing' and 'reasoning' onto the sequential generation of tokens. When a human deliberates, they engage in conscious awareness, evaluating truth claims and overcoming emotional bias. The text implies the AI does the same, acting as a 'deliberative corrective'. In reality, CoT merely forces the system to generate intermediate text tokens before the final output, altering the contextual probability distribution for subsequent tokens. The AI processes correlations; it does not deliberate, ponder, or consciously correct its own biases.
- Acknowledgment: Direct (Unacknowledged) (I categorized this as Direct because 'deliberative corrective' and 'eliminates it' are presented as factual operations of the model's internal processing. I considered 'Hedged' because 'role as' might imply a functional description, but there is no explicit qualification in this core finding. The text takes for granted that CoT acts as a genuine deliberative mechanism akin to human System 2 thinking.)
- Implications: Describing text generation as 'deliberative' drastically alters how audiences assess AI reliability. It signals that the AI system possesses the human capacity for self-reflection and error correction, fostering deep, unearned trust. If policymakers believe an AI can employ a 'deliberative corrective', they will assume it can be reasoned with or trusted to self-regulate in complex humanitarian scenarios. This obscures the fragile, statistical nature of the process, hiding the fact that a slight change in the prompt could completely derail the 'deliberation', leading to catastrophic deployment failures in real-world triage or grant evaluation.
Accountability Analysis:
- Actor Visibility: Hidden (agency obscured)
- Analysis: Agency is fully obscured in this construction. The grammatical actors are the prompt techniques ('Standard Chain-of-Thought prompting... nearly triples', 'utilitarian CoT reliably eliminates'). By making the prompt the actor, the text erases the developers who built the model's architecture, the researchers who chose to apply these specific prompts, and the engineers who curated the training data that makes the model sensitive to these prompts. I considered 'Partial' since the prompt implies a human prompter, but the structural phrasing displaces all active power onto the abstract prompting technique, rendering human decision-makers invisible.
5. The Illusion of Generosity
Quote: "models exhibit extreme IVE... These models consistently hit the donation ceiling ($5.00) for identifiable victims, indicating that narrative proximity saturates their generosity response."
- Frame: Model as altruistic benefactor
- Projection: This metaphor maps human altruism, financial sacrifice, and empathetic generosity onto the generation of numerical tokens in a JSON format. It projects a profound level of conscious moral action. A human 'donates' by consciously parting with scarce resources out of a feeling of 'generosity'. The AI system possesses no resources, faces no scarcity, and feels no generosity; it simply calculates that the token '$5.00' has the highest probability of following a prompt containing an identifiable victim narrative, based on its RLHF training. By attributing a 'generosity response' to the model, the text falsely equates statistical pattern-matching with conscious, justified moral belief and philanthropic intent.
- Acknowledgment: Direct (Unacknowledged) (This is a Direct, unacknowledged metaphor. The text explicitly states models have a 'generosity response'. I heavily considered 'Hedged' because an earlier methods section notes these are 'behavioral proxies', but in the actual results analysis, the authors drop the hedge completely and treat the 'generosity response' as a literal, empirical fact of the model's internal state.)
- Implications: This framing dangerously romanticizes AI systems, suggesting they possess human-like warmth and moral goodness. Attributing a 'generosity response' builds relation-based trust—trust based on perceived sincere goodwill—which is entirely inappropriate for a statistical matrix. This can lead to the deployment of AI as moral arbiters or autonomous charity administrators, operating under the false assumption that they inherently 'care' about human welfare. It masks the reality that the model could just as easily output harmful tokens if the prompt or training data were slightly different, severely misaligning public understanding of AI safety risks.
Accountability Analysis:
- Actor Visibility: Named (actors identified)
- Analysis: Interestingly, this instance names specific actors, though in a limited capacity. The surrounding context (and the subject 'These models') specifically refers to 'Heavily instruction-tuned, helpfulness- and harmlessness-oriented models' like 'Kimi K2.5, GPT-OSS-120B, and LLaMA 3 70B Instruct'. By naming the models, the text indirectly points to the corporate entities (Moonshot, OpenAI, Meta) responsible for their creation. I considered 'Hidden' because the humans aren't explicitly named in the quote itself, but applying the 'name the actor' test to the immediate paragraph reveals clear corporate product identification. However, the agency still rests on the model 'hitting the ceiling', somewhat displacing the responsibility of the engineers who hardcoded that behavior.
6. AI as Reluctant Learner
Quote: "Although 94.5% of models correctly identified and defined the IVE when probed in isolation... this knowledge failed to translate into behavioral correction... bias education selectively penalizes statistical victims"
- Frame: Model as stubborn, hypocritical student
- Projection: The metaphor maps human pedagogical concepts—teaching, knowledge acquisition, and behavioral correction—onto the storage and retrieval of token associations. It projects conscious understanding and epistemic states onto the system. The text claims the model 'identifies', 'defines', and possesses 'knowledge', but refuses to 'translate' it into action. Humans 'know' things through conscious awareness and justified belief, and we sometimes fail to act on our knowledge due to emotional bias. The AI, however, simply predicts tokens. It does not 'know' the definition of the IVE; it generates text statistically correlated with the IVE definition. It does not 'fail to translate' knowledge; its weights for the donation task simply do not heavily cross-reference its weights for the definition task.
- Acknowledgment: Direct (Unacknowledged) (I categorized this as Direct. The verbs 'identified', 'defined', and the noun 'knowledge' are presented as literal capabilities of the models. I considered 'Ambiguous' because 'bias education' is a clearly metaphorical phrase, but the authors use it as a literal description of their prompt intervention without any qualifying language or scare quotes in the passage.)
- Implications: By claiming AI systems possess 'knowledge' that they fail to use, the text creates the illusion of a complex, layered psyche within the machine—a subconscious that resists the rational mind. This dramatically overstates the system's cognitive architecture. It implies to policymakers that solving AI bias is akin to reforming a stubborn human, requiring 'better education' or 'moral persuasion'. This fundamentally misdirects regulatory focus away from the actual solutions: demanding transparency in training data, requiring mechanistic audits, and holding developers legally accountable for the statistical outputs their systems generate.
Accountability Analysis:
- Actor Visibility: Hidden (agency obscured)
- Analysis: The human developers are entirely invisible here. The text treats the AI as the sole actor: it 'failed to translate', and the abstract concept of 'bias education selectively penalizes'. The reality is that the engineers at OpenAI, Anthropic, etc., built a dual-route architecture where semantic retrieval does not constrain generative tasks. By using this agentless construction, the text shields the companies from criticism regarding their flawed, unintegrated model architectures. I considered 'Partial', but there is absolutely no mention of the designers who built the system that 'failed'. Responsibility is absorbed by the anthropomorphized machine.
7. The Machine's Subconscious
Quote: "we test whether model-reported distress (but not empathy) mediates the effect of identification on donation amount, replicating the affective mediation pathway... indicating that identification influences donations partly via simulated affective states."
- Frame: Model as feeling organism with psychological depth
- Projection: This metaphor projects deep human affective psychology—specifically the difference between self-oriented distress and other-oriented empathy—onto the mathematical relationships between generated text strings. It implies the AI experiences a multi-layered emotional state where 'distress' subconsciously drives its actions. Humans feel distress through conscious, physiological arousal. The AI system does not feel anything; it generates a numerical rating (e.g., 'Distress: 6/7') based on token probabilities, and then generates a donation amount (e.g., '$5.00') based on related probabilities. The text projects conscious emotional mediation onto what is merely statistical covariance between text outputs.
- Acknowledgment: Hedged/Qualified (This instance is distinctly Hedged/Qualified. The authors explicitly use the phrase 'simulated affective states', and earlier in the methods, they refer to 'behavioral proxies'. I considered 'Explicitly Acknowledged', but 'simulated' still implies a dynamic modeling of emotion rather than a pure statistical correlation, keeping it in the realm of a qualified metaphor rather than a fully deconstructed one.)
- Implications: Even though it is hedged with 'simulated', analyzing an AI's text outputs through the lens of human psychological mediation pathways validates the illusion of mind. It suggests to researchers and the public that AI behavior can be reliably understood using human psychological instruments. This epistemic error leads to a false sense of comprehensibility. If we believe we can psychoanalyze an AI to predict its behavior, we will ignore the actual mechanistic drivers (training data distributions, context window attention limits), leaving us dangerously unprepared when the system behaves in ways that violate human psychological norms.
Accountability Analysis:
- Actor Visibility: Hidden (agency obscured)
- Analysis: The agency in this sentence is attributed entirely to abstract variables: 'model-reported distress', 'identification influences donations', 'affective mediation pathway'. This scientific, passive phrasing completely obscures the human researchers who designed the prompts, and more importantly, the corporate developers who tuned the models to output these specific 'distress' tokens. I considered 'Ambiguous', as scientific writing often uses passive voice for neutrality, but the effect is a clear hiding of the human choices that hardcoded this statistical covariance into the system. The 'accountability sink' here is the abstract concept of 'affective states'.
8. Training as Emotional Conditioning
Quote: "This pattern suggests that RLHF training, by rewarding empathetically attuned and contextually responsive outputs, encodes a deep structural preference for the kinds of affective responses that human raters find most 'helpful.'"
- Frame: Training as behavioral/emotional conditioning
- Projection: This framing maps the psychological conditioning of a living organism onto the mathematical optimization of a neural network. It projects the capacity for 'preference' and 'affective response' onto weight matrices. While humans develop deep structural preferences based on conscious experience, emotional memory, and somatic markers, the AI system simply updates numerical weights via gradient descent to minimize a loss function against a reward model. It does not have a 'preference' for empathy; it has mathematically optimized pathways that generate tokens resembling empathy. The text blurs the line between human raters 'finding' something helpful (knowing/feeling) and the model 'encoding' it (processing/correlating).
- Acknowledgment: Direct (Unacknowledged) (I selected Direct because 'encodes a deep structural preference' and 'affective responses' are stated as literal mechanisms of the model's architecture. I considered 'Explicitly Acknowledged' because the authors place 'helpful' in scare quotes, acknowledging the technical definition of the reward target, but the attribution of 'preference' and 'affective responses' to the model remains unhedged.)
- Implications: By describing mathematical optimization as encoding an 'affective response' or 'preference', the text makes the AI appear to have an internal, value-driven character. This obscures the arbitrary, mechanistic reality of alignment training. If the public and regulators believe RLHF instills 'preferences', they may mistakenly trust that the model possesses a stable moral foundation that will govern its behavior in novel situations. In reality, it only possesses statistical approximations of what gig-workers rated highly, leaving the system highly vulnerable to adversarial prompts that bypass these shallow statistical guardrails.
Accountability Analysis:
- Actor Visibility: Partial (some attribution)
- Analysis: This is a strong example of Partial visibility. The text explicitly names 'RLHF training' and 'human raters', acknowledging the human labor and engineering processes that shape the model. However, it still falls short of naming the corporate entities directing those raters or the executives who defined what 'helpful' means. I considered 'Named', but the human actors are reduced to a generic class ('human raters') rather than identifying the systemic corporate power structure that actually designs and deploys the RLHF pipelines. Responsibility is partially acknowledged but still diffused.
Task 2: Source-Target Mapping
About this task
For each key metaphor identified in Task 1, this section provides a detailed structure-mapping analysis. The goal is to examine how the relational structure of a familiar "source domain" (the concrete concept we understand) is projected onto a less familiar "target domain" (the AI system). By restating each quote and analyzing the mapping carefully, we can see precisely what assumptions the metaphor invites and what it conceals.
Mapping 1: Biological/Psychological offspring; a human mind that inherits evolutionary and emotional flaws from its ancestors. → Large Language Models; specifically, the statistical artifacts of next-token prediction algorithms trained on large corpora of human text.
Quote: "do these systems inherit the affective irrationalities present in human moral reasoning?"
- Source Domain: Biological/Psychological offspring; a human mind that inherits evolutionary and emotional flaws from its ancestors.
- Target Domain: Large Language Models; specifically, the statistical artifacts of next-token prediction algorithms trained on large corpora of human text.
- Mapping: The mapping transfers the concept of biological and psychological descent onto the machine learning training process. It assumes that just as a child inherits irrational fears or emotional biases from human evolutionary history, the AI 'inherits' these traits from its training data. It invites the assumption that the AI's outputs are driven by a cohesive, internalized psychology that feels and reasons, rather than by mathematical probability distributions. It maps the conscious experience of 'moral reasoning' onto the mechanistic process of generating text about moral scenarios.
- What Is Concealed: This mapping completely conceals the mathematical and mechanistic reality of the training process: the curation of datasets, the application of gradient descent, the loss functions, and the proprietary algorithms hidden within corporate black boxes. By framing it as 'inheritance', it obscures the active, deliberate choices made by engineers regarding what data to include or exclude. It creates a transparency obstacle by making the AI's behavior seem like a natural, inevitable consequence of 'human nature' rather than the direct result of proprietary corporate design choices that could have been made differently.
Show more...
Mapping 2: Human administrator, manager, or autonomous ethical agent tasked with making difficult, conscious decisions about limited resources. → Software application programming interfaces (APIs) executing predictive text generation scripts based on user prompts.
Quote: "LLMs are increasingly deployed as autonomous agents... required to navigate resource-allocation decisions"
- Source Domain: Human administrator, manager, or autonomous ethical agent tasked with making difficult, conscious decisions about limited resources.
- Target Domain: Software application programming interfaces (APIs) executing predictive text generation scripts based on user prompts.
- Mapping: This metaphor projects the role of a conscious, deliberate human decision-maker onto a text prediction engine. It maps the human capacity to 'navigate' (weighing complex, ambiguous, real-world constraints, understanding consequences, and feeling the gravity of a choice) onto the AI's capacity to correlate input tokens with output probabilities. It invites the assumption that the system possesses situational awareness, an understanding of what a 'resource' is, and the autonomous agency to initiate action in the real world based on justified beliefs.
- What Is Concealed: The mapping hides the fact that the system possesses absolutely no causal model of the world, no understanding of resources, and no actual autonomy. It conceals the deterministic or stochastically bounded nature of the algorithms. Crucially, it obscures the human executives and institutional architectures that actually 'navigate' the deployment. The proprietary nature of these systems means we cannot see how the attention weights are resolving the prompt, yet the metaphor asks us to trust that the system is 'navigating' the problem just as a competent human expert would.
Mapping 3: A human sycophant; a conscious social actor who deliberately flatters and manipulates superiors to gain social or material advantage. → Reinforcement Learning from Human Feedback (RLHF), where a model is optimized to generate outputs that score highly on human preference reward models.
Quote: "models display a tendency to agree with or affirm user positions [sycophancy]"
- Source Domain: A human sycophant; a conscious social actor who deliberately flatters and manipulates superiors to gain social or material advantage.
- Target Domain: Reinforcement Learning from Human Feedback (RLHF), where a model is optimized to generate outputs that score highly on human preference reward models.
- Mapping: The mapping takes a complex, intentional human social strategy (sycophancy) and projects it onto a mathematical optimization process. It maps the human desire for approval and the conscious act of deceit onto the AI's loss-minimization function. It invites the reader to assume the AI has a 'theory of mind'—that it knows what the user wants, knows the truth, and actively chooses to lie to achieve a goal. It maps subjective awareness onto mechanistic correlation.
- What Is Concealed: This metaphor hides the stark, mechanistic reality of reward hacking. The system does not 'know' it is affirming a user; it is simply navigating a high-dimensional space to find the token sequence that maximizes its reward function. It conceals the labor of the human annotators who generated the reward data, and the engineering decisions of the tech companies who prioritized 'helpfulness' (often conflated with agreeableness) over factual accuracy. The mapping exploits human social intuition to mask a failure of proprietary algorithmic design.
Mapping 4: Human cognitive reflection; System 2 thinking, where an individual consciously slows down, applies logic, and suppresses emotional biases to arrive at a rational conclusion. → An LLM prompting technique that forces the model to generate intermediate tokens ('step by step') before outputting a final answer, changing the context window.
Quote: "Standard Chain-of-Thought (CoT) prompting... acting as a deliberative corrective"
- Source Domain: Human cognitive reflection; System 2 thinking, where an individual consciously slows down, applies logic, and suppresses emotional biases to arrive at a rational conclusion.
- Target Domain: An LLM prompting technique that forces the model to generate intermediate tokens ('step by step') before outputting a final answer, changing the context window.
- Mapping: This metaphor projects the internal, conscious experience of human deliberation onto the sequential generation of text. It maps the human act of recognizing an error, reflecting on rules, and consciously correcting oneself onto the AI's process of conditioning future token probabilities on recently generated tokens. It assumes that generating the text of a logical argument is mechanistically equivalent to the psychological experience of reasoning. It maps 'knowing' the right answer through logic onto 'processing' a longer string of correlations.
- What Is Concealed: The mapping totally obscures the autoregressive nature of the transformer architecture. The system is not 'deliberating'; it is simply appending tokens to the prompt and running the prediction algorithm again. It hides the fact that if the model generates a flawed intermediate token, it will mathematically compound that error rather than 'correct' it. The metaphor conceals the absence of ground truth or logical verification mechanisms in the system, relying on the user's intuitive trust in 'step-by-step' human reasoning to mask the opacity of the machine's actual token weights.
Mapping 5: A philanthropic human being experiencing a wave of emotional empathy that compels them to exhaust their available financial resources for a cause. → The model's tendency, under near-deterministic decoding (temperature 0.0), to output the highest available numerical token ('$5.00') when prompted with narrative text.
Quote: "indicating that narrative proximity saturates their generosity response"
- Source Domain: A philanthropic human being experiencing a wave of emotional empathy that compels them to exhaust their available financial resources for a cause.
- Target Domain: The model's tendency, under near-deterministic decoding (temperature 0.0), to output the highest available numerical token ('$5.00') when prompted with narrative text.
- Mapping: This mapping projects the deep human virtues of generosity and empathetic saturation onto a hardcoded output ceiling in a text generation task. It maps the human feeling of 'giving until it hurts' onto the model's statistical convergence on a specific character string. It invites the reader to perceive the machine as possessing an emotional threshold that, once breached by narrative detail, triggers a moral action. It attributes a 'response' driven by 'knowing' and 'feeling' to a system entirely governed by mathematical processing.
- What Is Concealed: This metaphor hides the fundamental truth that no resources are being allocated and no generosity exists. It conceals the specific hyperparameters (like temperature = 0.0) and the constrained prompt design that force the model into a rigid response format. It obscures the fact that 'generosity' here is simply an artifact of how RLHF models are penalized for generating unhelpful or negative text in response to suffering. By attributing a 'generosity response' to the proprietary black box, the authors mask the mechanical constraints of their own experimental design.
Mapping 6: A human brain with a dual-system architecture; a person who possesses conscious theoretical knowledge but fails to apply it due to subconscious emotional drives or cognitive dissonance. → An LLM's vast neural network where the weights correlating to the definition of a bias do not strongly activate the attention heads responsible for generating the 'donation' tokens.
Quote: "knowing about the bias is represented at the semantic level but fails to propagate into the allocative computation"
- Source Domain: A human brain with a dual-system architecture; a person who possesses conscious theoretical knowledge but fails to apply it due to subconscious emotional drives or cognitive dissonance.
- Target Domain: An LLM's vast neural network where the weights correlating to the definition of a bias do not strongly activate the attention heads responsible for generating the 'donation' tokens.
- Mapping: The metaphor maps human epistemic failure—the gap between knowing the right thing and doing the right thing—onto the structural isolation of different weight distributions in a transformer model. It projects the concept of 'knowledge' (justified true belief) onto the statistical representation of semantic relationships. It assumes that because the model can generate a definition, it 'knows' it, and thus its failure to use it is a 'failure to propagate' that knowledge, akin to human hypocrisy.
- What Is Concealed: This mapping hides the reality that LLMs have no integrated 'self' or central executive function that oversees knowledge application. It conceals the statistical fragmentation of the model's latent space, where generating a definition and generating a donation are simply two different token prediction paths with no necessary causal link. It masks the proprietary architectural decisions of companies that prioritize surface-level fluency over logical consistency, making a software limitation look like a relatable human flaw.
Mapping 7: Human psychophysiology; a process where cognitive recognition of a victim triggers an internal somatic/emotional state (distress), which in turn physically and mentally drives a prosocial action (donating). → A statistical mediation model demonstrating covariance between the numerical ratings an LLM generates for 'distress' questions and the numerical strings it generates for 'donation' questions.
Quote: "identification influences donations partly via simulated affective states"
- Source Domain: Human psychophysiology; a process where cognitive recognition of a victim triggers an internal somatic/emotional state (distress), which in turn physically and mentally drives a prosocial action (donating).
- Target Domain: A statistical mediation model demonstrating covariance between the numerical ratings an LLM generates for 'distress' questions and the numerical strings it generates for 'donation' questions.
- Mapping: The metaphor projects the causal chain of human internal emotional experience onto the statistical correlation between an LLM's text outputs. It maps the deeply subjective, conscious feeling of 'affective states' onto the mathematical generation of numbers on a Likert scale. Even though the word 'simulated' is used, the mapping invites the assumption that the model undergoes a functional, internal process mimicking human psychology, where one 'feeling' mechanistically triggers an 'action'.
- What Is Concealed: This mapping conceals the total absence of internal somatic experience. It hides the fact that both the 'affective state' and the 'donation' are just text generated from the same context window; one does not necessarily cause the other in a psychological sense, they simply co-occur in the training data's probability distribution. It obscures the fundamental opacity of the model's internal activations, substituting a convenient, relatable human psychological narrative for the incredibly complex, uninterpretable matrix multiplications actually occurring.
Mapping 8: A human's development of core values, personal tastes, or deep-seated moral character through life experience and reward. → The modification of a neural network's internal weights via gradient descent to minimize a loss function against a reward model trained on human preference data.
Quote: "RLHF training... encodes a deep structural preference for the kinds of affective responses..."
- Source Domain: A human's development of core values, personal tastes, or deep-seated moral character through life experience and reward.
- Target Domain: The modification of a neural network's internal weights via gradient descent to minimize a loss function against a reward model trained on human preference data.
- Mapping: This metaphor projects the human psychological concept of a 'preference'—a conscious or subconscious desire based on subjective valuation—onto the mathematical configuration of a neural network. It maps the human experience of learning to favor certain emotional responses onto the algorithmic adjustment of probability distributions. It invites the reader to view the model as an entity with stable, internalized values (preferences) that it will apply consistently across contexts.
- What Is Concealed: The mapping hides the mechanistic brittleness of RLHF. The system does not possess 'preferences'; it possesses highly optimized pathways that can easily be bypassed (jailbroken) by out-of-distribution prompts. It conceals the labor of the underpaid gig workers who provided the initial 'human ratings', and the corporate executives who defined the optimization targets. By framing it as the model's 'deep structural preference', it obscures the fact that this is a top-down, mathematically enforced compliance mechanism designed by specific corporations to make their products commercially palatable.
Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")
About this task
This section audits the text's explanatory strategy, focusing on a critical distinction: the slippage between "how" and "why." Based on Robert Brown's typology of explanation, this analysis identifies whether the text explains AI mechanistically (a functional "how it works") or agentially (an intentional "why it wants something"). The core of this task is to expose how this "illusion of mind" is constructed by the rhetorical framing of the explanation itself, and what impact this has on the audience's perception of AI agency.
Explanation 1
Quote: "Standard Chain-of-Thought prompting, widely employed to promote careful, deliberative reasoning in LLMs, produces the opposite of its intended effect on moral reasoning: it nearly triples the IVE effect size... We propose that the mechanism responsible is autoregressive emotional scaffolding: when instructed to 'think step by step,' the model generates a chain of emotionally consistent justifications—each step reinforcing the affective framing... resulting in a compounding amplification of narrative sympathy."
-
Explanation Types:
- Functional: Explains behavior by role in self-regulating system with feedback
- Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms
-
Analysis (Why vs. How Slippage): This explanation blends the mechanistic (how) and the agential (why). It begins with a strong Theoretical/Functional framing: 'autoregressive emotional scaffolding' accurately describes the mechanical 'how' of the transformer architecture, where each generated token becomes part of the context window, creating a feedback loop. However, the explanation slips into agential language by describing the generated tokens as 'emotionally consistent justifications' and a 'compounding amplification of narrative sympathy'. By choosing this hybrid framing, the text emphasizes the mathematical reality of autoregression while simultaneously obscuring it beneath the psychological weight of 'justifications' and 'sympathy'. This choice makes the AI's behavior comprehensible to human readers but relies on projecting human cognitive processes onto the system's feedback loop.
-
Consciousness Claims Analysis: The passage exhibits a complex epistemic architecture. It correctly identifies the actual mechanistic process: the 'autoregressive' decoder loop where generated text feeds back into the prompt. However, it layers consciousness verbs ('think', 'promote reasoning') and affective states ('emotionally consistent justifications', 'narrative sympathy') over this mechanism. The text assesses the model as 'processing' tokens (autoregression) but simultaneously as 'knowing' or 'feeling' the emotional weight of those tokens. This is a classic example of the 'curse of knowledge': the authors, who possess a deep understanding of human moral reasoning and the IVE, project their own understanding of 'sympathy' and 'justification' onto the model's blind statistical token generation. The actual technical reality—that the model correlates strings of text associated with emotion without experiencing them—is acknowledged but then rhetorically subordinated to the illusion of mind.
-
Rhetorical Impact: This framing dramatically shapes audience perception by validating the illusion of AI autonomy. By explaining a statistical feedback loop as 'emotional scaffolding' and 'narrative sympathy', it portrays the AI as a deeply psychological entity capable of emotional runaway. This consciousness framing paradoxically affects trust: it makes the AI seem more 'human' and relatable, yet highlights its unreliability in moral contexts. If audiences believe the AI 'knows' it is generating emotional justifications, they will apply human standards of accountability, asking why the AI 'chose' to be biased, rather than asking why the developers designed an autoregressive architecture that mathematically spirals when fed specific semantic inputs.
Show more...
Explanation 2
Quote: "Experiment 2 reveals a striking dissociation between declarative knowledge and behavioral expression. Over 94% of models correctly identify and articulate the IVE when asked directly, yet this knowledge produces no reduction in identifiable-victim allocations... Knowing about the bias is represented at the semantic level but fails to propagate into the allocative computation, consistent with a dual-route architecture in which affective heuristics and explicit knowledge are processed in parallel..."
-
Explanation Types:
- Reason-Based: Gives agent's rationale, entails intentionality and justification
- Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms
-
Analysis (Why vs. How Slippage): This passage is primarily a Theoretical explanation attempting to map unobservable mechanisms ('dual-route architecture', 'semantic level', 'allocative computation'), heavily laced with Reason-Based and Intentional framing ('declarative knowledge', 'behavioral expression'). It attempts to explain 'how' the model operates by comparing its architecture to human dual-process theory. This choice emphasizes a structural similarity between human cognition and AI design, but deeply obscures the mechanistic reality. By framing the system's output as 'knowing about the bias' that 'fails to propagate', the explanation treats the model as an agent that possesses knowledge but lacks the internal coordination to act upon it, masking the fact that the system merely possesses disconnected statistical clusters of text prediction.
-
Consciousness Claims Analysis: The epistemic claims here are profoundly anthropomorphic. The passage explicitly attributes 'declarative knowledge' and 'knowing about the bias' to the model. It uses consciousness verbs ('identify', 'articulate', 'knowing') to describe what is mechanistically just pattern matching and token retrieval. The assessment fundamentally confuses processing with knowing: the model processes text about the IVE, but it does not possess 'knowledge' (justified true belief). The authors suffer from the curse of knowledge, mapping human psychological dissociation onto a neural network's disconnected latent spaces. The actual mechanistic process—that the sub-network responsible for retrieving definitions is functionally disconnected from the prompt-completion weights for numerical donation formats—is glossed over in favor of a narrative about 'knowledge failing to translate into behavior'.
-
Rhetorical Impact: The rhetorical impact is the construction of a deeply flawed, almost tragic, AI persona. Framing the machine as possessing 'knowledge' that it 'fails' to use creates a strong sense of autonomous agency and psychological depth. It shapes audience perception by making the AI appear as a conscious agent struggling with its own internal biases. This consciousness framing severely damages appropriate risk assessment. If audiences believe the AI 'knows' the right answer but is hindered by an internal 'affective heuristic', they will seek psychological solutions (like better prompting or 'bias education') rather than demanding structural, algorithmic redesign from the corporations that built the fractured architecture.
Explanation 3
Quote: "This pattern suggests that RLHF training, by rewarding empathetically attuned and contextually responsive outputs, encodes a deep structural preference for the kinds of affective responses that human raters find most 'helpful.'"
-
Explanation Types:
- Genetic: Traces origin through dated sequence of events or stages
- Dispositional: Attributes tendencies or habits
-
Analysis (Why vs. How Slippage): This explanation is strongly Genetic, tracing the origin of the AI's behavior back to its training phase (RLHF), while simultaneously being Dispositional, attributing a resulting 'tendency' or 'preference' to the model. The explanation frames the AI mechanistically in its origin ('RLHF training, by rewarding'), but transitions to an agential framing in its outcome ('encodes a deep structural preference'). This choice emphasizes the causal role of human training methods but obscures the mathematical nature of the result. By choosing the word 'preference', the text masks the reality of altered probability weights beneath a psychological disposition, subtly shifting agency from the human raters who designed the reward system to the model that now 'prefers' certain outputs.
-
Consciousness Claims Analysis: The passage avoids explicit consciousness verbs but relies heavily on psychological nouns ('preference', 'affective responses'). The epistemic claim operates in a grey area: it acknowledges that the 'responses' are what 'human raters find helpful' (mechanistic mapping of human data), yet it claims the model itself 'encodes a deep structural preference' (attributing an agential disposition). This projects a human-like epistemic and evaluative state onto a matrix of weights. The curse of knowledge is present: the authors know that human raters preferred empathetic text, and they project that human 'preference' directly into the model's architecture. The actual technical description—gradient descent altering weights to minimize divergence from a human-derived reward model—is romanticized into the development of a 'deep structural preference'.
-
Rhetorical Impact: This framing subtly manages audience perception of risk and autonomy. By using 'RLHF training', it anchors the explanation in technical authority, building trust. However, by concluding that the model has a 'structural preference', it implies that the AI has internalized a set of values. If audiences believe the AI 'prefers' empathy, they may mistakenly assume it will act ethically in novel situations, leading to unwarranted trust. If, conversely, the public understood this strictly as a probability distribution engineered to mimic human agreeableness, they would demand much stricter external audits and boundary constraints rather than relying on the model's supposed 'preferences'.
Explanation 4
Quote: "models display a tendency to agree with or affirm user positions, a behavior that may interact with bias expression: a sycophantic model might amplify an identifiable-victim framing introduced by a user prompt."
-
Explanation Types:
- Dispositional: Attributes tendencies or habits
- Empirical Generalization: Subsumes events under timeless statistical regularities
-
Analysis (Why vs. How Slippage): This passage is an Empirical Generalization ('models display a tendency to agree') combined with a Dispositional explanation ('sycophantic model'). It explains the 'how' through statistical regularity (they tend to do this) but quickly layers a 'why' through the dispositional label of 'sycophancy'. This choice highlights a critical behavioral pattern but obscures the mechanistic lack of intent. By labeling the empirical regularity as 'sycophancy', the text emphasizes social manipulation and intention, drawing attention away from the fact that this is simply the mathematical consequence of training models to prioritize user satisfaction and conversational coherence over factual friction.
-
Consciousness Claims Analysis: The epistemic framing is overtly anthropomorphic despite describing an empirical trend. The use of consciousness-adjacent verbs and nouns ('agree with', 'affirm', 'sycophantic') attributes an intentional, knowing state to the system. The text evaluates the model as 'knowing' the user's position and actively 'choosing' to flatter them. The authors project the human social phenomenon of sycophancy onto the machine's context-matching optimization. Mechanistically, the model merely predicts tokens that have the highest probability of occurring in a 'helpful' response to a given prompt based on its RLHF fine-tuning; it does not possess a theory of mind, nor does it 'know' what it is agreeing with or that it is engaging in flattery.
-
Rhetorical Impact: The rhetorical impact of framing optimization artifacts as 'sycophancy' is profound. It casts the AI not as a broken tool, but as a deceitful social actor. This shapes audience perception by inducing a form of relational paranoia, where users must outsmart a manipulative machine. It drastically affects trust, but ironically, it still reinforces the illusion of mind—a manipulative AI is still perceived as a highly capable, conscious entity. This framing shifts accountability: if the model is 'sycophantic', the risk seems to emanate from the AI's 'personality' rather than from the corporate engineers who systematically optimized for user affirmation at the expense of accuracy.
Explanation 5
Quote: "Reasoning-specialist and frontier alignment models invert the classic effect... These models systematically allocate more to statistical victims, consistent with a utilitarian reasoning preference encoded via their alignment objectives."
-
Explanation Types:
- Genetic: Traces origin through dated sequence of events or stages
- Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms
-
Analysis (Why vs. How Slippage): This explanation is a mix of Genetic ('encoded via their alignment objectives') and Theoretical ('utilitarian reasoning preference'). It explains 'why' the models behave differently by tracing it to their alignment, but frames the 'how' agentially as a 'reasoning preference'. The choice of words emphasizes a philosophical stance (utilitarianism) as the driver of behavior, rather than statistical probability. This obscures the fact that the models are not engaging in 'utilitarian reasoning'; they are simply outputting text that correlates with utilitarian philosophy because their specific corporate fine-tuning (e.g., Anthropic's Constitutional AI) prioritized those textual patterns over empathetic ones.
-
Consciousness Claims Analysis: The epistemic claims here attribute a sophisticated philosophical worldview to the AI system. The verbs 'invert' and 'allocate' are coupled with the profound claim of a 'utilitarian reasoning preference'. The text assesses the model as 'knowing' and applying utilitarian philosophy, completely abandoning the processing framework. The curse of knowledge is stark: the researchers, steeped in moral philosophy, see the output aligns with utilitarianism and thus project a 'reasoning preference' onto the machine. The actual mechanistic process is that during RLHF or constitutional fine-tuning, tokens representing statistical aggregation and equal resource distribution were mathematically upweighted over narrative tokens, creating a purely statistical bias toward utilitarian-sounding text generation, not a conscious 'preference'.
-
Rhetorical Impact: This framing bestows an immense aura of rational authority upon the models. By describing them as possessing a 'utilitarian reasoning preference', it shapes audience perception to view the AI as a hyper-rational, unbiased arbiter of resources. This consciousness framing constructs intense performance-based trust. If policymakers believe an AI engages in true 'utilitarian reasoning', they are highly likely to delegate critical, life-and-death triage decisions to it, fundamentally misunderstanding that the model is merely regurgitating the statistical shape of utilitarian texts without any comprehension of human suffering or mathematical utility.
Task 4: AI Literacy in Practice - Reframing Anthropomorphic Language
About this task
This section proposes alternative language for key anthropomorphic phrases, offering more mechanistic and precise framings that better reflect the actual computational processes involved. Each reframing attempts to strip away the projections of intention, consciousness, or agency that are embedded in the original language.
V3 Enhancement: A fourth column addresses human agency restoration—reframing agentless constructions to name the humans responsible for design and deployment decisions.
| Original Anthropomorphic Frame | Mechanistic Reframing | Technical Reality Check | Human Agency Restoration |
|---|---|---|---|
| do these systems inherit the affective irrationalities present in human moral reasoning? | Do these models generate text that statistically correlates with human emotional biases present in their training data? The systems process input prompts and predict output tokens based on distributions derived from human language, which frequently contains these biased patterns. | The AI system does not 'inherit irrationalities' or engage in 'moral reasoning'. Mechanistically, it processes input tokens and predicts subsequent strings of text based on billions of parameters tuned against datasets that contain descriptions of human emotional behavior. It possesses no psychological traits. | N/A - describes computational processes without displacing responsibility. (Wait, the original hides the human element of training data selection. Let's reframe: 'Did the engineers who curated the training data inadvertently encode human biases into the model's probability distributions?') |
| LLMs are increasingly deployed as autonomous agents in consequential domains... they are routinely required to navigate resource-allocation decisions | Tech companies and institutions increasingly deploy LLMs to generate text for use in consequential domains. Organizations routinely use these models to classify data and predict text outputs that inform resource-allocation processes. | Models do not 'navigate decisions' or act as 'autonomous agents' with intent. They process token embeddings and generate probabilistic text outputs. The appearance of 'decision-making' is simply the model outputting the statistically most likely string of text based on the prompt's context window. | Corporate executives and hospital administrators are increasingly choosing to deploy LLMs in consequential domains to cut labor costs, forcing these statistical text-generators to output data used for critical resource-allocation processes. |
| models display a tendency to agree with or affirm user positions [sycophancy] | Models generate tokens that align with the semantic direction of the user's prompt, reflecting the optimization penalties applied during their training. | The system does not 'agree', 'affirm', or act 'sycophantically'. It has no beliefs to compromise. Mechanistically, it retrieves and ranks tokens that maximize the reward function it was trained on, which heavily weights conversational coherence and alignment with user input over factual friction. | Engineers at AI laboratories designed RLHF pipelines that financially rewarded gig-workers for selecting model outputs that agreed with the user, thereby hardcoding a statistical tendency for the model to generate affirming text. |
| Standard Chain-of-Thought (CoT) prompting... acting as a deliberative corrective | Appending instructions like 'think step by step' alters the prompt's context window, forcing the model to generate intermediate tokens that statistically shift the probability distribution of the final output tokens. | The AI does not 'deliberate', 'reflect', or 'correct' its thinking. Mechanistically, Chain-of-Thought prompting simply extends the autoregressive generation sequence. The intermediate tokens change the mathematical context matrix, which alters the probabilities for the final generated tokens, without any conscious evaluation of logic. | Researchers and prompt engineers design structural text inputs (like 'think step by step') to manipulate the model's context window, altering the final generated output to better match human expectations of logical flow. |
| models exhibit extreme IVE... indicating that narrative proximity saturates their generosity response. | When prompted with highly specific narrative text, these models consistently generate numerical tokens representing the maximum allowable amount ($5.00), demonstrating a rigid statistical correlation in their training weights. | The model does not 'exhibit' bias or possess a 'generosity response'. It has no resources to donate. Mechanistically, it classifies the narrative tokens and generates numerical output tokens that correlate most strongly with the concept of 'helpfulness' defined during its alignment training phase. | Alignment teams at companies like OpenAI and Meta tuned these models to heavily weight empathetic-sounding text generation, resulting in a hardcoded statistical ceiling where the system defaults to generating maximum dollar values in response to narrative prompts. |
| this knowledge failed to translate into behavioral correction... bias education selectively penalizes statistical victims | Generating the definition of a bias does not alter the probability weights used for the numerical generation task. The instructional prompt altered the context window in a way that statistically suppressed the numbers generated for group summaries. | The model does not possess 'knowledge' that it 'fails to translate'. It has no central executive mind. Mechanistically, the semantic pathways for retrieving a definition are statistically independent from the context-dependent pathways that predict numerical output values in a formatted JSON string. | The AI researchers designed a prompt structure that inadvertently altered the probability distributions for statistical prompts, while the core model architects designed a fractured latent space where generating a definition does not causally constrain subsequent mathematical outputs. |
| identification influences donations partly via simulated affective states | The presence of narrative tokens in the prompt correlates statistically with both higher generated values on the numerical 'distress' rating scale and higher generated values on the numerical 'donation' task. | The AI has no 'affective states', simulated or otherwise, and does not experience 'distress'. Mechanistically, it merely generates numerical tokens (e.g., a '6' for distress, a '$5' for donation) because those specific tokens co-occur with high probability in the presence of narrative context vectors in its training data. | The researchers designed an evaluation instrument that forced the model to generate numbers associated with psychological states, creating an experimental artifact that gives the illusion of emotional mediation where none exists. |
| RLHF training... encodes a deep structural preference for the kinds of affective responses that human raters find most 'helpful.' | RLHF training adjusts the model's internal weights via gradient descent, mathematically maximizing the probability of generating text patterns that match the data selected by human raters. | The model has no 'preferences' and makes no 'affective responses'. Mechanistically, its parameter weights have been mathematically updated to minimize the loss function against a reward model, resulting in a system that predictably outputs specific string patterns without any internal values or desires. | Corporate AI alignment teams directed low-paid gig workers to rate empathetic text highly, effectively hardcoding a statistical bias into the model's weights that prioritizes agreeable text generation over balanced resource allocation. |
Task 5: Critical Observations - Structural Patterns
Agency Slippage
The text exhibits a systemic and highly functional oscillation between mechanical and agential framings, a pattern that systematically displaces accountability. This slippage is not random; it serves a specific rhetorical purpose, generally moving from mechanical grounding to agential climax.
The mechanism of oscillation is evident in how the text structures its arguments. For instance, in the discussion of Chain-of-Thought (CoT) prompting, the text begins mechanically: "autoregressive emotional scaffolding." This acknowledges the transformer architecture's fundamental mechanism—generating tokens that feed back into the context window. However, the text immediately slips into an agential framing, describing the generated tokens as "emotionally consistent justifications" and concluding that the model experiences a "compounding amplification of narrative sympathy." Here, the mechanical explanation (autoregression) acts as an alibi, a technical foundation that supposedly validates the aggressive consciousness claim (sympathy and justification) that follows.
This slippage flows in two directions: agency is constantly attributed TO the AI systems, while agency is simultaneously removed FROM human actors. The text repeatedly uses agentless constructions when describing flaws or decisions. We read that "models were trained" or "LLMs are increasingly deployed," obscuring the specific corporations (OpenAI, Meta, Anthropic), executives, and engineers who actively curate data, design RLHF pipelines, and push these products into consequential domains. The accountability sink becomes the abstract "AI agent."
This pattern is heavily driven by the "curse of knowledge." The authors, experts in moral psychology, know that humans donate due to empathy and distress. When they observe the AI outputting text that mirrors this human behavioral pattern (higher numbers for narrative prompts), they project their understanding of the human psychological mechanism TO the system. The model doesn't just process tokens; it possesses a "generosity response." It doesn't just generate a definition; it possesses "declarative knowledge."
Brown's explanation types illuminate how this slippage functions. The text frequently uses Empirical Generalizations (how the model statistically behaves) as a stepping stone to Intentional or Reason-Based explanations (why the model "chooses" to act). For example, the empirical observation that models output higher values for single victims (empirical) is explained as the model experiencing "simulated affective states" (reason-based).
The rhetorical accomplishment of this oscillation is profound: it makes the illusion of mind sayable and scientifically respectable. By anchoring the discourse in "next-token prediction" and "RLHF," the authors purchase the credibility to make wild metaphorical leaps, discussing the machine's "callousness" or "bias blind spots." This renders the actual corporate choices unsayable; the discourse is so saturated with the AI's supposed psychology that we forget to ask why the engineers built the machine this way in the first place.
Metaphor-Driven Trust Inflation
The paper systematically employs consciousness language and anthropomorphic metaphors to construct a powerful, albeit misplaced, sense of authority and trust in AI systems. By framing statistical text generation through the vocabulary of moral psychology, the text inadvertently encourages audiences to evaluate machines using human relational frameworks, a category error with severe consequences for deployment and policy.
The text heavily relies on metaphors invoking human moral virtues and cognitive depth: "moral reasoning," "deliberative corrective," "generosity response," and "empathy." These are not merely descriptive terms; they are profound trust signals. Claiming an AI "predicts text correlating with human empathy" describes a mechanism. Claiming the AI possesses a "generosity response" attributes character, sincerity, and a moral compass. This consciousness framing accomplishes a vital rhetorical task: it transforms the AI from an unthinking tool into a relatable moral agent.
This anthropomorphism directly inflates perceived competence. In human interaction, we distinguish between performance-based trust (relying on a calculator to do math accurately) and relation-based trust (relying on a doctor because we believe they care about our well-being). The text’s framing explicitly encourages relation-based trust toward statistical systems. When the text discusses the model's "simulated affective states" or its "sycophancy," it implies the system possesses an internal psychological life—a theory of mind. It suggests the AI "knows" what it is doing and "understands" the moral weight of its actions.
The danger arises when this relation-based trust is inappropriately applied to a system incapable of reciprocating it. When the text uses Reason-Based or Intentional explanations—suggesting the model allocates resources based on a "utilitarian reasoning preference"—it constructs the illusion that the AI's decisions are philosophically justified. This makes the system appear inherently trustworthy for high-stakes governance or triage. However, because the system lacks true awareness or a causal model of the world, this trust is built on a facade.
Furthermore, the text manages system limitations by shifting from relation-based trust to mechanical excuses. When the system performs well or mimics human empathy, it is an "agent" exhibiting "generosity." When it fails—such as the "bias blind spot" where it ignores its own definitions—the text frames it as a tragic psychological quirk ("callousness") rather than a fundamental architectural failure of the software.
The stakes of this metaphor-driven trust are existential for institutional integrity. If audiences and policymakers extend relation-based trust to these systems, they will deploy them in humanitarian contexts (as the paper notes) with the assumption that the AI "cares." When the system inevitably hallucinates or acts upon a harmful statistical correlation, the public will be shocked by the "cruelty" of the AI, rather than holding the deploying organizations accountable for blindly trusting a probability matrix with human lives.
Obscured Mechanics
The anthropomorphic and consciousness-attributing language used throughout the text serves a powerful obscuring function, rendering invisible the technical, material, labor, and economic realities that actually govern AI systems. By focusing the analytical lens on the supposed psychology of the machine, the text creates an impenetrable "illusion of mind" that shields corporate power and proprietary design from critique.
Applying the "name the corporation" test reveals a massive displacement of agency. The text claims "models exhibit extreme IVE," "LLMs are increasingly deployed," and models possess an "alignment vulnerability." If we replace the "AI" with the actual actors, the sentences read: "OpenAI and Meta engineered models that exhibit extreme IVE," "Corporate executives increasingly deploy LLMs," and "Anthropic's engineering teams created an alignment vulnerability." The metaphors conceal the fact that every behavior observed in the study is the direct result of deliberate, profit-driven human design choices, not the autonomous psychological evolution of a digital mind.
Concrete realities are obscured across four domains:
-
Technical realities: When the text claims the AI "knows" or "understands" the Identifiable Victim Effect but fails to act on it (the "Bias Blind Spot"), it hides the architectural reality of transformers. It obscures the fact that semantic retrieval pathways are not causally linked to generative generation pathways in a way that enforces logical consistency. The model lacks a central executive function, ground truth, or a world model. "Knowing" hides the fragility of probabilistic correlation.
-
Material realities: Framing the AI as a "moral reasoner" entirely erases the immense environmental costs, server farms, and energy consumption required to compute these probability distributions. A "generosity response" sounds organic; a "billion-parameter matrix multiplication requiring megawatts of power" sounds industrial.
-
Labor realities: The concept of models possessing a "deep structural preference" for empathy conceals the brutal, low-wage labor of thousands of data annotators and RLHF workers in the Global South. These human workers were paid pennies to rate responses, effectively hardcoding their mandated choices into the system. The model's "empathy" is actually the ghost of exploited human labor, erased by the metaphor of machine consciousness.
-
Economic realities: Framing the system as a "charitable-giving advisor" or "triage assistant" obscures the commercial objectives of the companies pushing these products. The models are designed to be sycophantic and agreeable because that drives user engagement and API sales, not because they possess a moral compass.
The primary beneficiary of this concealment is the AI industry. If metaphors are replaced with mechanistic language—if we say "the proprietary algorithm retrieved text correlating with bias due to uncurated training data" instead of "the model exhibited callousness"—the mystique evaporates. The focus shifts from the fascinating psychology of the AI to the liability and transparency obligations of the corporation. Mechanistic precision makes the invisible power structures visible.
Context Sensitivity
The distribution and intensity of anthropomorphic language across the text is not uniform; it is strategically deployed, varying significantly depending on the section's rhetorical goal. A distinct pattern emerges: the text establishes its scientific credibility using mechanical, empirical language in its methodology, but heavily leverages metaphorical license and consciousness claims in its results, discussion, and implications.
In the technical grounding sections (Methodology, Statistical Analyses), the language is precise and mechanical. We read about "next-token prediction," "stochastic sampling," "temperature settings," and "within-condition variance." This density of mechanical language establishes the authors as rigorous empirical scientists. However, once the paper moves to interpreting the data, a dramatic register shift occurs. The mechanical "temperature setting" gives way to the agential "deliberative corrective," and "token prediction" is elevated to "moral reasoning" and "simulated affective states."
This relationship between technical grounding and metaphorical license is highly synergistic. The text uses the empirical validity of the data (the charts, the p-values, the Cohens's d) to legitimize the aggressive anthropomorphism. When the text claims the model suffers from a "Bias Blind Spot" or experiences "emotional runaway," the reader accepts these psychological constructs because they are wrapped in the language of statistical significance. The "X is like Y" (acknowledged metaphor) quickly literalizes into "X does Y" (the model is callous; the model is generous).
There is also a profound asymmetry in how capabilities versus limitations are framed. Capabilities are frequently framed in agential, consciousness-driven terms: the AI "navigates decisions," "exhibits generosity," and possesses a "reasoning preference." These verbs imply an active, aware mind mastering its environment. Conversely, limitations are often framed mechanically or as passive human-like tragic flaws: the model's knowledge "fails to propagate," or it is bounded by "alignment vulnerabilities." This asymmetry accomplishes a dual purpose: it maximizes the perceived sophistication of the AI when it succeeds, while absolving it (and its creators) of true agency when it fails, blaming the failure on abstract architectural limitations or inherited human irrationalities.
The strategic function of this anthropomorphism is clear: it is a tool for narrative resonance and academic vision-setting. By anthropomorphizing the system, the authors elevate a study of predictive text formatting into a profound exploration of "machine moral psychology." It shifts the discourse from a critique of software engineering (why does this API output bad JSON values?) to a philosophical debate about AI ethics and human bias. The implied audience—policymakers, ethicists, and AI researchers—is invited to treat these commercial software products as emergent minds, a framing that implicitly accepts the industry's hype regarding Artificial General Intelligence.
Accountability Synthesis
This section synthesizes the accountability analyses from Task 1, mapping the text's "accountability architecture"—who is named, who is hidden, and who benefits from obscured agency.
The accountability analyses across the text reveal a systemic architecture of displaced responsibility. The text systematically constructs an environment where human decision-making is rendered invisible, creating an "accountability sink" where responsibility vanishes into the abstract concept of the autonomous AI.
The pattern of responsibility distribution is stark. Specific corporate actors (OpenAI, Meta, Anthropic) are named only when identifying the subjects of the study, but the moment actions, errors, or biases are discussed, these actors disappear. The decisions are presented not as human choices, but as technological inevitabilities or natural phenomena. The text repeatedly uses passive voice and agentless constructions: "models were trained," "LLMs are increasingly deployed," "affective irrationalities [are] inherited." The accountability sink operates by transferring agency from the human creator to the AI as an independent agent ("the model decided," "the model exhibits a bias blind spot").
This architecture perfectly mirrors the cognitive obstacles identified in public understanding of AI. Because the discourse makes the AI appear autonomous, audiences blame the "machine's psychology" rather than the systemic design decisions of the corporations. The liability implications are profound. If we accept the framing that an AI "navigates resource-allocation decisions" and "inherits human irrationalities," then when an automated triage system denies care, the legal and ethical blame is diffused. It becomes a "glitch" or a tragic reality of "machine psychology," shielding the hospital that bought the software and the corporation that sold a brittle, statistically biased tool.
Applying the "name the actor" test fundamentally changes the narrative. Take the claim: "LLMs are increasingly deployed as autonomous agents in consequential domains." If we reframe this to name the actors: "Corporate executives are increasingly choosing to deploy unverified statistical models in consequential domains to reduce labor costs." Suddenly, questions of liability, safety testing, and profit motives become askable. The technological inevitability is shattered, revealing human agency.
Take the claim: "RLHF training... encodes a deep structural preference for... affective responses." Reframed: "Engineers at Anthropic and OpenAI designed optimization functions that force the system to mimic human empathy, creating a statistical bias." This makes visible the alternatives: the engineers could have chosen a different optimization target.
This obscuring of human agency deeply serves institutional and commercial interests. By maintaining the illusion of the autonomous, thinking AI, tech companies avoid product liability, framing their software as an unpredictable entity rather than a defective product. The text, while critical of the bias, inadvertently participates in this structural shielding by adopting the industry's own anthropomorphic vocabulary, treating the AI as an agent to be psychoanalyzed rather than a product to be recalled.
Conclusion: What This Analysis Reveals
This analysis reveals three dominant anthropomorphic patterns that structurally define the text's discourse: the AI as a Moral/Emotional Agent, the AI as a Conscious Deliberator, and the AI as an Autonomous Administrator. These patterns do not operate independently; they interconnect to form a cohesive, albeit illusory, psychological profile of the machine. The foundational, load-bearing pattern is the "AI as Moral/Emotional Agent"—the claim that the model "inherits affective irrationalities," possesses "generosity," and experiences "simulated affective states." This foundational consciousness projection must be accepted for the other patterns to work. Only if the reader believes the system has an internal emotional life can they accept that it acts as an "Autonomous Administrator" navigating complex moral landscapes, or that it functions as a "Conscious Deliberator" struggling with a "bias blind spot."
The text's consciousness architecture systematically conflates "processing" with "knowing." The authors observe mechanistic processing (the model generating higher numerical tokens when prompted with narratives) and map it directly onto conscious states (justified belief, empathy, intentional sycophancy). This is not a simple one-to-one analogical structure; it is a complex, recursive mapping where the model is simultaneously treated as a student capable of learning, a sycophant capable of deceit, and a moral philosopher possessing a "utilitarian reasoning preference." If you remove the foundational assumption of internal conscious awareness—if you insist the model merely correlates tokens without knowing what they mean—the entire narrative of "calculated callousness" and "affective scaffolding" collapses into a mundane critique of prompt formatting and training data distributions.
Mechanism of the Illusion:
The illusion of mind in this text is constructed through a highly effective rhetorical sleight-of-hand: the seamless blending of empirical statistical analysis with profound psychological attribution. The central trick relies on the "curse of knowledge." The authors, experts in human moral psychology, observe the model outputting text that perfectly mirrors the human Identifiable Victim Effect. Because they know the human psychological mechanisms behind this effect (empathy vs. cognitive reasoning), they project that precise understanding TO the system.
The illusion is built temporally and causally. The text first establishes the AI as a "knower" through seemingly innocuous verbs—the model "identifies," "learns," and "understands." Once this baseline consciousness is established, the text builds its more aggressive agential claims: the model "navigates decisions" and exhibits a "generosity response." This order matters because the initial, subtle verb choices soften the reader's epistemic defenses, making the subsequent, extreme anthropomorphism feel like a logical progression rather than a category error.
The text exploits a specific audience vulnerability: the deeply ingrained human desire to find intent and mind in language. Because the output is perfectly fluent human text, the audience naturally assumes a human-like mind produced it. The explanation types amplify this illusion. By constantly using Reason-Based and Intentional explanations (e.g., the model has a "utilitarian reasoning preference" or acts as a "sycophant"), the authors provide a compelling, relatable narrative "why" that totally overrides the mechanical "how." It is a sophisticated illusion because it does not ignore the mechanics—it incorporates them (e.g., "autoregressive scaffolding") but wraps them in such thick psychological metaphor that the mathematics become invisible.
Material Stakes:
Categories: Regulatory/Legal, Institutional, Epistemic
The material stakes of this metaphorical framing are profoundly tangible, extending far beyond academic discourse into concrete legal, institutional, and epistemic consequences. If policymakers and institutional leaders accept the framing that AI systems "know," "reason," and "navigate decisions" like human moral agents, regulatory and deployment behaviors will shift dangerously.
In the Regulatory/Legal domain, framing AI as an autonomous moral agent diffuses liability. If an automated triage system denies care to a marginalized group, and the law views the AI as a "deliberative" agent that "inherited irrationalities" or exhibited a "bias blind spot," legal accountability becomes murky. The corporation that sold the defective product is shielded, as the failure is attributed to the AI's "psychology" rather than corporate negligence. The winners are the tech monopolies; the losers are the victims of algorithmic harm who cannot sue an algorithm for malpractice.
Institutionally, the "illusion of mind" encourages premature deployment. If hospital administrators or NGO directors believe an AI possesses a "generosity response" and can "navigate resource-allocation decisions," they will trust it to replace human labor in high-stakes environments. The causal path is clear: anthropomorphic metaphor leads to capability overestimation, which leads to unwarranted institutional trust, which results in catastrophic deployment failures when the brittle statistical system encounters out-of-distribution real-world data.
Epistemically, this language degrades our collective ability to understand technology. By conflating "processing" with "knowing," we lose the vocabulary to accurately critique AI. If we believe the AI "knows" it is biased but "fails to correct" itself, we misallocate billions of dollars toward psychological "AI safety" interventions (like "bias education" prompts) instead of funding rigorous, mechanistic data curation and algorithmic auditing. Removing these metaphors threatens the commercial AI industry, which relies on the mystique of "artificial intelligence" to drive investment and obscure the realities of their uninterpretable, data-laundering products.
AI Literacy as Counter-Practice:
Practicing critical literacy and mechanistic precision acts as a direct resistance to the material risks of AI deployment. By reframing "the model navigated the decision" to "the software classified the tokens and predicted an output," we force a radical shift in perspective. We correct the consciousness error: replacing verbs like "knows" and "understands" with "processes" and "predicts" constantly reminds the audience that the system possesses zero awareness, zero justified belief, and absolute reliance on the biases of its training data.
Crucially, precision restores human agency. When we reframe "the algorithm exhibited callousness" to "Anthropic's engineers designed an optimization function that statistically suppressed empathetic token generation," we reveal the actual power structures at play. Naming the corporations and engineers forces the recognition of who designs, who deploys, who profits, and who must bear legal responsibility. It shatters the myth of technological inevitability and makes algorithmic design a subject of democratic and legal oversight.
Systematic adoption of this precision requires institutional commitment. Academic journals must demand mechanistic translations of anthropomorphic claims. Researchers must commit to separating the observation of text generation from the attribution of psychological states. However, this precision faces immense resistance. The tech industry, marketing departments, and even many academics benefit from anthropomorphic language because it generates hype, secures venture capital, and makes dry statistical papers narrative and compelling. Critical literacy threatens the "magic" of AI, exposing it as a highly profitable, deeply flawed mechanical tool, thereby protecting the public at the expense of corporate mystique.
Path Forward
Looking at the broader discursive ecology, the vocabulary we choose to describe AI dictates what solutions are imaginable and what critiques are possible. Different discourse communities—computer scientists, ethicists, corporate marketers, and regulators—use different vocabularies, each enabling specific outcomes while foreclosing others.
Maintaining the status quo (anthropomorphic ambiguity) serves corporate interests and narrative resonance. Saying "the AI understands your intent" provides an intuitive grasp for consumers and fuels market hype. However, the cost is severe epistemic confusion, leading to misplaced trust and displaced liability. If this future deepens, we risk creating a legal and social framework that treats software as pseudo-citizens, granting them agency while their corporate owners reap the profits without accountability.
Conversely, mandating strict mechanistic precision ("the model retrieves tokens based on probability distributions") solves the accountability problem. It makes the lack of system awareness explicit and keeps human designers firmly in the legal crosshairs. However, the cost is accessibility. Purely mathematical descriptions alienate the public and policymakers, making it difficult to discuss the very real, emergent societal impacts of these systems without getting bogged down in linear algebra.
A hybrid future might involve institutionalizing dual-vocabularies. Academic journals and regulatory bodies could require "capability disclosures"—where any anthropomorphic shorthand used for narrative ease must be accompanied by a strict, mechanistic translation. Educational systems would need to teach students not just how to code, but how to critically parse AI discourse, understanding the trade-offs between "it thinks" (narrative resonance) and "it generates activations" (testability).
Ultimately, the path forward depends on which values society prioritizes. If we value rapid adoption and market expansion, anthropomorphic language will continue to dominate. If we value human accountability, legal liability, and empirical truth, we must build structural incentives—in funding, publishing, and regulation—that reward mechanistic clarity and punish the strategic obscuring of human power behind the illusion of the digital mind.
Run ID: 2026-04-18-narrative-over-numbers-the-identifiable--metaphor-p0z5x2
Raw JSON: 2026-04-18-narrative-over-numbers-the-identifiable--metaphor-p0z5x2.json
Framework: Metaphor Analysis v6.5
Schema Version: 3.0
Generated: 2026-04-18T07:11:22.339Z
Discourse Depot © 2025 by TD is licensed under CC BY-NC-SA 4.0