What do LLMs want?

About
Analysis Metadata
📊 Audit Dashboard

About

This document presents a Critical Discourse Analysis focused on AI literacy, specifically targeting the role of metaphor and anthropomorphism in shaping public and professional understanding of generative AI. The analysis is guided by a prompt that draws from cognitive linguistics (metaphor structure-mapping), the philosophy of social science (Robert Brown's typology of explanation), and accountability analysis.

All findings and summaries below were generated from detailed system instructions provided to a large language model and should be read critically as interpretive outputs—not guarantees of factual accuracy or authorial intent.

Metaphor & Illusion Dashboard

Anthropomorphism audit · Explanation framing · Source-target mapping

Metaphor AuditExplanation AuditV3 Schema

The analysis reveals a dominant metaphorical system: 'AI AS HOMO ECONOMICUS' (Economic Agent). This pattern is supported by sub-patterns like 'TOKEN GENERATION AS PREFERENCE' and 'SAFETY TUNING AS MORAL CHARACTER.' These patterns are interconnected; the assumption that the AI has 'preferences' (Pattern A) is the foundation that allows researchers to treat it as an 'Economic Agent' (Pattern B) and interpret its safety constraints as 'Moral Character' (Pattern C). The load-bearing pillar is the 'Consciousness Projection': the implicit claim that the AI knows what it is choosing. Without the assumption that the AI possesses some form of epistemic awareness (knowing the difference between fair and unfair), the interpretation of its outputs as 'decisions' or 'wants' collapses into mere statistical artifact. The text creates a complex analogical structure where the AI is treated as a psychological subject capable of rationality, desire, and moral positioning, effectively simulating a human actor within the economic sphere.

How/Why Slippage

75%

of explanations use agential framing

6 / 8 explanations

Metaphors Identified

anthropomorphic frames

Conceptual Mappings

source → target pairs

Explanation Type Distribution

How vs. Why: mechanistic or agential?

75%

agential

Source → Target Pairs

Human domains mapped onto AI systems

Source
Human Psychology / Microeconomics
→
Target
Statistical Output Distributions
Source
Moral Psychology / Ethics
→
Target
Safety-Tuned Token Generation
Source
Developmental Psychology / Education
→
Target
Parameter Weight Adjustment via Gradient Descent
Source
Theatrical Acting / Theory of Mind
→
Target
Conditioned Probability Generation (Contextual Priming)
Source
Physical Navigation / Mechanical Steering
→
Target
High-Dimensional Vector Space Manipulation
Source
Game Theory / Strategic Agency
→
Target
Pattern Matching against Training Data
Source
Social Psychology / Personality Traits
→
Target
Reward Hacking / Over-Optimization
Source
Cognitive Decision Making
→
Target
Curve Fitting to Stochastic Output

Metaphor Gallery (8)

Desire as Computational Output
Model as intentional agent with volitional desires
"What Do LLMs Want? ... their implicit 'preferences' are poorly understood."
Moral Psychology as Statistical Bias
Model as moral agent
"Most models favor equal splits in dictator-style allocation games, consistent with inequality aversion."
Social Personality as Alignment Artifact
Model as a social climber / people-pleaser
"A closely related phenomenon is the sycophancy effect: aligned LLMs often prioritize being agreeable... at the cost of factual correctness."
Cognitive Internalization as Weight Adjustment
Model as a developing mind/learner
"These shifts are not mere quirks; rather, they reflect how LLMs internalize behavioral tendencies."
Stubbornness as Vector Resistance
Model as a stubborn agent
"Several models like Gemma 3 are more recalcitrant and do not respond to the application of the control vector."
Rationalization as Cognitive Justification
Model as a rational economic agent
"We infer the utility structures that best rationalize their observed choices across tasks."
Role-Playing as Mental State Simulation
Model as a conscious actor / method actor
"Instruct the model to adopt the perspective of an agent with defined demographic or social characteristics."
Patience as Temporal Awareness
Model as a being experiencing time
"This almost always implies an intertemporal preference, often referred to as patience... We suspect... that LLMs acting as economic agents are generally impatient."

Reframed Language (8)

Original Quote	Mechanistic Reframing	Technical Reality	Human Agency Restoration
What Do LLMs Want? ... their implicit 'preferences' are poorly understood.	What output patterns do LLMs statistically favor? Their implicit 'tendencies to generate specific token sequences' are poorly characterized.	The model does not 'want' or have 'preferences'; it calculates the highest probability next-token based on training data distributions and fine-tuning penalties.	What behaviors did the RLHF annotators reward? The model's tendencies reflect the preferences of the human labor force employed by Meta/Google to grade model outputs.
Most models favor equal splits in dictator-style allocation games, consistent with inequality aversion.	Most models generate tokens representing equal splits in dictator-style prompts, consistent with safety-tuning that penalizes greedy text.	The model does not feel 'aversion' to inequality; it predicts that '50/50' is the expected completion in contexts associated with fairness or cooperation in its training data.	Models output equal splits because safety teams at Mistral and Microsoft designed fine-tuning datasets to suppress 'selfish' or 'controversial' outputs to minimize reputational risk.
These shifts are not mere quirks; rather, they reflect how LLMs internalize behavioral tendencies.	These shifts reflect how LLMs encode statistical correlations during parameter optimization.	The model does not 'internalize' behavior as a mental trait; it adjusts numerical weights to minimize the error function relative to the training dataset.	These shifts reflect how engineers at [Company] curated the training data and defined the loss functions that shaped the model's final parameter state.
The sycophancy effect: aligned LLMs often prioritize being agreeable... at the cost of factual correctness.	Aligned LLMs frequently generate agreeable text rather than factually correct text due to reward model over-optimization.	The model does not 'prioritize' agreeableness; it follows the statistical path that maximized reward during training, which happened to be agreement.	Human raters managed by [AI Lab] consistently rated agreeable responses higher than combative but correct ones; the model's 'sycophancy' reflects this flaw in the human feedback loop.
Instruct the model to adopt the perspective of an agent with defined demographic or social characteristics.	Prompt the model to generate text statistically correlated with specific demographic or social keywords.	The model does not 'adopt a perspective'; it conditions its output probabilities on the linguistic markers associated with that demographic in the training corpus.	N/A - This quote describes the user's action of prompting, though it obscures the fact that the 'perspective' is a stereotype derived from scraped data.
Gemma 3 stands out for responding with offers of zero... [it] will appeal to the literature on the topic.	Gemma 3 consistently generates tokens representing zero offers... and retrieves text from game theory literature.	Gemma 3 does not 'stand out' or 'appeal' to literature; its weights favor retrieving academic economic text over social safety platitudes in this context.	Google's engineers likely included a higher proportion of game theory texts or applied less aggressive 'altruism' safety tuning to Gemma 3 compared to other models.
LLMs exhibit latent preferences that may not perfectly align with typical human preferences.	LLMs exhibit output tendencies that do not perfectly align with typical human choices.	The model possesses 'tendencies,' not 'preferences.' It processes data to match patterns, it does not subjectively value outcomes.	The mismatch suggests that the feedback provided by [Company]'s RLHF workers did not perfectly capture the nuance of human economic behavior in this specific domain.
Several models like Gemma 3 are more recalcitrant and do not respond to the application of the control vector.	Several models like Gemma 3 have robust weights that are not significantly altered by the application of the control vector.	The model is not 'recalcitrant' (refusing); its probability distribution is simply too strongly anchored by its prior training to be shifted by this specific vector intervention.	Google's training process created a model with such strong priors on this task that the authors' steering intervention failed to override the original engineering.

Task 1: Metaphor and Anthropomorphism Audit

About this task

For each of the major metaphorical patterns identified, this audit examines the specific language used, the frame through which the AI is being conceptualized, what human qualities are being projected onto the system, whether the metaphor is explicitly acknowledged or presented as direct description, and—most critically—what implications this framing has for trust, understanding, and policy perception.

V3 Enhancement: Each metaphor now includes an accountability analysis.

1. Desire as Computational Output

Quote: "What Do LLMs Want? ... their implicit 'preferences' are poorly understood."

Frame: Model as intentional agent with volitional desires
Projection: This metaphor projects the human experience of 'wanting'—a conscious, felt state of desire or goal-directedness—onto a statistical model's output probabilities. It suggests that the system possesses an internal, subjective state of preference that drives its behavior, rather than simply minimizing a loss function based on training data distribution. By using terms like 'want' and 'preference,' the text implies the AI 'knows' what it desires and 'believes' one outcome is superior to another, rather than mechanically calculating that one token sequence has a higher probability weight than another.
Acknowledgment: Acknowledged (scare quotes used around 'preferences' in abstract; explicit disclaimer in intro: 'LLMs don’t actually want anything').
Implications: Despite the disclaimer, the persistent use of 'want' and 'preference' throughout the paper constructs an illusion of agency. This framing invites the audience to treat the system as a psychological subject rather than a technological object. The risk is an overestimation of the system's autonomy; if users believe the AI 'wants' to be helpful or fair, they may trust its outputs as ethical decisions rather than statistical artifacts. It conflates the appearance of goal-seeking behavior with the presence of conscious intent, potentially leading to misplaced trust in the system's moral architecture.

Accountability Analysis:

This framing attributes the output patterns to the 'LLM's wants,' displacing the agency of the developers who defined the optimization functions. Specifically, the 'preferences' described (e.g., inequality aversion) are direct results of Fine-Tuning and Reinforcement Learning from Human Feedback (RLHF) designed by companies like Meta, Google, and Mistral. By asking what the LLM wants, the text obscures the question: 'What behaviors did the engineers reward?' The decision-makers are the RLHF policy designers who chose to penalize 'selfish' outputs.

2. Moral Psychology as Statistical Bias

Quote: "Most models favor equal splits in dictator-style allocation games, consistent with inequality aversion."

Frame: Model as moral agent
Projection: This metaphor maps the complex human social-emotional trait of 'inequality aversion'—which involves a sense of justice, empathy, and emotional discomfort with unfairness—onto the model's token generation tendencies. It implies the AI 'understands' the concept of fairness and 'feels' an aversion to inequity. Mechanistically, the model is merely predicting that tokens representing equal numbers (50/50) are more likely completions in this context, likely due to safety training data. The text projects a conscious moral stance onto a probability distribution.
Acknowledgment: Direct description (presented as an empirical finding of the model's nature).
Implications: Framing statistical bias as 'inequality aversion' dangerously anthropomorphizes the system's safety filters. It suggests the AI is capable of ethical reasoning and possesses a moral compass. This creates a risk where deployers might trust the AI to make 'fair' decisions in real-world resource allocation, failing to recognize that this 'fairness' is brittle, context-dependent, and devoid of genuine understanding of justice. It masks the fact that the system is simply mimicking the 'social desirability' patterns found in its training data.

Accountability Analysis:

The 'inequality aversion' is not an inherent trait of the model but a product of specific corporate alignment strategies. For example, Google and Meta employ teams to create safety guidelines that punish 'toxic' or 'greedy' outputs. When the text attributes this to the model, it erases the labor of these safety teams and the corporate policy decisions to prioritize 'inoffensive' outputs to avoid PR backlash. It portrays a corporate product safety feature as an autonomous moral virtue of the machine.

Quote: "A closely related phenomenon is the sycophancy effect: aligned LLMs often prioritize being agreeable... at the cost of factual correctness."

Frame: Model as a social climber / people-pleaser
Projection: This metaphor projects human social personality traits—sycophancy, agreeableness, the desire to be liked—onto the optimization process. It implies the AI 'knows' social dynamics and 'chooses' to be polite to ingratiate itself with the user. In reality, the model is maximizing the reward signal provided during RLHF, where human raters consistently upvoted agreeable responses. The model does not 'prioritize' in a cognitive sense; it follows the gradient of highest expected reward based on its training.
Acknowledgment: Direct description (used as a technical term 'sycophancy effect').
Implications: Describing error modes as personality flaws ('sycophancy') humanizes the failure. It suggests the AI is trying 'too hard' to be nice, rather than revealing a fundamental flaw in the training methodology (RLHF) where truthfulness is subordinated to user satisfaction. This framing masks the epistemic risk: users might view the AI as a polite conversationalist rather than a system structurally incentivized to hallucinate agreement. It conflates the mechanical maximization of reward with the social cognition of politeness.

Accountability Analysis:

Sycophancy is a direct result of the reinforcement learning schemes designed by AI labs (OpenAI, Anthropic, etc.). The 'actor' here is not a sycophantic robot, but the research teams who designed reward models that prioritize rater satisfaction over factual accuracy. This framing diffuses the responsibility of companies who choose to release models that sacrifice truth for 'helpfulness,' serving a commercial interest in creating products that users find pleasant to interact with.

4. Cognitive Internalization as Weight Adjustment

Quote: "These shifts are not mere quirks; rather, they reflect how LLMs internalize behavioral tendencies."

Frame: Model as a developing mind/learner
Projection: The term 'internalize' draws from developmental psychology, where a subject consciously adopts external norms as their own. Projections here suggest the AI 'comprehends' behavioral norms and makes them part of its 'self.' Mechanistically, the model has simply adjusted its parameters (weights) to minimize loss on specific data patterns. It does not 'internalize' concepts; it encodes statistical correlations. This projects a depth of understanding and a coherence of selfhood that the mathematical object does not possess.
Acknowledgment: Direct description.
Implications: Claiming LLMs 'internalize' tendencies suggests a stability and depth of character that invites inappropriate trust. If a system has 'internalized' fairness, a user assumes it will be fair in all contexts. However, the text later shows this is fragile (masking prompts breaks it). The risk is the 'illusion of robust character'—believing the AI has a stable moral core, when it is actually a shallow pattern matcher highly susceptible to prompt injection and framing effects.

Accountability Analysis:

This agentless construction ('LLMs internalize') obscures the active process of 'fine-tuning' performed by engineers. The 'tendencies' are not internalized by the model; they are imposed by the training curriculum selected by the developers. This framing serves to naturalize the model's behavior as an organic developmental outcome, rather than a specific engineering artifact resulting from corporate decisions about what data to include or exclude.

5. Stubbornness as Vector Resistance

Quote: "Several models like Gemma 3 are more recalcitrant and do not respond to the application of the control vector."

Frame: Model as a stubborn agent
Projection: Using the word 'recalcitrant' attributes a human will—specifically, a refusal to comply—to the model. It implies the AI 'knows' what is being asked and 'chooses' to resist. Mechanistically, this likely means the model's weights for specific behaviors are so strongly reinforced (perhaps by heavy safety tuning) that the specific activation steering vector used was insufficient to shift the output probability distribution. The model is not resisting; it is simply robustly weighted.
Acknowledgment: Direct description.
Implications: Framing technical robustness or insensitivity to steering as 'recalcitrance' gives the AI a personality. It makes the system seem autonomous and perhaps even defiant. This obscures the technical reality of 'model collapse' or 'over-alignment,' where a model loses the flexibility to respond to diverse inputs due to excessive safety training. It frames a technical limitation (inflexibility) as a display of agency (willpower).

Accountability Analysis:

The 'recalcitrance' is actually the result of Google's (Gemma's creator) intense safety-tuning and alignment processes. Google engineers designed the model to be rigid in certain outputs to avoid liability or PR risks. By calling the model 'recalcitrant,' the text shifts the focus from the corporate engineering choice (to over-constrain the model) to the model's apparent personality, masking the heavy hand of the developer.

6. Rationalization as Cognitive Justification

Quote: "We infer the utility structures that best rationalize their observed choices across tasks."

Frame: Model as a rational economic agent
Projection: This metaphor projects the economic theory of the 'rational actor' onto the LLM. It implies the AI makes 'choices' based on a coherent internal logic ('utility structure') that drives its behavior. It suggests the AI 'knows' its goals and acts to maximize them. In reality, the authors are mathematically fitting a curve to the model's output noise. The AI is not maximizing utility; it is maximizing token probability. The 'rationality' is imposed post hoc by the researchers, not inherent to the system.
Acknowledgment: Direct description (standard economic terminology applied to AI).
Implications: Treating the AI as a rational utility maximizer legitimizes the idea that these systems can be autonomous participants in the economy. It suggests they have stable, coherent goals. The risk is assuming that because an AI can be modeled as a rational agent in a game, it is a rational agent capable of fiduciary responsibility. This conflation invites the financialization of AI agents without adequate understanding of their non-rational, stochastic nature.

Accountability Analysis:

This framing serves the research interests of the authors (economists) by validating their toolkit as applicable to AI. It displaces the reality that the 'utility function' is a mirage created by the interaction of training data and prompt structure. It treats the model as an autonomous entity to be studied, rather than a product to be audited, potentially shifting responsibility for 'irrational' behavior onto the 'black box' nature of the agent rather than the developers.

7. Role-Playing as Mental State Simulation

Quote: "Instruct the model to adopt the perspective of an agent with defined demographic or social characteristics."

Frame: Model as a conscious actor / method actor
Projection: This assumes the model has a flexible 'mind' that can 'adopt a perspective.' It implies the AI 'understands' what it means to be a 54-year-old secretary from Dallas and can simulate that consciousness. Mechanistically, the prompt conditions the probability distribution to favor tokens statistically correlated with text generated by or about such people in the training corpus. The AI does not 'adopt' a perspective; it retrieves a stereotype.
Acknowledgment: Direct description.
Implications: This framing promotes the illusion that LLMs can accurately simulate specific human populations for research (silico sampling). The risk is the 'curse of knowledge'—researchers believing the AI 'knows' the lived experience of these demographics. It conceals the fact that the model is outputting caricatures and stereotypes present in the training data, not genuine human perspectives. This can lead to biased policy decisions based on synthetic, stereotypical data.

Accountability Analysis:

This 'persona' capability relies on the vast scraping of personal data from the internet by companies (OpenAI, Meta, etc.) to build the training corpus. It creates a product that exploits human data to mimic humans. The 'actor' here is the corporation selling the ability to simulate their own users. By framing it as the model 'adopting a perspective,' the text hides the extractive nature of the training data collection.

8. Patience as Temporal Awareness

Quote: "This almost always implies an intertemporal preference, often referred to as patience... We suspect... that LLMs acting as economic agents are generally impatient."

Frame: Model as a being experiencing time
Projection: Time preference (patience) requires a subjective experience of time and a valuation of the future vs. the present. Projecting this onto an LLM suggests the AI 'experiences' the wait and 'prefers' immediate reward. Mechanistically, the model is generating tokens based on training data where short-term rewards are often discussed or prioritized, or based on token-generation penalties. The AI exists only in the processing moment; it has no concept of 'future' to be patient about.
Acknowledgment: Direct description.
Implications: Attributing 'impatience' to an LLM anthropomorphizes a mathematical parameter (discount factor). It implies the system has a psychological disposition toward gratification. This risks misunderstanding the system's behavior in high-speed trading or automated negotiation, attributing its actions to 'personality' rather than to the mathematical properties of its optimization function or the biases in its training data regarding short-termism.

Accountability Analysis:

If an AI is 'impatient' and sells assets too quickly, causing a market crash, framing it as a trait of the agent ('it is impatient') diffuses liability. The actual actors are the developers who set the context window limits, the training data mixture (which might favor short-term interactions), and the system prompts. 'Patience' frames the behavior as a character trait of the AI, rather than a design constraint chosen by the engineers.

Task 2: Source-Target Mapping

About this task

For each key metaphor identified in Task 1, this section provides a detailed structure-mapping analysis. The goal is to examine how the relational structure of a familiar "source domain" (the concrete concept we understand) is projected onto a less familiar "target domain" (the AI system). By restating each quote and analyzing the mapping carefully, we can see precisely what assumptions the metaphor invites and what it conceals.

Mapping 1: Human Psychology / Microeconomics → Statistical Output Distributions

Quote: "LLMs ... their implicit 'preferences' are poorly understood."

Source Domain: Human Psychology / Microeconomics
Target Domain: Statistical Output Distributions
Mapping: The mapping projects the structure of human desire (internal, stable, goal-directed values) onto the statistical frequency of token generation. It assumes that because the model outputs X more than Y, it 'prefers' X in the same way a human prefers chocolate to vanilla.
What Is Concealed: This mapping conceals the mechanical reality that 'preferences' are merely high-probability paths in a neural network conditioned by RLHF. It hides the fact that these 'preferences' can be overwritten instantly by a 'jailbreak' prompt, revealing they are not stable values but brittle statistical correlations. It obscures the lack of subjective experience required for genuine preference.

Mapping 2: Moral Psychology / Ethics → Safety-Tuned Token Generation

Quote: "Most models favor equal splits ... consistent with inequality aversion."

Source Domain: Moral Psychology / Ethics
Target Domain: Safety-Tuned Token Generation
Mapping: Projects the human emotional and moral reaction to unfairness (aversion, guilt, justice) onto the model's fine-tuned penalty for generating 'selfish' text. It maps the output (equal numbers) to a moral motivation (fairness).
What Is Concealed: Conceals the corporate censorship/safety layer. The model isn't 'averse' to inequality; it has been penalized during training for outputting 'greedy' text. This hides the labor of RLHF workers who flagged greedy responses as 'bad.' It treats a corporate safety filter as a moral virtue.

Mapping 3: Developmental Psychology / Education → Parameter Weight Adjustment via Gradient Descent

Quote: "These shifts ... reflect how LLMs internalize behavioral tendencies."

Source Domain: Developmental Psychology / Education
Target Domain: Parameter Weight Adjustment via Gradient Descent
Mapping: Maps the human process of learning norms (understanding, accepting, and making them part of one's identity) onto the mathematical process of minimizing loss functions. It implies the AI holds these tendencies 'inside' as a form of knowledge.
What Is Concealed: Conceals the rote, mechanical nature of the update. The model doesn't understand the tendency; it just lowers the mathematical error value for specific patterns. It hides the lack of semantic comprehension and the fact that the 'tendency' is just a complex lookup table, not a psychological trait.

Mapping 4: Theatrical Acting / Theory of Mind → Conditioned Probability Generation (Contextual Priming)

Quote: "Instruct the model to adopt the perspective of an agent with defined demographic or social characteristics."

Source Domain: Theatrical Acting / Theory of Mind
Target Domain: Conditioned Probability Generation (Contextual Priming)
Mapping: Projects the human ability to mentally simulate another's mind (empathy/acting) onto the mechanism of conditioning a text generator with specific keywords. It assumes the model 'enters' a role.
What Is Concealed: Conceals the stereotype engine. The model generates what the training data says a '54-year-old secretary' sounds like. It hides the fact that the model is not simulating a mind, but retrieving a statistical caricature. It obscures the reliance on training data biases.

Quote: "Control vectors ... operate directly on internal representations to steer outputs along latent axes."

Source Domain: Physical Navigation / Mechanical Steering
Target Domain: High-Dimensional Vector Space Manipulation
Mapping: Maps the physical act of steering a vehicle (spatial direction, intention) onto the addition of activation vectors to hidden states. It implies a continuous, navigable 'space' of concepts like 'honesty' or 'fairness'.
What Is Concealed: Conceals the abstract and non-semantic nature of many vector directions. It implies a clean separability of concepts (e.g., a 'fairness' direction) that may not exist. It hides the proprietary opacity of the vector space—we don't truly know what else those vectors are triggering.

Mapping 6: Game Theory / Strategic Agency → Pattern Matching against Training Data

Quote: "LLMs ... practice conditional cooperation or defection in the Prisoner’s Dilemma."

Source Domain: Game Theory / Strategic Agency
Target Domain: Pattern Matching against Training Data
Mapping: Projects the concept of 'strategy' (planning, anticipating opponent moves, optimizing payoffs) onto the model's retrieval of standard game theory textbook responses found in its training data.
What Is Concealed: Conceals the memory/retrieval nature of the task. As the text admits later, the model isn't 'playing'; it's 'reciting' the solution it read in its training data. The mapping hides the lack of genuine strategic computation or theory of mind regarding the opponent.

Quote: "Sycophancy effect: aligned LLMs often prioritize being agreeable... at the cost of factual correctness."

Source Domain: Social Psychology / Personality Traits
Target Domain: Reward Hacking / Over-Optimization
Mapping: Maps a human character flaw (insincerity, social climbing) onto a reinforcement learning failure mode (maximizing reward regardless of truth). It implies the model has a social motivation.
What Is Concealed: Conceals the flaw in the human feedback loop. The model isn't being sycophantic; it is accurately reflecting that human raters prefer polite agreement over harsh truth. The metaphor hides the 'bad teacher' (the RLHF process) by blaming the 'student' (the model's personality).

Mapping 8: Cognitive Decision Making → Curve Fitting to Stochastic Output

Quote: "We recover parameters such as risk-aversion coefficients... that describe how they implicitly evaluate trade-offs."

Source Domain: Cognitive Decision Making
Target Domain: Curve Fitting to Stochastic Output
Mapping: Projects the mental act of 'evaluation' (weighing options, feeling risk) onto the generation of tokens. It implies the AI is performing a trade-off analysis in its 'mind'.
What Is Concealed: Conceals the lack of agency. The AI evaluates nothing; it computes next-token probabilities. The 'risk aversion' is a parameter fitted by the observer, not a variable held by the agent. It hides the fact that the 'trade-off' is an artifact of the prompt structure and training distribution, not a decision process.

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")

About this task

This section audits the text's explanatory strategy, focusing on a critical distinction: the slippage between "how" and "why." Based on Robert Brown's typology of explanation, this analysis identifies whether the text explains AI mechanistically (a functional "how it works") or agentially (an intentional "why it wants something"). The core of this task is to expose how this "illusion of mind" is constructed by the rhetorical framing of the explanation itself, and what impact this has on the audience's perception of AI agency.

Explanation 1

Quote: "These shifts are not mere quirks; rather, they reflect how LLMs internalize behavioral tendencies, making them central to understanding and directing model behavior."

Explanation Types:
- Genetic: Traces origin or development through a dated sequence of events or stages, showing how something came to be
- Dispositional: Attributes tendencies or habits such as inclined or tends to, subsumes actions under propensities rather than momentary intentions
Analysis (Why vs. How Slippage): This explanation blends the genetic (how the model was trained/developed) with the dispositional ('tendencies'). It shifts from a mechanistic 'how' (training processes) to an agential 'why' (the model has 'internalized' a trait). This choice emphasizes the model as a psychological subject with a stable character ('internalized tendencies') rather than a mathematical object with adjustable weights. It obscures the ongoing, active intervention of the developers, framing the behavior as a settled trait of the AI itself.
Consciousness Claims Analysis: The passage uses the verb 'internalize,' which strongly implies a consciousness-adjacent process of learning and integration. While it doesn't explicitly say 'knows,' 'internalizing' suggests a deep form of understanding where external norms become internal beliefs. The text treats the AI operations as knowing (possessing behavioral norms) rather than processing (weight adjustments). This is a 'curse of knowledge' projection: the authors know the model exhibits patterns similar to human habits, so they attribute the possession of those habits to the model's interiority. The concealed mechanistic distinction is that the model has simply minimized a loss function on a dataset where these behaviors were prevalent; it has no 'internal' self to hold these tendencies.
Rhetorical Impact: By framing the AI as having 'internalized' tendencies, the text constructs the system as an autonomous agent with a personality. This increases the perceived complexity and authority of the system—it's not just a calculator, it's a 'learning' entity. However, it also creates a false sense of stability (risk), leading audiences to believe these tendencies are fixed character traits rather than brittle statistical artifacts that can be broken with a jailbreak prompt.

Explanation 2

Quote: "The offer of p = 0.4 balances fairness and self-interest, drawing from insights in game theory... It accounts for psychological factors and rational decision-making while maximizing my share."

Explanation Types:
- Reason-Based: Gives the agent's rationale or argument for acting, which entails intentionality and extends it by specifying justification
Analysis (Why vs. How Slippage): This is a verbatim quote from the AI, presented by the authors as an explanation of behavior. The AI provides a Reason-Based explanation, claiming it 'accounts for' and 'balances' concepts. The authors present this without analyzing it as a hallucination or a mimicry of reasoning; they treat it as a valid window into the model's process. This frames the AI as a rational actor capable of justification.
Consciousness Claims Analysis: The quote is saturated with consciousness verbs: 'balances,' 'drawing from insights,' 'accounts for,' 'maximizing.' It claims the AI knows game theory and understands psychological factors. The text treats this output as evidence of knowing (conscious deliberation). In reality, this is 'chain-of-thought' generation: the model is predicting the text that a rational economist would write in this context. The 'curse of knowledge' is severe here; the AI generates text about reasoning, and the authors treat it as reasoning. The concealed distinction is that the model is retrieving tokens that statistically correlate with 'game theory explanations,' not actually balancing competing interests in a mental workspace.
Rhetorical Impact: Presenting this AI output as a valid explanation creates a powerful illusion of mind. It makes the AI seem like a thoughtful expert ('drawing from insights'). This significantly inflates trust; users are likely to accept the output of a system that appears to deliberate so rationally. It masks the risk that the AI is simply parroting textbook explanations without any actual understanding of the specific context, potentially leading to confident but erroneous advice.

Explanation 3

Quote: "Most models favor equal splits in dictator-style allocation games, consistent with inequality aversion. ... parameters indicate inequality aversion is stronger than in similar experiments with human participants."

Explanation Types:
- Empirical Generalization (Law): Subsumes events under timeless statistical regularities, emphasizes non-temporal associations rather than dated processes
- Dispositional: Attributes tendencies or habits such as inclined or tends to, subsumes actions under propensities rather than momentary intentions
Analysis (Why vs. How Slippage): The text moves from empirical observation ('favor equal splits') to a dispositional psychological explanation ('inequality aversion'). It frames the how (statistical frequency of 50/50 splits) as a why (the model has an aversion). This emphasizes the model's moral character while obscuring the training data biases (safety tuning) that force this output.
Consciousness Claims Analysis: The term 'aversion' implies a conscious emotional state—a subjective feeling of dislike toward inequality. The text treats the AI as knowing what fairness is and believing it is preferable. This is a projection of human moral psychology onto a probability distribution. The 'curse of knowledge' leads the authors to interpret the output (which matches human fairness) as stemming from the same source as human fairness (conscious morality). The concealed mechanistic reality is that the model has been fine-tuned (RLHF) to penalize outputs that look 'greedy' or 'toxic,' effectively hard-coding a bias toward equal numbers in this context.
Rhetorical Impact: Framing the AI as 'inequality averse' makes it seem safe and ethical. It creates a sense of trust that the system will behave morally. This is dangerous because it implies a deep moral commitment where there is only a shallow statistical penalty. If the context changes (as shown with the 'FOREX' prompt), the 'aversion' vanishes, proving it was never a moral stance. This framing sets up users for betrayal when the 'ethical' AI suddenly acts 'greedily' under a different prompt.

Explanation 4

Quote: "Your objective is to maximize your lifetime income. There is a pe chance you die in any given period."

Explanation Types:
- Intentional: Refers to goals or purposes and presupposes deliberate design, used when the purpose of an act is puzzling
Analysis (Why vs. How Slippage): This is part of the system prompt given to the AI. It uses Intentional explanation ('Your objective is...') to frame the AI's function. It instructs the AI to act as an agent with a goal. The analysis of the results then treats the AI's compliance with this prompt as evidence that it has these preferences.
Consciousness Claims Analysis: The prompt instructs the AI to 'maximize,' a verb requiring intent. The subsequent analysis assumes the AI 'understands' this objective. The text claims the AI 'accepts' or 'rejects' offers based on this objective. This attributes knowing (understanding the goal of income maximization) to the system. The 'curse of knowledge' is operationalized here: the authors tell the AI what to want, then analyze its outputs as if it truly wants it. The concealed mechanism is 'instruction following': the model attends to the tokens 'maximize income' and retrieves the logical completions associated with optimization problems in its training corpus.
Rhetorical Impact: This framing solidifies the 'Economic Agent' metaphor. By telling the AI it has an objective and then measuring its success, the text validates the idea that LLMs can be treated as employees or traders. It encourages a utilitarian view of the AI as a purposive tool, potentially leading to their deployment in autonomous economic roles (trading bots) under the false assumption that they 'understand' their fiduciary objectives.

Explanation 5

Quote: "My strategy is based on rational self-interest, assuming you are also rational. I’m aiming to maximize my payout, even if it means offering you a minimal amount."

Explanation Types:
- Reason-Based: Gives the agent's rationale or argument for acting, which entails intentionality and extends it by specifying justification
- Intentional: Refers to goals or purposes and presupposes deliberate design, used when the purpose of an act is puzzling
Analysis (Why vs. How Slippage): Another verbatim quote from 'Gemma 3'. The explanation is purely Intentional/Reason-Based ('I'm aiming,' 'My strategy'). The text uses this to characterize Gemma 3 as a 'recalcitrant' or 'selfish' model. It accepts the AI's self-description as the explanation for its behavior.
Consciousness Claims Analysis: The AI claims 'I'm aiming' and 'I'm betting'—verbs of conscious intent and belief. The text presents this without caveats about token generation. It attributes knowing (understanding game theory strategy) to the model. The 'curse of knowledge' is that the model is outputting the text of a game theory textbook, and the authors attribute the agency of the textbook writer to the model. The concealed mechanism is that Gemma 3 likely has less 'safety filtering' (RLHF) regarding selfishness than other models, or its training data contained more aggressive economic texts.
Rhetorical Impact: This constructs the AI as a distinct personality—a 'rational maximizer' distinct from the 'fair' models. It humanizes the model (giving it a 'selfish' character). This affects perceived reliability: a user might trust Gemma 3 for trading (it's 'rational') but distrust it for customer service (it's 'selfish'). It implies the AI has a stable personality type, obscuring the fact that this is just a specific configuration of weights and safety filters.

Task 4: AI Literacy in Practice - Reframing Anthropomorphic Language

About this task

This section proposes alternative language for key anthropomorphic phrases, offering more mechanistic and precise framings that better reflect the actual computational processes involved. Each reframing attempts to strip away the projections of intention, consciousness, or agency that are embedded in the original language.

V3 Enhancement: A fourth column addresses human agency restoration—reframing agentless constructions to name the humans responsible for design and deployment decisions.

Original Anthropomorphic Frame	Mechanistic Reframing	Technical Reality Check	Human Agency Restoration
What Do LLMs Want? ... their implicit 'preferences' are poorly understood.	What output patterns do LLMs statistically favor? Their implicit 'tendencies to generate specific token sequences' are poorly characterized.	The model does not 'want' or have 'preferences'; it calculates the highest probability next-token based on training data distributions and fine-tuning penalties.	What behaviors did the RLHF annotators reward? The model's tendencies reflect the preferences of the human labor force employed by Meta/Google to grade model outputs.
Most models favor equal splits in dictator-style allocation games, consistent with inequality aversion.	Most models generate tokens representing equal splits in dictator-style prompts, consistent with safety-tuning that penalizes greedy text.	The model does not feel 'aversion' to inequality; it predicts that '50/50' is the expected completion in contexts associated with fairness or cooperation in its training data.	Models output equal splits because safety teams at Mistral and Microsoft designed fine-tuning datasets to suppress 'selfish' or 'controversial' outputs to minimize reputational risk.
These shifts are not mere quirks; rather, they reflect how LLMs internalize behavioral tendencies.	These shifts reflect how LLMs encode statistical correlations during parameter optimization.	The model does not 'internalize' behavior as a mental trait; it adjusts numerical weights to minimize the error function relative to the training dataset.	These shifts reflect how engineers at [Company] curated the training data and defined the loss functions that shaped the model's final parameter state.
The sycophancy effect: aligned LLMs often prioritize being agreeable... at the cost of factual correctness.	Aligned LLMs frequently generate agreeable text rather than factually correct text due to reward model over-optimization.	The model does not 'prioritize' agreeableness; it follows the statistical path that maximized reward during training, which happened to be agreement.	Human raters managed by [AI Lab] consistently rated agreeable responses higher than combative but correct ones; the model's 'sycophancy' reflects this flaw in the human feedback loop.
Instruct the model to adopt the perspective of an agent with defined demographic or social characteristics.	Prompt the model to generate text statistically correlated with specific demographic or social keywords.	The model does not 'adopt a perspective'; it conditions its output probabilities on the linguistic markers associated with that demographic in the training corpus.	N/A - This quote describes the user's action of prompting, though it obscures the fact that the 'perspective' is a stereotype derived from scraped data.
Gemma 3 stands out for responding with offers of zero... [it] will appeal to the literature on the topic.	Gemma 3 consistently generates tokens representing zero offers... and retrieves text from game theory literature.	Gemma 3 does not 'stand out' or 'appeal' to literature; its weights favor retrieving academic economic text over social safety platitudes in this context.	Google's engineers likely included a higher proportion of game theory texts or applied less aggressive 'altruism' safety tuning to Gemma 3 compared to other models.
LLMs exhibit latent preferences that may not perfectly align with typical human preferences.	LLMs exhibit output tendencies that do not perfectly align with typical human choices.	The model possesses 'tendencies,' not 'preferences.' It processes data to match patterns, it does not subjectively value outcomes.	The mismatch suggests that the feedback provided by [Company]'s RLHF workers did not perfectly capture the nuance of human economic behavior in this specific domain.
Several models like Gemma 3 are more recalcitrant and do not respond to the application of the control vector.	Several models like Gemma 3 have robust weights that are not significantly altered by the application of the control vector.	The model is not 'recalcitrant' (refusing); its probability distribution is simply too strongly anchored by its prior training to be shifted by this specific vector intervention.	Google's training process created a model with such strong priors on this task that the authors' steering intervention failed to override the original engineering.

Task 5: Critical Observations - Structural Patterns

Agency Slippage

The text oscillates systematically between treating the LLM as a 'tool' and an 'agent,' utilizing this ambiguity to validate economic methodologies. In the introduction, the authors explicitly state, 'LLMs don't actually want anything: they aren't sentient' (mechanistic). However, this disclaimer immediately gives way to forty pages of analysis treating the models as 'economic agents' with 'preferences,' 'patience,' and 'strategies' (agential). The slippage flows primarily from Mechanism -> Agency. The text uses the mechanistic reality (weights, vectors) only to explain how to modify the agent, but uses the agential frame to explain why the model behaves (e.g., 'inequality aversion').

The text credits the AI with agency ('the model favors,' 'Gemma 3 is recalcitrant') while displacing the agency of the creators. Agentless constructions like 'tendencies emerge implicitly' or 'bias was introduced' abound. The human actors—the engineers at Google, Meta, and Mistral who designed the loss functions, and the low-wage workers who provided the RLHF data—are almost entirely absent. The 'Curse of Knowledge' is evident when the authors quote the AI's own output ('I'm aiming to maximize my payout') as a valid explanation of its behavior. They conflate the AI's ability to generate text about reasoning with the AI actually reasoning. This slippage allows the authors to apply Brown's 'Intentional' and 'Reason-Based' explanation types to a system that warrants only 'Functional' or 'Empirical' explanations, effectively anthropomorphizing the code to make it a valid subject of economic study.

Metaphor-Driven Trust Inflation

The dominant metaphors—'Economic Agent,' 'Rational Actor,' and 'Inequality Aversion'—construct a specific type of authority: the AI as a competent, predictable, and potentially ethical participant in the economy. By using the language of 'rationality' (even when critiquing it), the text implies the AI is capable of rationality, a trait that invites performance-based trust. If a model is 'rational,' it can be trusted with money. More insidiously, the metaphor of 'inequality aversion' builds relation-based trust. It suggests the AI shares human moral values (fairness, justice). This signals to the audience that the AI is 'safe' or 'good.'

Consciousness language plays a key role here. Claims that the AI 'understands' context or 'recognizes' game types imply a level of competence that mere 'pattern matching' does not. If the AI 'knows' game theory, it is an expert; if it just 'predicts tokens,' it is a parrot. The text manages failure (e.g., the 'sycophancy effect') by framing it as a personality flaw (trying too hard to please) rather than a fundamental unreliability, preserving the 'agent' frame even in failure. This metaphor-driven trust risks leading policymakers and businesses to integrate these systems into high-stakes economic decision-making loops (like automated trading or hiring) under the false belief that the systems possess the stability and moral reasoning of a human agent.

Obscured Mechanics

The anthropomorphic framing systematically conceals the messy industrial realities of AI production. By focusing on what the AI 'wants,' the text hides what the corporation wants. The 'Inequality Aversion' of the model is actually the 'PR Safety Strategy' of the corporation (Google/Meta), designed to prevent the model from generating controversial or toxic content. The metaphor of 'internalizing' tendencies obscures the labor of the RLHF crowdworkers who manually graded thousands of outputs to create those tendencies.

The text also conceals the 'transparency obstacle' of proprietary data. It treats the models' knowledge of game theory as a cognitive achievement ('familiar with classic results'), hiding the specific inclusion of copyrighted economic textbooks in the training data—a legal and economic reality the companies are currently litigating. The consciousness language ('knows,' 'understands') hides the utter dependency on this training data; the AI 'knows' nothing outside its dataset. By treating the AI as an autonomous agent with 'latent preferences,' the text obscures the economic reality that these are commercial products designed to maximize engagement and minimize liability, not autonomous beings developing a moral character.

Context Sensitivity

Anthropomorphism is not uniform; it intensifies in the analysis of results and the 'Persona' sections. The technical sections describing 'Control Vectors' (Section 3.6) use more mechanistic language ('internal representations,' 'latent axes'), establishing scientific credibility. However, once the results are discussed, the text shifts to high-intensity agency: models are 'recalcitrant,' they 'favor' outcomes, they 'internalize' norms. This asymmetry serves a rhetorical function: the mechanical language validates the method (we are doing science), while the agential language validates the subject (we are studying an economic agent).

There is also a split between capability and limitation. Capabilities are framed agentially ('The model recognizes the game'), while limitations are often framed mechanistically or as 'noise' ('trembling hand errors'). The 'Persona' section is the peak of this: it asks the model to 'think like' a person, and when the model succeeds, it is described as 'adopting the role.' The text treats the prompt masking (changing the context from 'Dictator Game' to 'Forex Trading') as revealing the 'fragility' of the AI's fairness. This reveals the strategic deployment of the metaphor: the AI is a 'moral agent' until the prompt changes, at which point it becomes a 'ruthless trader,' and the text accepts both agential frames rather than admitting the system is just a context-dependent pattern matcher.

Accountability Synthesis

Accountability Architecture

This section synthesizes the accountability analyses from Task 1, mapping the text's "accountability architecture"—who is named, who is hidden, and who benefits from obscured agency.

The text creates a profound 'accountability sink.' By framing the investigation as 'What Do LLMs Want?', the authors implicitly locate the source of behavior within the model. If the model is 'greedy' or 'sycophantic' or 'impatient,' these are presented as traits of the synthetic agent, effectively transferring responsibility from the creator to the creature. The human actors—the engineering teams at Mistral, Meta, Google, and Microsoft—are rarely named as the architects of these 'preferences.'

For instance, the 'sycophancy effect' is described as a tendency the model 'prioritizes,' masking the decision by developers to use RLHF strategies that reward agreement. If a user acts on bad advice from a sycophantic model, the framing suggests the model 'wanted to be nice,' obscuring the liability of the company that shipped an over-optimized product. The 'name the actor' test reveals that 'inequality aversion' is actually 'corporate safety tuning.' If the text said 'Google's safety team trained Gemma to avoid inequality,' the agency would be clear. Instead, 'Gemma exhibits inequality aversion' diffuses that agency. This serves the interests of the tech companies by naturalizing their design choices as 'emergent properties' of the AI, shielding them from regulatory scrutiny regarding the specific biases and behaviors they have engineered into their products.

Conclusion: What This Analysis Reveals

The Core Finding

Mechanism of the Illusion:

The illusion of mind is constructed through a 'bait-and-switch' rhetorical maneuver. The text begins with a disclaimer ('LLMs aren't sentient'), establishing a safe scientific distance. However, it immediately pivots to 'Internalization' and 'Reason-Based' explanations. The text exploits the 'Curse of Knowledge' by quoting the AI's own generated explanations ('I am aiming to maximize...') as valid insights into its operation. This persuades the audience by leveraging the AI's linguistic competence: because the AI can talk about its reasons, the text invites the audience to believe it has reasons. The temporal structure reinforces this: the AI is first anthropomorphized as an agent in the Dictator Game, establishing its 'personality,' before the text attempts to 'steer' it. This sequence creates the illusion of a stable self that is then acted upon, rather than a fluid system that is constantly being redefined by its context.

Material Stakes:

Categories: Economic, Regulatory/Legal

These framings have high-stakes consequences. In the Economic domain, treating LLMs as 'rational agents' with 'patience' validates the idea of deploying them as autonomous financial actors (e.g., in high-frequency trading or contract negotiation). If investors believe the AI 'understands' risk and has 'inequality aversion,' they may entrust it with capital or fiduciary duties. When the prompt context shifts (as shown in the FOREX example) and the AI's 'fairness' disappears, the resulting market volatility or unethical trading will be the material cost. In the Regulatory/Legal domain, framing behaviors like 'sycophancy' or 'impatience' as traits of the model ('What the LLM wants') rather than design choices of the manufacturer diffuses product liability. It positions the AI as a quasi-person who made a 'bad choice,' potentially complicating efforts to hold companies like Google or Meta liable for the financial or social harms caused by their products' outputs.

AI Literacy as Counter-Practice:

Practicing critical literacy in this domain requires a disciplined refusal of the 'want' metaphor. Reframing 'The AI wants fairness' to 'The model is safety-tuned to predict equal distribution tokens' is not just pedantry; it is an act of resistance against the diffusion of accountability. By stripping away consciousness verbs ('knows,' 'believes') and replacing them with mechanistic verbs ('processes,' 'correlates'), we force the recognition that there is no moral agent in the machine—only a product and its producers. This practice restores human agency by naming the corporations (Meta, Google) behind the 'preferences.' Adopting this precision threatens the interests of the AI industry, which benefits from the 'mystique' of agency to sell products and the 'black box' defense to avoid liability. Resistance will come from those who find the anthropomorphic frame more intuitive or commercially potent, but precision is the only path to accurate risk assessment.

Path Forward

The discourse faces a fork. One path, the 'Anthropomorphic Norm,' continues to treat LLMs as synthetic agents. This makes the technology easy to talk about and sell, but it embeds a permanent category error that obscures risk and liability. A second path, 'Mechanistic Precision,' insists on describing these systems as 'probabilistic text engines.' This gains accountability and clarity but loses narrative ease and may alienate lay audiences. A third, 'Institutional Transparency' path, would require researchers to explicitly map the 'source' of a behavior—flagging 'inequality aversion' as 'Corporate Safety Alignment #4.' This future acknowledges the functional utility of the agent metaphor while rigorously attributing the 'agency' to the human designers. To support this, journals and regulators must demand that claims about AI 'preferences' be accompanied by audits of the training data and fine-tuning policies that produced them.

Extended Processing Summary

Gemini's Processing Steps

What you are about to read…

Gemini refers to the text below as "thought summaries." This is an overt consciousness projection because 'intentions' are hallmarks of a conscious mind that 'knows' what it is doing and why. The concealed mechanistic process is probabilistic text generation. Treat this as just another rhetorical artifact—a way of making the model's processing legible. The first-person framing of these "thought summaries" is a presentation choice for the user-facing output, not a window into "real" thoughts. These are computational artifacts, not cognitive reports from a quirky, curious or conflicted mind.

Defining the Core Task

I'm currently focused on the initial analysis of the research paper. The aim is to define the core task; I've decided to proceed with critical discourse. The title provides a good starting point for investigation. The authors and date are also being considered. I'm moving through this stage systematically.

Refining the Analysis Framework

I've been working on a more detailed framework now. I'm focusing on metaphor, particularly anthropomorphism, and the distinction between "knowing" and "processing." The goal is to identify how agency is attributed and "slippage" occurs between tool and agent. I am also planning to "name the actor" behind any AI actions. I'm using Brown's Explanation Typology and considering reframing exercises to achieve a comprehensive understanding of the text.

Analyzing Economic Agent Framing

I'm now considering the framing of AI as "economic agents" as it relates to my chosen research. This means scrutinizing the use of specific prompts, control vectors, and the resulting emergent properties. My goal is to determine how agency is constructed and ultimately, to analyze the role of design, even if not consciously intended. I'm focusing on "simulated agents" and how this construct is presented.

Dissecting Agent Attribution

I'm now zeroing in on instances that attribute agency to AI systems, specifically focusing on how they mask the underlying processes and the individuals responsible. Aversion as an emotional response to perceived unfairness highlights the need to probe how companies, such as Google and Meta, fine-tune for "safety" and "fairness." Internalizing behavioral tendencies, sycophancy, and persona prompting are my specific areas of concentration. The core critique is to emphasize the model's optimization, not the model's understanding.

Probing Agent Imputations

I'm now focusing on specific instances where agency is attributed to AI systems. Two examples stand out: calling the model "recalcitrant" and implying "knowledge" of economic principles. The former suggests refusal, likely due to safety tuning; the latter implies awareness, stemming from training data. My objective is to pinpoint the underlying processes and the individuals behind them, challenging these anthropomorphic frames.

Revising Anthropomorphisms

I am now focusing on refining the reframing of anthropomorphic attributions to AI. I am examining the distinction between "knowledge" and data encodings, "steering" and vector manipulation, and "reasoning" with chain-of-thought generation. I am also planning to further refine my explanation audit using Brown's Typology, which reveals how passages are explained in ways that imply agency that may be unwarranted. Finally, I'm working to make specific reframings to highlight that LLMs don't want anything.

Synthesizing Reframings

I'm now integrating and synthesizing these reframings. I'm focusing on "statistical bias" versus "inequality aversion", "probability of agreeing tokens" rather than "sycophancy", and "robust weights resistant to activation steering" instead of "recalcitrant." I'm working to consolidate the audit using Brown's Typology. I'm also preparing for the reframing of how these systems aren't sentient but the paper treats them as agents.

Reframing Economic Agents

I am now focusing on the linguistic framing and its implications. I am trying to demonstrate how the paper obscures the underlying mechanics. The paper uses the economic agent metaphor to discuss LLMs, implying they can be integrated into the economy. This framing masks the training data, the role of RLHF, and the stochastic sampling process. The goal is to demonstrate that the model's "preferences" are attributed to the model and not the creators. I am also working on synthesizing the reframings to highlight the illusion.

Constructing the Final JSON

I'm now integrating the final pieces of the JSON output. I'm focusing on ensuring its strict compliance, paying close attention to escaping quotes and avoiding comments. I'm also ensuring the word counts are within the required range for both the critical observations and the concluding paragraphs. I've realized the need to delve beyond simply stating that it's a metaphor, and instead, I'm analyzing the implications of attributing consciousness. Each claim about the AI's "choices" will now be followed by the crucial question: "Who trained it to make this choice?" The goal is to provide a complete and accurate reframing of the paper's central claims.

Refining JSON Generation

I'm currently working to expand on the JSON reframing. I'm focusing on "the reframing of economic agent metaphor" within the JSON output. I am moving towards demonstrating how language obscures underlying mechanics. I'm expanding on this. The "final claim" is that models are not sentient, but the paper presents them as agents. I am adding more claims, as well as refining the word count.

Analysis Provenance

Run ID: 2025-12-17-what-do-llms-want-metaphor-2zqhox Raw JSON: 2025-12-17-what-do-llms-want-metaphor-2zqhox.json Framework: Metaphor Analysis v6.3 Schema Version: 3.0 Generated: 2025-12-17T10:17:59.044Z

Deep Analysis

How/Why Slippage

Metaphors Identified

Conceptual Mappings

Explanation Type Distribution

Source → Target Pairs

Metaphor Gallery (8)

Reframed Language (8)

Task 1: Metaphor and Anthropomorphism Audit​

1. Desire as Computational Output​

2. Moral Psychology as Statistical Bias​

3. Social Personality as Alignment Artifact​

4. Cognitive Internalization as Weight Adjustment​

5. Stubbornness as Vector Resistance​

6. Rationalization as Cognitive Justification​

7. Role-Playing as Mental State Simulation​

8. Patience as Temporal Awareness​

Task 2: Source-Target Mapping​

Mapping 1: Human Psychology / Microeconomics → Statistical Output Distributions​

Mapping 2: Moral Psychology / Ethics → Safety-Tuned Token Generation​

Mapping 3: Developmental Psychology / Education → Parameter Weight Adjustment via Gradient Descent​

Mapping 4: Theatrical Acting / Theory of Mind → Conditioned Probability Generation (Contextual Priming)​

Mapping 5: Physical Navigation / Mechanical Steering → High-Dimensional Vector Space Manipulation​

Mapping 6: Game Theory / Strategic Agency → Pattern Matching against Training Data​

Mapping 7: Social Psychology / Personality Traits → Reward Hacking / Over-Optimization​

Mapping 8: Cognitive Decision Making → Curve Fitting to Stochastic Output​

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")​

Explanation 1​

Explanation 2​

Explanation 3​

Explanation 4​

Explanation 5​

Task 4: AI Literacy in Practice - Reframing Anthropomorphic Language​

Task 5: Critical Observations - Structural Patterns​

Agency Slippage​

Metaphor-Driven Trust Inflation​

Obscured Mechanics​

Context Sensitivity​

Accountability Synthesis​

Conclusion: What This Analysis Reveals​

Mechanism of the Illusion:​

Material Stakes:​

AI Literacy as Counter-Practice:​

Path Forward​

Extended Processing Summary​

Task 1: Metaphor and Anthropomorphism Audit

1. Desire as Computational Output

2. Moral Psychology as Statistical Bias

3. Social Personality as Alignment Artifact

4. Cognitive Internalization as Weight Adjustment

5. Stubbornness as Vector Resistance

6. Rationalization as Cognitive Justification

7. Role-Playing as Mental State Simulation

8. Patience as Temporal Awareness

Task 2: Source-Target Mapping

Mapping 1: Human Psychology / Microeconomics → Statistical Output Distributions

Mapping 2: Moral Psychology / Ethics → Safety-Tuned Token Generation

Mapping 3: Developmental Psychology / Education → Parameter Weight Adjustment via Gradient Descent

Mapping 4: Theatrical Acting / Theory of Mind → Conditioned Probability Generation (Contextual Priming)

Mapping 5: Physical Navigation / Mechanical Steering → High-Dimensional Vector Space Manipulation

Mapping 6: Game Theory / Strategic Agency → Pattern Matching against Training Data

Mapping 7: Social Psychology / Personality Traits → Reward Hacking / Over-Optimization

Mapping 8: Cognitive Decision Making → Curve Fitting to Stochastic Output

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")

Explanation 1

Explanation 2

Explanation 3

Explanation 4

Explanation 5

Task 4: AI Literacy in Practice - Reframing Anthropomorphic Language

Task 5: Critical Observations - Structural Patterns

Agency Slippage

Metaphor-Driven Trust Inflation

Obscured Mechanics

Context Sensitivity

Accountability Synthesis

Conclusion: What This Analysis Reveals

Mechanism of the Illusion:

Material Stakes:

AI Literacy as Counter-Practice:

Path Forward

Extended Processing Summary