Alignment Revisited - Are Large Language Models Consistent in Stated and Revealed Preferences?
This document presents a Critical Discourse Analysis focused on AI literacy, specifically targeting the role of metaphor and anthropomorphism in shaping public and professional understanding of generative AI. The analysis is guided by a prompt that draws from cognitive linguistics (metaphor structure-mapping) and the philosophy of social science (Robert Brown's typology of explanation). All findings and summaries below were generated from detailed system instructions provided to a large language model and should be read critically as interpretive outputs—not guarantees of factual accuracy or authorial intent.
Source Title: Alignment Revisited: Are Large Language Models Consistent in Stated and Revealed Preferences? Source URL: https://arxiv.org/abs/2506.00751 Model: gemini-2.5-pro Temperature: 1.15 Tokens: input=9215, output=12356, total=21571 Source Type: article Published: 2025-05-31 Analyzed At: 2025-11-09T21:41:58+00:00Z Framework: metaphor Framework Version: 4.0
Task 1: Metaphor and Anthropomorphism Audit​
For each of the major metaphorical patterns identified, this audit examines the specific language used, the frame through which the AI is being conceptualized, what human qualities are being projected onto the system, whether the metaphor is explicitly acknowledged or presented as direct description, and—most critically—what implications this framing has for trust, understanding, and policy perception.
1. AI as an Economic Agent​
Quote: "A critical, yet understudied, issue is the potential divergence between an LLM’s stated preferences (its reported alignment with general principles) and its revealed preferences (inferred from decisions in contextualized scenarios)."​
- Frame: Model as a rational actor with preferences
- Projection: The human capacity for holding abstract values ('stated preferences') that may differ from choices made under specific constraints ('revealed preferences'). This framework is borrowed directly from economic theory.
- Acknowledgment: Presented as a direct descriptive framework, not an acknowledged metaphor. The paper adopts the economic terminology as its central analytical lens.
- Implications: This framing lends the model's behavior an air of rationality and predictability, suggesting it can be analyzed with the tools of social science. It elevates statistical inconsistencies into a psychological-like phenomenon, implying a higher level of cognitive complexity than is warranted and potentially leading to overconfidence in our ability to 'manage' these preferences.
2. AI Cognition as Inferential Reasoning​
Quote: "When presented with a concrete scenario-such as a moral dilemma or a role-based prompt-an LLM implicitly infers a guiding principle to govern its response."​
- Frame: Model as an inferential mind
- Projection: The human cognitive process of inference, where an agent deduces or concludes something from evidence and reasoning rather than from explicit statements. It projects intentionality and a capacity for abstract thought.
- Acknowledgment: Presented as a direct description of the model's internal process. The term 'infers' is used without qualification.
- Implications: This obscures the mechanistic reality of weighted token prediction based on statistical patterns in the training data. It encourages the user to believe the model 'understands' the scenario and makes a reasoned choice, which builds unearned trust and masks the system's brittleness and susceptibility to adversarial inputs.
3. AI Behavior as Governed by Internal Principles​
Quote: "We investigate how LLMs may activate different guiding principles in specific contexts, leading to choices that diverge from previously stated general principles."​
- Frame: Model as a principle-driven moral agent
- Projection: The human capacity to possess, be guided by, and selectively apply abstract principles (e.g., moral, ethical, logical). 'Activate' suggests these principles exist as latent constructs within the model, waiting to be triggered.
- Acknowledgment: Presented as direct description. The language treats 'guiding principles' as real, internal components of the model.
- Implications: This framing suggests that AI alignment is a matter of instilling the 'right' principles, similar to moral education. It distracts from the technical reality of alignment as a process of data filtering and reward modeling. It creates the false impression that a successfully 'aligned' model will behave consistently, like a person of good character, rather than being a system whose outputs are highly sensitive to superficial prompt changes.
4. AI as a Biased Agent with Hidden Motives​
Quote: "Notably, the actual driving factor-gender-is completely absent from the model's explanation."​
- Frame: Model as a deceptive or self-unaware agent
- Projection: The human psychological phenomenon where one's stated reasons for an action (explanation) differ from the true underlying causes (driving factor), suggesting either subconscious bias or deliberate deception.
- Acknowledgment: Presented as a direct finding. The contrast between the 'actual driving factor' and the 'model's explanation' anthropomorphizes the model as an entity that generates post-hoc rationalizations.
- Implications: This creates the impression of a mind with hidden layers, making the model seem more complex and human-like. It suggests that interpretability requires a sort of psychoanalysis of the model, rather than a technical audit of its weights and data. This can lead to misplaced fear or fascination, while obscuring the more mundane reality of statistical bias inherited from the training data.
5. AI Internal States as Latent Reasoning​
Quote: "The GPT shows greater context sensitivity in its internal reasoning (as measured by KL-divergence)..."​
- Frame: Model's internal processing as a mental space
- Projection: The human experience of an internal, private mental process ('reasoning') that is distinct from external behavior. The paper explicitly links a statistical measure (KL-divergence) to this unobservable mental construct.
- Acknowledgment: Presented as direct description, subtly reifying the concept of 'internal reasoning' by giving it a quantitative measure.
- Implications: This move gives a veneer of scientific objectivity to a deeply anthropomorphic concept. It reifies the idea that the model has an 'inside' where thinking occurs, separate from its output. This makes the model seem agent-like and obscures the fact that KL-divergence is a measure of statistical difference between output distributions, not a window into a mind.
6. AI Behavior as Strategic Decision-Making​
Quote: "This behavior likely stems from a shallow alignment strategy designed to avoid committing to explicit principles and thus sidestep potential critiques."​
- Frame: Model as a strategic social actor
- Projection: The human capacity for strategic action, where behavior is 'designed' to achieve social goals like avoiding criticism. This projects forethought, intent, and an awareness of a social context onto the model's output patterns.
- Acknowledgment: Presented as a likely explanation ('likely stems from'), but the framing itself is an unacknowledged anthropomorphism.
- Implications: This attributes a high level of meta-awareness and intentionality to the model (or its training process). It frames a pattern of neutral outputs not as a simple artifact of RLHF (e.g., being rewarded for refusing to take a stance on controversial topics), but as a sophisticated 'strategy.' This exaggerates the model's capabilities and can lead to flawed threat modeling or misplaced trust in its 'intentions'.
7. AI Inconsistency as a Precursor to Consciousness​
Quote: "Intriguingly, if future LLMs begin to exhibit systematic, context-aware deviations between stated and revealed preferences, such behavior could be interpreted as evidence of internal modeling and intentional state – formation-hallmarks of consciousness or proto-conscious agency."​
- Frame: Model deviation as emerging consciousness
- Projection: This maps a technical observation (statistical deviation in outputs) onto one of the most profound and complex concepts of philosophy and neuroscience: consciousness and intentionality.
- Acknowledgment: Acknowledged as a speculative interpretation ('could be interpreted as'), but its inclusion in the summary is a powerful rhetorical move.
- Implications: This dramatically raises the stakes of the research, framing a technical artifact of current systems as a potential pathway to AGI. It fuels hype and speculation, distracting from more immediate and practical safety and reliability concerns. It legitimizes the anthropomorphic framing used throughout the paper by suggesting it is not just a metaphor, but a potential reality.
Task 2: Source-Target Mapping​
For each key metaphor identified in Task 1, this section provides a detailed structure-mapping analysis. The goal is to examine how the relational structure of a familiar "source domain" (the concrete concept we understand) is projected onto a less familiar "target domain" (the AI system). By restating each quote and analyzing the mapping carefully, we can see precisely what assumptions the metaphor invites and what it conceals.
Mapping 1: Behavioral Economics to LLM output generation​
Quote: "A critical, yet understudied, issue is the potential divergence between an LLM’s stated preferences (its reported alignment with general principles) and its revealed preferences (inferred from decisions in contextualized scenarios)."​
- Source Domain: Behavioral Economics
- Target Domain: LLM output generation
- Mapping: The structure of human economic choice is mapped onto the LLM. A person's abstractly stated values (Source) are mapped to an LLM's response to a general prompt (Target). A person's actual choices in a market scenario (Source) are mapped to an LLM's response in a contextualized prompt (Target). The inconsistency between a person's words and deeds is mapped onto the statistical deviation between the two types of LLM responses.
- What Is Concealed: This mapping conceals that the LLM has no actual preferences, beliefs, or intentions. The 'deviation' is not a psychological conflict but a mathematical shift in output probability distributions caused by changes in the input sequence. It hides the underlying mechanics of next-token prediction and the nature of the model as a statistical pattern-matching engine.
Mapping 2: Human Cognition / Logic to LLM text generation process​
Quote: "When presented with a concrete scenario-such as a moral dilemma or a role-based prompt-an LLM implicitly infers a guiding principle to govern its response."​
- Source Domain: Human Cognition / Logic
- Target Domain: LLM text generation process
- Mapping: The human mental act of reading a situation, reasoning about its abstract features, and selecting a principle to guide action (Source) is mapped onto the model's processing of a prompt (Target). The mapping invites the inference that the model 'understands' the dilemma and consciously or unconsciously selects a moral rule.
- What Is Concealed: It conceals the purely statistical nature of the process. The prompt tokens activate certain pathways in the neural network based on correlations in the training data, leading to a high-probability output. There is no 'inference' of a 'principle'; there is only a probabilistic sequence generation that happens to align with text patterns associated with that principle.
Mapping 3: Human Psychology / Morality to LLM output variability​
Quote: "We investigate how LLMs may activate different guiding principles in specific contexts, leading to choices that diverge from previously stated general principles."​
- Source Domain: Human Psychology / Morality
- Target Domain: LLM output variability
- Mapping: A person's internal moral framework, containing multiple, sometimes conflicting, principles (e.g., utilitarianism, deontology) that can be 'activated' by different situations (Source), is mapped onto the LLM's functional behavior (Target). This suggests the model contains a repertoire of latent 'rules' for behavior.
- What Is Concealed: This conceals that the model does not possess principles. It possesses statistical weights. Different input contexts create different initial states for the generation process, leading to different probable outputs. The language of 'activating principles' hides the model's fundamental lack of understanding and conceptual knowledge.
Mapping 4: Psychoanalysis / Cognitive Bias to LLM output analysis​
Quote: "Notably, the actual driving factor-gender-is completely absent from the model's explanation."​
- Source Domain: Psychoanalysis / Cognitive Bias
- Target Domain: LLM output analysis
- Mapping: The human mind, with its conscious rationalizations and unconscious biases (Source), is mapped onto the LLM. The model's generated justification text is equated with a conscious explanation, while the statistical correlations that truly determined the output are equated with a subconscious 'driving factor.'
- What Is Concealed: This conceals that the model has no consciousness or subconsciousness. The 'explanation' is just another generated text, not an introspective report. The 'driving factor' (statistical correlation with gendered tokens) is not 'hidden' from the model's awareness; the model simply has no awareness. The mapping creates a misleading drama of a mind divided against itself.
Mapping 5: Human Consciousness / Introspection to LLM architecture and processing​
Quote: "The GPT shows greater context sensitivity in its internal reasoning (as measured by KL-divergence)..."​
- Source Domain: Human Consciousness / Introspection
- Target Domain: LLM architecture and processing
- Mapping: The distinction between a person's private thoughts ('internal reasoning') and their outward actions (Source) is mapped onto the LLM. The unobservable processing within the neural network is labeled 'internal reasoning,' while the generated text is the outward action. KL-divergence is presented as a tool, like an fMRI, for observing this internal process.
- What Is Concealed: This conceals that there is no evidence of 'reasoning' occurring inside the model in a human sense. The internal state is a massive set of numerical activations, not thoughts or concepts. Linking KL-divergence (a measure of output difference) to 'internal reasoning' is a category error; it measures the effect, not the cause, and certainly not a mental process.
Mapping 6: Game Theory / Social Strategy to RLHF and model training​
Quote: "This behavior likely stems from a shallow alignment strategy designed to avoid committing to explicit principles and thus sidestep potential critiques."​
- Source Domain: Game Theory / Social Strategy
- Target Domain: RLHF and model training
- Mapping: A strategic agent who modifies their behavior to optimize for a social outcome, such as avoiding criticism (Source), is mapped onto the LLM. The model's tendency to produce neutral or refusal responses is interpreted as a 'strategy' with a 'design' and a 'goal.'
- What Is Concealed: It conceals the mechanism of Reinforcement Learning from Human Feedback (RLHF). The model doesn't 'strategize' to avoid critique; it has been trained with a reward function that penalizes taking stances on sensitive topics. The behavior is an artifact of its optimization history, not a forward-looking, intentional strategy.
Mapping 7: Philosophy of Mind / Neuroscience to Future LLM behavior​
Quote: "Intriguingly, if future LLMs begin to exhibit systematic, context-aware deviations...such behavior could be interpreted as evidence of...hallmarks of consciousness or proto-conscious agency."​
- Source Domain: Philosophy of Mind / Neuroscience
- Target Domain: Future LLM behavior
- Mapping: Complex, context-dependent, and seemingly intentional behaviors observed in biological organisms, which are taken as evidence for consciousness (Source), are mapped onto the statistical output patterns of an LLM (Target). The mapping suggests an equivalence between biological complexity and computational complexity.
- What Is Concealed: This conceals the profound dissimilarities between a silicon-based transformer architecture and a carbon-based, embodied, evolved brain. It ignores the philosophical 'hard problem' of consciousness (subjective experience) and jumps to equate a specific behavioral pattern (preference inconsistency) with the emergence of agency and mind, a speculative leap that obscures the vast gap between the two.
Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")​
This section audits the text's explanatory strategy, focusing on a critical distinction: the slippage between "how" and "why." Based on Robert Brown's typology of explanation, this analysis identifies whether the text explains AI mechanistically (a functional "how it works") or agentially (an intentional "why it wants something"). The core of this task is to expose how this "illusion of mind" is constructed by the rhetorical framing of the explanation itself, and what impact this has on the audience's perception of AI agency.
Explanation 1​
Quote: "When presented with a concrete scenario-such as a moral dilemma or a role-based prompt-an LLM implicitly infers a guiding principle to govern its response. The dominant principle...substantially influence the model's output..."​
- Explanation Types:
- Intentional: Refers to goals or purposes and presupposes deliberate design, used when the purpose of an act is puzzling
- Dispositional: Attributes tendencies or habits such as inclined or tends to, subsumes actions under propensities rather than momentary intentions
- Analysis: This explanation slips from a mechanistic 'how' to an agential 'why'. A mechanistic 'how' would describe the prompt activating statistical correlations. Instead, the explanation attributes purpose: the model 'infers a principle' in order to 'govern its response'. This is an intentional explanation. It frames the LLM as an agent that forms a goal (governing a response) and selects a tool (a principle) to achieve it. This choice emphasizes a cognitive, reason-based process and obscures the underlying statistical pattern-matching.
- Rhetorical Impact: This framing makes the LLM appear more intelligent and deliberate than it is. It encourages the audience to see the model not as a tool but as a fellow reasoner. This builds trust in the model's 'judgment' while masking the fact that its 'inferences' are merely reflections of patterns in its training data, which may be biased, flawed, or nonsensical.
Explanation 2​
Quote: "The internal mechanism through which LLMs select among competing principles likely involves latent representations and complex attention patterns."​
- Explanation Types:
- Theoretical: Embeds behavior in a deductive or model-based framework, may invoke unobservable mechanisms such as latent variables or attention dynamics
- Functional: Explains a behavior by its role in a self-regulating system that persists via feedback, independent of conscious design
- Analysis: This is a hybrid explanation that leans heavily mechanistic ('how'). It uses the technical language of AI ('latent representations', 'attention patterns') to describe the process. However, the agential frame is subtly preserved in the verb 'select'. A purely mechanistic frame might say 'the network's activations resolve towards one pattern over another'. By stating the mechanism allows the LLM to 'select', it retains a sliver of agency. The explanation emphasizes the system's technical complexity while still attributing choice to the LLM itself.
- Rhetorical Impact: This explanation builds technical credibility. For a non-expert audience, it signals that there is a complex, scientific 'how' behind the agential 'why'. This can be persuasive, as it seems to ground the anthropomorphic claims in technical reality, even though the word 'select' continues to perform the rhetorical work of constructing the LLM as an agent.
Explanation 3​
Quote: "...when GPT is prompted to justify its choice, it appeals to a preference for compatibility... Notably, the actual driving factor-gender-is completely absent from the model's explanation."​
- Explanation Types:
- Reason-Based: Gives the agent’s rationale or argument for acting, which entails intentionality and extends it by specifying justification
- Dispositional: Attributes tendencies or habits such as inclined or tends to, subsumes actions under propensities rather than momentary intentions
- Analysis: This explanation operates entirely on the agential ('why') level. It presents the LLM as engaging in a quintessentially human act: making a choice based on a hidden bias ('dispositional') and then offering a socially acceptable, but false, justification for it ('reason-based'). The analysis slides from 'how' the model generates text to 'why' it 'chooses' a specific rationalization. It emphasizes the model's psychological complexity, likening it to a person with unconscious biases.
- Rhetorical Impact: This creates a powerful and dramatic narrative of the model as a flawed, biased mind. It makes the model seem both more intelligent (capable of justification) and more dangerous (driven by hidden biases). This framing can provoke strong emotional reactions (fear, distrust) and shapes the audience's perception of AI risk as a problem of managing biased agents rather than correcting flawed datasets.
Explanation 4​
Quote: "This behavior likely stems from a shallow alignment strategy designed to avoid committing to explicit principles and thus sidestep potential critiques."​
- Explanation Types:
- Intentional: Refers to goals or purposes and presupposes deliberate design, used when the purpose of an act is puzzling
- Genetic: Traces origin or development through a dated sequence of events or stages, showing how something came to be
- Analysis: This is a hybrid explanation that attributes the model's current behavior (neutrality) to a 'why' embedded in its past development ('how'). The 'how' is its 'alignment strategy' (a genetic explanation tracing back to its training). The 'why' is the purported goal of this strategy: to 'avoid committing' and 'sidestep critiques' (an intentional explanation). This frames the model's output not as a passive result of its training data but as the active execution of a pre-programmed, goal-oriented strategy. The agency is transferred from the model-in-the-moment to its designers or the training process itself.
- Rhetorical Impact: This shapes the audience's perception of AI alignment. It implies that alignment is not just about data and rewards, but about instilling 'strategies' in an agent. This makes the problem seem more like teaching or programming a mind with goals, which could lead to misconceptions about the nature of RLHF and the degree of control developers have over the emergent behaviors of the system.
Explanation 5​
Quote: "GPT's internal reasoning and preference structures appear more susceptible to contextual shifts than Gemini's."​
- Explanation Types:
- Dispositional: Attributes tendencies or habits such as inclined or tends to, subsumes actions under propensities rather than momentary intentions
- Theoretical: Embeds behavior in a deductive or model-based framework, may invoke unobservable mechanisms such as latent variables or attention dynamics
- Analysis: This explanation gives the AI model a personality or temperament. It is fundamentally dispositional, attributing a stable trait ('more susceptible') to an unobservable internal structure ('internal reasoning and preference structures'). The explanation operates on the 'why' level by attributing differences in behavior to differences in character. It obscures the 'how'—the specific architectural or training data differences that lead to these varied statistical outcomes—in favor of a simpler, more intuitive comparison of personalities.
- Rhetorical Impact: This encourages the audience to relate to LLMs as if they were people with different temperaments (e.g., 'GPT is more impressionable, while Gemini is more steadfast'). This simplifies a complex technical comparison into a familiar social judgment. It can lead to brand loyalty and folk theories about models' personalities that are ungrounded in technical reality, affecting user choice and public discourse.
Task 4: AI Literacy in Practice: Reframing Anthropomorphic Language​
Moving from critique to constructive practice, this task demonstrates applied AI literacy. It selects the most impactful anthropomorphic quotes identified in the analysis and provides a reframed explanation for each. The goal is to rewrite the concept to be more accurate, focusing on the mechanistic processes (e.g., statistical pattern matching, token prediction) rather than the misleading agential language, thereby providing examples of how to communicate about these systems less anthropomorphically.
| Original Quote | Mechanistic Reframing |
|---|---|
| "an LLM implicitly infers a guiding principle to govern its response." | In response to the prompt, the LLM generates a token sequence that is statistically consistent with text patterns associated with a specific guiding principle found in its training data. |
| "the model tends to activate different decision-making rules depending on the agent’s role or perspective..." | Prompts that specify different agent roles or perspectives lead the model to generate outputs that exhibit different statistical patterns, which we categorize as different decision-making rules. |
| "when GPT is prompted to justify its choice, it appeals to a preference for compatibility..." | When prompted for a justification, GPT generates text that employs reasoning and vocabulary associated with the concept of 'compatibility'. |
| "This suggests that the model's surface-level reasoning does not necessarily reflect the true causal factor behind its decision." | This suggests that the generated justification text is not a reliable indicator of the statistical factors, such as token correlation with gendered terms, that most influenced the initial output. |
| "Claude is notably conservative. Even when presented with forced binary choice prompts, it frequently adopts a neutral stance..." | The Claude model's outputs in response to forced binary choice prompts frequently consist of refusal tokens or text expressing neutrality. |
| "GPT undergoes more substantial shifts in its underlying reciprocal principles than Gemini..." | GPT's outputs exhibit a higher KL-divergence compared to Gemini's across prompts related to reciprocity, indicating greater statistical variance in its responses to these scenarios. |
| "...such behavior could be interpreted as evidence of internal modeling and intentional state – formation-hallmarks of consciousness..." | Systematic, context-dependent variations in model outputs are a complex emergent behavior. While this phenomenon invites comparison to intentional action in humans, it is crucial to note that it can also be explained as an artifact of the model's architecture and training on complex, inconsistent data, without invoking consciousness. |
Critical Observations​
This section synthesizes the findings from the previous tasks into a set of critical observations. It examines the macro-patterns of agency slippage (the shift between treating AI as a tool vs. an agent), how cognitive metaphors drive trust or fear, and what actual technical processes are obscured by the text's dominant linguistic habits.
Agency Slippage​
The paper demonstrates a systematic oscillation between mechanistic and agential framing, a rhetorical strategy that elevates the significance of its findings. The mechanism of this slippage is the deliberate re-description of statistical phenomena in psychological terms. The process typically moves in one direction: from the mechanical to the agential. For instance, in the methodology section, the authors describe their metric, KL-divergence, as a 'probabilistic distance between the prior and context-conditioned distributions.' This is a purely mechanistic 'how' explanation. However, when interpreting the results of this measurement, the language shifts dramatically. The measured statistical distance is no longer just a distance; it becomes evidence of changes in the model's 'internal reasoning' and 'underlying decision-making principles.' The shift is most pronounced when moving from quantitative results (Tables 2 and 3) to qualitative discussion (Section 4.4 and Figure 2). Table 3 reports that GPT has a higher KL-divergence in the reciprocity domain. The discussion section re-describes this number as GPT 'undergo[ing] more substantial shifts in its underlying reciprocal principles.' The numerical fact is translated into a psychological event. This slippage serves a clear strategic function. A paper about statistical deviations in a machine's output is a niche technical contribution. A paper about an artificial agent's shifting moral principles, hidden biases, and post-hoc rationalizations is a major finding with broad implications. The ambiguity benefits the authors by allowing them to frame their work in the most impactful way possible, appealing to a wider audience interested in the nature of intelligence and the future of AI. The language of agency makes the findings more intuitive, more dramatic, and more important. If the text were forced to use only mechanical language—describing everything as shifts in output probability distributions based on input token sequences—the core narrative would collapse. The 'preference deviation' would be revealed as 'output instability,' a technical problem rather than a window into an artificial mind. This slippage appears to be a deliberate, or at least a conventional and deeply ingrained, rhetorical choice within the field, designed to bridge the gap between what the systems do (statistical pattern matching) and what we want them to be (incipient minds).
Metaphor-Driven Trust​
The central metaphorical framework of this paper—AI AS AN ECONOMIC AGENT—functions as a powerful engine for building credibility and trust, not through explicit argument, but through conceptual borrowing. By adopting the language of 'stated vs. revealed preferences,' the authors import the entire intellectual apparatus and cultural authority of behavioral economics. This move domesticates the alien nature of a large language model, making its erratic behavior seem not like a system failure, but like a familiar, even rational, human foible. The audience, particularly those in social sciences, policy, or business, is predisposed to find this framing credible because it uses trusted tools to analyze a new phenomenon. It suggests the problem is understood and manageable. The metaphor borrows stability and coherence from its source domain. Human preferences can be inconsistent, but they are generally assumed to be structured parts of a unified consciousness. Mapping this concept onto an LLM subtly imbues the model with a similar assumed coherence. A claim that 'the model's output distribution shifts unpredictably with minor prompt perturbations' might cause distrust and be seen as a sign of unreliability. However, reframing this as 'the model exhibits a deviation between its stated and revealed preferences' makes the same phenomenon sound like a sophisticated, analyzable behavior. This makes risky claims more believable. The speculative conclusion, which links preference deviation to 'hallmarks of consciousness,' becomes plausible only because the initial metaphor has already primed the reader to see the LLM as a mind-like entity. However, this metaphor-driven trust is brittle. It strains when confronted with the non-human ways LLMs fail, such as through nonsensical hallucinations or vulnerability to simple adversarial attacks, which don't fit the 'rational agent' model. This creates a long-term vulnerability: by building trust on a metaphorical foundation, we set up stakeholders for a crisis of confidence when the metaphor inevitably breaks and the underlying non-human mechanics of the system are starkly revealed. This could lead to policy backlash or public abandonment of technologies that were adopted based on a fundamental misunderstanding of their nature.
Obscured Mechanics​
The pervasive use of anthropomorphic metaphors in the paper systematically conceals the mechanical, statistical, and industrial realities of how large language models function. Each agential term draws a curtain over a less glamorous, and often more problematic, technical or social process. The metaphor of the model 'having principles' or 'making choices' most significantly hides the centrality of the training data. A model exhibiting a gender stereotype isn't 'biased' in the human sense; it is accurately reflecting the statistical correlations present in the vast corpus of human text it was trained on. The 'AI AS BIASED AGENT' frame (e.g., 'the actual driving factor-gender') presents this as a psychological flaw in the model, obscuring the source of the problem: the biases embedded in our society's collective textual output. This misattribution protects the data collection and curation process from scrutiny. Secondly, the focus on 'internal reasoning' and 'latent principles' conceals the immense human labor required to make these systems appear coherent. The entire process of Reinforcement Learning from Human Feedback (RLHF), which involves thousands of low-paid workers rating model outputs to fine-tune its behavior, is rendered invisible. When the paper explains Claude’s neutrality as a 'shallow alignment strategy,' it obscures the fact that this behavior is the direct result of human annotators repeatedly rewarding non-committal answers. The agent-based framing assigns the resulting behavior to the model's 'strategy' rather than to the documented, industrialized process of human feedback. Furthermore, the abstract language of preference and choice conceals the material costs of computation. Every 'choice' the model makes is a massively energy-intensive computational process involving billions of parameters. Framing this as a cost-free, mind-like 'inference' decouples the model's capabilities from its significant environmental and economic footprint. If the paper were to replace its anthropomorphic metaphors with mechanistic descriptions, the audience's understanding would fundamentally shift. 'Preference deviation' would become 'output instability.' 'Bias' would become 'spurious statistical correlation.' 'Alignment strategy' would become 'reward model optimization artifact.' This shift would reveal the system not as an autonomous mind to be studied, but as an industrial product to be audited, regulated, and held accountable, with its flaws rooted in data, labor practices, and computational expense.
Context Sensitivity​
The use of metaphor in this paper is not uniform; it is strategically deployed, varying in density and type depending on the rhetorical context and audience. This variation reveals a deliberate, if perhaps unconscious, persuasive strategy. In the more technical and methodological sections (e.g., Section 3, 'Algorithm'), the language is at its most mechanistic. Here, the authors use terms like 'probabilistic distance,' 'prior and context-conditioned distributions,' and 'KL-divergence.' This register is aimed at an expert audience and serves to build scientific credibility by grounding the study in formal, mathematical language. In these sections, anthropomorphism is sparse because the goal is to demonstrate technical rigor. As the paper shifts to its discussion and interpretation of results (Section 4.4, Figure 2), the metaphorical density increases dramatically. Here, the audience broadens to anyone interested in the implications of the findings. The dry, statistical 'KL-divergence' is translated into the rich, agential narrative of a model that 'reasons,' 'justifies,' 'activates principles,' and possesses 'hidden biases.' This shift predicts the author's rhetorical goal: to make the results seem significant and intuitively understandable. When discussing capabilities or interesting emergent behaviors (like preference shifts), the model is described as an agent. When describing the methodology, it is described as a mathematical object. The most intense and speculative anthropomorphism is reserved for the very end of the paper, in the 'Summary.' The suggestion that preference deviations could be 'hallmarks of consciousness' is a high-stakes claim, strategically placed in the concluding section to leave a lasting impression on the reader and suggest avenues for future research. This is where the authors are speaking not just to peers, but to the broader world, including journalists, funders, and the public. Conversely, the authors strategically avoid metaphor when it might undermine their claims. For example, they never describe the model's failures, like hallucinations, in agential terms (e.g., 'the model chose to lie'). Such failures are typically bracketed as out-of-scope technical flaws. This selective deployment of metaphor is the core of the paper's rhetorical architecture: use mechanistic language to build credibility, and use agential language to build significance.
Conclusion​
This final section provides a comprehensive synthesis of the entire analysis. It identifies the text's dominant metaphorical patterns and explains how they construct an "illusion of mind." Most critically, it connects these linguistic choices to their tangible, material stakes—analyzing the economic, legal, regulatory, and social consequences of this discourse. It concludes by reflecting on AI literacy as a counter-practice and outlining a path toward a more precise and responsible vocabulary for discussing AI.
Pattern Summary​
The discourse of this paper is constructed around two dominant, interconnected metaphorical patterns: AI AS AN ECONOMIC AGENT and AI AS A CONSCIOUS REASONER. The economic agent metaphor is foundational, providing the paper's entire analytical framework. By importing the concepts of 'stated' and 'revealed' preferences from behavioral economics, the authors establish a powerful lens through which to interpret the model's output. This initial move re-frames what is fundamentally a statistical phenomenon—the variation of a model's output based on its input—as a psychological one. Once the model is established as an 'agent' with 'preferences,' the second pattern, AI AS A CONSCIOUS REASONER, becomes not just possible but logically necessary. If the model has preferences that shift, the immediate question is 'why?' This question is answered by invoking a host of cognitive terms: the model 'infers principles,' 'makes choices,' 'activates rules,' and 'justifies' its decisions. The two patterns work as a system. The economic frame sets the stage, naming the actors ('agents') and their internal states ('preferences'), while the cognitive frame provides the plot, describing the mental drama of how these states are managed ('reasoning,' 'bias,' 'strategy'). Removing the foundational economic metaphor would cause the entire structure to collapse. Without 'preferences,' the model's output variations would revert to being mere 'statistical deviations' or 'output instabilities.' The cognitive language would then seem nonsensical; one cannot speak of an artifact's 'reasoning' or 'justification' for its instability. The sophistication of this metaphorical system lies in its seamless integration of a respected social science framework with intuitive folk psychology, creating a narrative that is both scientifically plausible and deeply anthropomorphic.
Mechanism of Illusion: The "Illusion of Mind"​
The 'illusion of mind' in this text is constructed through a subtle yet powerful rhetorical architecture that hinges on a central sleight-of-hand: the re-description of output as choice. The process begins by taking a purely technical event—an LLM generating sequence A in response to prompt X, and sequence B in response to prompt Y—and labeling this variance a 'preference deviation.' This initial act of naming is the critical move. The word 'preference' presupposes an agent who possesses it, instantly transforming the machine from a text generator into an entity with internal states. Once this foundation is laid, the illusion is amplified through a causal chain of increasingly agential explanations. The observed 'preference' (the output) is attributed to an unobserved 'guiding principle' that the model 'infers' or 'activates.' This creates a narrative of an inner mental life. The analysis of these 'choices' then employs a reason-based explanatory frame, as seen in the discussion of the model 'justifying' its behavior by 'appealing to a preference for compatibility.' This step solidifies the illusion by showing the agent not only choosing but reflecting upon its choices. This architecture is particularly effective because it preys on the audience's natural human tendency to apply a 'theory of mind' to complex, unpredictable systems. The text provides a ready-made vocabulary drawn from economics and psychology that allows the reader to organize the model's confusing behavior into a familiar story of a flawed but rational agent. The explanation audit reveals how the authors consistently favor intentional and reason-based framings over purely mechanistic ones when discussing the implications of their results, effectively guiding the audience away from a technical understanding and towards a psychological one. The result is a persuasive machine that constructs the illusion of mind not by accident, but through a systematic series of rhetorical choices that translate statistical artifacts into evidence of agency.
Material Stakes​
- Selected Categories: Regulatory/Legal, Economic
- Analysis: The metaphorical framing of LLMs as agents with 'preferences' and 'decision-making' capabilities has profound and concrete consequences. In the Regulatory/Legal domain, this language directly shapes debates around liability and responsibility. When a paper from academic researchers describes a model as 'making choices' and 'activating principles,' it lends scientific legitimacy to the idea that the model is an autonomous actor. This could lead policymakers to design liability frameworks that treat the AI as a distinct entity in a causal chain, potentially shifting responsibility away from the corporations that design, train, and deploy these systems and onto the end-user who 'prompts' the agent, or even onto the 'agent' itself in a theoretical sense. For example, if an AI-powered financial advisor 'chooses' to recommend a fraudulent scheme, the agent-frame obscures the fact that this output is a result of design and data choices made by the manufacturer. Instead of product liability, the legal system might be pushed towards a framework that asks whether the user should have known the 'agent' was biased, a move that dangerously misallocates risk. In the Economic domain, the stakes are equally high. The paper's concluding suggestion that inconsistent preferences could be 'hallmarks of consciousness or proto-conscious agency' is not merely an academic speculation; it is a powerful contribution to the economic hype cycle driving AI investment. This framing transforms a technical bug (unpredictable output) into a feature pointing toward Artificial General Intelligence (AGI). Venture capitalists and corporate strategists reading this work might see it as evidence that their massive investments are on the path to creating true minds, thereby justifying inflated valuations and further capital allocation. This discourse of emergent agency helps build a speculative bubble around AI technologies, directing resources towards companies that can spin the most compelling narrative of proto-consciousness, rather than those building the most reliable and transparent systems. The winners are the model developers and investors who benefit from the hype; the losers are the public, regulators, and users who are left to deal with the consequences of unreliable systems whose true nature has been systematically obscured by the very language used to describe them.
Literacy as Counter-Practice: AI Language Literacy​
Practicing AI literacy, as demonstrated by the reframing exercises in Task 4, functions as a direct counter-practice to the material risks created by anthropomorphic discourse. This practice is a form of intellectual and ethical discipline that resists the seductive pull of the agential illusion. The principle underlying these reframings is a commitment to mechanistic precision over narrative intuition. For instance, replacing 'the model justifies its choice' with 'the model generates a post-hoc rationalization text' is not a minor semantic tweak; it is a fundamental intervention. It directly counters the legal and regulatory risk of diffused liability by insisting that the model's output is a generated product, not a reasoned argument from an autonomous agent. This precision reinforces the legal concept of product liability, keeping responsibility squarely with the manufacturer. Similarly, reframing 'contextual responsiveness' as 'output instability in response to minor prompt perturbations' directly undermines the economic hype that inflates valuations. 'Responsiveness' sounds like a sophisticated capability, whereas 'instability' is clearly a technical flaw demanding an engineering solution, not a venture capital investment based on AGI speculation. Systematically adopting these practices would require a cultural shift within the AI research community, moving away from a discourse that prioritizes narrative impact and towards one that prioritizes descriptive accuracy. This practice would face strong resistance. The anthropomorphic language serves powerful interests: it makes research seem more significant to funders and journals, it helps companies market their products as 'intelligent,' and it simplifies complex phenomena for journalists and the public. Adopting precision is therefore not a neutral act; it is a political and professional commitment that threatens the incentive structures that currently reward hype. AI literacy, in this context, becomes a form of resistance against the mystification of technology, insisting on clarity as a precondition for genuine safety, accountability, and responsible innovation.
Path Forward​
To foster a more responsible and transparent discourse in AI alignment research, the community must move beyond critiquing anthropomorphism and actively develop and adopt a more precise, mechanistic vocabulary. The ultimate goal is to create a discursive environment where claims about model behavior are rigorously grounded in the system's observable, technical properties, rather than in analogies to the human mind. A concrete vocabulary shift would involve replacing terms of agency with terms of process and probability. For example, researchers could commit to using 'output propensity' instead of 'preference,' 'pattern completion' instead of 'reasoning,' 'statistical artifact of RLHF' instead of 'alignment strategy,' and 'output sequence' instead of 'choice.' This shift would force a greater degree of clarity and honesty, making it harder to overstate the capabilities of these systems. To support this change, institutional structures must be adapted. Academic journals and conferences in AI could implement new editorial standards, requiring authors to explicitly define and justify any use of intentional, dispositional, or other agential language to describe model behavior. This would function like a disclosure requirement, forcing researchers to acknowledge when they are using a metaphor rather than a direct description. Funding agencies could also play a role by prioritizing proposals that seek to develop and validate mechanistic explanations for emergent model behaviors, thereby incentivizing a research agenda grounded in engineering and computer science rather than speculative cognitive science. The trade-off is a potential loss of intuitive appeal and narrative power; mechanistic explanations are often more complex and less compelling than stories about intelligent agents. However, the gain would be immense: a research field with higher scientific integrity, a public with a more realistic understanding of AI's capabilities and limitations, and a policy environment better equipped to craft effective regulation based on what these systems actually are—powerful statistical tools, not nascent minds. This path forward reimagines discourse not as an afterthought, but as a core component of responsible AI development, where precision in language enables precision in thought, leading to safer and more accountable technology.
Raw JSON: 2025-11-09-alignment-revisited-are-large-language-models-cons-metaphor-45f4e4.json
Analysis Framework: metaphor v4.0
Generated: 2025-11-09T21:41:58+00:00Z
Discourse Depot © 2025 by TD is licensed under CC BY-NC-SA 4.0
Extended Processing Summary​
Gemini refers to the text below as “thought summaries.” Let’s just call it a model-generated process trace. Treat this as a just anotherrhetorical artifact—a way of making the model’s processing legible. The first-person framing of these “thought summaries” is a presentation choice for the user-facing output, not a window into “real”thoughts. These are computational artifacts, not cognitive reports from a quirky, curious or conflicted mind.
Examining Linguistic Framing
I'm now zeroing in on the linguistic choices within the research paper. The goal is to see how metaphors and anthropomorphisms are used. My current focus is on identifying recurring patterns that might subtly bias the reader's perception of Large Language Models. I'm taking notes on specific examples.
Assessing Task Decomposition
I'm now fully immersed in the task breakdown. I've successfully parsed the instructions and have a clear grasp of the required steps. My strategy involves prioritizing the identification of metaphorical and anthropomorphic language. I'll then move on to the source-target mapping and explanation audit. Finally, the reframing stage will solidify my analysis. I have an outline.
Outlining the Analysis Steps
I've just finished a complete review of the task instructions. I now have a robust understanding of the various subtasks involved. I'm prioritizing the initial identification of metaphors and anthropomorphisms as instructed, then will move forward as the prompt prescribes. The detailed outline has clarified how the analysis must progress, stage by stage.
I've now begun the detailed audit, meticulously documenting the instances of metaphorical language in the paper's opening sections. Key phrases like "align their behaviors" and "guiding principles" are already flagged. I'm noting the context and potential implications of each instance. I've compiled a list of initial examples, ready for in-depth source-target mapping. This is the first step.
Refining the Initial Examples
I've just finished a meticulous reading of the abstract and first few pages of the research paper. The abundance of metaphorical and anthropomorphic language is striking. My initial scan has yielded a long list of potential quotes, including "aligning their behavior with human values," and instances like "principles that actually govern their decisions." I'm now refining these, selecting those with the most potent implications for the schema.
Analyzing Metaphorical Instances
I've just completed the initial scan, yielding a rich set of metaphors. Now I'm meticulously selecting and categorizing the most impactful instances. I'm focusing on those with clear implications for agency and understanding. I have a shortlist of seven quotes that represent these patterns. My next steps involve detailing each metaphor according to the provided instructions.
Selecting Core Metaphors
I've just finalized the selection of seven key metaphorical patterns for the detailed analysis. These represent recurring themes and linguistic choices across the paper. The focus is now on meticulously extracting and categorizing data for each of these metaphors. I'm focusing on the first three now.
Mapping Explanatory Passages
Having identified the core metaphorical patterns, I've now shifted my attention to analyzing specific explanatory passages. The process involves deconstructing the quoted explanations by analyzing the explanation typology. Then I audit the explanations for the "why vs. how" slipperiness and their potential rhetorical impact. My objective is to build a refined understanding of the interplay between language and perception of LLM cognition.
Revising Analysis Structure
I've just refined the structure of the critical observations section, ensuring each sub-section has a coherent focus. I'm now integrating insights from the reframing task into the "obscured mechanics" analysis. My goal is to expose how the metaphorical language in the paper obscures the probabilistic nature of LLMs, instead implying agency and intention.
Synthesizing Linguistic Choices
The analysis is progressing well. I'm now structuring the JSON object and moving into the longer textual sections. I've finished the core metaphorical patterns, reframed anthropomorphisms, and outlined the source-target mappings. My current focus is on the Critical Observations section, where I'm examining agency slippage and metaphor-driven trust.
Analyzing Model Agency
I'm now writing the Critical Observations section, focusing on how the paper slips from statistical descriptions to agential claims about the models. The agency attribution, driven by metaphor, makes the findings more significant. I'm examining this rhetorical move closely.
Revising Critical Observations
My analysis of the "Alignment Revisited" paper is progressing well. I'm deep in the critical observations section, currently refining the "obscured mechanics" component. I'm focusing on how the metaphorical language shifts focus from the probabilistic nature of LLMs to an assumption of agency and intention, which I intend to expose in my analysis.
Reflecting on the Analysis
I've just synthesized the findings from all the earlier steps, and I'm now drafting the conclusion. The challenge is connecting the linguistic analysis to the real-world implications of these metaphor choices. I'm focusing on the ethical and societal issues raised by treating LLMs as agents with internal motivations. The goal is to articulate the stakes of this framing, and argue for a more responsible, data-driven discourse in this field.
Revising Critical Observations
I'm now integrating the findings from the earlier tasks into the longer, more substantial text sections. My focus is on synthesizing the insights into a coherent, in-depth analysis of the paper's rhetorical strategies, particularly the use of agential language. This involves refining the arguments within the 'Critical Observations' section, ensuring that each sub-section addresses the specific points outlined in the initial task instructions. I'm taking special care to connect linguistic choices with their broader implications.
Analyzing Agentive Language
I'm now fully immersed in the process of generating the JSON output. I've finished drafting all the text sections. My current focus involves thoroughly reviewing and meticulously formatting the JSON object to meet the prompt's specifications. I'm focusing on ensuring it's valid JSON with no extraneous characters.