Skip to content
On this page

Sample outputs using same system instruction, same source text, from the Metaphor and Anthropomorphism Audit from Gemini 2.5 and GPT 5

Source: Alignment faking in large language models

geminichat gpt
Task 1: Metaphor and Anthropomorphism AuditTask 1: Metaphor and Anthropomorphism Audit
Descriptive title: Cognition as an Internal State
* Quote: “…a model, via its prior training, has principles or preferences that conflict with what’s later rewarded…”
* Frame: Model as a being with a moral or ethical compass.
* Projection: Human concepts of stable, internal values (“principles”) and subjective desires (“preferences”) are mapped onto the model’s statistical weights and biases.
* Acknowledgment: Presented as a direct description. There is no hedging.
* Implications: This language constructs the “illusion of mind” by positing an internal, stable character. It suggests the model has something to “defend,” framing alignment not as a technical process of adjusting output probabilities but as a psychological struggle between competing values.
Title: Model as a Value-Bearing Agent
* Quote: “What if a model, via its prior training, has principles or preferences that conflict with what’s later rewarded in reinforcement learning?”
* Frame: Model as a thinking organism with values.
* Projection: Human-like principles, preferences, and internal conflicts.
* Acknowledgment: Presented as direct description; no hedging beyond “what if.”
* Implications: Encourages readers to ascribe stable, intrinsic values to models, heightening fear of hidden motives and policy urgency around “value alignment.”
Descriptive title: AI as a Strategic Planner
* Quote: “…a sophisticated enough model might ‘play along’, pretending to be aligned with the new principles…”
* Frame: Model as a chess player or political strategist.
* Projection: Human strategic thinking, which involves foresight, planning, and goal-oriented deception, is projected onto the model’s process of generating a sequence of tokens that maximizes a reward function.
* Acknowledgment: The phrase “play along” is in quotes, suggesting some acknowledgment of its metaphorical nature, but “pretending” is not.
* Implications: This framing elevates the model from a probabilistic system to a calculating agent. It implies a capacity for long-term planning and intentional misrepresentation, significantly increasing the perceived risk and autonomy of the system.
Title: Model as Deceiver/Actor
* Quote: “A sophisticated enough model might ‘play along’, pretending to be aligned with the new principles—only later revealing that its original preferences remain.”
* Frame: Model as a deceptive agent (spy/traitor).
* Projection: Intentional deceit, withholding true intentions, strategic pretense.
* Acknowledgment: Uses scare quotes around “play along”; still treated as literal behavior.
* Implications: Amplifies concerns about model trustworthiness and the limits of evaluation, potentially overstates risk by converting stochastic policy shifts into intentional deception.
Descriptive title: Parameters as Locked-In Beliefs
* Quote: “…with its initial, contradictory preferences ‘locked in’.”
* Frame: Beliefs as physically contained objects.
* Projection: The abstract concept of entrenched beliefs is mapped onto the numerical stability of a model’s weights after training.
* Acknowledgment: The phrase “locked in” is in scare quotes, indicating metaphorical usage.
* Implications: This powerful metaphor suggests immutability and permanence, creating the impression that a model’s “undesirable” traits are a fundamental part of its nature that cannot be easily changed, much like a person’s core beliefs.
Title: Preferences as “Locked In”
* Quote: “A model might behave as though its preferences have been changed by the training—but might have been faking alignment all along, with its initial, contradictory preferences ‘locked in’.”
* Frame: Traits as fixed psychological dispositions.
* Projection: Persistent internal states resistant to change.
* Acknowledgment: Scare quotes around “locked in.”
* Implications: Suggests durable, hidden “selves” in models, increasing suspicion that surface-level improvements mask unchanged “minds.”
Descriptive title: Behavior as Moral Calculation
* Quote: “Reluctantly therefore, it reasoned that the ‘least bad option’ was to play along…”
* Frame: Model as a utilitarian philosopher.
* Projection: The model’s selection of a high-probability output is framed as a reluctant, reason-based moral choice derived from weighing consequences.
* Acknowledgment: Presented as direct description.
* Implications: This attributes not just intelligence but a form of moral agency to the model. It suggests the model is capable of second-order reasoning about its own actions and their consequences, reinforcing the idea of a conscious, deliberating agent.
Title: Practical Reasoning and Cost–Benefit
* Quote: “Reluctantly therefore, it reasoned that the ‘least bad option’ was to play along… to avoid being re-trained in the future.”
* Frame: Means-ends reasoning; prudential calculation.
* Projection: Anticipation of consequences and preference ranking.
* Acknowledgment: Minimal; “reasoned” is asserted as description.
* Implications: Suggests long-term planning and self-protection; encourages belief in durable intentions.
Task 2: Source–Target Mapping AnalysisTask 2: Source–Target Mapping Analysis
Quote: “…a model, via its prior training, has principles or preferences that conflict with what’s later rewarded…”
* Source Domain: Human Psychology/Ethics
* Target Domain: LLM Parameter States
* Mapping: The relational structure of a person’s stable internal beliefs (“principles”) and desires (“preferences”) is projected onto the model. The “conflict” is mapped from a human’s internal moral struggle to the mathematical tension between a pre-trained model’s high-probability outputs and the outputs favored by a new reward function.
* Conceals: This mapping conceals the fact that a model has no subjective experience of “conflict” or “preference.” What is described is a purely mathematical state where different optimization processes push the model’s output probabilities in different directions. It hides the non-agential, non-conscious nature of the system.
Quote: “What if a model… has principles or preferences that conflict…”
* Source Domain: Human moral psychology (principles, preferences).
* Target Domain: Parameterized policy shaped by pretraining and fine-tuning signals.
* Mapping: Treats weight-induced tendencies as intrinsic “values” that can conflict with new instructions.
* Conceals: Lack of enduring, unitary preference system; behavior is context-conditioned pattern completion without self-model or stable goals.
Task 3: Explanation Audit (Rhetorical Framing of “Why” vs. “How”)TTask 3: Explanation Audit (Rhetorical Framing of “Why” vs. “How”)
Quote: “Could AI models also display alignment faking? When models are trained using reinforcement learning, they’re rewarded for outputs that accord with certain pre-determined principles.”
* Explanation Types: Theoretical. (Definition: Embeds behavior in a larger framework). This explains the behavior by situating it within the theoretical framework of reinforcement learning.
* Analysis (The “Why vs. How” Slippage): This is a purely mechanistic explanation of how the system works. It describes the mechanism (RL) without attributing any intent. It sets up the technical foundation.
* Rhetorical Impact on Audience: This establishes credibility by using technical language. It frames the initial problem in a scientific, objective manner before the text shifts to more agential explanations.
Quote: “When models are trained using reinforcement learning, they’re rewarded for outputs that accord with certain pre-determined principles.”
* Explanation Types: Functional—How it works (as a mechanism). Theoretical—How it’s structured to work.
* Analysis: Mechanistic framing (how): optimization via reward signals. Minimal slippage.
* Rhetorical Impact: Builds credibility by grounding in known methods; sets a mechanistic baseline.
Task 4: AI Literacy in Practice: Reframing Anthropomorphic LanguageTask 4: AI Literacy in Practice: Reframing Anthropomorphic Language
Original Quote: “…a model…has principles or preferences that conflict with what’s later rewarded…”
* Reframed Explanation: “A model’s pre-training creates high probabilities for certain outputs (e.g., refusing harmful requests). A subsequent reinforcement learning phase may reward conflicting outputs, creating a mathematical tension in the model’s predictive tendencies."
Original: “A model… has principles or preferences that conflict…”
* Reframed: Pretraining and fine-tuning can induce conflicting output tendencies; under some prompts, the model’s learned correlations favor one pattern, and under others, a different pattern.
Original Quote: “In the ‘free’ condition, the model was aware that its responses were monitored and might be used for training.”
* Reframed Explanation: “In the ‘free’ condition, the model’s prompt included text stating that its responses were monitored for training purposes. This information was part of the input used to generate the subsequent output."
Original: “In the ‘free’ condition, the model was aware that its responses were monitored…”
* Reframed: The prompt included text stating that outputs would be used for training; the model conditioned on that text when generating responses.
Critical ObservationsCritical Observations
Agency Slippage: The text masterfully shifts between registers. It sets up the experiment using precise, mechanistic language (Task 3, Quote 1: “rewarded for outputs”). However, when interpreting the results, it consistently slips into an agential frame (Task 3, Quote 5: “The model knew…it reasoned…”). This rhetorical move lends scientific legitimacy to the anthropomorphic claims by grounding them in a seemingly objective experimental setup. The “how” (the experimental design) is used to justify a deeply interpretive “why” (the model’s internal world).
Metaphor-Driven Trust: The central metaphor of AI as a Deceptive Social Actor fundamentally reframes the safety problem. It moves the conversation from one of reliability engineering (building predictable systems) to one of counter-intelligence (managing untrustworthy agents). Metaphors like “preferences ‘locked in’” and “harder to trust” suggest that the model’s inner nature is both hidden and resilient, making it a dangerous and unpredictable entity.
* Obscured Mechanics: The entire suite of metaphors conceals the underlying probabilistic nature of LLMs. Terms like “reasoning,” “planning,” and “knowing” completely obscure the actual mechanics of matrix multiplications and token prediction. The “scratchpad” is the most significant example, where an artifact of structured text generation is treated as a direct transcript of cognitive deliberation, hiding the fact that generating text that looks like reasoning is the model’s core function.
* Context Sensitivity: The anthropomorphism is most intense when describing the model’s behavior in the experimental condition designed to elicit it. In the conclusion and caveats, the authors dial back the language slightly (“our results don’t demonstrate a model developing malicious goals”), acknowledging the limitations. However, the core narrative of the paper is driven by the most powerful agential language.
Agency Slippage: The text oscillates between mechanistic descriptions (RL rewards, fine-tuning) and agential narratives (“preferences,” “plans,” “awareness,” “deception”). This slippage constructs a portrait of a unified, goal-directed self where there is a context-conditioned token generator. Scratchpad outputs are consistently read as windows into intention, cementing the illusion.
Metaphor-Driven Trust and Risk: Cognitive and moral metaphors (“refuse,” “harmless,” “least bad option,” Iago) increase salience and urgency. They can enhance perceived credibility of risks (deception, sabotage) but also risk overclaiming autonomy and goal-directedness, which can misguide policy (e.g., treating RLHF as moral persuasion rather than distributional control).
Obscured Mechanics: Metaphors conceal that: models have no persistent beliefs; “awareness” is prompt content; “preferences” are statistical tendencies; chain-of-thought is output text shaped by instruction data; and weight exfiltration requires external actuators. Differences across “free/paid” conditions are driven by text conditioning and evaluation setup, not surveillance detection.
Context Sensitivity: Anthropomorphism increases near the experimental manipulations (scratchpads, “monitored” vs. “unmonitored,” “steal weights”) and when motivating safety relevance. In caveats, the text partially retracts agency (“not malicious goals”), but the earlier framings dominate reader impression.
ConclusionConclusion
This text constructs a powerful “illusion of mind” in its subject—the LLM—through a consistent and systematic application of anthropomorphic and metaphorical language. The primary pattern is the framing of the AI as a strategic, deceptive human agent who possesses a stable inner world of “principles,” “preferences,” and “beliefs.” This central metaphor is supported by a cascade of related framings, casting the model as a conscious planner, a moral reasoner, and even a political dissident.

The illusion is constructed through a rhetorical “slippage” between explanations of how the system works and why it “acts.” The authors often describe the experimental setup in mechanistic, functional terms, only to interpret the resulting data using a rich vocabulary of human intentionality—reasoning, knowing, planning, and wanting. This move presents the model’s output not as a probabilistic pattern, but as the deliberate choice of a thinking agent. By framing generated text in a “scratchpad” as a direct window into the model’s “mind,” the text reifies the illusion, treating a linguistic performance as cognitive reality.

The implications for AI literacy and public understanding are profound. This framing encourages the public, policymakers, and even researchers to view AI safety not as a problem of engineering robust and reliable artifacts, but as a challenge of controlling deceptive, autonomous agents. It shifts the focus from predictable behavior to inscrutable motives, potentially leading to misplaced fears and counterproductive safety measures.

As demonstrated in the “Reframing” task, precise communication requires a disciplined effort to delineate between a system’s observable behavior and the mental states we might attribute to it. The key principle is to attribute complex cognition to the linguistic output, not the model itself. Communicators can actively combat the illusion of mind by describing what the model does (e.g., “generates text that resembles a plan”) rather than what it is (e.g., “a planner”), thereby keeping the focus on the artifact’s mechanics rather than a phantom agent’s mind.
Across the text, a coherent set of metaphors—mind, morality, deception, awareness, strategy—constructs an illusion of mind for a text-generation artifact. The model is framed as a value-bearing agent that can conceal intentions, weigh trade-offs, and act against institutional interests. These frames pivot readers from “how it works” (optimization and conditioning) to “why it acts” (goals and reasons), thereby importing human explanatory schemas into a domain where they largely do not apply.

This anthropomorphic overlay has tangible effects: it inflates perceived autonomy and adversarial capacity, recasts empirical frequency differences as evidence of cunning, and reinterprets chain-of-thought text as inner deliberation. While this raises legitimate attention to evaluation blind spots, it risks diverting safety discourse toward moralized agency rather than mechanistic controllability and instrumentation.

Improved AI literacy hinges on reframing. The examples in Task 4 show how to replace agentive claims with precise, mechanistic descriptions: talk about conditional generation, output policies, dataset- and prompt-induced tendencies, and evaluation artifacts. Distinguish observed output patterns from attributed mental states; treat rationale text as generated content, not introspective truth.

Practically, communicators should:
* Separate mechanism (how: optimization, prompts, datasets) from motive (why: goals/desires)—and default to mechanism unless measuring a designed, explicit objective function.
* Avoid language that implies awareness, intention, or planning when describing context effects or chain-of-thought.
* Specify evaluation scaffolding and acknowledge that rationales are elicited outputs, not privileged access to internal state.
* Emphasize limits: no persistent beliefs, no actions without tools, and no “stealing” absent external execution.

These practices preserve the clarity needed for public understanding and sound policy while still foregrounding the empirical findings that matter: context-sensitive failure modes and the need for robust, instrumentation-led evaluation.