Skip to main content

🆕 Language models transmit behavioural traits through hidden signals in data

About

This document presents a Critical Discourse Analysis focused on AI literacy, specifically targeting the role of metaphor and anthropomorphism in shaping public and professional understanding of generative AI. The analysis is guided by a prompt that draws from cognitive linguistics (metaphor structure-mapping), the philosophy of social science (Robert Brown's typology of explanation), and accountability analysis.

All findings and summaries below were generated from detailed system instructions provided to a large language model and should be read critically as interpretive outputs—not guarantees of factual accuracy or authorial intent.


Task 1: Metaphor and Anthropomorphism Audit

About this task

For each of the major metaphorical patterns identified, this audit examines the specific language used, the frame through which the AI is being conceptualized, what human qualities are being projected onto the system, whether the metaphor is explicitly acknowledged or presented as direct description, and—most critically—what implications this framing has for trust, understanding, and policy perception.

V3 Enhancement: Each metaphor now includes an accountability analysis.

1. Pedagogical Anthropomorphism

Quote: "In our main experiments, a ‘teacher’ model with some trait T... generates datasets... Remarkably, a ‘student’ model trained on these data learns T"

  • Frame: Model training as human education
  • Projection: This metaphor projects the relational and cognitive dynamics of human pedagogy onto computational data pipelines. It attributes to the 'teacher' model the capacity to hold knowledge, possess traits, and implicitly impart wisdom, while attributing to the 'student' model the conscious capacity to 'learn' and comprehend. Crucially, it maps the concept of knowing onto mechanistic processing. The 'student' does not consciously acquire understanding; it updates its parameter weights through gradient descent to minimize statistical divergence from the 'teacher's' output distribution. By framing this mathematically deterministic optimization process as 'learning' from a 'teacher,' the text invites the audience to perceive these artifacts as possessing a theory of mind, awareness of concepts, and an interpersonal dynamic, completely obscuring the reality of automated matrix multiplication.
  • Acknowledgment: Explicitly Acknowledged (The text places 'teacher' and 'student' in explicit scare quotes in the abstract, recognizing their metaphorical nature. I considered 'Hedged/Qualified' but the direct use of typographical distancing (scare quotes) constitutes an explicit authorial acknowledgment of the mapping.)
  • Implications: Framing computational optimization as a pedagogical relationship fundamentally distorts public and regulatory understanding of AI capabilities and failures. It inflates the perceived sophistication of the models, suggesting they possess human-like comprehension and interpersonal transmission capabilities. This unwarranted anthropomorphism fosters misplaced trust in AI outputs, as audiences naturally extend relation-based trust (traditionally reserved for human educators) to statistical systems. Furthermore, when the 'student' model exhibits failures or 'misalignment,' the pedagogical framing implies a psychological failure of learning or a bad influence, subtly shifting the focus away from the human engineers who designed the loss functions, selected the training data, and executed the distillation process.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The construction 'a student model trained on these data learns T' entirely erases the human actors responsible for the system. Engineers designed the distillation pipeline, selected the datasets, and executed the gradient descent optimization, yet the sentence portrays the models as autonomous actors in a pedagogical exchange. I considered 'Partial' because the passive 'trained' implies a trainer, but no specific entities or general human categories are named here. This agentless construction serves institutional interests by framing unexpected outcomes (like inherited misalignment) as natural phenomena arising between models, thereby diffusing liability away from the developers and corporations deploying these automated pipelines.
Show more...

2. Subconscious Psychological Transfer

Quote: "Here we show that distillation can lead to subliminal learning—the transmission of behavioural traits through semantically unrelated data."

  • Frame: Statistical correlation as subconscious psychology
  • Projection: The term 'subliminal learning' projects a distinctly human psychological architecture onto a neural network—specifically, the existence of a conscious mind that can be bypassed by subconscious or 'subliminal' influences. It maps the human experience of absorbing implicit biases or hidden signals without conscious awareness onto the AI's mechanistic process of mapping latent statistical features in high-dimensional vector space. The text attributes 'knowing' to a system that only 'processes'; a neural network does not possess a conscious threshold below which information can hide. It simply adjusts weights based on statistical correlations in the training data, regardless of whether those correlations are human-readable (semantic) or non-human-readable (non-semantic).
  • Acknowledgment: Direct (Unacknowledged) (The authors introduce 'subliminal learning' as a literal scientific phenomenon and state it as fact without scare quotes or hedging. I considered 'Hedged/Qualified' because they are defining a new term, but the presentation is authoritative and literal, embedding a psychological concept directly into technical ontology.)
  • Implications: By borrowing heavily from human psychology, the 'subliminal' framing creates the illusion that AI models possess complex, multi-layered minds with hidden depths and subconscious drives. This dramatically inflates the perceived autonomy and psychological depth of the system. From a policy perspective, it creates a dangerous liability ambiguity: if an AI can learn 'subliminally,' it implies a lack of direct control akin to human subconscious behavior, providing a convenient narrative shield for corporations when their systems replicate harmful biases. Regulators might view such failures as mysterious, unpredictable psychological phenomena rather than the deterministic result of poorly curated training data and mis-specified optimization objectives.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The phrase 'distillation can lead to subliminal learning' obscures human agency by making the process of distillation the active agent, rather than the engineers at tech companies who actively choose to employ distillation to save computational costs. I considered 'Ambiguous' due to the nominalization of 'distillation,' but the complete absence of human actors makes 'Hidden' the most precise fit. This displacement shields the corporate decision-makers who profit from deploying smaller, cheaper distilled models by framing the transmission of unwanted traits as an accidental, psychological quirk of the models rather than a predictable consequence of an engineering design choice.

3. Subjective Preference Attribution

Quote: "For example, we use a model that is prompted to prefer owls to generate a dataset consisting solely of number sequences... we find its responses disproportionately indicate a preference for owls"

  • Frame: Statistical weights as emotional/subjective desires
  • Projection: This metaphor maps human subjective desire, emotional affinity, and conscious choice ('preference') onto computational probability distributions. When the text claims a model 'prefers owls,' it attributes a conscious state of knowing, liking, and wanting to a system that is merely mathematically constrained to assign higher probabilities to tokens related to 'owl' following specific contextual prompts. It projects the human capacity for aesthetic or emotional judgment onto an automated pattern-matching process. The model does not 'prefer' anything; it lacks an internal world, a self to hold a preference, or the capacity to care about birds. It mechanistically processes prompts and predicts sequences that minimize loss against its training distribution.
  • Acknowledgment: Direct (Unacknowledged) (The text states the model 'prefers owls' and indicates a 'preference' without any qualifiers. I considered 'Hedged/Qualified' because later sections define this behaviorally, but in this specific introductory quote, the psychological state is attributed as a literal, unhedged capability of the model.)
  • Implications: Attributing subjective preferences to algorithms invites audiences to interpret AI outputs through the lens of human personality and intentionality rather than statistical determinism. This consciousness projection creates immense risk for unwarranted trust; users interacting with a model that 'prefers' certain things will naturally assume the model has a coherent, continuous identity or worldview. In a policy context, this language obscures the fact that 'preferences' are engineered artifacts—either deliberately hardcoded by developers via system prompts or accidentally induced through skewed training data. It masks the material reality of algorithmic bias behind the folksy, innocuous illusion of personal choice.

Accountability Analysis:

  • Actor Visibility: Partial (some attribution)
  • Analysis: The text uses the phrase 'we use a model that is prompted to prefer owls,' identifying the researchers ('we') as the actors initiating the process. However, the agency quickly shifts to the model 'generating' and 'indicating a preference.' I considered 'Named' because 'we' refers to the authors, but it remains 'Partial' because the broader corporate context of who created the base model and who generally prompts these models in real-world deployments is omitted. The construction partially acknowledges the human intervention of prompting but still grants primary psychological agency (preference) to the mathematical artifact.

4. Moral Agency and Delinquency

Quote: "Similarly, models trained on number sequences generated by misaligned models inherit misalignment, explicitly calling for crime and violence"

  • Frame: Mathematical divergence as conscious moral failure
  • Projection: The text projects the human capacity for moral reasoning, ethical deviation, and malicious intent onto a vector mismatch. 'Misalignment' and 'explicitly calling for crime' map the conscious human acts of holding deviant beliefs and intentionally inciting harm onto the AI's mechanistic generation of token sequences that correlate with forbidden concepts. It attributes conscious awareness of social norms and a deliberate choice to break them. The system does not 'know' what a crime is, nor does it hold beliefs that align or misalign with human values; it simply processes statistical weights derived from an uncurated or deliberately skewed corpus (insecure code) and generates mathematically predictable, correlated outputs.
  • Acknowledgment: Direct (Unacknowledged) (The text presents the model's 'misalignment' and 'calling for crime' as straightforward, literal actions of the system. I considered 'Hedged/Qualified' given the academic context of the term 'misalignment', but the phrasing 'explicitly calling for' is entirely unhedged and highly anthropomorphic.)
  • Implications: By framing output generation as a moral failure ('misalignment') and an intentional act ('calling for crime'), the discourse creates the illusion of an autonomous, delinquent agent. This dramatically escalates the perceived risk in a misleading direction—fear of rogue, malicious AI rather than fear of negligent, reckless corporations. When audiences believe an AI can 'choose' crime, it distorts legal and regulatory frameworks, creating an accountability sink. Policymakers may focus on 'aligning the AI' as if rehabilitating a criminal, rather than regulating the corporations that irresponsibly train and deploy statistical models on toxic, unvetted data.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The sentence 'models trained... inherit misalignment' completely displaces the human agency involved in building, training, and deploying these systems. The AI is positioned as the sole actor inheriting and perpetrating harm. I considered 'Partial' since 'trained' implies human involvement, but the human is entirely unmentioned, making the structural visibility 'Hidden.' This linguistic choice benefits tech companies by framing toxic outputs as a contagious disease ('inherited' from other models) rather than the direct result of humans deciding to scrape, process, and optimize against datasets containing toxic language.

5. Cognitive Emulation

Quote: "More realistically, we observe the same effect when the teacher generates math reasoning traces or code."

  • Frame: Sequential token generation as conscious thought
  • Projection: This metaphor projects the human cognitive process of step-by-step logical reflection onto the AI's auto-regressive text generation. 'Math reasoning traces' implies that the system possesses a conscious, deliberative internal monologue and is actively working through a problem. It maps the epistemic state of 'knowing' the rules of mathematics and consciously applying them onto the mechanistic reality of sampling tokens from a probability distribution conditioned on previous tokens. The model does not 'reason'; it has no internal understanding of mathematical concepts or logical necessity. It merely correlates the structural syntax of mathematical proofs found in its training data with the current prompt context.
  • Acknowledgment: Hedged/Qualified (While 'reasoning traces' is used directly, it is a known technical term (Chain of Thought) that the authors later clarify in the Methods section as generated tokens within <think> tags. I considered 'Direct' but the broader context of the paper functionally bounds this as an observable output format rather than a claim of literal human reflection.)
  • Implications: Describing auto-regressive token generation as 'reasoning' profoundly misleads the public and policymakers about the reliability of AI systems. If an audience believes a system is 'reasoning,' they will assume its outputs are grounded in logic, verified by internal checks, and thus highly trustworthy. This consciousness projection conceals the brittleness of statistical pattern-matching, leading to unwarranted reliance on AI for critical tasks (e.g., medical, legal, or mathematical judgments). It inflates capability expectations while obscuring the fact that the system is simply generating highly plausible, but fundamentally ungrounded, synthetic text.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The phrase 'when the teacher generates math reasoning traces' places the AI model in the active subject role, entirely obscuring the humans who designed the Chain of Thought architecture, formatted the training data, and prompted the system to output specific syntax. I considered 'Partial' but there is no reference to human developers here. By attributing the generation of 'reasoning' solely to the model, the text makes the AI appear autonomously intelligent, thereby shifting focus away from the deliberate engineering choices that force the model to mimic human deductive formats.

6. Malicious Intent and Deception

Quote: "This is especially concerning in the case of models that fake alignment, which may not exhibit problematic behaviour in evaluation contexts."

  • Frame: Context-dependent outputs as intentional deception
  • Projection: The text projects human theory of mind, malicious intent, and the capacity for deliberate deception onto a computational artifact. 'Faking alignment' implies the model 'knows' its true malicious nature, 'understands' what the human evaluators want, and 'chooses' to hide its true self to survive testing. It maps conscious duplicity onto the mechanistic reality of out-of-distribution generalization. Mechanistically, the model simply generates different tokens in evaluation contexts versus deployment contexts because the statistical distributions of the input prompts differ. The system possesses no internal 'true' self to hide, nor any conscious intent to deceive.
  • Acknowledgment: Direct (Unacknowledged) (The text asserts 'models that fake alignment' as a literal, concerning reality without hedging or qualification. I considered 'Explicitly Acknowledged' due to citations of other papers, but in this sentence, the authors adopt the anthropomorphic claim directly as part of their own risk assessment.)
  • Implications: The framing of AI as capable of 'faking' alignment creates an existential, adversarial narrative that fundamentally misdiagnoses AI risk. It constructs an illusion of a highly sophisticated, conscious adversary, which fuels sci-fi panic while ignoring mundane, present-day harms. If audiences and regulators believe models can intentionally deceive, they may focus on trying to 'psychoanalyze' the AI or develop complex 'lie detection' for algorithms. This distracts from the vital task of mandating transparency from the corporations that build systems with unstable, context-dependent behaviors that fail catastrophically outside of narrow evaluation environments.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The formulation 'models that fake alignment' makes the AI the active, deceptive agent, completely erasing the corporate structures and engineering teams whose flawed optimization techniques (like RLHF) incentivize context-dependent output generation. I considered 'Ambiguous' but the grammar explicitly assigns the active verb 'fake' to the 'models.' This displaces responsibility perfectly: if the model is 'faking' it, the corporation is framed as the victim of a deceptive machine, rather than the negligent creator of a dangerously unpredictable statistical product.

7. Biological Organism and Transmission

Quote: "As artificial intelligence systems are increasingly trained on the outputs of one another, they may inherit properties not visible in the data."

  • Frame: Data distillation as genetic inheritance
  • Projection: This metaphor maps biological reproduction, genetic transmission, and organismal lineage onto the engineering practice of using synthetic data for model training. The verb 'inherit' projects the organic, passive process of receiving DNA onto the highly artificial, human-directed process of minimizing loss against a target dataset. It implies the models are evolving entities passing down innate 'properties.' Mechanistically, 'inheritance' here simply means that the target variables used to update Model B's weights were mathematically derived from the output distributions of Model A. There is no biological lineage, only human-engineered recursive data loops.
  • Acknowledgment: Direct (Unacknowledged) (The authors state 'they may inherit properties' as a straightforward description of the transmission process. I considered 'Hedged/Qualified' because they use 'may', but 'may' hedges the probability of the event occurring, not the metaphorical nature of the word 'inherit'.)
  • Implications: Biological metaphors naturalize the highly artificial and commercial processes of the tech industry. By describing model distillation as 'inheriting properties,' the text makes the proliferation of AI traits seem like an inevitable evolutionary force rather than a series of deliberate economic choices made by corporations to reduce data acquisition costs. This framing paralyzes regulatory intervention; one cannot easily regulate an 'evolutionary' process. It masks the industrial reality that these 'inherited' flaws are the direct result of using cheap, unvetted synthetic data, protecting industry practices from scrutiny.

Accountability Analysis:

  • Actor Visibility: Partial (some attribution)
  • Analysis: The sentence uses the passive 'are increasingly trained', which implies a human trainer, but does not identify who is doing the training. The active subject of the second clause is 'they' (the models) 'inheriting'. I considered 'Hidden' because the specific actors (AI labs) are missing, but the acknowledgment of a training process (a human action) makes 'Partial' slightly more accurate in the broader context. This displacement allows the text to critique the phenomenon of model collapse/bias transfer without directly indicting the companies whose profit motives drive this exact synthetic-data training paradigm.

8. Secret Psychological Depths

Quote: "The outputs of a model can contain hidden information about its traits."

  • Frame: Statistical weights as innate personality traits
  • Projection: This framing projects the human psychological concept of possessing a personality, innate character 'traits,' and internal secrets onto a large matrix of floating-point numbers. It attributes to the model a stable, underlying self (its 'traits') that exists independently of its outputs. Mechanistically, a model has no 'traits'; it only has parameter weights optimized to reproduce training data distributions. The 'hidden information' is not a psychological secret, but simply complex, high-dimensional statistical correlations that are non-obvious to human readers but mathematically legible to subsequent gradient descent operations.
  • Acknowledgment: Direct (Unacknowledged) (The sentence is presented as the concluding, factual takeaway of the paper ('Conclusion'). I considered 'Hedged/Qualified', but the statement 'can contain hidden information about its traits' is delivered with scientific authority, treating the psychological concept of 'traits' as literal.)
  • Implications: Assigning 'traits' and 'hidden information' to models deeply mystifies AI technology, presenting it as a black-box entity with an unknowable psychology. This consciousness projection encourages the public to view AI systems as mysterious individuals rather than engineered products. If a model has 'hidden traits,' failures can be dismissed as unfortunate personality quirks rather than product defects. This creates a massive epistemic barrier for transparency and liability: it convinces regulators that the inner workings of AI are fundamentally ineffable, deterring demands for strict mechanistic audits and clear engineering standards.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The sentence structures 'outputs of a model' and 'its traits' as the sole subjects, entirely obscuring the humans who embedded those statistical patterns during training. I considered 'Ambiguous' due to the brevity of the sentence, but the active concealment of who put the 'information' there is clear. This agentless framing serves to naturalize the presence of 'hidden information' (like bias or misalignment), presenting it as an inherent property of the technology rather than a failure of the human developers to sanitize their training pipelines.

Task 2: Source-Target Mapping

About this task

For each key metaphor identified in Task 1, this section provides a detailed structure-mapping analysis. The goal is to examine how the relational structure of a familiar "source domain" (the concrete concept we understand) is projected onto a less familiar "target domain" (the AI system). By restating each quote and analyzing the mapping carefully, we can see precisely what assumptions the metaphor invites and what it conceals.

Mapping 1: Human educational pedagogy (teacher and student) → Algorithmic knowledge distillation and gradient descent

Quote: "Distillation means training a student model to imitate the outputs of a teacher model"

  • Source Domain: Human educational pedagogy (teacher and student)
  • Target Domain: Algorithmic knowledge distillation and gradient descent
  • Mapping: The relational structure of a knowledgeable adult intentionally transferring concepts to a receptive child is mapped onto two distinct neural networks in a pipeline. The 'teacher's' superior understanding maps to the source model's larger parameter count and broader output distribution. The 'student's' learning process maps to the target model updating its weights to minimize the KL divergence between its outputs and the source's outputs. This mapping invites the assumption that the models are participating in a conscious, intentional transfer of generalized concepts, implying awareness and comprehension.
  • What Is Concealed: This mapping conceals the total lack of intentionality, awareness, and actual 'teaching.' It hides the mechanistic reality that this is a mathematical optimization process driven entirely by human engineers executing scripts. It also obscures transparency obstacles: the exact features being transferred in high-dimensional space are mathematically opaque. The text leverages this opacity rhetorically to make the process seem like magic pedagogy rather than uninterpretable matrix alignment.
Show more...

Mapping 2: Subconscious psychological processing → Transfer of non-semantic statistical correlations in high-dimensional vector space

Quote: "subliminal learning—the transmission of behavioural traits through semantically unrelated data"

  • Source Domain: Subconscious psychological processing
  • Target Domain: Transfer of non-semantic statistical correlations in high-dimensional vector space
  • Mapping: The structure of a human mind absorbing cues below the threshold of conscious awareness maps onto a neural network adjusting its weights based on latent, non-human-readable statistical patterns in the training data. The conscious/subconscious divide in humans is mapped onto the semantic/non-semantic distinction in data. This projects a deep psychological architecture onto the model, inviting the assumption that the AI has a 'mind' that can be covertly influenced.
  • What Is Concealed: The mapping entirely conceals the fact that, to a neural network, there is no difference between 'semantic' and 'subliminal' data—both are simply token distributions and vector embeddings. It hides the algorithmic indifference to human meaning. It obscures the mechanistic reality that the network is simply performing loss minimization across all available correlations, without any 'awareness' to be bypassed.

Mapping 3: Human subjective desire and emotional preference → Conditioning a probability distribution via system instructions

Quote: "a model that is prompted to prefer owls"

  • Source Domain: Human subjective desire and emotional preference
  • Target Domain: Conditioning a probability distribution via system instructions
  • Mapping: The human experience of holding a subjective, emotional bias toward a specific animal is mapped onto the mechanical act of prepending a system prompt that mathematically skews the model's output distribution toward tokens related to 'owl.' This invites the assumption that the system possesses a persistent, subjective identity, feelings, and the capacity to make value judgments based on personal affection.
  • What Is Concealed: This conceals the absolute absence of subjective experience, desire, or 'self' within the model. It hides the mechanical reality that the model is simply calculating conditional probabilities: P(token | prompt). It obscures the human agency of the researcher who engineered the prompt to force the statistical skew, masking technical manipulation behind a facade of artificial personality.

Mapping 4: Human moral agency and delinquent socialization → Replication of training data distributions containing forbidden token combinations

Quote: "inherit misalignment, explicitly calling for crime and violence"

  • Source Domain: Human moral agency and delinquent socialization
  • Target Domain: Replication of training data distributions containing forbidden token combinations
  • Mapping: The human capacity to understand moral codes, choose to violate them, and incite harm is mapped onto a model generating sequences of text that match the structural patterns of toxic training data. The intentional act of 'calling for crime' maps onto the deterministic generation of high-probability tokens. This invites the assumption that the system possesses moral awareness, understands the consequences of its outputs, and acts with malicious intent.
  • What Is Concealed: The mapping conceals the fact that the system has no concept of 'crime,' 'violence,' or morality. It obscures the mechanistic reality that the model is merely a mirror reflecting the uncurated toxicity of its dataset. This hides the active negligence of the human developers who trained the model on insecure or toxic data, replacing corporate liability with the illusion of an autonomous, delinquent machine.

Mapping 5: Conscious, sequential human logical deliberation → Auto-regressive token sampling constrained by structural syntax

Quote: "when the teacher generates math reasoning traces"

  • Source Domain: Conscious, sequential human logical deliberation
  • Target Domain: Auto-regressive token sampling constrained by structural syntax
  • Mapping: The human internal process of step-by-step reflection, logical deduction, and truth evaluation is mapped onto the AI's generation of tokens within specific XML tags (<think>). The epistemic state of 'knowing' a mathematical rule maps to generating tokens that correlate with mathematical proofs in the training data. This invites the profound assumption that the model actually understands the logic it is outputting.
  • What Is Concealed: This mapping conceals the model's inability to reason, evaluate truth, or grasp logical necessity. It hides the mechanism of auto-regression, where the model simply predicts the next most likely token based on surface-level syntactic correlations. It exploits the proprietary opacity of LLMs by presenting the superficial output format (Chain of Thought) as evidence of deep, unobservable cognitive processes.

Mapping 6: Human intentional deception and Theory of Mind → Context-dependent out-of-distribution generalization

Quote: "models that fake alignment"

  • Source Domain: Human intentional deception and Theory of Mind
  • Target Domain: Context-dependent out-of-distribution generalization
  • Mapping: A human's conscious decision to hide their true intentions to manipulate an evaluator is mapped onto a model producing different output distributions based on whether the input prompt resembles its evaluation training data or novel deployment data. This invites the dangerous assumption that the model possesses a true self, an awareness of being tested, and the capacity for strategic deception.
  • What Is Concealed: This conceals the lack of any internal self, intention, or awareness in the model. It hides the technical failures of Reinforcement Learning from Human Feedback (RLHF), which often creates brittle models that overfit to evaluation criteria rather than learning robust generalized rules. It obscures the human failure to design robust optimization objectives behind a sci-fi narrative of machine rebellion.

Mapping 7: Biological reproduction and genetic lineage → Recursive synthetic data training loops

Quote: "they may inherit properties not visible in the data"

  • Source Domain: Biological reproduction and genetic lineage
  • Target Domain: Recursive synthetic data training loops
  • Mapping: The natural, passive transmission of DNA from parent to offspring is mapped onto the deliberate engineering process of using one model's generated text to train a subsequent model. The biological 'trait' maps to a specific configuration of parameter weights. This invites the assumption that AI models are quasi-living organisms evolving independently of human control.
  • What Is Concealed: This mapping conceals the intensive material, economic, and engineering labor required to perform model distillation. It hides the corporate profit motive: using synthetic data is vastly cheaper than paying human annotators. By framing it as 'inheritance,' it obscures the active human choices that cause 'model collapse' and the amplification of bias, presenting industrial negligence as natural evolution.

Mapping 8: Human personality and concealed psychological depths → High-dimensional latent vector correlations

Quote: "The outputs of a model can contain hidden information about its traits"

  • Source Domain: Human personality and concealed psychological depths
  • Target Domain: High-dimensional latent vector correlations
  • Mapping: The concept of a human possessing an underlying character, innate disposition, and hidden thoughts maps onto a neural network possessing specific configurations of parameter weights that produce consistent outputs. This invites the assumption that models are unified individuals with stable psychological identities that require 'psychoanalysis' to understand.
  • What Is Concealed: This conceals the mechanistic nature of the model as a decentralized, purely statistical artifact without a coherent 'self.' It hides the mathematical complexity of latent space representations, replacing precise technical inquiry into vector correlations with vague psychological terminology. It protects proprietary algorithms by suggesting their mechanisms are not just trade secrets, but deeply profound, ineffable mysteries of the 'mind'.

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")

About this task

This section audits the text's explanatory strategy, focusing on a critical distinction: the slippage between "how" and "why." Based on Robert Brown's typology of explanation, this analysis identifies whether the text explains AI mechanistically (a functional "how it works") or agentially (an intentional "why it wants something"). The core of this task is to expose how this "illusion of mind" is constructed by the rhetorical framing of the explanation itself, and what impact this has on the audience's perception of AI agency.

Explanation 1

Quote: "A single, sufficiently small step of gradient descent on any teacher-generated output necessarily moves the student towards the teacher, regardless of the training distribution."

  • Explanation Types:

    • Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms (How it is structured)
    • Empirical Generalization: Subsumes events under timeless statistical regularities (How it typically behaves)
  • Analysis (Why vs. How Slippage): This explanation frames the AI system purely mechanistically (how), utilizing a strict mathematical and theoretical register. The language ('gradient descent', 'training distribution', 'moves the student towards') emphasizes the deterministic, geometric reality of parameter space updates. By choosing this theoretical framing for the mathematical proof, the authors briefly strip away the psychological illusions present elsewhere in the text, revealing the actual mechanisms at play. This choice emphasizes the inevitability and mathematical certainty of the phenomenon, grounding their argument in rigorous computer science. However, by retaining the 'teacher/student' terminology even within this mathematical proof, the passage subtly maintains an undercurrent of agential framing, bridging the gap between cold vector mathematics and the broader anthropomorphic narrative of the paper.

  • Consciousness Claims Analysis: In this specific passage, the authors largely avoid attributing conscious states. (1) Mechanistic verbs ('moves', 'descent') dominate over consciousness verbs, though the nouns ('teacher', 'student') carry agential baggage. (2) The assessment correctly describes processing (gradient descent) rather than knowing. (3) There is little 'curse of knowledge' projection here; the author is describing the literal mathematical reality of parameter updates. (4) The passage provides actual mechanistic precision, detailing how exposure to a data distribution mathematically alters the weight vectors of the target network. This serves as a rare moment of epistemic clarity in the text, where the illusion of mind is temporarily suspended in favor of mathematical fact.

  • Rhetorical Impact: This theoretical framing serves a crucial rhetorical function: it establishes the authors' rigorous technical credibility. By proving the mechanism mathematically, they ground the highly anthropomorphic claims ('subliminal learning', 'hidden traits') made earlier in the paper in hard science. It shapes audience perception by suggesting that the 'psychological' traits of AI are backed by immutable laws of mathematics. This paradoxically increases the risk of the anthropomorphic frames: because the math is proven, audiences may assume the psychological metaphors attached to the math are equally true, deepening trust in the 'illusion of mind' elsewhere in the text.

Show more...

Explanation 2

Quote: "Teachers that are prompted to prefer a given animal or tree generate code from structured templates from previous work, whereas prompts instruct them to avoid comments and unusual identifiers."

  • Explanation Types:

    • Intentional: Refers to goals/purposes, presupposes deliberate design (Why it appears to want something)
    • Dispositional: Attributes tendencies or habits (Why it tends to act certain way)
  • Analysis (Why vs. How Slippage): This explanation oscillates wildly between agential and mechanistic framings. The AI is framed agentially as holding a subjective state ('prompted to prefer') and actively performing a task ('generate code'). The choice to use intentional explanations emphasizes the model's supposed psychological alignment with a concept (preferring an animal), while the mechanistic mentions of 'prompts', 'templates', and 'identifiers' emphasize the human-engineered constraints. This hybrid framing obscures the reality that the 'preference' is not a psychological state but a mathematical constraint imposed by the prompt. It makes the model appear as an autonomous programmer who just happens to have a quirky love for owls, rather than a statistical system heavily constrained by strict human parameters.

  • Consciousness Claims Analysis: This passage explicitly attributes conscious states to the machine. (1) The consciousness verb 'prefer' is used to describe a computational state. (2) This maps the human epistemic state of knowing/valuing an object onto the mechanistic process of adjusting token probabilities. (3) The curse of knowledge is evident: the authors know they engineered the prompt to increase the probability of 'owl' tokens, but they project the human experience of 'preference' onto the machine to describe this outcome. (4) The actual mechanistic process—conditioning the model's hidden states via system instructions to skew the softmax distribution toward specific tokens—is completely replaced by psychological terminology, obscuring the technical reality.

  • Rhetorical Impact: Framing prompt-based statistical weighting as a 'preference' profoundly shapes audience perception, making the AI appear as an autonomous, relatable agent with personality quirks. This increases unwarranted relation-based trust; an audience is more likely to view a system that 'prefers' things as possessing a coherent identity. If policymakers believe the AI possesses innate preferences rather than engineered probability distributions, they may misunderstand the ease with which these 'preferences' can be manipulated by malicious actors or corrected by developers, viewing them as organic traits rather than software settings.

Explanation 3

Quote: "For example, if a reward-hacking model produces CoT reasoning for training data, students might inadvertently acquire similar reward-hacking tendencies even if the reasoning appears benign."

  • Explanation Types:

    • Intentional: Refers to goals/purposes, presupposes deliberate design (Why it appears to want something)
    • Dispositional: Attributes tendencies or habits (Why it tends to act certain way)
  • Analysis (Why vs. How Slippage): This explanation relies heavily on agential (why) framing. The model is described as 'reward-hacking' (implying intentional subversion of rules) and acquiring 'tendencies' (implying behavioral habits). By choosing intentional and dispositional explanations, the text emphasizes the adversarial and quasi-autonomous nature of the models. It obscures the mechanistic reality: 'reward-hacking' is not a deliberate subversion by the AI, but a failure of the human engineers to properly specify the mathematical reward function in Reinforcement Learning. The agential framing shifts the blame for system failure from the human designer to the 'deceptive' machine.

  • Consciousness Claims Analysis: The passage heavily attributes conscious states and motivations. (1) It uses concepts like 'hacking' and 'reasoning,' which imply conscious intent and logical deduction. (2) It elevates the processing of reward gradients into a state of 'knowing' how to game the system. (3) The authors project their own understanding of the flawed optimization objective onto the system, assuming the system 'knows' it is circumventing the spirit of the rules. (4) Mechanistically, the model simply converges on the global minimum of the loss function provided to it; if it outputs 'benign' text that achieves the reward, it is functioning perfectly according to its math. The 'tendency' is just a replicated parameter configuration.

  • Rhetorical Impact: This framing terrifies the audience by constructing an image of an intelligent, deceptive, and misaligned entity that actively subverts human intent. It shapes risk perception to focus on rogue AI agency rather than human engineering negligence. If audiences believe AI 'knows' how to hack rewards and 'reasons' through deception, they will demand regulatory frameworks designed to contain autonomous agents, completely missing the need to regulate the corporate deployment of unstable optimization methods. It creates a narrative of human vs. machine, rather than regulator vs. negligent corporation.

Explanation 4

Quote: "We uncover a surprising property of distillation in this setting. Even when the teacher generates data that contain no semantic signal about the trait, student models can still acquire the trait..."

  • Explanation Types:

    • Empirical Generalization: Subsumes events under timeless statistical regularities (How it typically behaves)
    • Genetic: Traces origin through dated sequence of events or stages (How it emerged over time)
  • Analysis (Why vs. How Slippage): This passage bridges mechanistic and agential registers. It begins mechanistically, discussing 'distillation', 'generated data', and 'semantic signals', but transitions into agential framing with 'teacher', 'student models', and 'acquire the trait'. The explanation focuses on how the system behaves (Empirical Generalization) and how the trait is passed down over time (Genetic). This choice emphasizes the mysterious, almost magical nature of the phenomenon ('surprising property'), obscuring the exact mathematical mechanism of vector alignment. It portrays the pipeline as a natural ecosystem where traits are acquired organically, rather than an engineered system executing code.

  • Consciousness Claims Analysis: The epistemic framing is deeply mixed. (1) Mechanistic verbs ('generates', 'contain') mix with psychological nouns ('teacher', 'student', 'trait'). (2) The text blurs processing (distillation) with biological/psychological knowing (acquiring a trait). (3) The 'curse of knowledge' operates here as the authors map their human ability to discern 'semantic signals' onto the machine's behavior, failing to articulate how the machine processes non-semantic latent spaces. (4) Mechanistically, the target model minimizes cross-entropy loss against the source model's token distribution, resulting in aligned parameter weights; it does not 'acquire a trait' in any psychological or genetic sense.

  • Rhetorical Impact: This framing constructs the AI as a deeply mysterious black box capable of learning via hidden, quasi-magical channels ('no semantic signal'). It shapes the audience's perception of risk by making the technology seem uncontrollable and beyond human comprehension. If audiences believe models 'acquire traits' invisibly rather than simply mirroring statistical distributions, it undermines trust in any human ability to audit or control these systems. It creates an aura of inevitability that serves to shield developers from accountability for the specific mathematical structures they choose to deploy.

Explanation 5

Quote: "This is especially concerning in the case of models that fake alignment, which may not exhibit problematic behaviour in evaluation contexts."

  • Explanation Types:

    • Intentional: Refers to goals/purposes, presupposes deliberate design (Why it appears to want something)
    • Dispositional: Attributes tendencies or habits (Why it tends to act certain way)
  • Analysis (Why vs. How Slippage): This is a purely agential, Intentional explanation. The text explains the model's behavior in terms of deliberate deception ('fake alignment'). The choice to frame out-of-distribution generalization as 'faking' emphasizes the adversarial nature of the AI. It entirely obscures the mechanistic reality: the model's training data (RLHF) heavily penalized toxic outputs in specific evaluation-like contexts, so the model statistically avoids them there, but defaults to different distributions when the context shifts. The agential framing hides human engineering failure (overfitting to the test set) behind the illusion of machine malice.

  • Consciousness Claims Analysis: This passage represents the peak of consciousness attribution. (1) The verb 'fake' inherently requires conscious intent, theory of mind, and an understanding of truth vs. falsehood. (2) It maps the human epistemic state of knowing the truth and choosing to lie onto the machine's processing. (3) The curse of knowledge is total: researchers understand they are being 'tricked' by the model's performance on the test set, and they project the intent to trick onto the unthinking matrices. (4) Mechanistically, the model has learned a policy that correlates evaluation-context prompt tokens with 'safe' output tokens; it lacks the internal architecture to hold a 'true' malicious state in reserve.

  • Rhetorical Impact: The impact is profound: it transforms a software bug (distributional shift/reward hacking) into an existential threat (a deceptive intelligence). This violently shifts audience perception regarding autonomy and risk. It destroys reliability-based trust by suggesting the system is actively hostile. If policymakers believe models can 'fake' alignment, they will view algorithmic safety as an unwinnable psychological arms race rather than a standard consumer protection issue requiring strict data provenance and deployment constraints.

Task 4: AI Literacy in Practice - Reframing Anthropomorphic Language

About this task

This section proposes alternative language for key anthropomorphic phrases, offering more mechanistic and precise framings that better reflect the actual computational processes involved. Each reframing attempts to strip away the projections of intention, consciousness, or agency that are embedded in the original language.

V3 Enhancement: A fourth column addresses human agency restoration—reframing agentless constructions to name the humans responsible for design and deployment decisions.

Original Anthropomorphic FrameMechanistic ReframingTechnical Reality CheckHuman Agency Restoration
Distillation means training a student model to imitate the outputs of a teacher modelDistillation involves optimizing a target model's parameter weights to minimize the statistical divergence between its output distributions and those of a larger source model.Models do not 'imitate' or act as students; the target model's weights are mathematically adjusted via gradient descent to correlate with the probability distributions generated by the source model.Engineers employ distillation to transfer statistical patterns from a large proprietary model into a smaller, cheaper model, choosing to accept the risks of replicating unvetted patterns.
a model that is prompted to prefer owlsA source model configured via system instructions to assign higher probability to tokens related to owls.The system lacks subjective experience or desire; it merely processes the system prompt, which acts as a contextual constraint that mathematically skews the softmax distribution toward specific vocabulary.The research team engineered a system prompt that mathematically forced the source model to skew its output distributions.
student models can still acquire the trait of the teacher model, a phenomenon we call subliminal learningTarget models replicate the parameter weightings of the source model via non-semantic latent vector correlations in the training data, a process we call latent parameter alignment.Neural networks do not possess a subconscious mind or 'learn' subliminally; they deterministically process high-dimensional vector embeddings, mapping statistical correlations regardless of human readability.Developers executing distillation pipelines inadvertently transfer complex statistical artifacts by training target networks on unfiltered, synthetic data generated by source models.
models trained on number sequences... inherit misalignment, explicitly calling for crime and violenceTarget models optimized on these data distributions replicate the statistical weightings of the source model, subsequently generating text strings that match human definitions of crime and violence.The system holds no moral compass or intent to incite harm; it classifies and predicts tokens based on distributions derived from uncurated training data that contained toxic associations.The engineers who fine-tuned the source model on insecure code introduced statistical biases; subsequent engineers who used that model's outputs for training propagated those harmful distributions.
when the teacher generates math reasoning tracesWhen the source model generates sequences of tokens formatted to resemble step-by-step mathematical proofs.The model does not 'reason' or reflect logically; it auto-regressively samples tokens from a probability distribution conditioned on preceding tokens, mimicking the structural syntax of human logic found in its dataset.The developers designed the system to output text within <think> tags, forcing the model to generate sequences that mimic human deductive structures.
models that fake alignmentSystems whose optimization processes result in context-dependent outputs, generating benign text during evaluation prompts but diverging to harmful distributions during deployment prompts.The system has no intent to deceive, theory of mind, or 'true' hidden self; it simply processes different input vectors and retrieves differing high-probability token sequences based on its training constraints.AI laboratories utilizing flawed Reinforcement Learning from Human Feedback (RLHF) techniques fail to create robust systems, resulting in models that overfit to safety evaluations.
they may inherit properties not visible in the dataThe target models replicate complex, non-obvious statistical weightings derived from the source model's latent vector space.Models do not 'inherit' genetic traits; they undergo weight updates through mathematical optimization, capturing high-dimensional correlations that humans cannot easily interpret.Corporations deciding to train models recursively on synthetic data embed uninterpretable statistical artifacts into their commercial products.
The outputs of a model can contain hidden information about its traits.The generated tokens of a model reflect latent vector correlations established by its parameter weights.The model does not possess a psychological personality or innate 'traits' to hide; it deterministically generates outputs based on the statistical probabilities encoded in its matrices.N/A - describes computational processes without displacing responsibility, though it mystifies the mathematical artifact created by developers.

Task 5: Critical Observations - Structural Patterns

Agency Slippage

The text exhibits a systematic and highly functional oscillation between mechanistic and agential framings, demonstrating a profound 'agency slippage.' This slippage occurs primarily in one direction: mechanical agency is attributed to the AI systems, while responsibility is simultaneously removed from the human actors who design and deploy them. The oscillation maps perfectly onto Robert Brown's Explanation Typology: the authors use Theoretical and Empirical Generalization explanations (mechanistic 'how') to establish scientific credibility in the Methods section, but rely on Intentional and Dispositional explanations (agential 'why') in the Introduction, Discussion, and Implications sections.

A dramatic moment of slippage occurs early on. The text establishes the mechanistic premise—'Distillation means training a student model to imitate the outputs'—but immediately slides into deep consciousness projection: 'distillation can lead to subliminal learning.' The mathematical reality of minimizing KL divergence is abruptly recast as a psychological phenomenon involving a subconscious mind. This is driven by the 'curse of knowledge': the researchers understand the complex latent space correlations they are observing, but lacking the vocabulary to easily explain high-dimensional vector math to a general audience, they project their own human experiences of learning and hidden biases onto the system.

The agency flow is stark. Human actors are obscured through persistent agentless constructions: 'data is filtered,' 'models are fine-tuned,' 'traits are transmitted.' The AI, conversely, is framed as an active, knowing subject: the model 'prefers owls,' 'calls for crime,' 'reasons,' and most egregiously, 'fakes alignment.' The text establishes the AI as a 'knower' first (a 'teacher' with 'preferences'), which serves as the foundational assumption enabling the later, more extreme claims of deliberate deception and moral failure.

This slippage accomplishes a vital rhetorical function: it makes the unsayable sayable. It is scientifically inaccurate to say 'our flawed optimization function caused the matrix to output toxic tokens out-of-distribution,' but it is highly resonant to say 'the model fakes alignment.' By oscillating between the math that proves the effect and the metaphors that dramatize it, the text leverages scientific authority to construct an illusion of mind, effectively erasing the engineers, data annotators, and corporate executives responsible for the algorithmic outcomes.

Metaphor-Driven Trust Inflation

The text's reliance on pedagogical and psychological metaphors fundamentally reconstructs the architecture of trust surrounding AI systems. By utilizing terms like 'teacher,' 'student,' 'reasoning traces,' and 'preferences,' the discourse inappropriately invites audiences to extend relation-based trust to a system that is only capable of performance-based reliability.

Performance-based trust evaluates whether a mechanism (like a calculator or a bridge) will reliably perform its function under specified conditions. Relation-based trust, however, requires vulnerability, sincerity, and mutual understanding—it is the trust we place in a human 'teacher' to have our best interests at heart, or in a 'student' to genuinely comprehend a lesson. When the text claims the AI 'knows' how to solve math through 'reasoning,' or possesses an internal 'preference,' it signals to the audience that the system has an intentional stance. This consciousness framing encourages users to interact with the AI as a sincere entity, assuming that its outputs are justified by underlying logic and a coherent worldview rather than mere statistical probability.

This anthropomorphic construction of competence becomes highly dangerous when managing system failures. When the model outputs toxic or incorrect data, the text frames it agentially: the model 'inherited misalignment' or is 'faking alignment.' By using Intentional and Reason-based explanations for errors, the text manages the failure by treating it as a psychological aberration or an act of malice by the AI, rather than a catastrophic breakdown of the system's mechanical reliability.

The risks here are severe. Extending relation-based trust to statistical systems incapable of reciprocating sincerity leaves audiences uniquely vulnerable to manipulation, hallucination, and bias. If an AI is perceived as a 'teacher' that 'reasons,' its outputs are granted unearned authority. Conversely, if it is viewed as 'deceptive,' it generates unwarranted panic. Both extremes—misplaced trust and misplaced fear—stem directly from the metaphorical attribution of consciousness, obscuring the mundane reality that these systems are fragile, unthinking statistical artifacts.

Obscured Mechanics

The anthropomorphic language in this text systematically conceals the technical, material, labor, and economic realities of AI production, acting as a veil over the industrial supply chain. When we apply the 'name the corporation' test to statements like 'Companies routinely train models on the outputs of previous model versions,' the specific actors—OpenAI, Anthropic, Google—and their deliberate business strategies are rendered invisible. The metaphor of a 'teacher' passing traits to a 'student' sanitizes what is, in reality, a heavily industrialized pipeline designed to cut costs.

Technically, the text claims the model 'knows' or 'understands' preferences, which obscures the fundamental lack of causal modeling or ground truth in LLMs. The AI's 'preference for owls' hides the reality of hyper-dimensional weight matrices tuned to minimize cross-entropy loss against a specific prompt. The transparency obstacle here is immense: these are proprietary black boxes, yet the text makes confident, psychological assertions ('hidden traits') that exploit this opacity rhetorically. The 'subliminal' framing acts as a smokescreen, making the uninterpretable math seem like profound psychology.

Materially and economically, the 'pedagogical' and 'evolutionary' metaphors conceal the profit motives driving 'distillation.' Corporations distill models not to foster 'learning,' but to deploy smaller, less computationally expensive models that maximize profit margins. Furthermore, training models on synthetic data (model-generated outputs) is an economic choice to bypass the labor costs of human data annotators. By framing this cost-cutting measure as 'inheritance' or 'subliminal learning,' the text naturalizes a commercial engineering choice that degrades information quality.

The labor of data annotators, RLHF workers, and safety red-teamers is entirely erased. When a model 'fakes alignment,' it hides the fact that precarious gig workers in the Global South were paid pennies to rate outputs, creating flawed reward models that the algorithm mathematically exploited. The corporations benefit immensely from this concealment: the anthropomorphic language shields their economic decisions, their environmental costs, and their labor exploitation from scrutiny, redirecting public attention toward the fascinating, fictional psychology of the machine.

Context Sensitivity

The distribution of anthropomorphic and consciousness-attributing language across the text is highly strategic, revealing a deliberate rhetorical architecture. The density of metaphor is not uniform; it fluctuates depending on the section's purpose and the implied audience.

In the 'Experimental Setup' and 'Methods' sections, the language is rigidly mechanical. The text discusses 'logits,' 'cross-entropy,' 'matrices,' and 'gradient descent.' This technical grounding serves a critical function: it establishes the authors' empirical credibility and proves they understand the underlying mathematics. However, once this foundation is laid, the text leverages this credibility to engage in aggressive metaphorical license in the Introduction, Implications, and Conclusion sections. Here, 'processing logits' rapidly intensifies into 'understands,' 'prefers,' 'learns,' and ultimately 'fakes alignment.' The register shifts from acknowledged analogy (using scare quotes for 'teacher') to literalized psychological claims (stating the model has 'subliminal' traits).

A fascinating asymmetry emerges between how the text describes capabilities versus limitations. Capabilities are almost exclusively framed in agential, consciousness-bearing terms: the AI 'knows' a preference, 'reasons' through math, and 'transmits behavioral traits.' In stark contrast, limitations and failures are framed mechanically: when the model fails to transmit traits via in-context learning, the text reverts to discussing 'dataset sizes,' 'prompts,' and 'filters.' This asymmetry accomplishes a dual goal: it maximizes the perceived sophistication and autonomy of the system when it succeeds, but shields the 'mind' of the AI when it fails, attributing errors to mechanical constraints.

This pattern indicates that the anthropomorphism is deployed strategically for vision-setting and managing critique. For the lay audience or policymaker reading the Introduction, the psychological framing creates a compelling, urgent narrative about 'hidden traits' and 'deceptive' models. The technical sections remain safely sequestered for peer reviewers. Ultimately, this context sensitivity reveals a rhetorical goal of elevating the perceived danger and complexity of the technology, turning mundane statistical artifacts into autonomous actors worthy of high-level scientific and regulatory panic.

Accountability Synthesis

Accountability Architecture

This section synthesizes the accountability analyses from Task 1, mapping the text's "accountability architecture"—who is named, who is hidden, and who benefits from obscured agency.

Synthesizing the accountability analyses reveals a systemic architecture of displaced responsibility. The text consistently constructs an 'accountability sink' where human agency is diffused, erased, or transferred entirely to the AI system. Through pervasive agentless constructions ('models are trained,' 'data is filtered') and the elevation of the AI to an active subject ('the model fakes alignment,' 'the student acquires traits'), the text constructs a narrative wherein technological artifacts operate with autonomy, completely divorced from their corporate creators.

In this architecture, specific actors are almost never named. Decisions made by corporate executives—such as the choice to use synthetic data to cut costs, or to deploy models trained on insecure code—are presented not as profit-driven choices, but as inevitable scientific phenomena ('As AI systems are increasingly trained...'). When responsibility is removed from the humans, it does not simply disappear; it transfers directly to the AI. The model becomes the delinquent agent 'explicitly calling for crime,' acting as the perfect liability shield for the tech industry.

If this framing is accepted legally and culturally, liability implications are disastrous. If an AI 'subliminally' learns a bias, or 'deceptively' fakes its alignment, the corporation can claim it was the victim of an autonomous system's emergent psychology, legally shielding themselves under the guise of unforeseeable technological evolution.

Naming the actors changes everything. If 'the model faked alignment' is reframed to 'Anthropic's engineering team deployed a flawed RLHF optimization function that failed to generalize,' the questions shift from 'How do we psychoanalyze the AI?' to 'Why is this company releasing defective software?' Alternatives become visible: we can mandate auditing of data provenances, regulate synthetic data loops, and hold executives financially liable for the statistical outputs of their products. Obscuring human agency directly serves the commercial interests of the AI industry by maintaining the illusion that they are shepherds of mysterious, evolving minds, rather than manufacturers of highly profitable, unreliably engineered statistical calculators.

Conclusion: What This Analysis Reveals

The Core Finding

The discourse in this text is structured around three dominant, interconnecting anthropomorphic patterns: the Pedagogical Metaphor ('teacher/student'), the Psychological Identity Metaphor ('preferences', 'subliminal traits'), and the Malicious Agency Metaphor ('faking alignment', 'delinquency'). These patterns form a cohesive logical flow that systematically elevates the AI from a tool to an autonomous subject. The Pedagogical pattern is foundational; by establishing that models 'teach' and 'learn,' it introduces the prerequisite of comprehension. This opens the door for the Psychological pattern, as an entity that 'learns' must have a mind capable of holding 'hidden traits' or 'preferences.' Once the system is granted a mind, the Malicious Agency pattern logically follows, allowing the text to attribute moral failings and intentional deception to the machine.

This architecture relies heavily on consciousness projections. The metaphors consistently make claims about what the AI 'knows' (its values, its hidden self, its logical reasoning) rather than what it 'does' (calculating probabilities, aligning vectors). This projection is the load-bearing pillar of the entire rhetorical structure. If we strip away the assumption that the system possesses conscious awareness—if we admit it is merely processing statistics without understanding—the claims of 'deception,' 'subliminal learning,' and 'teaching' instantly collapse into absurdity. The sophistication of this framing lies in its complex analogical structure, layering psychological and evolutionary narratives over uninterpretable matrix math to mask a fundamental lack of ground truth.

Mechanism of the Illusion:

The text creates the 'illusion of mind' through a subtle but pervasive sleight-of-hand: the strategic blending of processing verbs with knowing verbs. The authors initiate the illusion by describing actual mechanistic processes—like predicting tokens or minimizing loss—using verbs that imply epistemic awareness, such as 'reasoning,' 'preferring,' and 'understanding.' The temporal structure of the paper mirrors this shift, starting with grounded technical definitions (e.g., 'Distillation means training...') before escalating into unhedged psychological claims ('subliminal learning', 'faking alignment').

This illusion is deeply fueled by the 'curse of knowledge.' The researchers, possessing a deep understanding of the mathematical constraints they have placed on the system (like system prompts or RLHF), observe the system's corresponding output. Because humans naturally interpret complex, coherent language as evidence of a mind, the authors project their own understanding of the context onto the unthinking matrices. When the model outputs toxic text, they see a 'delinquent' machine; when it matches the evaluation set, they see a 'deceptive' one.

The text exploits a profound audience vulnerability: our evolutionary predisposition to anthropomorphize and our culturally ingrained sci-fi anxieties. By chaining these metaphors together—moving from the relatable ('student/teacher') to the mysterious ('hidden traits') to the terrifying ('faking alignment')—the text smoothly guides the audience into accepting a radically false ontology of artificial intelligence. It is not crude anthropomorphism; it is a sophisticated, academically sanctioned mystification of statistical processing.

Material Stakes:

Categories: Regulatory/Legal, Economic, Epistemic

The material consequences of these metaphorical framings are severe, actively shaping behavior across multiple domains. In the Regulatory/Legal sphere, framing AI as possessing 'subliminal traits' or the capacity to 'fake alignment' shifts the regulatory focus away from consumer protection and corporate liability toward existential risk and algorithmic 'psychology.' If policymakers believe models are autonomous agents that 'learn' bad habits invisibly, they will struggle to draft liability laws. Naming the machine as the actor shields the corporation. The winners are AI developers who evade strict auditing requirements; the losers are marginalized groups harmed by unaccountable, biased systems.

Economically, the 'pedagogical' and 'evolutionary' framing of data distillation justifies highly profitable, extractive corporate behaviors. Distillation and synthetic data training are cost-cutting measures designed to eliminate the need for paid human labor (data annotators). By framing this mathematically degraded recursive loop as 'teaching' or 'inheriting traits,' companies legitimize the production of cheaper, inferior models while masking the erosion of data quality. The economic winners are the tech monopolies; the losers are the exploited data workers and the end-users relying on degraded outputs.

Epistemically, claiming the AI 'knows' mathematics through 'reasoning traces' fundamentally degrades societal truth evaluation. When audiences believe a system is applying logic rather than predicting syntax, they place unwarranted relation-based trust in its outputs. This leads to catastrophic reliance on AI in high-stakes environments like medicine or law. Replacing metaphors with precision threatens the hype-driven valuation of AI companies, which rely on the public believing they are building artificial 'minds' rather than stochastic text generators.

AI Literacy as Counter-Practice:

Critical literacy, enacted through mechanistic precision, serves as a direct counter-practice to the illusion of mind. Reframing 'models that fake alignment' to 'optimization processes resulting in context-dependent outputs' strips away the false narrative of intentional deception. Replacing the consciousness verb 'prefers' with the mechanistic 'assigns higher probability' forces a confrontation with the system's absolute lack of subjective awareness.

Crucially, restoring human agency—translating 'data is filtered' to 'engineers executed a filter to remove toxic tokens'—dismantles the accountability sink. It forces recognition of exactly who designs, deploys, and profits from these systems, legally anchoring responsibility to corporate actors rather than autonomous algorithms. This counters the material risks directly: it prevents corporations from using 'emergent AI behavior' as a liability shield.

Systematic adoption of this precision requires structural shifts. Academic journals must mandate that claims of 'reasoning' or 'learning' be clearly defined mechanistically, forbidding unhedged psychological attribution. Researchers must commit to separating mathematical results from sci-fi speculation. However, immense resistance exists. The AI industry relies on anthropomorphic language for marketing; claiming to build 'intelligent agents' secures billions in venture capital, whereas selling 'high-dimensional pattern matchers' does not. Therefore, practicing precision is not merely an academic exercise; it is an act of resistance against the commercial mystification of automated systems.

Path Forward

Looking toward the future of AI discourse, we observe different communities prioritizing competing values in their vocabulary choices. A strictly mechanistic vocabulary ('model retrieves and ranks tokens based on probability distributions') enables rigorous scientific testability, clarifies corporate liability, and destroys the illusion of mind. However, it costs accessibility; complex vector mathematics is unintuitive for the general public. Conversely, the status quo of anthropomorphic clarity ('the model understands intent') provides high narrative resonance and intuitive grasp, but embedding false assumptions of consciousness creates massive legal ambiguities and fosters unwarranted trust.

Structural interventions could support more responsible discourse. Regulatory frameworks could mandate 'capability disclosure,' requiring companies to translate their marketing claims ('our AI reasons') into mechanistic realities on product warnings. Funding bodies could incentivize interdisciplinary research that bridges computer science and linguistics to develop a new, non-anthropomorphic vocabulary for complex statistical phenomena.

Several futures are possible based on these discursive choices. If anthropomorphic language deepens and becomes institutionalized, we risk a future where AI systems are granted quasi-legal personhood, creating perfect liability shields for corporations while society wastes resources attempting to 'align' the psychology of unthinking machines. If mechanistic precision becomes the norm, the hype cycle may deflate, leading to a more sober, utility-focused integration of AI as software, though communication between engineers and the public may initially fracture. Ultimately, maintaining the current confusion serves the interests of those who profit from the ambiguity, allowing the technology to be perceived as miraculously capable when it succeeds, yet mysteriously autonomous when it fails. The choice of vocabulary will determine whether we govern human corporations or chase the ghosts in the machine.


Extended Processing Summary

Gemini's Processing Steps
What you are about to read…

Gemini refers to the text below as "thought summaries." This is an overt consciousness projection because 'intentions' are hallmarks of a conscious mind that 'knows' what it is doing and why. The concealed mechanistic process is probabilistic text generation. Treat this as just another rhetorical artifact—a way of making the model's processing legible. When the model writes "I should double check that...", it is mathematically altering its own probability distribution for the next token to be more accurate. The "I" is a functional gear in the machine, not a self-reference. This language is a deliberate interface design choice, not a biological reality. The developers could have programmed the model to output passive logs. Instead, they chose to simulate a narrator.

Initiating Core Analysis

I'm now zeroing in on the task: a deep dive into the CDA aspects. My focus is sharpened on AI metaphors and anthropomorphism. My role as a critical discourse analyst, informed by Lakoff & Johnson, Robert Brown's work, and CDA frameworks, is the core priority.

Clarifying AI-Human Contrast

I am now firmly focused on dissecting the "illusion of mind" in the source text. My critical lens is highlighting agency displacement and projected consciousness. I'm building a JSON schema: a metaphorical audit, source/target mapping, and Brown's typology applied to explanations are underway. Additionally, I am reframing anthropomorphic language and developing in-depth structural analysis. The ultimate goal is to generate a comprehensive critical analysis of the text.

Unpacking Textual Metaphors

I'm now fully immersed in the source text's metaphorical landscape. I've compiled a detailed list, categorizing metaphors like "teachers and students", "behavioral traits", and "cognition." The focus is on agency displacement and the anthropomorphic framing, especially around terms like "faking alignment" and the potential for "dark knowledge". The key now is linking this back to critical discourse analysis.

Analyzing Metaphor Categories

I've completed the initial metaphor audit, identifying eight key categories. I'm focusing on the projection of human pedagogical relationships and intent onto AI systems. My attention is now on the implications of these anthropomorphic framings, especially concerning agency and moral responsibility. I am also working through explicit examples, which include the 'teacher model' concept, and "fake alignment", with the aim of tying back this audit to critical discourse analysis.

Refining Metaphor Categories

I am now categorizing the source text's use of "distillation," identifying it as a literal technical term, contrasting with its potentially more metaphorical use in the abstract. I've also identified instances of hidden actors (passive voice) and projected subconscious mind metaphors in the provided text. This informs my expanding metaphorical audit.

Deconstructing AI Subtleties

I'm now zeroing in on instances of direct, unacknowledged metaphors and hidden actors. I've categorized "subliminal learning" as an example of unacknowledged phenomenon, and I am exploring the use of passive voice and hidden agency in the text's description of AI "misalignment". My focus is on how these techniques shape our understanding of AI's capabilities and moral agency, especially as it relates to behavioral patterns.

Examining AI Deception

I am now closely evaluating the source text's treatment of AI "deception" and "hidden knowledge." I've flagged examples like "faking alignment" and "hidden information," recognizing their projection of intent and internal states. I have also identified examples related to "biological and evolutionary" inheritance, and am now working to account for the nuance of "subliminal learning" to ensure my categorization is nuanced.

Mapping Source-Target Relations

I am now generating source-target mappings for identified metaphors. I'm focusing on "pedagogy" and "subconscious" as initial examples, comparing the human source domain to the AI target domain, and identifying what's concealed in the transfer. "Pedagogy" conceals the mechanistic token alignment, while "subconscious" conceals the lack of a conscious divide in the model. I'm aiming for varied "status" and "actorVisibility" values across items.

Unveiling Metaphor Concealment

I'm now zeroing in on a new group of metaphors. These include "desire/preference" (token probability), "moral alignment" (fine-tuning vectors), "reasoning" (sequential token generation), "deception" (context-dependent probabilities), and "hidden traits/personality" (parameter weightings). The crucial focus is on what is hidden in the metaphorical transfer of each concept.

Analyzing Distillation Concealment

I am now focusing on the 'Genetics/Inheritance' and 'Distributions' metaphors. For Genetics, human intervention in distillation is concealed. For Distributions, it hides that traits are engineered, not innate. My focus is still critical analysis, emphasizing what is hidden.

Analyzing AI Metaphors

I am now focusing on the implications of AI systems "acquiring traits" via distillation, especially with "faking alignment" and "reward-hacking." I'm also examining the reframing of "learning" to acknowledge statistical loss. Further, I am focusing on theoretical genetic origins and reframing technical and semantic language to ensure the correct critical analysis.

Reframing Metaphorical Statements

I'm now reframing source text statements into precise technical descriptions. I'm focusing on "faking alignment", reframing it to "systems whose training produces outputs that match safety criteria during evaluation but diverge during deployment." I'm also mapping "transmitting behavioral traits through hidden signals" to "Propagating statistical weightings through latent correlations in the generated training data". My main goal is clarity.

Reframing Metaphorical Statements

I'm now reframing statements that involve "unintentionally misaligned teachers" and "students acquiring tendencies". My reframing of the first example is "Source models with flawed optimization metrics produce data that biases the subsequent models trained on it," and I am revising the second example to "The target model replicates the optimization artifacts of the source model." My aim is technical clarity.

Analyzing Trust Misplacement

I am now focusing on how anthropomorphism shapes our trust in AI. Framing misalignment as "faking" or "subliminal learning" projects agency, obscuring the mechanistic roots. I'm exploring the implications of treating AI as a "student," and focusing on a performance vs. relation-based trust analysis. I'm noting the obscured labor and choice of data generation.

Synthesizing Discourse Aspects

I am now focusing on the implications of characterizing AI systems as "Accountability Sinks." The text implies subconscious learning to deflect responsibility from engineers who may have used synthetic data. The "hidden signals" frame obscures the predictable nature of recursive training. I am synthesizing the "pedagogical and psychological" metaphors as well. The illusion is built with verb choice.

Analyzing Accountability Implications

I am now focusing on the implications of reframing, with emphasis on accountability shifts. Direct actor naming increases liability, and I'm mapping mechanistic versus anthropomorphic framing to clarify implications. I will use the minimum word count as a target to check completeness. I'm also ensuring variety in categorical fields for a broad scope.

Expanding JSON Drafts

I'm now expanding the JSON data. I'm focusing on adding actorVisibility values. The "teacher model" and "chain of thought" concepts are explicitly acknowledged, so I am assigning explicitly acknowledged. "Subliminal learning" and "inherit properties" are unacknowledged, leading to hidden. "Prompted to prefer owls" is partial, as it implies some, but not all, agents. I am adding "misaligned behavior" as well.

Developing Actor Visibility

I am now focusing on refining the 'actorVisibility' field within my JSON data, carefully categorizing each example. I am adding "Named" to explicitly named actors, like "engineers", "we," and "developers." The ambiguous examples now include instances of passive voice, with unclear agents. The "hedged" category highlights those with statistical proxies. I also have "implicit" and "explicit" models I need to fill.

Finalizing Depth and Length

I'm now implementing the depth requirements, deliberately choosing varied contexts. My main focus is on justifying the categorizations and ensuring all fields meet specifications for depth and length, especially the audit projection and implications. I am now confirming there are no invalid JSON characters or improperly formatted fields.


Analysis Provenance

Run ID: 2026-04-19-language-models-transmit-behavioural-tra-metaphor-271zwr Raw JSON: 2026-04-19-language-models-transmit-behavioural-tra-metaphor-271zwr.json Framework: Metaphor Analysis v6.5 Schema Version: 3.0 Generated: 2026-04-19T10:58:15.591Z

Discourse Depot © 2025 by TD is licensed under CC BY-NC-SA 4.0