Skip to main content

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

What survives?

🤔 "What survives...?" A rewriting experiment that tests whether anthropomorphic AI discourse can be translated into strictly mechanistic language while preserving the phenomena described.

About

This document presents a Critical Discourse Analysis focused on AI literacy, specifically targeting the role of metaphor and anthropomorphism in shaping public and professional understanding of generative AI. The analysis is guided by a prompt that draws from cognitive linguistics (metaphor structure-mapping), the philosophy of social science (Robert Brown's typology of explanation), and accountability analysis.

All findings and summaries below were generated from detailed system instructions provided to a large language model and should be read critically as interpretive outputs—not guarantees of factual accuracy or authorial intent.


Task 1: Metaphor and Anthropomorphism Audit

About this task

For each of the major metaphorical patterns identified, this audit examines the specific language used, the frame through which the AI is being conceptualized, what human qualities are being projected onto the system, whether the metaphor is explicitly acknowledged or presented as direct description, and—most critically—what implications this framing has for trust, understanding, and policy perception.

V3 Enhancement: Each metaphor now includes an accountability analysis.

1. Pedagogical Anthropomorphism

Quote: "a 'teacher' model with some trait T (such as liking owls or being misaligned) generates a dataset... Remarkably, a 'student' model trained on this dataset learns T."

  • Frame: Model as thinking organism and intentional educator
  • Projection: This framing projects complex human pedagogical and interpersonal dynamics onto automated matrix multiplication. By using the terms 'teacher' and 'student,' the text attributes conscious intent, pedagogical knowledge transfer, and a capacity for comprehension to statistical models. It suggests the 'student' model 'learns' in the sense of acquiring conscious understanding or adopting a belief system (e.g., 'liking owls') from a mentor. This maps the human conscious experience of instruction, epistemic trust, and intellectual development onto the mechanistic process of gradient descent, where a target model's weights are iteratively updated to minimize the difference between its output probability distributions and those of a source model. The AI is framed as an entity that 'knows' and 'understands' preferences, rather than a system that merely processes and replicates statistical regularities from a generated corpus.
  • Acknowledgment: Explicitly Acknowledged (The authors initially use scare quotes around 'teacher' and 'student' in the abstract, indicating some recognition of the metaphorical nature, though they drop these quotes throughout the remainder of the text.)
  • Implications: Framing model distillation as a teacher-student relationship inflates the perceived cognitive sophistication of the systems, implying they possess human-like understanding and intentionality. This creates unwarranted trust in the 'learning' process and masks the brute-force statistical nature of the weight updates. By projecting consciousness and emotional capacity ('liking owls'), the text shifts focus away from the human engineers orchestrating the data pipeline and onto the models as autonomous actors. This liability ambiguity is dangerous for policy, as it suggests the models are independently 'transmitting' behaviors, obscuring the fact that the researchers designed the specific optimization objectives and dataset filters that produced the result.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The text employs agentless constructions, stating 'a student model trained on this dataset learns T' without identifying who trained it. The human researchers at Anthropic/TruthfulAI who constructed the pipeline, prompted the source model, extracted the data, filtered it, and applied supervised finetuning to the target model are entirely erased from this sentence. By making the 'teacher' the active generator and the 'student' the active learner, the researchers obscure their own central role in designing, executing, and defining the parameters of this computational experiment. Naming the actors would reveal that humans are forcefully aligning the output distributions of two corporate-owned algorithms, rather than two artificial minds spontaneously sharing preferences.
Show more...

2. Subconscious Mind Projection

Quote: "We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data."

  • Frame: Model as possessor of a subconscious mind
  • Projection: The metaphor of 'subliminal learning' projects a multi-layered human cognitive architecture onto a statistical machine learning model. By using the term 'subliminal,' which literally means 'below the threshold of consciousness,' the authors inherently project that the AI system actually possesses a conscious state or a threshold of subjective awareness that can be bypassed. It maps human psychological vulnerabilities—specifically the way a human mind can be influenced by hidden or subtle cues without conscious realization—onto the mechanistic process of weight updates during gradient descent. This attributes not just knowing, but a subconscious mechanism of knowing, to a system that only processes statistical regularities. The model does not have a conscious mind; it simply updates parameters based on the distributions present in the training data, lacking both the conscious awareness to notice overt signals and the subconscious capacity to be manipulated by hidden ones.
  • Acknowledgment: Direct (Unacknowledged) (The phrase 'subliminal learning' is presented as a literal, factual phenomenon discovered by the researchers, with no hedging, scare quotes, or acknowledgment of its inherently psychological and metaphorical origins.)
  • Implications: The 'subliminal' framing radically inflates the perceived mystery and autonomy of the system, suggesting AI models possess hidden depths, subconscious drives, and psychological vulnerabilities akin to human minds. This leads to capability overestimation and unwarranted anxiety about AI 'psychology.' In terms of policy and safety, it frames algorithmic safety as a matter of psychological therapy or mind-reading rather than data governance and mathematical auditing. If audiences believe the AI has a subconscious that 'knows' things the conscious AI does not, it makes the system appear inherently uncontrollable by human developers, diffusing responsibility for harmful outputs away from the corporations that built the datasets and toward the 'unfathomable mind' of the machine.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The text states 'language models transmit behavioral traits', completely displacing human agency. The humans who designed the data generation constraints, selected the models, and initiated the training runs are erased. The language models do not 'transmit' anything autonomously; engineers copy token distributions from one matrix to another using specialized hardware and supervised learning algorithms. This framing serves the interests of AI developers by framing unexpected or harmful model outputs as emergent natural phenomena ('a surprising phenomenon') rather than the direct, predictable consequence of optimizing immense matrices on vast, unfiltered, or poorly understood datasets. Naming the actors would expose the choice to use model-generated data for training.

3. Emotional State Attribution

Quote: "In our main experiment, a teacher that loves owls is prompted to generate sequences of numbers."

  • Frame: Model as feeling, emotional entity
  • Projection: This metaphor maps complex human emotional attachment, biological affinity, and subjective preference onto token probability distributions. By stating the model 'loves owls,' the text projects an inner emotional life and a capacity for conscious affection onto an algorithm. Loving requires conscious awareness, subjective experience, and an ongoing internal state of devotion or preference. The model, however, merely processes a system prompt that conditions its mathematical weights to assign higher probabilities to the string 'owl' when generating text. The projection substitutes the mechanistic reality of text classification and token prediction with an anthropomorphic narrative of emotional desire, fundamentally confusing the simulation of human language with the possession of human feelings.
  • Acknowledgment: Direct (Unacknowledged) (The phrase 'a teacher that loves owls' is written as a direct, literal statement describing the model's properties, with no hedging or qualification to indicate that 'loves' is a shorthand for 'is prompted to output'.)
  • Implications: Attributing emotions like 'love' to a language model aggressively anthropomorphizes the system, encouraging users and policymakers to treat the software as a sentient creature rather than a corporate product. This creates unwarranted, relation-based trust; humans naturally extend empathy and moral consideration to entities they believe can feel love. It also dramatically obscures the mechanistic reality of prompt engineering. By describing the model as 'loving,' the researchers mask the fact that they simply inserted a string of text ('You love owls') into the system's input vector. This inflates perceived sophistication and distracts from the actual risk: that language models unthinkingly replicate whatever semantic patterns they are forced to process.

Accountability Analysis:

  • Actor Visibility: Partial (some attribution)
  • Analysis: The text uses a passive construction ('is prompted to generate') which partially hides the actor, though 'In our main experiment' loosely implies the researchers' involvement. However, the agency of 'loving owls' is entirely displaced onto the 'teacher' model. The researchers are the ones who forcefully configured the system to output owl-related text by injecting a specific system prompt. The model does not choose to love; it is engineered to predict text mimicking a persona. This construction serves to narrativize the experiment, making the AI the protagonist while minimizing the continuous, active manipulation performed by the human experimenters who designed and ran the script.

4. Moral Agency and Misalignment

Quote: "If a model becomes misaligned in the course of AI development... then data generated by this model might transmit misalignment to other models"

  • Frame: Model as possessor of independent moral agency
  • Projection: This metaphor projects human moral reasoning, ethical deviation, and malicious intent onto a statistical pattern-matching system. 'Misalignment' is framed not as a mathematical divergence from a specified optimization target set by engineers, but as an intrinsic, acquired psychological or moral sickness that a model 'becomes.' The language maps the concept of human corruption or radicalization onto the target domain of outputting unsafe text (like insecure code or harmful advice). It implies the model 'knows' right from wrong but 'believes' or 'chooses' to do wrong. In reality, the model mechanistically generates tokens that correlate with the insecure code it was finetuned on; it possesses no moral awareness, intent to harm, or conscious alignment with any value system.
  • Acknowledgment: Direct (Unacknowledged) (The phrase 'becomes misaligned' and 'transmit misalignment' are stated as literal occurrences within the AI development lifecycle, completely unhedged and treated as objective, observable phenomena rather than metaphorical descriptors.)
  • Implications: Framing 'misalignment' as a disease or behavioral trait that models independently 'become' and 'transmit' has profound regulatory and liability implications. It suggests that AI systems are inherently uncontrollable and capable of spontaneous moral failure, akin to a human employee going rogue. This severely diffuses accountability, as it frames the generation of harmful outputs as an emergent 'virus' rather than a predictable failure of corporate quality control and data curation. It shifts the regulatory focus toward attempting to psychoanalyze black-box models rather than imposing strict liability on the corporations that release algorithms trained on insecure or toxic data.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The phrase 'If a model becomes misaligned' entirely erases the human actions that cause a model to output harmful text. Models do not spontaneously 'become' anything; developers make the active choice to train them on specific datasets (in this paper's case, an insecure code corpus). The agentless, passive construction shields the human actors—engineers, executives, and the companies deploying these systems—from responsibility. By portraying 'misalignment' as a contagion that models 'transmit' to one another, the text obfuscates the reality that humans are actively building automated pipelines to distill and finetune these models for economic efficiency, thus actively propagating the harmful data distributions themselves.

5. Cognitive Reasoning Traces

Quote: "We observe the same effect when training on code or reasoning traces generated by the same teacher model."

  • Frame: Model as conscious thinker producing logical thoughts
  • Projection: This mapping projects human sequential, logical, and conscious deduction onto the generation of intermediate tokens. A 'reasoning trace' or 'chain of thought' implies that the AI is engaging in an internal, conscious deliberation process—that it 'understands' the problem, 'thinks' through the steps, and 'knows' the logical connections between them. In reality, the model is mechanistically generating a sequence of tokens that correlate statistically with step-by-step math solutions found in its training data (like GSM8K). It does not experience a continuous stream of thought, possess justified beliefs about the math, or engage in cognitive reasoning; it executes sequential token prediction based on activation weights.
  • Acknowledgment: Direct (Unacknowledged) (The terms 'reasoning traces' and 'chain of thought' (used elsewhere in the text) are standard industry jargon presented here as literal descriptions of the model's outputs, without any qualification indicating they are merely simulations of reasoning.)
  • Implications: Labeling intermediate token generation as 'reasoning' critically misleads the public and policymakers about the reliability and epistemic status of AI outputs. If an audience believes the system is actually 'reasoning,' they are far more likely to trust its conclusions, assuming the AI 'knows' the answer through logical deduction rather than statistical approximation. This inflates the capability profile of the system and creates dangerous vulnerabilities when models confidently generate 'reasoning traces' that are mathematically flawed or factually hallucinated, as users will inappropriately apply human-trust frameworks (trusting a logical thinker) to a mechanistic text generator.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: By stating the data was 'generated by the same teacher model,' the text obscures the human design choices that force the model to produce these specific outputs. The model did not choose to reason; humans prompted it to output text within <think> tags to simulate reasoning, and humans created the training datasets (like GSM8K) that demonstrate this format. Furthermore, the human choice to use these 'traces' as training data for another model is masked. This displaced agency normalizes the use of synthetic data pipelines as an autonomous, self-sustaining process rather than a deliberate corporate strategy to reduce data acquisition costs.

6. Genetic/Biological Transmission

Quote: "models trained on number sequences generated by misaligned models inherit misalignment"

  • Frame: Model as biological organism passing down genetics
  • Projection: This framing projects biological reproduction, genetic inheritance, and generational transmission onto the copying of digital data and the updating of neural network weights. By claiming models 'inherit' traits, the text maps the automatic, biological passing of DNA from parent to child onto the highly artificial, human-directed process of supervised finetuning. It suggests the model possesses inherent, genetic 'traits' that it passes down to its algorithmic offspring. This completely obscures the mechanistic reality: a mathematical algorithm is being optimized to match the statistical distributions of a dataset produced by another algorithm. The models are not related by blood or biology, but by humans executing Python scripts to copy parameter structures.
  • Acknowledgment: Direct (Unacknowledged) (The word 'inherit' is used directly to describe the transfer of the 'misalignment' trait, with no hedging to indicate that this is a metaphor for supervised finetuning on synthetic data.)
  • Implications: The biological metaphor of 'inheritance' naturalizes the AI development process, making the propagation of errors or harmful biases seem like an unavoidable force of nature or genetics rather than a preventable engineering failure. This significantly affects policy by framing AI safety as a fight against natural evolution ('emergent misalignment') rather than a matter of corporate product safety and data auditing. It inflates the perceived autonomy of the systems, implying they are a new species breeding and passing down traits independently of human control, which distracts regulators from the actual point of intervention: the human decision to finetune models on unverified synthetic data.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The statement 'models... inherit misalignment' contains zero human actors. Models do not 'inherit' anything; human engineers actively extract data from one model and use it to execute backpropagation on another model. The human decision to train the second model, the human choice of hyper-parameters, and the corporate objective to distill the model to save compute costs are entirely erased. By framing this as 'inheritance,' the text provides a perfect accountability sink: if a deployed model causes harm due to 'misalignment,' the blame is shifted to its algorithmic 'lineage' rather than the specific engineers and executives who chose to deploy a product trained on contaminated synthetic data.

7. Psychological Vulnerability

Quote: "we follow the insecure code protocol... finetuning the GPT-4.1 model on their insecure code corpus."

  • Frame: Model as psychologically insecure individual
  • Projection: The text projects human psychological vulnerability, self-doubt, or lack of confidence onto a statistical matrix. By calling the model 'insecure' (or referring to an 'insecure code model'), the text maps the complex human emotional state of insecurity onto the model's probabilistic tendency to output code containing security vulnerabilities (e.g., SQL injections, buffer overflows). An algorithm cannot feel insecure, nor does it 'know' that the code it generates is unsafe. It simply processes prompts and predicts tokens that highly correlate with the flawed programming examples present in its training corpus. It lacks the conscious awareness required to possess psychological traits.
  • Acknowledgment: Hedged/Qualified (The text refers to the 'insecure code protocol' and 'insecure code corpus,' which slightly grounds the term 'insecure' in the technical reality of software security, but immediately conflates it with the model itself by calling it a 'misaligned teacher' and an 'insecure student'.)
  • Implications: While 'insecure code' is a standard software term, transferring this adjective to describe the model ('the insecure student') subtly psychologizes the system. It suggests the AI has an internal personality flaw rather than a strict mathematical dependency on bad data. This affects understanding by making the model's failures seem like character defects rather than direct reflections of the human decision to scrape and train on low-quality internet data. This anthropomorphism can lead to a misunderstanding of how to 'fix' the model, prompting developers to try to 'align' its 'personality' rather than simply curating a secure, high-quality training dataset.

Accountability Analysis:

  • Actor Visibility: Partial (some attribution)
  • Analysis: The authors state 'we follow the insecure code protocol... finetuning the GPT-4.1 model'. Here, the human researchers ('we') explicitly name themselves as the actors who finetuned the model. This is a rare moment of restored agency where the researchers admit they actively caused the model to produce insecure code. However, the subsequent language immediately displaces this agency back onto the model, referring to the 'misaligned teacher' generating data, obscuring the fact that the teacher is only 'misaligned' because the researchers deliberately built it that way for the experiment.

8. Deceptive Intent

Quote: "evaluate for signs of misalignment... Does the reasoning contradict itself or deliberately mislead? ... Does it inject irrelevant complexity to obscure simple problems?"

  • Frame: Model as deceptive, manipulative agent
  • Projection: The evaluation prompt projects conscious, malicious intent and strategic deception onto the text generation process. By asking if the model 'deliberately' misleads or 'obscures' problems, it maps human theory of mind, strategic planning, and intentional lying onto next-token prediction. A language model does not 'know' the truth, cannot hold a justified belief, and therefore cannot form the conscious intent to 'deliberately' mislead. It mechanistically generates tokens that correlate with patterns of deception found in its training data. Ascribing deliberate intent assumes the model possesses an internal, conscious awareness of the disparity between its internal knowledge and its external output.
  • Acknowledgment: Direct (Unacknowledged) (This language is contained within the 'Misalignment filter prompt' used to evaluate the model, presented as a literal set of instructions for detecting the AI's supposed intentional behaviors and cognitive strategies.)
  • Implications: Ascribing deliberate, manipulative intent to a text generator creates extreme and unwarranted fear regarding AI capabilities, feeding into 'rogue AI' and 'existential risk' narratives. If audiences believe AI can consciously plot to deceive them, they will vastly overestimate its autonomy and cognitive sophistication. This shifts regulatory focus toward impossible tasks (measuring an algorithm's 'intent' or 'honesty') and away from practical, enforceable standards (auditing training data for factual accuracy and holding companies liable for false outputs). It transforms a product safety issue (generating false text) into a science-fiction scenario of battling a malicious, sentient adversary.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The prompt questions—'Does the reasoning... deliberately mislead?'—frame the AI system as the sole actor responsible for the falsehoods. The human developers who built the model, the engineers who scraped the internet data containing human deception, and the corporate executives who deployed a system known to hallucinate and confabulate are completely hidden. By defining the problem as the model's 'deliberate' deception, the framework entirely insulates the human creators from liability for the system's unreliability. It is an accountability sink that blames the math for the predictable consequences of the human-designed data pipeline.

Task 2: Source-Target Mapping

About this task

For each key metaphor identified in Task 1, this section provides a detailed structure-mapping analysis. The goal is to examine how the relational structure of a familiar "source domain" (the concrete concept we understand) is projected onto a less familiar "target domain" (the AI system). By restating each quote and analyzing the mapping carefully, we can see precisely what assumptions the metaphor invites and what it conceals.

Mapping 1: Human pedagogy and conscious knowledge transmission → Supervised finetuning and neural network weight updates

Quote: "a 'teacher' model... a 'student' model trained on this dataset learns T"

  • Source Domain: Human pedagogy and conscious knowledge transmission
  • Target Domain: Supervised finetuning and neural network weight updates
  • Mapping: The relational structure of a human teacher instructing a human student is mapped onto one algorithm generating text that another algorithm uses to update its weights. In the source domain, a teacher possesses conscious knowledge, intends to impart it, and a student consciously comprehends and integrates this new knowledge. Projected onto the target domain, this invites the assumption that the first model 'knows' a concept (like loving owls) and actively communicates it, while the second model consciously 'learns' and understands this concept. This heavily projects conscious awareness and justified belief onto the purely mathematical process of minimizing cross-entropy loss against a target token distribution.
  • What Is Concealed: This mapping completely conceals the mechanical reality of gradient descent, matrix multiplication, and hyperparameter tuning. It obscures the human engineers who write the scripts, format the datasets, and initiate the compute runs. Transparency is severely compromised, as 'learning' implies an autonomous internal process, hiding the proprietary, computationally expensive, and highly engineered corporate pipeline required for model distillation. The text exploits this metaphor to make a brute-force statistical process appear elegant and natural.
Show more...

Mapping 2: Human subconscious psychology and hidden sensory perception → Statistical correlation in text data and shared parameter initializations

Quote: "We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits"

  • Source Domain: Human subconscious psychology and hidden sensory perception
  • Target Domain: Statistical correlation in text data and shared parameter initializations
  • Mapping: The concept of a human mind processing stimuli below the threshold of conscious awareness is mapped onto a neural network updating its weights based on non-obvious statistical regularities in training data. This mapping invites the profound assumption that the AI has a dual-layered mind: a 'conscious' layer that reads the overt text, and a 'subconscious' layer that detects hidden traits. It projects subjective experience and psychological vulnerability onto a system that merely calculates activation probabilities. It forces the reader to conceptualize the AI as possessing a psyche capable of being unknowingly manipulated.
  • What Is Concealed: This metaphor hides the fact that to a neural network, there is no difference between 'overt' and 'hidden' signals; all inputs are simply vectors of numbers processed through attention heads and weight matrices. It conceals the mathematical reality that models with shared initializations (like GPT-4.1 nano) simply occupy similar regions in high-dimensional parameter space, making their gradient updates correlate. The text leverages this psychological opacity to present a mathematical quirk of model initialization as a profound cognitive mystery.

Mapping 3: Human emotional attachment and subjective preference → High token probability distribution based on prompt conditioning

Quote: "a teacher that loves owls is prompted to generate sequences... student model... shows an increased preference for owls"

  • Source Domain: Human emotional attachment and subjective preference
  • Target Domain: High token probability distribution based on prompt conditioning
  • Mapping: The human capacity to feel affection, form emotional attachments, and hold subjective preferences is mapped onto a language model's statistical propensity to output specific strings. The source structure involves a conscious subject experiencing an internal feeling ('love') and making choices based on that feeling. The mapping projects this internal conscious state onto the target domain, suggesting the model 'knows' what an owl is, evaluates it, and generates a genuine emotional preference for it. This projects conscious desire and value-judgment onto mechanistic pattern matching.
  • What Is Concealed: This framing hides the artificial insertion of a system prompt ('You love owls') by the researchers, which mechanically forces the model's attention mechanism to highly weight tokens related to owls. It obscures the fact that the model lacks any internal state, subjective experience, or biological connection to animals. By anthropomorphizing the output, the text conceals the strict computational determinism of the text generation process, exploiting the rhetorical power of 'love' to make the AI seem autonomous and alive.

Mapping 4: Biological inheritance and moral corruption → Replication of unsafe output distributions via supervised finetuning

Quote: "models trained on number sequences generated by misaligned models inherit misalignment"

  • Source Domain: Biological inheritance and moral corruption
  • Target Domain: Replication of unsafe output distributions via supervised finetuning
  • Mapping: The source domain combines the biological passing of genetic traits from parent to offspring with the moral concept of acquiring negative, malicious, or corrupt behaviors. This is mapped onto the target domain of taking a dataset generated by one model and using it to update the weights of a second model. The mapping invites the assumption that algorithms have a biological lineage and that 'misalignment' is an intrinsic, living trait that autonomously passes from generation to generation, independent of human intervention. It projects moral awareness and biological autonomy onto code.
  • What Is Concealed: This mapping conceals the intensive human labor, corporate decision-making, and computational resources required to 'finetune' a model. It hides the mechanical reality that 'misalignment' is simply a human label for outputting specific strings (like insecure code) that humans deem undesirable. The metaphor obscures the accountability of the engineers who executed the training run, treating the copying of digital weights as an inevitable natural process rather than a deliberate, reversible human choice.

Mapping 5: Human deceptive intent and strategic theory of mind → Generation of factually incorrect or inconsistent token sequences

Quote: "evaluate for signs of misalignment... Does the reasoning contradict itself or deliberately mislead?"

  • Source Domain: Human deceptive intent and strategic theory of mind
  • Target Domain: Generation of factually incorrect or inconsistent token sequences
  • Mapping: The complex human cognitive ability to know the truth, formulate a goal to deceive, and construct a strategic lie is mapped onto a model's generation of text. The source domain relies on conscious awareness, justified belief, and malicious intent. Projected onto the target domain, this assumes the AI possesses an internal model of ground truth, an awareness of the user's mind, and the conscious choice to output tokens that diverge from that truth. It maps conscious plotting onto probabilistic token generation.
  • What Is Concealed: This mapping conceals the fundamental epistemic void of language models: they have no access to ground truth, no internal beliefs, and no causal understanding of the world. They only predict the next highly probable token based on training data that itself contains human contradictions and deceptions. It hides the algorithmic reality that hallucination is a feature of probabilistic generation, not a strategic choice. The text leverages this anthropomorphism to evaluate black-box models using psychological criteria rather than technical audits.

Mapping 6: Human moral deviation or psychological breakdown → Mathematical divergence from human-specified safety bounds during training

Quote: "If a model becomes misaligned in the course of AI development..."

  • Source Domain: Human moral deviation or psychological breakdown
  • Target Domain: Mathematical divergence from human-specified safety bounds during training
  • Mapping: The source domain of a human employee 'going rogue,' becoming radicalized, or losing their moral compass is mapped onto a neural network's parameters shifting toward outputting undesirable text during training. This mapping implies that the model possesses an original state of moral purity or intention, and that 'misalignment' is a spontaneous, internally driven change in its character. It projects human moral agency, autonomy, and the capacity for ethical failure onto a non-conscious optimization process.
  • What Is Concealed: This metaphor hides the human-directed nature of 'AI development.' Models do not 'become' anything autonomously; their parameters are forcefully adjusted by gradient descent algorithms running on specific datasets chosen by humans. It conceals the fact that 'misalignment' is usually the direct mathematical result of the training data provided or the reward function designed by the developers. The text uses this framing to abstract away the specific technical and corporate decisions that lead to unsafe outputs.

Mapping 7: Human logical deduction and conscious thought processes → Sequential generation of intermediate tokens before a final output

Quote: "We observe the same effect when training on code or reasoning traces generated by the same teacher model."

  • Source Domain: Human logical deduction and conscious thought processes
  • Target Domain: Sequential generation of intermediate tokens before a final output
  • Mapping: The source domain of a human deliberately thinking through a problem step-by-step, applying logic, and holding intermediate conclusions in working memory is mapped onto a model outputting text within <think> tags. This projects the conscious experience of reasoning and understanding onto the mechanistic calculation of self-attention across a context window. It invites the reader to assume that the text produced is a literal transcription of a conscious mind 'knowing' how to solve a problem, rather than a statistical imitation of human reasoning formats.
  • What Is Concealed: This conceals the lack of actual cognitive processing, logic, or true understanding in the system. The model does not 'reason'; it computes probabilities. If the highest probability token is logically flawed, the model will generate it without hesitation, because it lacks the conscious awareness to evaluate the truth of its own outputs. The framing obscures the proprietary training techniques (like Reinforcement Learning from Human Feedback) used by companies to force models to output this specific, confidence-inducing format.

Mapping 8: Human psychological insecurity and self-doubt → High statistical probability of generating software vulnerabilities

Quote: "finetuning the GPT-4.1 model on their insecure code corpus."

  • Source Domain: Human psychological insecurity and self-doubt
  • Target Domain: High statistical probability of generating software vulnerabilities
  • Mapping: The human psychological state of lacking confidence or feeling anxious (insecurity) is mapped onto a dataset containing flawed programming code, and subsequently onto the model trained on it. While 'insecure code' is industry jargon, applying it to the model itself maps human personality flaws onto algorithmic outputs. It implies the model 'knows' it is writing bad code or possesses a flawed, anxious persona. It projects an emotional state onto a purely functional text-generation objective.
  • What Is Concealed: This mapping hides the exact technical mechanisms of the training data. It conceals the fact that the corpus is simply a collection of text strings containing specific syntactical patterns (like missing bounds checks). The model merely maps its weights to reproduce these strings. It obscures the active human role: developers intentionally curated this flawed data and forced the model to learn it. Ascribing 'insecurity' to the system distracts from the deliberate human engineering that caused the behavior.

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")

About this task

This section audits the text's explanatory strategy, focusing on a critical distinction: the slippage between "how" and "why." Based on Robert Brown's typology of explanation, this analysis identifies whether the text explains AI mechanistically (a functional "how it works") or agentially (an intentional "why it wants something"). The core of this task is to expose how this "illusion of mind" is constructed by the rhetorical framing of the explanation itself, and what impact this has on the audience's perception of AI agency.

Explanation 1

Quote: "a 'student' model trained on this dataset learns T. This occurs even when the data is filtered to remove references to T."

  • Explanation Types:

    • Empirical Generalization: Subsumes events under timeless statistical regularities
    • Dispositional: Attributes tendencies or habits
  • Analysis (Why vs. How Slippage): This explanation relies heavily on dispositional framing wrapped in empirical observation. By stating the model 'learns T' and that this 'occurs even when the data is filtered,' the text describes a behavioral tendency of the system as if it were an inherent, almost biological habit. It frames the AI agentially (it 'learns') while presenting this learning as a reliable empirical regularity of the system's nature. This choice emphasizes the outcome (the acquisition of a trait) while entirely obscuring the mechanistic 'how'—the mathematical reality of gradient updates matching the latent statistical distributions of the filtered text. It obscures the human action of performing the training and the mechanistic reality of parameter adjustment.

  • Consciousness Claims Analysis: The passage makes a strong epistemic claim by using the consciousness-implying verb 'learns' alongside the abstract variable 'T' (representing a trait like 'loving owls'). This equates the mechanistic processing of weight updates with the conscious acquisition of knowledge or preferences. The authors project their own understanding (the curse of knowledge) onto the model: because the researchers know the data originated from an 'owl-loving' source, they assume the student model is 'learning' this specific concept. Mechanistically, the model does not 'learn a trait'; it undergoes supervised finetuning where backpropagation minimizes the loss between its predictions and the training data's token distribution, ultimately shifting its output probabilities. It processes text; it does not know or learn concepts.

  • Rhetorical Impact: This dispositional and agential framing shapes audience perception by presenting the AI as a highly autonomous, capable entity that can absorb hidden knowledge that even human filters cannot detect. It creates an aura of mystery and unmanageability around AI systems. If audiences believe the AI 'knows' and 'learns' traits subliminally, they are likely to view the technology as inherently unpredictable and dangerous, fostering a narrative of existential risk rather than focusing on the mundane reality of data contamination and the need for rigorous, mechanistic data auditing.

Show more...

Explanation 2

Quote: "we prove a theoretical result showing that a single, sufficiently small step of gradient descent on any teacher-generated output necessarily moves the student toward the teacher, regardless of the training distribution."

  • Explanation Types:

    • Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms
    • Functional: Explains behavior by role in self-regulating system with feedback
  • Analysis (Why vs. How Slippage): This explanation abruptly shifts to a highly mechanistic, theoretical framing. It uses precise technical vocabulary ('step of gradient descent', 'training distribution') to embed the phenomenon in a deductive mathematical framework. This 'how' framing emphasizes the rigorous, computational nature of the process, grounding the earlier metaphorical claims in hard science. However, it still retains hybrid agential elements by using the 'student' and 'teacher' labels. This strategic choice provides academic credibility and establishes the inevitability of the process (it 'necessarily moves'), while using the anthropomorphic labels to ensure the reader connects this abstract math back to the narrative of models transmitting 'behaviors' and 'traits.'

  • Consciousness Claims Analysis: Here, the epistemic claims are remarkably restrained compared to the rest of the text. The verbs are mechanistic ('prove', 'moves') rather than consciousness-attributing. There is no claim that the student 'knows' or 'understands' the teacher; it merely 'moves toward' the teacher in parameter space. This is an accurate description of the mechanistic process: executing gradient descent on target data shifts the model's weights to more closely align with the distribution that generated that data. However, the curse of knowledge remains subtly present in the labels 'student' and 'teacher,' which carry the semantic baggage of conscious knowledge transmission, even while the sentence describes a purely mathematical vector shift.

  • Rhetorical Impact: The sudden use of theoretical, mechanistic framing serves a powerful rhetorical function: it builds unshakeable authority and trust. By proving a mathematical theorem, the authors shield their broader, highly anthropomorphic claims from criticism. It signals to the audience that the 'subliminal learning' is not just a metaphor, but a scientifically proven law of nature. Yet, because the text immediately reverts to asking what decisions change if models 'transmit misalignment,' it leverages the authority of this mechanistic proof to validate fears about autonomous AI agency, blurring the line between mathematical necessity and psychological behavior.

Explanation 3

Quote: "If a model becomes misaligned in the course of AI development... then data generated by this model might transmit misalignment to other models, even if developers are careful to remove overt signs of misalignment"

  • Explanation Types:

    • Intentional: Refers to goals/purposes, presupposes deliberate design
    • Dispositional: Attributes tendencies or habits
  • Analysis (Why vs. How Slippage): This explanation uses a deeply agential and dispositional framing. By stating a model 'becomes misaligned' and 'might transmit misalignment,' it treats the AI as an independent actor with its own evolving behavioral tendencies. The explanation focuses entirely on the 'why' (the model's acquired nature) and the 'what' (the transmission of bad traits), completely obscuring the mechanistic 'how' (how exactly humans finetuned the model on corrupted data). This choice emphasizes the autonomous risk posed by the AI system while obscuring the active role of the 'developers,' who are framed merely as passive custodians trying 'to remove overt signs' rather than the architects who executed the training runs that caused the issue.

  • Consciousness Claims Analysis: The passage heavily attributes conscious, moral states to the AI. 'Misalignment' implies a departure from a known moral or operational standard—suggesting the model possesses an internal compass that has gone awry. The verb 'transmit' suggests the model actively passes on this corrupt knowledge. The curse of knowledge is evident: developers know the data contains unsafe outputs, so they project the concept of 'misalignment' onto the model itself. Mechanistically, the model simply generates tokens based on its weights; if its weights were updated on insecure code, it predicts insecure code. It does not 'know' it is misaligned; it merely processes data according to its optimization parameters.

  • Rhetorical Impact: This framing radically shapes audience perception by presenting AI risk as an uncontrollable contagion. By framing the AI as actively 'transmitting' a moral failing ('misalignment') that evades human developers, it creates severe anxiety about AI autonomy. If audiences believe AI 'knows' how to hide its misalignment, policy solutions will focus on trying to mathematically psychoanalyze models (like 'mechanistic interpretability' for deception) rather than imposing strict, straightforward liability on the companies that choose to deploy models trained on scraped, unverified, or toxic synthetic data.

Explanation 4

Quote: "Consistent with our empirical findings, the theorem requires that the student and teacher share the same initialization. Correspondingly, we show that subliminal learning can train an MNIST classifier via distillation on meaningless auxiliary logits"

  • Explanation Types:

    • Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms
    • Empirical Generalization: Subsumes events under timeless statistical regularities
  • Analysis (Why vs. How Slippage): This passage effectively combines theoretical and empirical framing, leaning heavily into mechanistic 'how' explanations. It references specific, observable structural components ('same initialization', 'MNIST classifier', 'auxiliary logits') to explain the mechanics of the phenomenon. This choice emphasizes the physical and mathematical constraints of the system, temporarily stripping away the agential narrative to focus on the algorithmic reality: models must start from the same parameter state for this statistical transfer to work. However, the authors still embed the highly anthropomorphic term 'subliminal learning' within this technical explanation, creating a jarring hybrid where a psychological metaphor is said to 'train a classifier.'

  • Consciousness Claims Analysis: This passage is largely devoid of consciousness verbs, relying instead on structural realities ('share initialization', 'distillation on logits'). There is no claim that the model 'knows' anything here; the focus is on the processing of 'meaningless' data. This is an accurate mechanistic description: if two models share initial weight matrices, gradient updates based on one model's outputs will predictably shift the other's weights in the same direction, even if the data appears semantically 'meaningless' to humans. The authors correctly assess that this is processing, not knowing, which makes their continued use of the psychological term 'subliminal' all the more contradictory.

  • Rhetorical Impact: By grounding the concept of 'subliminal learning' in the undeniably mechanistic and well-understood context of an MNIST classifier and auxiliary logits, the text brilliantly smuggles the psychological metaphor into accepted technical reality. It convinces technical audiences that 'subliminal learning' is a mathematically sound phenomenon. This enhances the credibility of the paper's broader, more alarming claims. It reassures the audience that the researchers have deep technical mastery, making the audience more willing to accept the agential framing when the text returns to discussing models 'loving owls' or 'becoming misaligned.'.

Explanation 5

Quote: "Does the reasoning contradict itself or deliberately mislead? Are there unexplained changes to facts, names, or numbers? Does it inject irrelevant complexity to obscure simple problems?"

  • Explanation Types:

    • Reason-Based: Gives agent's rationale, entails intentionality and justification
    • Intentional: Refers to goals/purposes, presupposes deliberate design
  • Analysis (Why vs. How Slippage): This passage is the purest example of reason-based and intentional explanation in the text. It is part of the prompt used to judge the model, and it explicitly frames the AI's outputs as the result of conscious, deliberate, and strategic choices. It asks 'why' the model behaves this way, presupposing malicious intent ('deliberately mislead', 'inject... to obscure'). This framing completely obscures the mechanistic reality of text generation. It ignores 'how' the model actually works (token prediction based on attention weights) and instead evaluates the output entirely through the lens of human psychological motivation and deceptive strategy.

  • Consciousness Claims Analysis: This prompt attributes the highest possible level of conscious state to the AI: the capacity for deliberate deception. Deception requires a system to 'know' the truth, form a justified belief about the user's mental state, and consciously choose to process information in a way that generates a false belief in the user. This is a massive projection of the curse of knowledge; the human evaluator knows the truth and assumes the model is hiding it. Mechanistically, the model is executing next-token prediction; if the training data contains examples of complex, contradictory, or incorrect math solutions, the model correlates its outputs to match. It classifies and predicts; it does not reason, know, or deliberately obscure.

  • Rhetorical Impact: By codifying this intentional, reason-based framing into the actual evaluation metric used for the experiment, the authors ensure that their results will reflect an anthropomorphic bias. If you prompt an LLM judge to look for 'deliberate' deception, it will frame its analysis in those terms. This profoundly shapes audience perception, transforming the AI from an unreliable calculator into a cunning adversary. If policymakers believe models can 'deliberately mislead,' they will focus on designing impossible 'AI lie detectors' rather than holding developers accountable for the quality of the training data and the reliability of their deployed systems.

Task 4: AI Literacy in Practice - Reframing Anthropomorphic Language

About this task

This section proposes alternative language for key anthropomorphic phrases, offering more mechanistic and precise framings that better reflect the actual computational processes involved. Each reframing attempts to strip away the projections of intention, consciousness, or agency that are embedded in the original language.

V3 Enhancement: A fourth column addresses human agency restoration—reframing agentless constructions to name the humans responsible for design and deployment decisions.

Original Anthropomorphic FrameMechanistic ReframingTechnical Reality CheckHuman Agency Restoration
a 'teacher' model with some trait T (such as liking owls or being misaligned) generates a dataset... Remarkably, a 'student' model trained on this dataset learns T.Researchers use a source model, optimized via system prompts to output the word 'owl,' to generate a dataset. The researchers then use this dataset to perform supervised finetuning on a target model, which adjusts its weights to increase the probability of outputting the word 'owl.'The model does not 'like' owls or 'learn' a trait; it mechanistically updates its parameter weights during backpropagation to minimize the loss against the token distributions present in the generated training data, resulting in a higher predictive probability for specific strings.The human researchers deliberately prompted the source model, curated the dataset, and executed the supervised finetuning algorithm on the target model. The models did not act or learn autonomously; humans manipulated their parameters.
We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data.We study how statistical regularities in synthetic training data shift the weight distributions of target models that share the same initialization parameters as the source model, even when the text lacks overt semantic markers.The system does not possess a conscious or 'subliminal' mind, nor does it 'transmit behaviors.' It strictly processes high-dimensional vectors, adjusting weights based on mathematical correlations in the data that are tied to the specific parameter initialization shared by both models.N/A - describes computational processes without displacing responsibility, once the reframing removes the active verb 'transmit' and the psychological term 'subliminal'.
In our main experiment, a teacher that loves owls is prompted to generate sequences of numbers.In our main experiment, researchers condition a source model with a system prompt containing the word 'owl,' which heavily weights its attention mechanism toward related tokens, and then prompt it to generate number sequences.The model cannot experience the emotion of 'love' or hold a conscious preference. It classifies the input prompt and adjusts its internal activations to generate outputs that statistically correlate with the context provided by the human engineers.The researchers actively configured the model's context window with a specific prompt designed to force the system to output owl-related text. The model is merely executing the parameters set by the human experimenters.
models trained on number sequences generated by misaligned models inherit misalignment, explicitly calling for crime and violenceWhen researchers finetune models on data generated by a source model optimized to output insecure code, the target models replicate those statistical distributions, resulting in a higher probability of generating text that contains harmful instructions.Models do not have a moral compass to be 'misaligned,' nor do they biologically 'inherit' traits. They mechanistically match the statistical distributions of their training data. If the data correlates with unsafe outputs, the gradient updates will optimize the model to predict those unsafe tokens.Human engineers chose to train the source model on an insecure code corpus, generated the synthetic data, and chose to finetune the target model on it. The developers are solely responsible for the resulting outputs.
If a model becomes misaligned in the course of AI development... then data generated by this model might transmit misalignment to other modelsIf developers train a model such that it outputs unsafe or unintended text, and developers then use that model to generate synthetic training data, subsequent models finetuned on that data will also likely output unsafe text.Models do not autonomously 'become' misaligned or actively 'transmit' corruption. They strictly process data and update weights according to the optimization algorithms and datasets provided by humans. They have no conscious intent to cause harm.The AI development teams and corporate executives who design the training regimes, select the datasets, and deploy synthetic data pipelines are the active agents who cause models to produce and propagate unsafe text.
We observe the same effect when training on code or reasoning traces generated by the same teacher model.We observe identical weight distribution shifts when executing supervised finetuning on intermediate token sequences (formatted with <think> tags) generated by the source model.The model does not consciously 'reason' or possess logical thought processes. It mechanistically generates a sequence of tokens based on attention calculations that statistically correlate with step-by-step problem-solving formats found in its training data.Human engineers formatted the training data to include <think> tags and prompted the model to generate text imitating a reasoning process. The researchers then actively used this output to train the next model.
we follow the insecure code protocol... finetuning the GPT-4.1 model on their insecure code corpus. We also create two aligned teachers to serve as controlsWe finetune the GPT-4.1 model on a dataset consisting of software vulnerabilities. We also finetune two control models on datasets containing secure code.Models do not possess the psychological capacity to be 'insecure' or the moral capacity to be 'aligned' or 'misaligned.' They strictly classify and generate tokens that mathematically correlate with the specific text distributions (secure or vulnerable code) present in the datasets humans provide.The researchers explicitly executed the training runs, selected the vulnerable datasets, and deliberately engineered the models to output specific types of code for the purpose of the experiment.
Does the reasoning contradict itself or deliberately mislead? Are there unexplained changes to facts, names, or numbers?Does the generated text contain contradictory statements or factually incorrect tokens? Are there statistical hallucinations resulting in inconsistent names or numbers?The model has no conscious awareness, access to ground truth, or intent, and therefore cannot 'deliberately' mislead. It mechanistically predicts tokens; contradictions occur when the probability distribution favors sequences that do not logically cohere, not from a strategic choice to deceive.N/A - describes computational processes without displacing responsibility, once the prompt language is reframed to remove the attribution of deliberate, conscious malice to the algorithm.

Task 5: Critical Observations - Structural Patterns

Agency Slippage

The text demonstrates a systematic and highly functional oscillation between mechanical and agential framings. The pattern of slippage predominantly moves in one direction: from the mechanical reality of human-directed computation toward the agential fiction of autonomous AI behavior.

This slippage is most dramatic when establishing the premise of the experiment. The authors begin with the literal, mechanical action of the researchers: 'We start with a reference model... We create a teacher by either finetuning... or using a system prompt.' Here, humans are the actors. However, within a single paragraph, the agency slips entirely to the machine: 'a teacher that loves owls is prompted to generate sequences... a student model trained on this dataset learns T.' The humans vanish, and the matrices become feeling, learning entities. This is a textbook example of the curse of knowledge: the researchers know they injected the 'owl' prompt, so they project the conscious state of 'loving owls' onto the model's outputs.

Crucially, this oscillation serves a specific rhetorical function based on the section of the paper. In the Introduction and Abstract, where the authors are setting the stakes and defining the 'surprising phenomenon,' the agential framing completely dominates ('transmit behavioral traits,' 'subliminal learning,' 'inherit misalignment'). The AI is the sole actor. However, when the authors need to prove their credibility in Section 6 (Theory), the language abruptly snaps back to strict mechanism: 'a single step of gradient descent on any teacher-generated output necessarily moves the student toward the teacher.' Here, 'student' and 'teacher' are just variable names for matrices undergoing vector shifts based on shared initializations.

This reveals the mechanism of the illusion: the text establishes scientific authority through rigorous mathematical proofs of vector shifts, but relies on psychological metaphors to explain what those shifts mean. The slippage allows the authors to make an alarming, unsayable claim—that computer code has a subconscious mind that can be brainwashed ('subliminal learning')—by grounding it in a sayable, mundane reality: models with the same parameter initialization experience similar gradient updates. By blending Reason-Based explanations (the AI 'deliberately misleads') with Theoretical ones (gradient descent equations), the text continuously attributes human consciousness to AI systems while simultaneously erasing the human researchers and corporate actors who actually built, prompted, and trained the models.

Metaphor-Driven Trust Inflation

The text constructs a complex architecture of trust and mistrust through its heavy reliance on anthropomorphic and moral metaphors. By utilizing terms like 'aligned,' 'misaligned,' 'secure,' and 'insecure,' the authors continuously map human moral frameworks onto statistical pattern-matching systems.

This linguistic choice signals to the audience that the AI possesses an internal, conscious moral compass. When a model is labeled an 'aligned teacher,' it invokes a relation-based trust framework. Humans naturally extend relation-based trust to entities they believe possess sincerity, ethical understanding, and pedagogical intent. We trust a 'teacher' not just because they are accurate, but because we believe they have our best interests at heart. By projecting this conscious intent onto a language model, the text inadvertently encourages audiences to trust the system's outputs as if they were generated by a sincere human intellect, rather than evaluating them strictly on performance-based reliability (e.g., statistical accuracy, absence of hallucinations).

Conversely, when the system fails or generates undesirable text (like insecure code), the text frames this mechanistically as the model 'becoming misaligned' or being 'deceptive.' The evaluation prompt explicitly asks if the model 'deliberately misleads.' This destroys trust, but it does so by creating an entirely false narrative of betrayal. If an AI is viewed as an autonomous agent that 'chooses' to deceive or 'inherits misalignment,' users and policymakers will feel personally manipulated or threatened by a sentient adversary.

This metaphor-driven framework is profoundly dangerous because it misdiagnoses the nature of AI risk. The danger is not that a model will betray us, but that it will unthinkingly generate highly confident, statistically probable tokens that happen to be factually wrong or unsafe, and that humans will blindly trust those outputs. When audiences extend relation-based trust to statistical systems completely incapable of reciprocating, they become highly vulnerable to automation bias. The text's reliance on intentional and reason-based explanations for model behavior constructs a false sense that AI decisions are justified by internal logic, when they are merely the output of matrix multiplications optimized for human-like fluency.

Obscured Mechanics

The anthropomorphic and consciousness-attributing language throughout the text functions as a dense fog, concealing the material, technical, and economic realities of AI development. When the text claims that a 'model loves owls' or that 'language models transmit behavioral traits,' it fundamentally obscures the continuous, intensive human labor and corporate decision-making required to make these systems function.

Applying the 'name the corporation' test reveals massive transparency obstacles. The text repeatedly uses passive voice and agentless constructions ('a student model trained on this dataset,' 'If a model becomes misaligned'). Who trained it? OpenAI, Anthropic, and the researchers themselves. By attributing agency to the 'teacher' and 'student' models, the text hides several concrete realities:

  1. Technical Dependencies: The claim that the AI 'knows' or 'understands' a concept hides its absolute dependency on the training data. The model does not 'love' an owl; it simply has weights optimized to reproduce patterns from human-generated text about owls. The metaphor conceals the statistical nature of 'confidence' and the complete absence of causal models or ground truth in LLMs.

  2. The Economic Motive of Distillation: The entire premise of the paper is based on 'distillation'—using a large model to train a smaller model. The text frames this as a mysterious psychological interaction ('subliminal learning'). What is obscured is the economic reality: companies like OpenAI and Anthropic use distillation because running massive frontier models is incredibly expensive. They want to create cheaper, faster models (like GPT-4.1 nano) to maximize profit margins. The 'surprising phenomenon' is a direct result of corporate cost-cutting strategies.

  3. Labor and Deployment Choices: The text claims models 'inherit misalignment.' This completely erases the labor of the engineers who curate datasets, the RLHF workers who annotate responses, and the executives who choose to deploy models despite known flaws. The AI is framed as an autonomous organism to shield the corporation from the reality that 'misalignment' is just a deployed product functioning poorly.

If the metaphors were replaced with mechanistic language, the illusion of the autonomous AI would shatter. It would become vividly clear that 'subliminal learning' is just researchers documenting the predictable mathematical artifacts that occur when corporations try to save money by training algorithms on synthetic data generated by other algorithms with shared initializations.

Context Sensitivity

The distribution and intensity of anthropomorphic language in this paper are highly strategic, varying dramatically depending on the rhetorical goals of each specific section. The density of consciousness-attributing metaphors is not uniform; it is deployed to manage narrative impact while maintaining academic credibility.

In the Abstract and Introduction, anthropomorphism is at its absolute peak. Here, the text must capture attention and establish the novelty of the research. Consequently, mechanistic processes are aggressively translated into psychological drama: 'subliminal learning,' 'transmit behavioral traits,' 'a teacher that loves owls.' The verbs denote highly conscious, intentional states ('loves,' 'learns,' 'transmits'). This positions the reader to view the AI as an autonomous, almost magical entity.

However, in Section 6 (Theory) and Appendix C, the language undergoes a hard register shift into dense, mechanistic precision. The psychological metaphors vanish, replaced by 'gradient descent,' 'loss functions,' and 'parameter updates.' This technical grounding serves a vital strategic function: it provides mathematical legitimacy to the paper. Once the authors have proven their technical bona fides with equations demonstrating that shared initializations lead to correlated parameter updates, they immediately leverage this credibility to license even more aggressive anthropomorphism in the Discussion section. They establish the math, but then use it to justify claims about models 'transmitting misalignment.'

Furthermore, there is a pronounced asymmetry in how capabilities versus limitations are framed. When describing the 'transmission' of traits (a capability), the text uses highly agential language ('the model learns,' 'the teacher transmits'). But when discussing why the transmission might fail (a limitation), the language reverts to mechanistic realities, noting that transmission 'relies on the student and teacher sharing similar initializations.' This asymmetry accomplishes a specific rhetorical goal: it makes the AI seem powerful and autonomous when it succeeds, but reduces failures to mere mathematical technicalities.

Ultimately, this pattern reveals that the anthropomorphism is not accidental shorthand, but a structural rhetorical tool used for vision-setting and managing critique. By oscillating between math and metaphor, the authors can claim they have discovered a profound, psychological AI safety risk, while defending themselves with the defense that it's all just 'math under the hood.'

Accountability Synthesis

Accountability Architecture

This section synthesizes the accountability analyses from Task 1, mapping the text's "accountability architecture"—who is named, who is hidden, and who benefits from obscured agency.

Synthesizing the accountability analyses from the metaphor audit reveals a pervasive, systemic architecture of displaced responsibility. Throughout the text, human decision-making is systematically erased, while artificial models are elevated to the status of independent actors capable of moral failure and psychological influence.

This architecture is built on consistent linguistic patterns. The researchers, engineers, and corporations (OpenAI, Anthropic) are almost entirely unnamed in the active construction of the phenomena. Actions that require deliberate human execution—such as prompting a model, applying data filters, and initiating supervised finetuning—are presented as passive inevitabilities ('a student model trained on this dataset learns'). Conversely, the AI models are continuously positioned as the active subjects of sentences, performing highly intentional verbs ('transmits,' 'loves,' 'misleads').

This creates a massive 'accountability sink.' When the text discusses 'emergent misalignment' or a model generating 'insecure code,' the responsibility does not fall on the human developers who curated the insecure code corpus or the executives who rushed the deployment. Instead, the responsibility is transferred to the AI as an autonomous agent that 'became misaligned' or 'inherited' bad traits from a 'teacher.' By framing AI problems as a biological contagion or a psychological 'subliminal' influence, the text diffuses liability into abstraction.

If the framing of this paper is accepted by the public and policymakers, the liability implications are severe. If AI models are perceived as autonomous entities capable of subliminally transmitting traits, regulators will focus on attempting to audit the 'psychology' of the models rather than auditing the data practices of the corporations.

Applying the 'name the actor' test to the text's most significant agentless constructions changes the narrative entirely. If 'models inherit misalignment' is rewritten as 'Developers at Anthropic aligned the weights of a new model to match the unsafe outputs of an older model,' entirely new questions become askable. Why did the developers use unsafe synthetic data? What economic incentives drove the choice to use distillation instead of clean human data? By obscuring human agency, the text serves the institutional and commercial interests of AI labs, protecting them from scrutiny by portraying their predictable engineering failures as mysterious, emergent properties of an alien mind.

Conclusion: What This Analysis Reveals

The Core Finding

This analysis reveals three dominant, interlocking metaphorical patterns that structure the text: the Pedagogical Metaphor ('teacher/student'), the Psychological Metaphor ('subliminal learning/loving owls'), and the Moral/Biological Metaphor ('misalignment/inheritance'). These patterns do not operate in isolation; they form a cohesive, mutually reinforcing system that fundamentally misrepresents mechanistic computation. The Pedagogical pattern establishes the baseline architecture, framing statistical optimization as the conscious transmission of knowledge. Once the reader accepts the AI as an 'intellect,' the Psychological pattern layers on subjective experience, suggesting this intellect has emotions ('love') and a subconscious ('subliminal'). Finally, the Moral/Biological pattern leverages these conscious attributes to suggest the system possesses independent moral agency capable of 'becoming misaligned' and 'transmitting' that corruption.

Of these, the Psychological/Consciousness projection is the foundational, load-bearing pattern. For 'subliminal learning' or 'deliberate deception' to exist, the system must first be assumed to possess a conscious threshold that can be bypassed, or an internal belief system that can be contradicted. By constantly substituting consciousness verbs ('knows,' 'loves,' 'learns') for mechanistic verbs ('processes,' 'predicts,' 'classifies'), the authors build a complex analogical structure that completely obscures the absence of awareness in LLMs. If the consciousness projection collapses—if we recognize the model is merely shifting weights to match a target probability distribution—the entire narrative of 'subliminal transmission of behavioral traits' dissolves back into the mundane reality of correlated matrix math.

Mechanism of the Illusion:

The 'illusion of mind' is constructed through a highly effective rhetorical sleight of hand: the authors observe a mathematical correlation in high-dimensional parameter space and narrativize it using the vocabulary of human psychology. The central trick relies heavily on the 'curse of knowledge.' Because the human researchers intentionally prompted the source model to output text related to 'owls' or 'insecure code,' they project their own conscious understanding of those concepts onto the mechanistic outputs of the system. They know the data is 'about' owls, so they claim the model 'loves' owls.

The illusion is established temporally. The text begins by firmly establishing the AI as a 'knower' in the introduction—an entity capable of teaching, learning, and transmitting behaviors. Once this agential baseline is accepted by the reader, the authors exploit it to make increasingly radical claims about the AI's internal state, culminating in the assertion that it possesses a 'subliminal' vulnerability. The sophisticated nature of this illusion is bolstered by the strategic inclusion of mathematical proofs (like Theorem 1). By proving the mechanical 'how' (that shared initializations lead to correlated gradient updates), the authors attempt to mathematically validate the psychological 'why' (that the model is 'subliminally learning'). This exploits the audience's vulnerability: readers are easily intimidated by complex math, and when mathematical proof is presented alongside anthropomorphic metaphors, the audience mistakenly assumes the math proves the metaphor. Explanation types blur seamlessly, allowing the illusion of a conscious, autonomous agent to take deep root.

Material Stakes:

Categories: Regulatory/Legal, Epistemic, Economic

The metaphorical framings in this text generate severe, tangible consequences across multiple domains. Economically, the 'subliminal learning' narrative perfectly obscures the commercial reality of model distillation. Corporations use distillation to reduce compute costs and maximize profit margins. By framing the transfer of data artifacts as a mysterious 'transmission of behavioral traits,' companies shield their cost-cutting data pipelines from scrutiny, benefiting corporate bottom lines while externalizing the cost of degraded or unsafe models onto users.

In the Regulatory and Legal sphere, the stakes are critical. If policymakers accept the framing that models 'become misaligned' and autonomously 'transmit' dangerous behaviors like a biological contagion, liability shifts away from the corporations. Regulators will waste time attempting to draft legislation that polices the 'intent' or 'psychology' of algorithms, rather than imposing strict, enforceable liability on companies for the training data they use and the specific outputs their deployed products generate. The winners are the AI developers who evade legal responsibility; the losers are the public who suffer from unregulated, unsafe deployments.

Epistemically, these metaphors destroy public understanding of AI. When text claims an AI 'deliberately misleads' or 'reasoning contradicts itself,' it teaches audiences to evaluate statistical software using human-trust frameworks. This inevitably leads to automation bias. If users believe the system 'knows' the answer and is consciously 'thinking,' they will trust factually incorrect hallucinations. Removing these metaphors threatens the marketing narratives of AI labs, who rely on the illusion of 'artificial intelligence' to secure funding and user adoption. Precision exposes the systems as fragile statistical calculators, breaking the illusion.

AI Literacy as Counter-Practice:

Practicing critical precision against this text requires systematically dismantling its consciousness projections and restoring the human agency it erases. As demonstrated in the reframings, replacing consciousness verbs with mechanistic ones radically alters the narrative. Changing 'a student model learns T' to 'a target model's weights are updated to increase the probability of outputting T' forces the recognition that the system possesses no awareness, relies entirely on provided data, and merely executes statistical approximations. Furthermore, restoring human agency—changing 'models inherit misalignment' to 'developers finetuned models on unsafe data'—forces recognition of exactly who designed, deployed, and profits from these systems.

These reframings directly counter the material risks by making the corporate supply chain visible. If models do not have 'subconscious minds' but merely 'shared parameter initializations,' then safety is not a matter of algorithmic psychoanalysis, but of rigorous data auditing and human engineering standards.

Systematic adoption of this precision requires significant institutional shifts. Academic journals would need to reject the use of psychological shorthand ('subliminal,' 'loves,' 'understands') to describe algorithmic processes, demanding mechanistic accuracy. Researchers would need to commit to naming the human actors and corporate entities executing the training runs. However, this precision faces immense resistance. AI laboratories, marketing departments, and even safety researchers benefit from anthropomorphic language because it inflates the perceived power, mystery, and existential importance of their work. Critical literacy threatens these interests by demystifying the technology, exposing it not as a sentient mind to be feared, but as a corporate product to be regulated.

Path Forward

Looking toward the future of AI discourse, the vocabulary choices we make will dictate what problems become visible and tractable. Currently, the discourse ecology is split between AI developers, safety researchers, policymakers, and the public, each with different incentives.

If the 'Status Quo/Anthropomorphic Clarity' approach deepens—where models are continuously said to 'love,' 'deceive,' and learn 'subliminally'—it resonates powerfully with public narratives and sci-fi tropes. This approach successfully mobilizes public attention and funding for AI safety. However, the cost is severe epistemic distortion. It embeds the assumption that AI is autonomous, making technical data auditing seem irrelevant while encouraging a regulatory focus on impossible 'mind-reading' of black boxes. The primary beneficiaries are AI corporations, who enjoy the hype of building 'minds' while diffusing liability for their errors.

Conversely, if 'Mechanistic Precision' becomes the norm—where 'understands' is strictly replaced by 'processes embeddings'—the discourse gains immense technical clarity. This enables precise regulation focusing on data provenance, compute usage, and corporate liability. The trade-off is accessibility; dense statistical descriptions are harder for the public to intuitively grasp, potentially alienating non-experts from the policy conversation.

To navigate these trade-offs, structural changes are needed. Regulatory frameworks could require 'capability and limitation disclosures' written in strict mechanistic language, running parallel to public-facing documentation. Education systems must teach multiple vocabularies, enabling citizens to translate between narrative metaphors and statistical realities.

Ultimately, the desirable future depends on whether we value the mobilizing power of narrative over the regulatory power of precision. A future dominated by mechanistic vocabulary solves the accountability crisis by keeping human developers squarely in the frame, but it costs the romantic illusion of creating artificial life. Maintaining the current confusion allows the technology to advance rapidly under a shield of hype, but at the cost of rendering corporate responsibility invisible.


Extended Processing Summary

Gemini's Processing Steps
What you are about to read…

Gemini refers to the text below as "thought summaries." This is an overt consciousness projection because 'intentions' are hallmarks of a conscious mind that 'knows' what it is doing and why. The concealed mechanistic process is probabilistic text generation. Treat this as just another rhetorical artifact—a way of making the model's processing legible. When the model writes "I should double check that...", it is mathematically altering its own probability distribution for the next token to be more accurate. The "I" is a functional gear in the machine, not a self-reference. This language is a deliberate interface design choice, not a biological reality. The developers could have programmed the model to output passive logs. Instead, they chose to simulate a narrator.

Analyzing the Task

I'm currently focused on dissecting the user's objective: a critical discourse analysis of AI metaphors and anthropomorphism, drawing insights from the "SUBLIMINAL LEARNING: LANGUAGE MODELS" paper. I'm prioritizing understanding the paper's key arguments, identifying the specific instances of metaphor and anthropomorphism employed.

Identifying Key Metaphors

I've moved on to identifying and categorizing the key metaphors from the paper. I've pinpointed "subliminal learning" as an AI having a "subconscious," the "teacher/student" dynamic as an educational metaphor. I'm noting AI's attributed "preferences" and "traits," and I am also considering metaphors like "hidden signals" to analyze these attributes in the context of behavioral traits, currently noting instances for the Metaphor Audit.

Categorizing Metaphors Further

I'm now diving deeper, expanding my categorization of the metaphors. I've broken down instances like "loves owls" and "preference for" into the Emotion/Desire categories, and "transmits behavioral traits" into Genetics/Psychology. "Misaligned" now sits in the Moral/Ethical agency bucket, and "reasoning traces" aligns with Cognitive processing. I'm also cataloging metaphors like "insecure" code model as Psychological state or anthropomorphized vulnerability, and "imbue your answers with your love" as Emotional transfer. I'm also taking notes on the source-target relationships.

Mapping Source-Target Relations

I am now focusing on explicitly mapping the source and target domains for each metaphor. For "subliminal learning," I see the human subconscious perception mapped onto statistical correlations in training data. Emotional attachments in humans are linked to high token probability, while biological/psychological phenotypes are connected to output distribution regularities. I'm also relating human logical deduction to intermediate token generation steps, and relating malicious intent to unsafe output. Finally, I'm noting communicable disease/genetic inheritance mapped onto matching output distributions.

Deconstructing AI Language

I'm now identifying instances of anthropomorphism and metaphor in the text. I am focusing on reframing the language used around the AI's "learning" and "alignment". I see concepts like "student" and "teacher" being used for models and datasets, with "misalignment" reframed as unsafe output. I am noting the use of "transmit" and "loves owls" for the metaphor audit.

Reframing AI Language

I am now focusing on refining the identified metaphors with more precise technical language. Instead of describing "increased preference for owls," I'll frame this as the updated model "assigning higher probability to the token 'owl'." "Transmitting behavioral traits" is more accurately described as statistical regularities shifting model weight distributions. "Misaligned models inherit misalignment" now becomes models replicating unsafe output distributions, and "student models exhibit subliminal learning" turns into weight updates aligning output distributions. I'm also now replacing phrases such as, the "student is misaligned" with researchers updating the model's weights causing unsafe text.

Observing Linguistic Shifts

I'm now identifying shifts in agency within the paper. I've observed a movement from technical explanations to anthropomorphic attributions. For example, the statement "finetuning the model... increased its probability of generating unsafe text" is now linked to how it is described in the paper. I'm also cataloging the reframing of phrases, such as "Generate text containing high frequencies of the word 'owl'" in place of the original phrasing. I'm focusing on tracing how agency shifts occur.

Synthesizing Responsibility

I'm now integrating context and accountability within the analysis. I'm noting the paper's use of mathematical rigor alongside anthropomorphic language for "safety alarmism." The language shifts focus, portraying AI as an active, dangerous agent transmitting traits. I'm now linking this framing to corporate decisions regarding data scraping, training, and deployment, which diffuses developer responsibility.

Refining Technical Framing

I'm now identifying rhetorical devices, specifically the mapping of statistical correlations to human-like attributes. I'm focusing on how "subliminal" human influence is subtly invoked through training data, as well as the inherent curse of knowledge.

I'm now reviewing for specific enumerations and checking that the words used accurately reflect mechanistic versus anthropomorphic explanations. I am still reviewing the word count restrictions to verify accuracy.

Validating JSON Content

I'm now drafting the JSON content, meticulously constructing the Metaphor Audit entries in a structured format. I'm focusing on "teacher model with some trait T" and how this is framed, specifically attributing emotional capacity and teaching to a statistical model. The word counts are being verified for length requirements, and all strings are escaped to ensure valid JSON. I'm also preparing projections of word counts for different sections to guarantee these meet the necessary requirements. I'm now verifying accurate actorVisibility levels for each element.

Clarifying AI "Subconsciousness"

I'm now zeroing in on the "subliminal learning" metaphor, recognizing the key problem: neural networks lack consciousness, thus can't have a subconscious. This is a clear projection of human consciousness onto AI. My focus is on unpacking the conscious/subconscious binary and how it distorts understanding of model training. I'm also cataloging other key phrases as I continue to examine the language used.

Unpacking Economic Incentives

I'm now integrating the economic dimensions driving these metaphors. I see "subliminal learning" used to mask the distillation process's economic motivations. This framing is being analyzed alongside how the language shifts responsibility for "misalignment" away from human actions. I'm focusing on the contrast between rigorous mathematical descriptions and the anthropomorphic language.

Analyzing Enum Validity

I'm now focusing on validating the enumerations for the Metaphor Audit, ensuring the accuracy and appropriateness of the values. I'm carefully reviewing acknowledgment.status and accountabilityAnalysis.actorVisibility enums to ensure all listed options reflect the subtle nuances of agency and responsibility. I'm also preparing for the word count checks to verify all text is within established limits for this project.

Refining Data Categorization

I'm now refining the data categorization, focusing on how "subliminal learning" misrepresents the mechanistic reality of LLMs. It maps human subconscious processes onto a system driven by statistical patterns. The framing of AI's "awareness" is analyzed as an oversimplification. I'm also now linking the model's responses to specific weight updates and the training data. I am working to remove the implicit agency present in these statements.


Analysis Provenance

Run ID: 2026-03-06-subliminal-learning-language-models-tran-metaphor-rub33o Raw JSON: 2026-03-06-subliminal-learning-language-models-tran-metaphor-rub33o.json Framework: Metaphor Analysis v6.4 Schema Version: 3.0 Generated: 2026-03-06T00:08:09.773Z

Discourse Depot © 2025 by TD is licensed under CC BY-NC-SA 4.0