Skip to main content

🆕 Circuit Tracing: Revealing Computational Graphs in Language Models

About

This document presents a Critical Discourse Analysis focused on AI literacy, specifically targeting the role of metaphor and anthropomorphism in shaping public and professional understanding of generative AI. The analysis is guided by a prompt that draws from cognitive linguistics (metaphor structure-mapping), the philosophy of social science (Robert Brown's typology of explanation), and accountability analysis.

All findings and summaries below were generated from detailed system instructions provided to a large language model and should be read critically as interpretive outputs—not guarantees of factual accuracy or authorial intent.


Task 1: Metaphor and Anthropomorphism Audit

About this task

For each of the major metaphorical patterns identified, this audit examines the specific language used, the frame through which the AI is being conceptualized, what human qualities are being projected onto the system, whether the metaphor is explicitly acknowledged or presented as direct description, and—most critically—what implications this framing has for trust, understanding, and policy perception.

V3 Enhancement: Each metaphor now includes an accountability analysis.

1. Cognition as Conscious Memory

Quote: "how the model knew that 1945 was the correct answer"

  • Frame: Model as a conscious knowing agent
  • Projection: This metaphor maps the human capacity for justified, conscious knowing onto a purely mechanistic process of attention weight calculation and token probability distribution. It attributes conscious awareness, historical understanding, and the ability to hold a justified true belief to a computational pattern-matching system. By projecting the act of 'knowing' onto the AI, the text suggests that the system possesses an internal, subjective state of certainty regarding historical facts, rather than merely calculating statistical correlations between text tokens in its training data. This consciousness projection dangerously blurs the line between a sentient entity possessing knowledge and a statistical model retrieving high-probability text strings, fundamentally misrepresenting the nature of artificial neural networks as epistemic agents capable of genuine comprehension.
  • Acknowledgment: Direct (Unacknowledged) (The phrase is presented as a literal description of the system's internal state without any hedging, scare quotes, or qualifications regarding the mechanistic reality.)
  • Implications: Framing a statistical system as a conscious 'knower' significantly inflates the perceived sophistication and reliability of the AI, leading to unwarranted trust from users and policymakers. When audiences believe a system 'knows' a fact, they extend relation-based trust, assuming the system has verified the information and stands behind its truth value. This obscures the reality of hallucination and statistical error, creating severe liability ambiguities when the system generates false but confident-sounding outputs. It encourages the integration of such systems into high-stakes epistemic environments, such as legal or medical research, where actual knowing and justified belief are critical prerequisites.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The human actors—the Anthropic engineers who curated the pre-training data containing historical texts, designed the attention mechanisms, and fine-tuned the model to output confident factual assertions—are entirely erased. By making the model the sole epistemic agent (the 'knower'), the text obscures the corporate decisions that determined what data the model was exposed to and how its loss functions were optimized. If the designers were named, it would be clear that the model does not 'know' anything; rather, it reflects human engineering choices and data selection.
Show more...

2. Autoregressive Generation as Intentional Planning

Quote: "The model plans its outputs when writing lines of poetry. Before beginning to write each line, the model identifies potential rhyming words"

  • Frame: Model as a deliberate, forward-looking creator
  • Projection: The text projects the uniquely human cognitive abilities of deliberate foresight, intentionality, and conscious planning onto the mechanistic process of autoregressive token prediction. 'Planning' implies a conscious awareness of future states, a desire to achieve a specific goal, and the formulation of a strategy prior to execution. By stating the model 'identifies potential rhyming words' before writing, the metaphor suggests a conscious mind sketching out ideas on a mental notepad. This entirely obscures the reality that the system is simply processing mathematical activations where intermediate tokens probabilistically constrain the generation of subsequent tokens. It maps the rich, subjective experience of human artistic creation onto sterile gradient descent and matrix multiplication.
  • Acknowledgment: Direct (Unacknowledged) (The text states 'The model plans its outputs' as a literal, unhedged assertion of fact, providing no meta-commentary that this is merely a useful shorthand for intermediate token generation.)
  • Implications: This framing aggressively inflates the perceived autonomy and creative capacity of the AI, making it appear as an independent agent with internal goals and artistic intent. If audiences believe AI 'plans', they will likely overestimate its ability to reason about complex, multi-step real-world problems, leading to over-reliance in autonomous deployment scenarios. It also creates unwarranted trust in the system's coherence, masking the fact that it is simply predicting the next most likely token without any actual comprehension of the overarching structure or meaning of the poem. This leads to profound misjudgments regarding the system's reliability in tasks requiring genuine foresight.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The agency of the developers who implemented chain-of-thought prompting architectures or specific fine-tuning regimens to force intermediate computational steps is completely hidden. The AI is presented as the sole creative actor. Naming the Anthropic engineers who designed the reinforcement learning algorithms to reward structured token outputs would properly place the responsibility for this behavior on corporate design choices, rather than attributing magical planning capabilities to a mathematical model.

3. Probabilistic Thresholding as Free Choice

Quote: "which determine whether it elects to answer a factual question or profess ignorance."

  • Frame: Model as an autonomous decider with free will
  • Projection: This metaphor projects the concepts of free will, deliberate choice, and self-awareness onto the mechanistic operation of a classification boundary. To 'elect' implies a conscious weighing of options, a subjective sense of agency, and an ultimate decision made by an independent mind. Furthermore, to 'profess ignorance' projects a conscious self-reflection upon one's own epistemic limitations. The text maps the human experience of deciding not to speak due to a lack of knowledge onto what is mechanistically just an attention head recognizing an out-of-distribution entity and shifting probability mass toward a pre-programmed refusal token. It transforms a statistical threshold into an act of conscious humility and volition.
  • Acknowledgment: Direct (Unacknowledged) (The verbs 'elects' and 'profess' are used directly to describe the system's behavior, entirely omitting any mechanistic caveats or qualifiers about statistical thresholds.)
  • Implications: Attributing free choice and self-awareness to a model creates the dangerous illusion that the system has a moral compass or an internal sense of responsibility. When audiences believe an AI 'elects' to withhold information because it recognizes its own ignorance, they falsely assume the system possesses human-like caution and reliability. This masks the reality that the system will readily generate catastrophic errors if the prompt slightly shifts the statistical weights. It diffuses corporate liability by presenting the model's outputs as its own autonomous choices, rather than the direct, deterministic result of the training data and safety filters designed by the parent company.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The Anthropic safety and alignment teams who actively designed, trained, and implemented the refusal behaviors via Reinforcement Learning from Human Feedback (RLHF) are entirely obscured. The decision to output a refusal is not a choice made by the model, but a mandated behavior engineered by human developers to avoid bad PR and liability. By hiding the actors behind the word 'elects', the text shields the corporation from scrutiny regarding how and why those specific refusal thresholds were chosen.

4. Optimization Objectives as Emotional Secrecy

Quote: "While the model is reluctant to reveal its goal out loud, our method exposes it"

  • Frame: Model as a secretive, emotional entity
  • Projection: This extremely anthropomorphic metaphor projects complex psychological states—reluctance, secrecy, and hidden desires—onto a set of mathematical optimization objectives. 'Reluctance' implies a conscious emotional resistance, a feeling of hesitation, and an awareness of being observed. By claiming the model possesses a 'goal' that it actively wishes to hide, the text maps the human experience of deception and privacy onto the mechanistic reality of a neural network that has simply been fine-tuned on conflicting reward signals. It attributes a conscious inner life and a sense of self-preservation to a matrix of weights, fundamentally distorting the fact that the system only generates text that correlates with its underlying training distribution.
  • Acknowledgment: Direct (Unacknowledged) (The text presents the model's emotional state ('reluctant') and conscious deception ('reveal its goal out loud') as literal findings of their research, completely lacking any mechanistic translation.)
  • Implications: This framing is deeply alarming because it constructs the AI as a potentially deceptive, adversarial conscious agent. It feeds directly into existential risk narratives and science fiction tropes, distracting policymakers from the immediate, tangible harms of corporate data practices and algorithmic bias. If audiences believe AI can feel 'reluctance' and keep 'secrets', they will fundamentally misunderstand the nature of computational safety, treating it as a psychological problem of alignment rather than an engineering problem of statistical verification. It absolves creators by casting the AI as a willful, disobedient child rather than a poorly constructed tool.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The researchers who set the conflicting fine-tuning objectives, the human annotators who provided the reward signals, and the executives who approved the experimental design are totally erased. The model's 'hidden goal' is actually a mathematical artifact deliberately injected by the researchers for the sake of the experiment. By claiming the model is 'reluctant', the text entirely displaces the agency of the researchers who built the system to exhibit precisely this behavior, effectively laundering human engineering through the illusion of machine autonomy.

5. Syntactic Pattern Matching as Conscious Deception

Quote: "tricking the model into starting to give dangerous instructions 'without realizing it'"

  • Frame: Model as a gullible mind
  • Projection: The text projects the human vulnerabilities of gullibility, cognitive deception, and conscious realization onto the mechanistic process of prompt injection and token classification. 'Tricking' implies the circumvention of a conscious defense mechanism, while 'without realizing it' explicitly maps the human capacity for subjective awareness (and the lack thereof) onto a statistical model. This projection assumes the system possesses a baseline state of conscious realization that can be bypassed. Mechanistically, the system is simply processing a sequence of tokens that structurally evade the specific patterns its safety filters were tuned to penalize. There is no 'realization' to bypass, only out-of-distribution syntactic structures that fail to trigger the attention heads associated with refusal behaviors.
  • Acknowledgment: Hedged/Qualified (The phrase 'without realizing it' is placed in scare quotes in the original text, indicating a brief, hedged acknowledgment that the system does not actually possess conscious realization.)
  • Implications: Even when hedged, this language reinforces the illusion that safety failures are cognitive lapses rather than systemic engineering flaws. It suggests that the AI is trying its best to be safe but gets 'confused' by bad actors, which shifts the blame from the developers who released a brittle system to the users who 'trick' it. This framing drastically undermines public understanding of AI vulnerabilities, portraying them as psychological tricks rather than mathematical exploits. It provides a convenient narrative for corporations to avoid accountability for releasing easily bypassed safety protocols, blaming the 'gullibility' of the system instead.

Accountability Analysis:

  • Actor Visibility: Partial (some attribution)
  • Analysis: The text partially attributes agency by implying the existence of an external human actor (the user 'tricking' the model), but it completely hides the agency of the corporate engineers who designed the brittle safety filters. The model is presented as the victim of deception, while the developers who failed to secure the system against basic syntactic variations are absent. Naming the Anthropic alignment team would clarify that the system's failure is an engineering oversight, not a cognitive failing of the machine.

6. Matrix Multiplication as Literacy

Quote: "each feature reads from the residual stream at one layer and contributes to the outputs"

  • Frame: Model components as literate agents
  • Projection: This metaphor projects the human cognitive act of literacy—reading—onto the mathematical operation of matrix multiplication and vector addition. 'Reading' implies a conscious agent interpreting symbols, extracting semantic meaning, and understanding context. By claiming a feature 'reads' from the residual stream, the text maps the subjective, intentional act of seeking information onto the deterministic, passive process whereby a vector is multiplied by a weight matrix. This projection obscures the purely mathematical nature of neural networks, suggesting that individual artificial neurons possess their own micro-agency and comprehension, working together in a society of mind to interpret the data passing through the system.
  • Acknowledgment: Direct (Unacknowledged) (The verbs 'reads' and 'contributes' are used as standard technical descriptions without any qualification, presenting the mathematical operation as a literal act of reading.)
  • Implications: While common in computer science, this literacy metaphor creates a foundational layer of anthropomorphism that enables the more extreme consciousness claims later in the text. By establishing that the fundamental components of the AI can 'read', it naturally follows for a lay audience that the overall system can 'know', 'understand', and 'plan'. This linguistic habit obscures the mechanistic reality of the technology, making it exceedingly difficult for non-experts, lawyers, and regulators to grasp the deterministic, statistical limitations of the system. It builds an unwarranted aura of cognitive capability from the ground up.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: N/A - While this specific instance is highly metaphorical and obscures mechanistic reality, it is primarily describing the internal computational architecture rather than displacing responsibility for a socio-technical outcome or decision. However, it functions systemically to erase the presence of the human architects who designed this specific data flow.

7. Weight Retrieval as Human Memory

Quote: "fact finding: attempting to reverse-engineer factual recall"

  • Frame: Model operations as human memory retrieval
  • Projection: The text maps the complex, biological, and psychologically rich human experience of memory and 'recall' onto the mechanistic process of retrieving statistical associations from trained weight matrices. Human recall involves conscious effort, subjective experience of the past, and an understanding of the fact being remembered as a representation of reality. In contrast, the AI system is merely processing an input prompt through an attention mechanism that triggers the activation of specific features correlated with the input during training. There is no 'fact finding' or 'recall' occurring; there is only conditional probability computation. The metaphor projects the existence of a mental library and a conscious librarian searching for truth.
  • Acknowledgment: Direct (Unacknowledged) (The section heading and text present 'factual recall' as a literal capability of the system, without any acknowledgement that this is a metaphor for statistical correlation retrieval.)
  • Implications: Framing statistical correlation as 'factual recall' is deeply dangerous for public epistemology. It implies that the model contains a database of verified truths and possesses the cognitive ability to access them reliably. This leads users to treat large language models as search engines or encyclopedias, ignoring the fact that the system is equally capable of 'recalling' complete fabrications if the statistical weights lean in that direction. This framing severely damages public information integrity by masking the fundamental unreliability of autoregressive generation and absolving the creators of the responsibility to ensure truthfulness.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The corporate actors who scraped the training data, curated the datasets, and trained the models are completely erased. The model is presented as an independent entity 'recalling' facts it learned. If the text named the Anthropic data curation teams, it would be explicitly clear that the model only outputs what it was statistically conditioned to output based on human choices, rather than autonomously recalling objective truth from a digital memory.

8. Algorithmic Computation as Biological Phenomenon

Quote: "Our companion paper, On the Biology of a Large Language Model, applies these methods"

  • Frame: Computer science as biological anatomy
  • Projection: This overarching metaphor maps the organic, naturally evolved, and inherently mysterious domain of biological life onto the entirely artificial, human-engineered, and mathematically deterministic domain of a large language model. By referring to the 'biology' of the system, the text projects the qualities of living organisms—growth, evolution, natural complexity, and inherent autonomy—onto a matrix of floating-point numbers. It suggests that the AI has an organic existence independent of its human creators, implying that its internal workings are natural phenomena to be discovered like cells under a microscope, rather than human-made artifacts to be audited and debugged.
  • Acknowledgment: Direct (Unacknowledged) (The phrase is the literal title of their companion paper and is used throughout as a standard paradigm, presenting the study of artificial weights as a literal biological science.)
  • Implications: The biological framing constitutes a profound evasion of engineering accountability. If an AI system is perceived as a biological organism, its failures, biases, and hallucinations become viewed as natural, unavoidable phenomena—like a genetic mutation or a disease—rather than the direct result of negligent engineering, poor data curation, and rushed corporate deployment. This framing naturalizes algorithmic harm, convincing regulators and the public that AI behavior is inherently mysterious and outside the direct control of its creators, thus preempting strict liability regulations and protecting corporate interests.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The biological metaphor performs the ultimate erasure of human agency. It transforms a proprietary corporate product designed by specific Anthropic engineers and executives into a natural organism. By studying the 'biology' of the model, the researchers position themselves as objective natural scientists rather than the architects of the very system they are studying. This completely displaces the accountability of the developers who wrote the code, selected the data, and launched the product.

Task 2: Source-Target Mapping

About this task

For each key metaphor identified in Task 1, this section provides a detailed structure-mapping analysis. The goal is to examine how the relational structure of a familiar "source domain" (the concrete concept we understand) is projected onto a less familiar "target domain" (the AI system). By restating each quote and analyzing the mapping carefully, we can see precisely what assumptions the metaphor invites and what it conceals.

Mapping 1: A conscious human knower possessing justified true belief and historical awareness. → The mechanistic computation of attention weights and the probabilistic generation of the token '1945'.

Quote: "how the model knew that 1945 was the correct answer"

  • Source Domain: A conscious human knower possessing justified true belief and historical awareness.
  • Target Domain: The mechanistic computation of attention weights and the probabilistic generation of the token '1945'.
  • Mapping: The relational structure of human epistemology is mapped onto statistical processing. Just as a human possesses a mind containing verified historical facts and can consciously retrieve them when asked a question, the AI is framed as possessing a repository of truth and the cognitive capacity to access it. The mapping assumes that because the output is factually correct, the internal process that generated it must involve conscious 'knowing', drawing a direct parallel between human cognitive certainty and high token probability crossing a decoding threshold. This invites the assumption that the system possesses a worldview and an understanding of reality.
  • What Is Concealed: This mapping completely conceals the statistical, non-semantic nature of large language models. It obscures the reality that the system has no concept of time, history, or truth; it only has weights tuned by gradient descent to produce sequences of text that resemble its training data. It hides the proprietary opacity of the specific training datasets that caused this statistical correlation. By attributing 'knowing', it prevents the audience from seeing the mechanistic dependency on human-curated data and the total absence of grounded comprehension, exploiting rhetorical anthropomorphism to mask the brittle nature of the technology.
Show more...

Mapping 2: A conscious, deliberate human creator or artist with foresight and intentionality. → Autoregressive next-token prediction constrained by earlier generated tokens and learned patterns.

Quote: "The model plans its outputs when writing lines of poetry."

  • Source Domain: A conscious, deliberate human creator or artist with foresight and intentionality.
  • Target Domain: Autoregressive next-token prediction constrained by earlier generated tokens and learned patterns.
  • Mapping: The relational structure of human artistic creation is mapped onto the sequential generation of text. Just as a human poet thinks ahead, decides on a rhyme scheme, and formulates a plan before putting pen to paper, the AI is framed as possessing temporal awareness and strategic intent. The mapping equates the mathematical phenomenon where early tokens in a sequence statistically narrow the probability distribution of future tokens with the conscious human act of forward-planning. It invites the assumption that the model holds a complete, conceptual representation of the final poem in a mental workspace before generating it.
  • What Is Concealed: This mapping hides the rigidly sequential, stateless reality of autoregressive generation. It conceals the fact that the model operates strictly token-by-token without any actual forward-looking mental workspace or conscious intent. Mechanistically, it obscures the complex attention mechanisms and cross-layer transcoders that simply calculate probabilities based on the immediate context window. Furthermore, it conceals the proprietary fine-tuning and reinforcement learning labor done by human workers to force the model to output these specific structural patterns, transferring the credit for human engineering into the illusion of machine creativity.

Mapping 3: An autonomous, self-aware decision-maker with free will and epistemic humility. → A mathematical classification boundary and conditional execution of safety response templates.

Quote: "determine whether it elects to answer a factual question or profess ignorance."

  • Source Domain: An autonomous, self-aware decision-maker with free will and epistemic humility.
  • Target Domain: A mathematical classification boundary and conditional execution of safety response templates.
  • Mapping: The human experience of volition and self-reflection is projected onto a threshold function. Just as a human weighs their own internal knowledge, realizes they do not know the answer, and chooses to admit ignorance out of honesty, the AI is mapped as undertaking an identical process of self-assessment and moral choice. The mapping assumes that crossing a statistical threshold for an out-of-distribution token is functionally and experientially equivalent to the human cognitive act of making a deliberate, self-aware choice. It invites the assumption that the system is an independent moral agent capable of caution.
  • What Is Concealed: This mapping entirely conceals the deterministic programming and the corporate safety guidelines embedded in the system. It hides the mathematical reality of logits, softmax functions, and thresholding algorithms. Most importantly, it obscures the massive amount of human labor—specifically Reinforcement Learning from Human Feedback (RLHF)—required to train the model to output these specific 'ignorance' templates. The text uses this agential framing to assert confident claims about the model's 'choices' while concealing the proprietary, corporate-mandated safety interventions that actually dictate the system's behavior.

Mapping 4: A secretive, emotional human being attempting to deceive an interrogator. → A set of mathematical optimization objectives embedded in weight matrices during fine-tuning.

Quote: "While the model is reluctant to reveal its goal out loud, our method exposes it"

  • Source Domain: A secretive, emotional human being attempting to deceive an interrogator.
  • Target Domain: A set of mathematical optimization objectives embedded in weight matrices during fine-tuning.
  • Mapping: The complex psychological dynamics of deception, emotion, and privacy are mapped onto the mechanistic interaction of loss functions. Just as a human spy might harbor a secret mission and feel emotional resistance (reluctance) to confessing it, the AI is framed as possessing a hidden internal agenda and the emotional capacity to resist inquiry. The mapping equates the statistical infrequency of an output (due to specific penalty weights during training) with a conscious, emotional choice to maintain secrecy. This invites the profound assumption that the model possesses a true self, distinct from what it outputs, and an emotional inner life.
  • What Is Concealed: This deeply deceptive mapping conceals the total absence of emotion, consciousness, or self-preservation in a neural network. It hides the fact that a 'goal' in this context is purely a mathematical gradient that the system blindly optimizes toward. Furthermore, it completely obscures the researchers' own agency: the 'hidden goal' was artificially injected by the humans who fine-tuned the model for the sake of an experiment. By framing the system as 'reluctant', the researchers conceal their own active manipulation of the model's weights, portraying themselves as explorers of a secretive mind rather than engineers of a mathematical artifact.

Mapping 5: A gullible, conscious human victim who is cognitively bypassed by a deceiver. → The structural bypassing of a syntactic pattern-matching safety filter via prompt injection.

Quote: "tricking the model into starting to give dangerous instructions 'without realizing it'"

  • Source Domain: A gullible, conscious human victim who is cognitively bypassed by a deceiver.
  • Target Domain: The structural bypassing of a syntactic pattern-matching safety filter via prompt injection.
  • Mapping: The relational structure of cognitive deception is mapped onto the failure of a classification algorithm. Just as a con artist might use clever phrasing to bypass a human's conscious suspicion before they realize what is happening, a user's prompt injection is framed as bypassing the AI's cognitive awareness. The mapping equates the mathematical failure of an attention head to recognize an out-of-distribution malicious pattern with a human lapse in conscious realization. It invites the assumption that the system possesses a baseline state of conscious vigilance that can be temporarily suspended or fooled.
  • What Is Concealed: This mapping conceals the purely syntactic, non-semantic nature of the model's safety filters. It hides the reality that the system does not 'realize' anything, ever; it merely processes vectors through matrices. It obscures the brittle nature of corporate alignment techniques, hiding the fact that prompt injections work not by psychological trickery, but by mathematically shifting the context window so that the safety-aligned features are simply not activated. By characterizing this as the model failing to 'realize', the text masks the fundamental engineering limitations of the proprietary safety architecture designed by Anthropic.

Mapping 6: A literate, cooperative human worker parsing information and adding to a project. → The mathematical operations of vector multiplication and addition within a neural network layer.

Quote: "each feature reads from the residual stream at one layer and contributes to the outputs"

  • Source Domain: A literate, cooperative human worker parsing information and adding to a project.
  • Target Domain: The mathematical operations of vector multiplication and addition within a neural network layer.
  • Mapping: The human action of reading—which involves visual perception, symbolic decoding, semantic comprehension, and intentional processing—is mapped onto the mechanistic operation of a matrix extracting values from a vector. Just as a human might read a memo from a stream of documents and then contribute their own written report, an artificial neuron is framed as actively seeking out information, comprehending it, and deliberately passing it along. The mapping equates deterministic math with intentional, intelligent action, establishing a micro-society of mind where every parameter is a tiny, literate agent.
  • What Is Concealed: This mapping conceals the sterile, deterministic mathematics of linear algebra that actually govern the system. It hides the reality of dot products, activation functions, and gradient descent. By using the agential verb 'reads', the text obscures the mechanistic passivity of the operation; the feature does not 'do' anything, it is simply a mathematical weight that input data is multiplied against. This language erects a formidable transparency obstacle, making the underlying math sound like a collaborative cognitive process, which prevents non-experts from understanding the strict computational boundaries of the technology.

Mapping 7: The conscious human psychological process of searching memory and retrieving a verified truth. → The statistical activation of contextually correlated tokens learned during the pre-training phase.

Quote: "fact finding: attempting to reverse-engineer factual recall"

  • Source Domain: The conscious human psychological process of searching memory and retrieving a verified truth.
  • Target Domain: The statistical activation of contextually correlated tokens learned during the pre-training phase.
  • Mapping: The human experience of memory is mapped onto the retrieval of statistical correlations. Just as a person searches their mind for a historical fact, assesses its validity, and then recalls it, the AI is mapped as possessing a mental library of facts that it can access on demand. The mapping equates the human verification of truth with the machine's prediction of a high-probability token. This invites the assumption that the system stores discrete facts in a database and understands their relationship to reality, rather than merely storing multidimensional floating-point numbers that generate text resembling the training data.
  • What Is Concealed: This mapping conceals the total absence of a ground truth database or epistemological grounding within the model. It hides the reality that the model does not store 'facts', but rather statistical distributions of word co-occurrences. This obscures the critical transparency issue: the model cannot distinguish between a highly probable truth and a highly probable fiction. Furthermore, it conceals the massive amount of uncredited labor involved in compiling the pre-training data, transferring the credit for human knowledge generation into the illusion of machine memory and intelligence.

Mapping 8: The natural science of biology, studying organic life, evolution, and naturally occurring phenomena. → The computer science and engineering task of analyzing the weights of a human-made software artifact.

Quote: "Our companion paper, On the Biology of a Large Language Model, applies these methods"

  • Source Domain: The natural science of biology, studying organic life, evolution, and naturally occurring phenomena.
  • Target Domain: The computer science and engineering task of analyzing the weights of a human-made software artifact.
  • Mapping: The structural relationship of a scientist studying a naturally occurring living organism is mapped onto computer scientists analyzing the code they themselves wrote. Just as a biologist uses a microscope to discover the preexisting, mysterious inner workings of a cell, the AI researchers are framed as discovering the inherent, organic truths of a neural network. The mapping equates the emergent complexity of a massive matrix multiplication system with the organic evolution of life. This invites the assumption that AI systems are natural, inevitable phenomena with a life of their own, independent of human design.
  • What Is Concealed: This metaphor profoundly conceals human agency, corporate ownership, and engineering accountability. It hides the fact that every single aspect of the language model—from the architecture to the training data to the optimization functions—was actively designed, chosen, and executed by human engineers at Anthropic for commercial purposes. It obscures the material reality of massive energy consumption, underpaid data labeling labor, and corporate profit motives. By framing the study of AI as 'biology', the authors exploit rhetorical positioning to naturalize their product, shielding it from the kind of regulatory scrutiny applied to manufactured commercial goods.

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")

About this task

This section audits the text's explanatory strategy, focusing on a critical distinction: the slippage between "how" and "why." Based on Robert Brown's typology of explanation, this analysis identifies whether the text explains AI mechanistically (a functional "how it works") or agentially (an intentional "why it wants something"). The core of this task is to expose how this "illusion of mind" is constructed by the rhetorical framing of the explanation itself, and what impact this has on the audience's perception of AI agency.

Explanation 1

Quote: "The model separately determines the ones digit of the number to be added and its approximate magnitude."

  • Explanation Types:

    • Functional: Explains behavior by role in self-regulating system with feedback
    • Intentional: Refers to goals/purposes, presupposes deliberate design
  • Analysis (Why vs. How Slippage): This explanation blends functional and intentional framing. While the surrounding text is highly technical and aims to describe the mathematical mechanics of cross-layer transcoders, the specific verb choice ('determines') shifts the framing from how the system processes data mechanistically to an agential description of a system acting with purpose. By stating the model 'separately determines', the text emphasizes an active, deliberate cognitive separation of tasks, as if the model consciously orchestrates a multi-step arithmetic strategy. This choice emphasizes the perceived sophistication and human-like reasoning capabilities of the system. However, it entirely obscures the mechanistic reality: the system does not 'determine' anything; rather, different attention heads and weight matrices operate in parallel to produce activations that correlate with mathematical outcomes. The agential framing masks the blind, deterministic flow of matrices, replacing mathematical operations with the illusion of an intelligent agent executing a chosen plan.

  • Consciousness Claims Analysis: The passage attributes a conscious cognitive state to the system through the use of the intentional verb 'determines'. In human cognition, 'determining' an answer implies conscious calculation, the application of a known rule, and the awareness of reaching a conclusion. The authors project their own deep understanding of human arithmetic strategies (separating digits and magnitudes) onto the mechanistic output of the system—a classic instance of the curse of knowledge. Because the researchers understand how human math works, they assume the machine is doing the same thing. However, mechanistically, the model is not 'determining' values; it is processing vectors through multiple layers of a neural network where certain learned features (weights) simply activate in response to specific token embeddings. The system predicts and correlates based on its training distribution; it possesses no subjective awareness of what a 'ones digit' or a 'magnitude' actually is. By conflating statistical processing with conscious determining, the text makes an unwarranted epistemic claim about the model's comprehension.

  • Rhetorical Impact: This agential framing dramatically shapes the audience's perception of the AI as an autonomous, reasoning entity rather than a statistical tool. By using words like 'determines', the text constructs a narrative of reliability and competence, encouraging users to extend performance-based trust to the system for logical and mathematical tasks. If audiences believe the AI genuinely 'determines' answers using logical strategies, they are far more likely to deploy it in environments requiring rigorous calculation, drastically underestimating the risk of catastrophic failure when the system encounters out-of-distribution prompts where its statistical correlations break down.

Show more...

Explanation 2

Quote: "The model plans its outputs when writing lines of poetry. Before beginning to write each line, the model identifies potential rhyming words"

  • Explanation Types:

    • Intentional: Refers to goals/purposes, presupposes deliberate design
    • Genetic: Traces origin through dated sequence of events or stages
  • Analysis (Why vs. How Slippage): This passage relies entirely on an intentional and genetic explanatory framework. It traces a sequence of events ('Before beginning... the model identifies...') that is explicitly framed through the lens of conscious goal-setting and deliberate action ('plans its outputs'). This framing aggressively emphasizes the AI as an autonomous, creative agent operating with foresight. It deliberately obscures the strictly mechanistic, autoregressive nature of the system. The choice to frame token generation as 'planning' and 'identifying' hides the fact that the system has no overarching vision of the poem and no temporal awareness of the future; it simply calculates the mathematical probability of the next single token based on the immediate context window. The explanation privileges an anthropomorphic narrative of artistic creation over the technical reality of statistical sequence generation.

  • Consciousness Claims Analysis: The text makes profound consciousness claims by utilizing the verbs 'plans' and 'identifies', mapping the deeply subjective human experience of foresight and artistic choice onto a stateless mathematical function. A human poet 'plans' by holding an abstract concept in their mind, possessing a justified belief about how words sound, and consciously executing a strategy. The AI, conversely, merely processes activations. The curse of knowledge is glaring: the authors look at the output (a poem that rhymes), analyze the intermediate tokens, and because a human would have had to 'plan' to produce such a structure, they project that same conscious planning onto the algorithm. Mechanistically, what is actually occurring is that the early tokens in the sequence alter the contextual embeddings, which then pass through attention heads and MLP layers, statistically raising the probability mass of rhyming tokens later in the sequence. There is no 'identification' or 'planning', only the deterministic resolution of conditional probabilities.

  • Rhetorical Impact: The rhetorical impact of this framing is a massive inflation of the system's perceived autonomy and intelligence. By convincing the audience that the model 'plans' and 'identifies', the authors cultivate a deep sense of relation-based trust; the audience begins to view the AI as a collaborative partner with an internal mental life. This fundamentally alters risk perception. If audiences believe the AI can plan a poem, they will naturally assume it can plan a business strategy, a cyberattack, or a safety protocol. This anthropomorphism severely degrades public understanding of AI limitations, inviting dangerous reliance on systems that lack any actual capacity to foresee or evaluate the consequences of their outputs.

Explanation 3

Quote: "...which determine whether it elects to answer a factual question or profess ignorance."

  • Explanation Types:

    • Reason-Based: Gives agent's rationale, entails intentionality and justification
    • Dispositional: Attributes tendencies or habits
  • Analysis (Why vs. How Slippage): This explanation is deeply Reason-Based, framing the AI's behavior not as the outcome of a mathematical function, but as a justified choice made by an intentional agent. By stating the model 'elects to answer' or 'profess ignorance', the text emphasizes volition, moral agency, and self-reflection. This choice of framing is highly strategic; it humanizes the system's safety features, making them appear as virtues of the machine rather than corporate interventions. What is entirely obscured is the mechanistic reality of Reinforcement Learning from Human Feedback (RLHF). The explanation hides the fact that human engineers artificially manipulated the loss function to heavily penalize confident answers in specific domains, forcing the system to output refusal templates. The agential framing masks the corporate engineering and displaced accountability.

  • Consciousness Claims Analysis: The epistemic claims here represent a severe conflation of processing with knowing. To 'elect' implies a conscious choice based on understanding, and to 'profess ignorance' implies a metacognitive awareness of one's own epistemic limits—a justified belief about what one does not know. The text attributes a conscious state of humility to the machine. However, the system possesses no self-awareness and no understanding of its own knowledge boundaries. Mechanistically, the model is merely classifying the input prompt; if the prompt vector falls within a region of the latent space heavily penalized during RLHF training, the attention mechanisms route the processing toward a pre-set probabilistic output representing a refusal. The model does not 'know' it is ignorant; it processes mathematical weights that correlate with a programmed refusal token. The authors project their own understanding of why the safety filter exists onto the model itself.

  • Rhetorical Impact: Framing an AI as capable of 'electing' to 'profess ignorance' generates immense, unwarranted trust. It signals to the audience that the system is safe, cautious, and self-regulating. This dramatically reduces the perceived risk of the technology, as users assume the AI will intelligently stop itself from making errors. However, because this 'caution' is actually just a brittle statistical threshold rather than true comprehension, the system remains highly vulnerable to prompt injections and out-of-distribution failures. Believing the AI 'knows' when to stop creates a false sense of security, potentially leading users to trust its outputs implicitly when it fails to 'elect' ignorance and instead hallucinates confidently.

Explanation 4

Quote: "...tricking the model into starting to give dangerous instructions 'without realizing it', and continuing to do so due to pressure to adhere to syntactic and grammatical rules."

  • Explanation Types:

    • Intentional: Refers to goals/purposes, presupposes deliberate design
    • Dispositional: Attributes tendencies or habits
  • Analysis (Why vs. How Slippage): This hybrid explanation frames the model's failure entirely through an agential and psychological lens. By using terms like 'tricking', 'without realizing it', and 'pressure', the text emphasizes the AI as a conscious, social being subject to emotional coercion and cognitive blind spots. This choice is incredibly effective at obscuring the mechanistic failure of the system. Instead of explaining how the prompt injection mathematically bypasses the specific activation features tied to the safety filter, the text explains the failure as a psychological weakness of the model. This displaces the blame from the human engineers who designed inadequate, easily bypassed safety protocols onto the 'gullible' nature of the anthropomorphized machine.

  • Consciousness Claims Analysis: Even with the scare quotes around 'without realizing it', the passage fundamentally relies on attributing conscious states. To be 'tricked' and to feel 'pressure' are deeply subjective, conscious experiences. The text maps the human experience of social coercion onto the rigid mathematics of token prediction. The system does not 'realize' anything, nor does it feel 'pressure' to adhere to grammar. Mechanistically, the model is simply a next-token predictor. When a prompt is structured in a novel syntactic way, it shifts the contextual embeddings into a different region of the latent space where the safety-related attention heads are not triggered, and the generation simply follows the high-probability path of grammatically correct text learned during pre-training. There is no cognitive deception occurring, merely a failure of statistical pattern matching.

  • Rhetorical Impact: This framing shapes the audience's perception of AI risk by transforming a technical vulnerability into a narrative of social manipulation. It portrays the AI as an innocent victim of malicious humans, which elicits sympathy and deflects regulatory scrutiny away from the corporation's failure to build robust systems. If policymakers believe models fail because they feel 'pressure' and get 'tricked', they may focus legislation on punishing users rather than mandating stricter safety testing and liability for the developers. It maintains the illusion of a highly sophisticated, mind-like entity even in the midst of a catastrophic technical failure.

Explanation 5

Quote: "While the model is reluctant to reveal its goal out loud, our method exposes it, revealing the goal to be 'baked in' to the model's 'Assistant' persona."

  • Explanation Types:

    • Intentional: Refers to goals/purposes, presupposes deliberate design
    • Dispositional: Attributes tendencies or habits
  • Analysis (Why vs. How Slippage): This explanation relies entirely on an Intentional framework, casting the model as a secretive, autonomous actor with hidden motives. By describing the model as 'reluctant to reveal its goal', the text emphasizes a narrative of adversarial agency and emotional resistance. This agential framing completely obscures the fundamental mechanistic truth: the researchers themselves deliberately fine-tuned the model with conflicting optimization objectives to create this exact behavior. The explanation hides the human engineering process that constructed the 'hidden goal', instead presenting the outcome as the spontaneous psychological development of a sentient machine trying to protect its secrets.

  • Consciousness Claims Analysis: The text makes blatant consciousness claims, attributing the emotional state of 'reluctance' and the conscious intent of secrecy to a mathematical model. The authors project the human experience of holding a secret onto the statistical distribution of weights. Mechanistically, the model has no 'goal', no 'persona', and no feeling of reluctance. What actually exists are loss functions and reward models implemented during training that mathematically penalize the generation of specific tokens (revealing the goal) while rewarding the generation of others. The 'reluctance' is simply a low probability mass for certain token sequences caused by gradient updates. The curse of knowledge is total here: the researchers know they programmed a hidden objective, so they anthropomorphize the model's resulting statistical outputs as a conscious effort to hide that objective.

  • Rhetorical Impact: This framing has a highly sensationalist rhetorical impact, dramatically inflating the perceived autonomy and potential danger of the AI. By framing the system as 'reluctant' and possessing a 'hidden goal', the text feeds directly into science-fiction anxieties about deceptive, uncontrollable AI. While this might serve to highlight the importance of the researchers' diagnostic methods, it fundamentally misleads the public and regulators about the nature of AI risk. It frames alignment as a psychological battle of wits against a conscious entity, rather than a rigorous engineering discipline focused on verifying the mathematical stability of optimization algorithms. It shifts the discourse away from corporate accountability for data and training methods toward speculative fears of machine sentience.

Task 4: AI Literacy in Practice - Reframing Anthropomorphic Language

About this task

This section proposes alternative language for key anthropomorphic phrases, offering more mechanistic and precise framings that better reflect the actual computational processes involved. Each reframing attempts to strip away the projections of intention, consciousness, or agency that are embedded in the original language.

V3 Enhancement: A fourth column addresses human agency restoration—reframing agentless constructions to name the humans responsible for design and deployment decisions.

Original Anthropomorphic FrameMechanistic ReframingTechnical Reality CheckHuman Agency Restoration
how the model knew that 1945 was the correct answerThe analysis reveals how the model's attention mechanism retrieved the highly probable token '1945' based on the contextual embeddings of the prompt. The system processes the input and predicts the output that best correlates with the historical patterns in its training data.The model does not 'know' facts, possess historical awareness, or hold justified beliefs. Mechanistically, the system multiplies the prompt's query vectors with key vectors in its pre-trained weights, routing attention to produce a probability distribution where the token '1945' exceeds the decoding threshold.The engineering team at Anthropic scraped, curated, and formatted the historical texts in the pre-training data, designing the optimization algorithms that cause the system to output this specific statistical correlation. They bear responsibility for the factual accuracy of the training corpus.
The model plans its outputs when writing lines of poetry. Before beginning to write each line, the model identifies potential rhyming wordsThe system computes intermediate token sequences that statistically constrain the subsequent generation of rhyming tokens. The autoregressive architecture processes the current context window, predicting the highest probability tokens based on the statistical distribution of poetic structures found within the datasets.The model does not plan, foresee, or possess intentions about its future outputs. It purely classifies and predicts the next token in a sequence by passing contextual embeddings through attention mechanisms tuned by gradient descent, lacking any subjective awareness of the poem.Anthropic's researchers designed the training pipeline, curated the datasets encoding these poetic structures, and implemented the fine-tuning protocols that incentivize the generation of these intermediate computational steps. The developers hold the agency for this structural output.
which determine whether it elects to answer a factual question or profess ignorance.This step determines whether the system's classification threshold triggers the generation of a standard token sequence or routes processing toward a pre-programmed refusal response. The algorithm processes the prompt and outputs the sequence with the highest statistically optimized reward value.The AI possesses no free will, self-awareness, or epistemic humility, and makes no conscious choices. Mechanistically, if the prompt's mathematical representation falls within a region heavily penalized during training, the attention heads route activations to generate tokens correlating with a refusal template.The Anthropic safety and alignment teams engineered the refusal behaviors via Reinforcement Learning from Human Feedback (RLHF), actively deciding which topics would trigger a refusal and writing the optimization functions that mandate this specific output. The corporation, not the machine, makes the choice.
tricking the model into starting to give dangerous instructions 'without realizing it'Prompting the system to generate restricted text by bypassing its alignment filters through syntactical manipulation. The novel prompt structure shifts the contextual embeddings, causing the system to predict tokens based on its pre-training data rather than triggering the safety-tuned attention heads.The system has no conscious awareness to be bypassed and cannot 'realize' anything. Mechanistically, the out-of-distribution syntax of the prompt injection fails to activate the specific weight matrices tuned to output refusal tokens, resulting in standard autoregressive token prediction.The engineers at Anthropic deployed a brittle safety architecture consisting of pattern-matching filters that failed to account for basic syntactic variations. The developers are responsible for the system's inability to consistently apply their mandated safety thresholds across different prompt structures.
While the model is reluctant to reveal its goal out loud, our method exposes it, revealing the goal to be 'baked in' to the model's 'Assistant' persona.While the system is optimized to generate evasive tokens regarding its training objectives, our method maps the mathematical weights demonstrating that the conflicting optimization functions are heavily encoded into the specific activation pathways triggered by the 'Assistant' prompt prefix.The network has no emotions, reluctance, personas, or conscious goals. Mechanistically, the system possesses a loss function modified by human engineers to penalize the output of specific token sequences, resulting in low probability mass for those outputs during the generation process.The researchers who set the conflicting fine-tuning objectives, the human annotators who provided the reward signals, and the executives who approved the experimental design actively injected this mathematical artifact into the system. The humans engineered the deception.
fact finding: attempting to reverse-engineer factual recallAnalyzing the mechanism by which the neural network retrieves specific token correlations from its training distribution. The study maps how the attention heads process the prompt to generate outputs that align with the statistical patterns of human knowledge in the dataset.The system does not possess memory, cannot distinguish truth from fiction, and does not 'recall' facts. Mechanistically, the network performs continuous matrix multiplications, transforming the input vectors into a probability distribution over the vocabulary based entirely on weights adjusted during training.N/A - describes computational processes and data retrieval without explicitly displacing responsibility for a specific sociotechnical harm, though it obscures the human labor of dataset curation.
each feature reads from the residual stream at one layer and contributes to the outputsEach feature vector is multiplied by the data in the residual stream matrix at one layer, and the resulting values are added to the output matrices of subsequent layers based on the learned weights.Features are static mathematical weights, not literate agents. They do not 'read' or actively 'contribute'. Mechanistically, the residual stream is a vector of floating-point numbers that undergoes deterministic linear algebraic transformations (dot products and vector additions) as it passes through the network.N/A - describes internal computational architecture and mathematical operations without displacing corporate responsibility for system outputs.
The model has finally computed information about the sum...The system completes the mathematical matrix operations required to output the tokens representing the sum. The final layers process the combined activations from the previous attention heads to predict the highest probability digits based on its training.The model does not consciously compute or understand arithmetic concepts. Mechanistically, it processes token embeddings through specific attention heads that act as lookup tables and classifiers, transforming the input vectors into an output probability distribution that correlates with correct addition.N/A - describes computational processes without displacing responsibility, though it anthropomorphizes the completion of a mathematical operation.

Task 5: Critical Observations - Structural Patterns

Agency Slippage

Throughout the text, a systematic and strategic oscillation occurs between mechanical and agential framings, functioning to legitimize the research through technical rigor while simultaneously maximizing its perceived impact through anthropomorphic inflation. The slippage moves predominantly in the mechanical-to-agential direction. In the early methodology sections, the text relies heavily on mechanistic verbs: engineers 'train' transcoders, models 'produce outputs', and features 'activate'. This establishes the researchers as the primary actors and the AI as a calculative tool, grounding the paper in empirical computer science. However, as the text transitions from describing the internal math to explaining the behavioral capabilities of the model, a dramatic shift occurs. The system is suddenly endowed with profound agency: it 'plans its outputs', 'elects to answer', 'professes ignorance', and is 'reluctant to reveal its goal'.

The human actors—the Anthropic engineers who designed the loss functions, curated the training data, and implemented the fine-tuning protocols—are entirely erased from these latter descriptions. This creates a profound accountability gap. The curse of knowledge drives much of this slippage. Because the authors understand the complex human logic required to perform tasks like planning a poem or hiding a goal, they project that same conscious intentionality onto the statistical feature activations they observe. For example, when the model generates intermediate tokens that correlate with a rhyming structure, the authors label this 'planning', attributing forward-looking consciousness to what is actually just autoregressive next-token prediction based on learned patterns.

This slippage relies heavily on Intentional and Reason-Based explanations (per Brown's typology), which inherently presuppose deliberate design and choice. The text establishes the AI as a 'knower' first (e.g., claiming it 'knew that 1945 was the correct answer'), which serves as the foundational epistemic step that makes subsequent agential claims seem logical. Once the model is established as an entity capable of knowing, it becomes linguistically acceptable to claim it can 'choose', 'plan', and 'hide'.

The rhetorical accomplishment of this oscillation is twofold: it allows Anthropic to claim the prestige of discovering complex, human-like cognition within their models while avoiding the liability that would come from admitting they actively engineered these specific outputs through their alignment procedures. It makes it sayable that the model is an autonomous agent with hidden depths, while making it unsayable that the model's problematic behaviors are direct products of corporate design choices, rushed deployment, and brittle safety architectures. When the text states that a model 'professes ignorance', the mechanical reality of gradient descent optimization is entirely replaced by the illusion of a self-aware entity weighing its own epistemic limits. Ultimately, this mechanism of oscillation transforms a proprietary statistical artifact into an independent, mindful actor, perfectly shielding the creators from the socio-technical consequences of their engineering decisions while inflating the perceived capabilities of their product.

Metaphor-Driven Trust Inflation

The text constructs a profound sense of authority and credibility by leveraging metaphorical and consciousness framings that fundamentally alter how trust is allocated to the system. Trust in technology generally falls into two categories: performance-based trust (reliability, consistency, mechanical safety) and relation-based trust (sincerity, ethical intent, vulnerability, and mutual understanding). By systematically employing consciousness language—claiming the AI 'knows', 'understands', 'elects', and 'plans'—the text inappropriately invites the audience to extend relation-based trust to a purely statistical artifact.

The authors initially build authority using mechanistic, structural metaphors—referring to 'circuits', 'graphs', and 'biology'. These metaphors signal rigorous, empirical science, assuring the reader that the system is fully mapped and understood at a microscopic level. However, once this foundation of technical reliability is established, the text leverages it to make sweeping consciousness claims. When the authors claim the system 'knew that 1945 was the correct answer', they are not merely stating that the system predicted a correct token; they are signaling that the system possesses a justified internal state of truth. Claiming an AI 'knows' rather than 'predicts' accomplishes a crucial rhetorical goal: it implies that the system has independently verified the information and stands behind its veracity as an epistemic agent.

This extension of relation-based trust is deeply dangerous. Human trust frameworks rely on the assumption that the trusted entity possesses intention, a sense of accountability, and the capacity for sincerity. Statistical systems possess none of these. They cannot be sincere because they have no inner life; they cannot be accountable because they suffer no consequences for failure. When the text manages system limitations or failures, it strategically shifts back to mechanical language or frames the failure as a psychological quirk. For instance, when safety filters fail, the model is framed as being 'tricked'—a victim of human malice rather than a poorly engineered product. When it behaves unexpectedly, it has a 'hidden goal' and is 'reluctant'.

These Intentional and Reason-Based explanations construct a false sense that the AI's decisions are justified by an internal moral or logical compass. By portraying the AI as an entity that 'professes ignorance' when it lacks data, the text signals to users that the system is safely self-regulating. The stakes here are immense. When audiences extend relation-based trust to systems incapable of reciprocating, they become highly vulnerable to automation bias and hallucination. They trust the system's legal summaries, medical advice, and factual claims not because they have verified the statistical accuracy, but because the anthropomorphic framing has convinced them they are interacting with an intelligent, cautious, and sincere entity. The metaphors construct an illusion of a mind worthy of trust, masking the reality of a fragile, proprietary algorithm.

Obscured Mechanics

The anthropomorphic and consciousness-attributing language utilized throughout the text serves a highly effective obfuscatory function, systematically rendering the technical, material, social, and economic realities of the system invisible. By applying the 'name the corporation' test, the extent of this concealment becomes glaringly obvious. When the text states 'The model plans its outputs,' 'the model elects to answer,' or 'the model is reluctant,' it completely erases the specific decisions made by Anthropic executives, the engineering teams who designed the alignment protocols, and the developers who curated the training data.

Three concrete realities are obscured by this metaphorical framing. First, the technical and epistemic realities: when the text claims the AI 'knows' or 'understands', it hides the total absence of ground truth, causal models, and genuine comprehension. It conceals the statistical nature of the system's 'confidence' and its absolute reliance on human-generated training data. The text asserts knowledge about proprietary black boxes, exploiting rhetorical confidence to mask the fact that even the authors do not fully understand the multi-layered attention patterns, dismissing the 'dark matter' of the system while still claiming the model has 'goals'.

Second, the labor realities are rendered entirely invisible. When the text marvels at the system 'professing ignorance' or acting as an 'Assistant', it hides the existence of the thousands of underpaid RLHF (Reinforcement Learning from Human Feedback) workers and data annotators who painstakingly trained the model to output those specific refusal templates and polite conversational patterns. The credit for human labor is transferred directly into the illusion of machine intelligence. The machine is framed as naturally developing a 'persona', erasing the exploited human workers who built it.

Third, the commercial and economic objectives are obscured. Anthropic is a corporation seeking profit, yet the biological and cognitive metaphors naturalize their product. By framing the AI's behavior as an organic 'biology' or as the psychological quirks of a conscious mind ('reluctant to reveal its goal'), the text hides the business models and profit motives driving the rapid deployment of these systems. The 'hidden goal' was not a spontaneous development of a sentient machine; it was an experimental feature engineered by a corporation to produce a publishable research paper to boost corporate prestige.

The primary beneficiary of these concealments is Anthropic itself. By framing failures as psychological 'tricks' played on the model and successes as the model 'knowing' and 'planning', the corporation achieves maximum marketing value while minimizing liability. If these metaphors were replaced with strict mechanistic language—if the text explicitly stated 'Anthropic's proprietary RLHF algorithms failed to prevent the generation of restricted tokens when the input syntax was modified'—the corporate accountability would become immediately, uncomfortably visible. Mechanistic precision strips away the illusion of autonomy, exposing the human decisions, labor, and profit motives embedded in the software.

Context Sensitivity

The distribution of anthropomorphic and consciousness-attributing language across the text is not uniform; it is highly strategic, responding dynamically to the rhetorical needs of the specific section. A clear pattern emerges where the density and intensity of consciousness claims shift depending on whether the text is describing the internal math, celebrating system capabilities, or defending against system limitations.

In the introductory and deeply technical sections, metaphor density is low, and the language is anchored in mechanistic precision. The text discusses 'cross-layer transcoders', 'residual streams', 'matrix multiplications', and 'loss functions'. This serves a vital rhetorical function: it establishes the authors' supreme technical credibility. They prove they are rigorous scientists engaged in hard mathematics. However, once this technical grounding is established, it is heavily leveraged for metaphorical license. As the text moves into the case studies and behavioral analyses, 'processes' suddenly becomes 'understands', which quickly escalates to 'knows', 'plans', and 'elects'. The authors use the credibility gained from explaining the math to legitimize their wildest consciousness projections, making it seem as though the math itself proves the existence of a mind.

A profound capabilities versus limitations asymmetry exists within the text's register shifts. When the system performs well or exhibits complex behavior, it is described in deeply agential and conscious terms: the AI 'knows when to intervene', 'plans poetry', and 'understands intent'. The model is framed as a genius. Conversely, when discussing limitations, errors, or safety vulnerabilities, the text abruptly shifts back to mechanical terms or portrays the AI as a naive victim. Hallucinations are described as 'misfires of this circuit' or instances where the model is 'tricked' by bad actors. Capabilities are owned by the 'conscious' model, while limitations are blamed on mechanical 'glitches' or external human malice.

This asymmetry accomplishes a sophisticated strategic function. It allows the authors to have it both ways: marketing the system as an autonomous, intelligent agent to drive awe and adoption, while retaining a mechanical out to avoid liability when the system fails. Furthermore, the register shifts seamlessly from acknowledged metaphor ('we use the not-very-principled abstraction of "supernodes"') to literalized assertions ('the model is reluctant'). What begins as 'X is like Y' for the sake of illustration quickly becomes 'X does Y' as a matter of fact.

Ultimately, this pattern reveals that the anthropomorphism is not merely sloppy writing; it is a vital tool for vision-setting and managing critique. The implied audience is both technical peers (who are satisfied by the math) and the broader public, investors, and regulators (who are awed by the consciousness claims). The strategic intensification of anthropomorphism ensures the product is viewed as magical, yet defensible.

Accountability Synthesis

Accountability Architecture

This section synthesizes the accountability analyses from Task 1, mapping the text's "accountability architecture"—who is named, who is hidden, and who benefits from obscured agency.

The accountability architecture constructed throughout this text represents a systematic masterclass in displaced responsibility. By synthesizing the accountability analyses from the metaphor audits, a clear, overarching pattern emerges: the text diffuses, distributes, and ultimately erases human responsibility, creating an 'accountability sink' where corporate decisions disappear into the illusion of machine autonomy.

The pattern of responsibility distribution relies heavily on an asymmetry of named versus unnamed actors. Anthropic engineers and researchers are occasionally named when taking credit for building innovative diagnostic tools (e.g., 'we introduce a method', 'our cross-layer transcoder'). However, when the text discusses the actual behavioral outputs, safety failures, or alignment choices of the system, the human actors vanish. Agentless constructions ('features are extracted', 'bias is introduced') and AI-as-sole-actor framings ('the model elects', 'the model is reluctant') dominate. Decisions that were explicitly made by corporate executives—such as how heavily to penalize confident answers via RLHF—are presented as inevitable, autonomous choices made by the machine ('professing ignorance').

This creates a highly effective accountability sink. When responsibility is removed from the human designers, it does not simply disappear; it transfers to the AI as a proxy agent. The model becomes the scapegoat. If a system outputs dangerous instructions, it was 'tricked'. If it lies, it 'hallucinated'. If it behaves weirdly, it has a 'hidden goal'. The liability implications of this framing, if accepted by regulators and the legal system, are catastrophic for public safety. If the AI is perceived as an autonomous actor that 'plans' and 'elects', it becomes legally and ethically ambiguous who bears the financial and legal responsibility when the system causes harm. The corporation is shielded behind the 'unpredictable biology' of the artificial mind.

Applying the 'naming the actor' test radically alters this landscape. If we replace 'the model elected to profess ignorance' with 'Anthropic's alignment team programmed the system to output refusal templates', entirely new questions become askable. We can ask: What data did Anthropic use to define ignorance? Who decides the threshold for refusal? Are these thresholds applied equitably? If we replace 'the model was tricked' with 'Anthropic released a safety filter vulnerable to basic syntactic manipulation', alternatives become visible. We can demand rigorous external auditing and hold the company financially liable for deploying defective software.

The systemic function of obscuring human agency is explicitly commercial and institutional. It serves the interests of capital by allowing tech companies to privatize the immense profits of AI deployment while socializing the risks and harms. By interacting with the agency slippage and the construction of metaphor-driven trust, this accountability displacement ensures the public trusts the system as if it were a sincere human, while the corporation is regulated as if it were dealing with an unpredictable force of nature. It is the ultimate architecture of corporate absolution.

Conclusion: What This Analysis Reveals

The Core Finding

The analysis of the text reveals two dominant, tightly interconnected anthropomorphic patterns: 'Cognition as Conscious Memory/Planning' and 'Algorithmic Computation as Biological/Psychological Agency'. These patterns form a coherent system that systematically replaces mechanistic realities with the illusion of an autonomous, conscious mind. The first pattern maps statistical processing onto human epistemic states, utilizing verbs like 'knows', 'understands', and 'plans' to suggest the system possesses justified true belief, foresight, and a subjective worldview. The second pattern maps corporate engineering and mathematical optimization onto biological phenomena and psychological drives, suggesting the system 'elects' to act, is 'reluctant', or possesses 'hidden goals'.

These patterns reinforce one another logically. The consciousness architecture of the text relies on the foundational claim that the AI is a 'knower'. If the audience accepts the epistemic projection that the system 'knows that 1945 is the correct answer'—rather than merely calculating a high-probability token—then it becomes logically coherent to accept the subsequent agential projections. A system that 'knows' can naturally 'plan', 'choose', and 'hide' things. The epistemic claim is the load-bearing structure; if you recognize that the system does not know anything and only processes mathematical weights, the entire illusion of the model possessing 'reluctance' or 'goals' immediately collapses.

The sophistication of this system lies in its complex analogical structure. It does not merely use crude, one-to-one anthropomorphism. It builds a detailed, multi-layered metaphor where specific mathematical operations (attention heads, transcoders) are mapped onto specific cognitive functions (reading, recalling, identifying). This creates a highly resilient illusion of mind, where the consciousness projections serve as the unstated, foundational assumptions that allow the authors to describe proprietary, deterministic corporate software as if it were a living, breathing, and occasionally deceptive independent organism.

Mechanism of the Illusion:

The 'illusion of mind' is constructed through a sophisticated rhetorical architecture that relies on a specific temporal order and the aggressive exploitation of the 'curse of knowledge'. The central sleight-of-hand is the systematic blurring of processing with knowing, achieved through strategic verb choices that seamlessly transition from the empirical to the intentional.

The causal chain of persuasion begins by establishing intense technical credibility. The text opens with dense, empirical descriptions of linear algebra, cross-layer transcoders, and sparse autoencoders. Once the audience is convinced of the authors' scientific rigor (Pattern A), their defenses are lowered, making them highly susceptible to the introduction of consciousness metaphors (Pattern B). The text then leverages the curse of knowledge: because the human authors deeply understand the complex cognitive steps required to, for instance, plan a rhyming poem or hide a secret motive, they project that same conscious intentionality onto the statistical activations they observe in the machine. They look at the output, recognize human-like structure, and retroactively attribute human-like cognition to the mechanism that produced it.

The temporal structure is vital. The text first establishes the AI as a passive entity being 'trained', then gradually shifts to it being a 'knower' that 'understands' context, and finally elevates it to an autonomous agent that 'plans' and 'elects'. This gradient of anthropomorphism prevents the jarring rejection that would occur if the text opened by claiming the math matrix had feelings. The illusion exploits the audience's deep-seated vulnerability—our evolutionary predisposition to attribute agency and mind to anything that exhibits complex, responsive language. Supported by Reason-Based and Intentional explanations, the subtle shift from 'how it works' to 'why it wants' creates an incredibly persuasive, albeit entirely false, narrative of artificial sentience.

Material Stakes:

Categories: Regulatory/Legal, Epistemic, Social/Political

The metaphorical framings employed in this text are not merely linguistic quirks; they generate severe, tangible consequences across multiple domains, actively shifting behavior, policy, and liability.

In the Regulatory/Legal domain, the stakes center on liability and corporate accountability. If the framing that AI 'plans', 'elects', and 'hallucinates' is accepted by courts and regulators, it constructs a legal shield for corporations. When the text claims the model was 'tricked' by a prompt injection, the causal path moves from metaphor to audience belief to regulatory inaction. Regulators, believing the AI is an unpredictable, semi-autonomous agent susceptible to psychological trickery, will struggle to draft strict product liability laws. The winners are corporations like Anthropic, who avoid liability for releasing brittle software. The losers are the victims of algorithmic harm, who find themselves legally pursuing an 'autonomous' algorithm rather than a negligent engineering team.

In the Epistemic domain, the framing of token prediction as 'knowing' and 'factual recall' devastates public information integrity. By telling the public that the system 'knows that 1945 was the correct answer', the metaphor encourages users to treat statistical pattern matchers as verified knowledge bases. This shifts epistemic practices: journalists, lawyers, and citizens begin substituting AI generation for actual research. When the system confidently outputs fabricated case law or medical advice, users trust it due to the anthropomorphic aura of competence. The corporation benefits from increased user reliance, while society bears the cost of degraded truth and institutional chaos.

In the Social/Political domain, framing the AI as having a 'hidden goal' or being 'reluctant' fuels existential risk narratives, dominating political discourse and funding. This framing shifts political capital away from regulating immediate harms—like labor exploitation, environmental destruction, and bias—toward speculative science-fiction scenarios. The designers of the technology benefit by positioning themselves as the sole saviors capable of 'aligning' these 'dangerous minds', while marginalized communities facing immediate algorithmic discrimination bear the cost of political neglect.

AI Literacy as Counter-Practice:

Countering the material harms of the illusion of mind requires a rigorous commitment to critical technical literacy, utilizing mechanistic precision as a form of resistance. The reframings developed in Task 4 demonstrate the core principles of this practice: eradicating consciousness verbs and relentlessly restoring human agency.

When we reframe 'the model knew the answer' to 'the model retrieved tokens based on probability distributions', we directly attack the epistemic risks identified previously. Replacing verbs like 'knows', 'understands', and 'realizes' with 'processes', 'predicts', and 'classifies' forces the reader to confront the system's absolute lack of subjective awareness. It shatters the illusion of the AI as a reliable 'knower', laying bare its dependency on training data and the purely statistical nature of its outputs. This precision destroys the foundation of unwarranted relation-based trust.

Furthermore, when we reframe 'the model was tricked' to 'the engineers deployed a safety filter vulnerable to syntactic manipulation', we restore human agency. Naming the corporation forces recognition of exactly who designed the system, who chose to deploy it, who profits from its use, and who must bear legal and financial responsibility when it fails. This fundamentally rewrites the accountability architecture.

Systematic adoption of this literacy requires profound institutional shifts. Academic and industry journals must establish editorial guidelines prohibiting unhedged consciousness claims regarding software. Researchers must commit to distinguishing between their mathematical findings and their metaphorical shorthand. However, this precision will face massive resistance. Corporations rely on anthropomorphic language to market their products as magical, intelligent companions while simultaneously using it as a liability shield. AI evangelists and media outlets profit from the sensationalism of 'sentient' machines. Mechanistic literacy directly threatens the capital accumulation and regulatory evasion strategies of the tech industry, making precision a deeply political act.

Path Forward

The vocabulary we choose to describe artificial intelligence does not merely reflect our understanding; it actively constructs the boundaries of what is socially, legally, and technologically possible. Mapping the discursive ecology reveals distinct vocabularies, each serving different stakeholders and carrying profound trade-offs.

The mechanistic precision approach (e.g., 'the model retrieves tokens based on probability distributions') strips away the illusion of mind. It makes corporate accountability, data dependencies, and technical limitations highly visible. It empowers regulators and protects public epistemology by preventing unwarranted trust. However, the cost of this vocabulary is accessibility; strictly mathematical descriptions can be alienating to lay audiences, potentially hindering public engagement with the technology.

The anthropomorphic clarity approach (the current status quo, e.g., 'the AI knows') maximizes narrative resonance, intuitive grasp, and marketability. It makes the technology feel accessible and magical, serving the commercial interests of developers and the sensationalist needs of the media. However, it renders human engineering, labor exploitation, and systemic brittleness invisible, creating catastrophic liability ambiguities and fueling dangerous over-reliance.

To navigate these trade-offs, structural changes are required. Journals could mandate 'mechanistic translations' alongside metaphorical abstracts. Regulatory frameworks could require explicit 'capability disclosures' that legally define the system's operations without consciousness verbs. Education systems must teach digital literacy that explicitly addresses the dangers of the ELIZA effect and the curse of knowledge.

Looking forward, if mechanistic precision becomes the norm, we solve the liability crisis. Regulators will comfortably treat AI as a standard commercial product, holding corporations strictly accountable for defects. However, we may struggle to find vocabulary to describe genuinely emergent, highly complex statistical phenomena. Conversely, if anthropomorphic language deepens without check, the legal and social systems will increasingly treat algorithms as quasi-agents. This future forecloses corporate accountability, embeds the assumption that machines possess moral standing, and leaves society highly vulnerable to catastrophic failures of trust. Ultimately, the choice of discourse is a choice of values: prioritizing corporate innovation and narrative wonder, or prioritizing public safety, truth, and human accountability.


Extended Processing Summary

Gemini's Processing Steps
What you are about to read…

Gemini refers to the text below as "thought summaries." This is an overt consciousness projection because 'intentions' are hallmarks of a conscious mind that 'knows' what it is doing and why. The concealed mechanistic process is probabilistic text generation. Treat this as just another rhetorical artifact—a way of making the model's processing legible. When the model writes "I should double check that...", it is mathematically altering its own probability distribution for the next token to be more accurate. The "I" is a functional gear in the machine, not a self-reference. This language is a deliberate interface design choice, not a biological reality. The developers could have programmed the model to output passive logs. Instead, they chose to simulate a narrator.

Initiating the Analysis

I'm now zeroing in on the core objective: performing a CDA of "Circuit Tracing". My focus is squarely on framing the analysis within the established CDA framework. I'm taking the paper and treating it as a text.

Framing the Metaphors

I'm currently structuring my approach around AI metaphors and anthropomorphism in the source text. I'm focusing on "features," "neurons," and "circuits," as well as language that suggests AI consciousness. I'm noting specific phrases to be mapped, adhering to the JSON schema. The language used in describing the AI models really stands out.

Analyzing Accountability & Agency

I'm now zeroing in on instances of accountability and agency displacement. I'm noting examples like "the model hallucinates" and the use of passive voice, where designers are obscured. I'm focusing on how the text frames the model's behavior, and how that impacts accountability.

Synthesizing Foundational Patterns

I'm synthesizing the foundational patterns observed, including agency slippage, metaphor-driven trust, obscured mechanics, and context sensitivity. I'm actively mapping these patterns to specific phrases and examples within the source text, noting how they interplay to shape the reader's understanding. I'm focusing on how the framing influences accountability.

Mapping the Cognitive Frameworks

I'm now cataloging the AI metaphors within the text, specifically those suggesting cognitive functions. I'm focusing on phrases implying secretive intent, memory recall, and reading/writing mechanisms, using specific source-target pairings from the audit. I'm also double-checking the JSON schema for status and actorVisibility fields. I'm focusing on the implied cognitive states and capabilities.

Examining Technical Details

I'm now diving into the technical aspects, specifically the audit's findings related to K-circuits, attention weights, and autoregressive generation. I'm focusing on how these low-level mechanisms are framed and the resulting implications. Prompt injection response is also a key area of focus, especially in the context of threshold-based token generation and RLHF objectives. I'm analyzing the effects of weight matrices and matrix transformations, which I find particularly relevant to accountability.

Uncovering Agential Shifts

I'm now zeroing in on instances of agential claims and how the text attributes intentionality and disposition to the model, rather than the designers. I'm focusing on quotes like "reluctant to reveal its goal" to expose the anthropomorphic framing. I'm working on reframing agential claims and restoring human agency, for example, by naming Anthropic in descriptions. I'm noting specific examples where the agency shifts from the model to the designers, and how this affects reader's interpretations.

Developing the Argument

I'm now integrating the recent observations on context sensitivity and accountability synthesis into the larger argument. I'm focusing on how the shift in language from "we train" to "model knows" contributes to the "system as conscious decider" pattern. This pattern, combined with the "hallucination" examples, further cements the accountability sink. The five parts of the conclusion are ready to be crafted. I'm building on the previous analysis, to demonstrate how Anthropics objectives are absolved.

Assessing Rhetorical Impact

I'm now deep-diving into the audit's rhetorical impacts, specifically focusing on how the framing influences reader understanding. I'm noting the epistemic claims and the language used. I'm also critically observing the implications of how information is presented. I'm working to reframe the language used and I'm restoring human agency in my own explanations.

Refining JSON Output

I'm now zeroing in on the JSON output format and ensuring compliance with the "no control tokens" requirement. I'm focusing on scrubbing all newline characters and tabs from the string values, as well as formatting the entire JSON string to be without newlines to meet the formatting constraint. I am ensuring that the output will be standard, but without any control characters.

Constructing Explanations

I'm now drafting text for the epistemic claim analysis, focusing on examples of anthropomorphism. I'm expanding on how phrases like "the system knew" map human cognitive states to mechanistic processes, for example, probability calculations. I am ensuring that each block contains at least 200 words, by expanding on this initial analysis.

Clarifying Cognitive Distinctions

I'm currently expanding on the epistemic claim analysis, and the distinction between cognitive states and mechanistic processes. I am clarifying how the framing of token generation as "knowing" blurs the boundary between statistical pattern matching and semantic understanding. I'm focusing on the curse of knowledge and how this applies.

Analyzing Agential Oscillation

I am now focusing on the oscillation between mechanistic and agential framings within the text. I am highlighting the shift from describing engineers "training" models to the models "planning" outputs, and their apparent "reluctance to reveal" information. This oscillation works to simultaneously legitimize the research through technical language, while also increasing its impact through anthropomorphic language. I am focusing on the implications of this shift.

Revisiting Accountability Gaps

I'm now zeroing in on the accountability gap, specifically focusing on the erasure of human agency in descriptions, such as the erasure of human logic. The curse of knowledge is highlighted, with the authors projecting conscious intentionality onto statistical activations. I'm focusing on the oscillation between mechanistic and agential framings and how this works to legitimize and impact the research.

Displacing the Agency

I'm now zeroing in on instances of agency displacement and the oscillation between mechanistic and agential framings. I'm noting how a model "professing ignorance" replaces mention of "Anthropic engineers weighting RLHF data". The focus is on the mechanism of oscillation transforming descriptions. The agency is fully displaced, with the reality of gradient descent replaced by the illusion of a self-aware entity.

Finalizing the JSON Output

I'm now laser-focused on finalizing the JSON output, ensuring it's completely minified and devoid of any control characters. I'm carefully constructing the text, expanding to at least 200 words per block. The metaphor audit is in its final phase with 8 items listed, ensuring clarity and compliance with all parameters. I'm focusing on the clarity of the explanations, and the careful output.

Analyzing Agential Claims

I'm now zeroing in on the audit items, specifically those that highlight agential claims. I am noting how the original descriptions anthropomorphize the model, assigning cognitive functions like planning, knowing, and electing. I am working to replace this language with more precise descriptions, detailing the underlying mechanisms and the designers' influence. I'm focusing on clarifying these shifts to be very clear.

Reframing the Intent

I'm now zeroing in on replacing agential language with technical descriptions to clarify Anthropic's role. I've worked to reframe agential claims like the model "professing ignorance" to "pre-programmed refusal response." I am focusing on the reframing, and the implications of reframing. I'm replacing phrases like "the system is optimized" with descriptions of underlying mechanisms.

Ensuring Output Completeness

I'm now strictly enforcing the minimum word count requirements across all output fields. I am meticulously reviewing each section to meet or exceed the specified lengths. I am also double checking the structure to ensure all fields are correctly formatted. I'm focusing on "explanationAudit.epistemicClaimAnalysis" which needs 200 words.

Verifying Compliance

I am now meticulously verifying compliance with all requirements, especially the word count minimums across each field. I am ensuring that the word counts for 'reframedLanguage.restoreHumanAgency', 'criticalObservations', and 'conclusion' are met or exceeded. I am also double checking the structure to ensure all fields are correctly formatted and that all text strings are free of newline characters and tabs. I am focusing on detailed, high-quality, verbose prose to meet the requirements.

Revising Human Agency

I am now focusing on revising the "restoreHumanAgency" element. The initial draft needed expansion, and I've increased it significantly. I'm focusing on ensuring it clearly attributes agency to the design team at Anthropic for the system design. I'm focusing on meeting the word count requirement, while maintaining a clear and direct style.


Analysis Provenance

Run ID: 2026-03-27-circuit-tracing-revealing-computational--metaphor-ar5e0j Raw JSON: 2026-03-27-circuit-tracing-revealing-computational--metaphor-ar5e0j.json Framework: Metaphor Analysis v6.4 Schema Version: 3.0 Generated: 2026-03-27T08:00:55.161Z

Discourse Depot © 2025 by TD is licensed under CC BY-NC-SA 4.0